[External] Fwd: Reading very large text files into R
I think you need the fill=TRUE argument. See ?read.table
On Sep 29, 2022, at 11:14, Enrico Schumann <es at enricoschumann.net> wrote: On Thu, 29 Sep 2022, Nick Wray writes:
---------- Forwarded message --------- From: Nick Wray <nickmwray at gmail.com> Date: Thu, 29 Sept 2022 at 15:32 Subject: Re: [R] Reading very large text files into R To: Ben Tupper <btupper at bigelow.org> Hi Ben Beneath is an example of the text (also in an attachment) and it's the "B", of which there are quite a few scattered throughout the text doc which causes the reading in error message (btw I don't need the "RAIN" column or the 1's after it or the last four elements). I have also attached the snippet as text file 1980-01-01 10:00, 225620, RAIN, 1, 1, WAHRAIN, 5091, 1001, 0, , 9, 0, , , 1980-01-01 10:00, 226918, RAIN, 1, 1, WAHRAIN, 5124, 1001, 0, , 9, 0, , , 1980-01-01 10:00, 228562, RAIN, 1, 1, WAHRAIN, 491, 1001, 0, , 9, 0, , , 1980-01-01 10:00, 231581, RAIN, 1, 1, WAHRAIN, 5213, 1001, 0, , 9, 0, , , 1980-01-01 10:00, 232671, RAIN, 1, 1, WAHRAIN, 487, 1001, 0, , 9, 0, , , 1980-01-01 10:00, 232913, RAIN, 1, 1, WAHRAIN, 5243, 1001, 0, , 9, 0, , , 1980-01-01 10:00, 234362, RAIN, 1, 1, WAHRAIN, 5265, 1001, 0, , 10009, 0, , , B 1980-01-01 10:00, 234682, RAIN, 1, 1, WAHRAIN, 5271, 1001, 0, , 9, 0, , , 1980-01-01 10:00, 235389, RAIN, 1, 1, WAHRAIN, 5279, 1001, 0, , 9, 0, , , 1980-01-01 10:00, 236466, RAIN, 1, 1, WAHRAIN, 497, 1001, 0, , 9, 0, , , 1980-01-01 10:00, 243350, RAIN, 1, 1, SREW, 484, 1001, 0, , 9, 0, , , 1980-01-01 10:00, 243350, RAIN, 1, 1, WAHRAIN, 484, 1001, 0, 0, 9, 9, , , Thanks Nick On Thu, 29 Sept 2022 at 15:12, Ben Tupper <btupper at bigelow.org> wrote:
Hi Nick, It's hard to know without seeing at least a snippet of the data. Could you do the following and paste the result into a plain text email? If you don't set your email client to plain text (from rich text or html) then we are apt to see a jumble of output on our email clients. ## start x <- readLines(filename, n = 20) cat(x, sep = "\n") ## end Cheers, Ben On Thu, Sep 29, 2022 at 9:54 AM Nick Wray <nickmwray at gmail.com> wrote:
Hello I may be offending the R purists with this question but it is linked to R, as will become clear. I have very large data sets from the
UK
Met Office in notepad form. Unfortunately, I can?t read them directly into R because, for some reason, although most lines in the text doc consist of 15 elements, every so often there is a sixteenth one and R doesn?t like this and gives me an error message because it has assumed
that
every line has 15 elements and doesn?t like finding one with more. I
have
tried playing around with the text document, inserting an extra element into the top line etc, but to no avail. Also unfortunately you need access permission from the Met Office to get the files in question so this link probably won?t work: https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcatalogue.ceda.ac.uk%2Fuuid%2Fbbd6916225e7475514e17fdbf11141c1&data=05%7C01%7Crmh%40temple.edu%7C3c7f7571b0204227932408daa22d6a35%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C638000614056886333%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FolfWagLVB9RNAAR3L88YUnOG8wwDHZFPm5%2BWVWgZ7Y%3D&reserved=0 So what I have done is simply to copy and paste the text docs into excel csv and then read them in, which is time-consuming but works. However
the
later datasets are over the excel limit of 1048576 lines. I can paste in the first 1048576 lines but then trying to isolate the remainder of the text doc to paste it into a second csv doc is proving v difficult ? the only way I have found is to scroll down by hand and that?s taking ages.
I
cannot find another way of editing the notepad text doc to get rid of the part which I have already copied and pasted. Can anyone help with a)ideally being able to simply read the text tables into R or b)suggest a way of editing out the bits of the text file I
have
already pasted in without laborious scrolling? Thanks Nick Wray
[...]
-- Ben Tupper (he/him) Bigelow Laboratory for Ocean Science East Boothbay, Maine https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.bigelow.org%2F&data=05%7C01%7Crmh%40temple.edu%7C3c7f7571b0204227932408daa22d6a35%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C638000614056886333%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Qmpsx1aA7kL9lYJYshs1U7PrPqFpYFbzOQWXQvW1RLI%3D&reserved=0 https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Feco.bigelow.org%2F&data=05%7C01%7Crmh%40temple.edu%7C3c7f7571b0204227932408daa22d6a35%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C638000614056886333%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=IXODOG1eVdJKrHZyrM5yud9gjInLCFNcMGo4dWqFe3I%3D&reserved=0
Maybe I have missed it, but could you please show how
you tried to read the table?
When I use your file with
read.table("sample text.txt", header = FALSE, sep = ",")
I get
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
## 1 1980-01-01 10:00 225620 RAIN 1 1 WAHRAIN 5091 1001 0 NA 9 0 NA NA
## 2 1980-01-01 10:00 226918 RAIN 1 1 WAHRAIN 5124 1001 0 NA 9 0 NA NA
## ## .....
## 7 1980-01-01 10:00 234362 RAIN 1 1 WAHRAIN 5265 1001 0 NA 10009 0 NA NA B
## 8 1980-01-01 10:00 234682 RAIN 1 1 WAHRAIN 5271 1001 0 NA 9 0 NA NA
--
Enrico Schumann
Lucerne, Switzerland
https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fenricoschumann.net%2F&data=05%7C01%7Crmh%40temple.edu%7C3c7f7571b0204227932408daa22d6a35%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C638000614056886333%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=O7lDKov8h%2FuC6kCepouzcWMfaIyzi0L6FBfc1BrE2zo%3D&reserved=0
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C01%7Crmh%40temple.edu%7C3c7f7571b0204227932408daa22d6a35%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C638000614056886333%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AXPi3UBPFniqQgD%2FWaF3tGpaPROl19tz0XC26sQHvR0%3D&reserved=0 PLEASE do read the posting guide https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=05%7C01%7Crmh%40temple.edu%7C3c7f7571b0204227932408daa22d6a35%7C716e81efb52244738e3110bd02ccf6e5%7C0%7C0%7C638000614056886333%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=dMj3dYXs8aPT7lXJVnnarZZp%2BAukdJcI%2BxEpn2reHCM%3D&reserved=0 and provide commented, minimal, self-contained, reproducible code.