Drop matching lines from readLines

4 messages · Santosh Srinivas, Mike Marchywka, Bert Gunter

Original

1

4

Santosh Srinivas

Wed, Oct 13, 2010 10:57 PM #

Dear R-group,

I have some noise in my text file (coding issues!) ...  I imported a 200 MB
text file using readlines
Used grep to find the lines with the error?

What is the easiest way to drop those lines? I plan to write back the
"cleaned" data set to my base file.

Thanks.

Thu, Oct 14, 2010 4:05 AM #

----------------------------------------

Generally for text processing, I've been using utilities external to R
although there may be R alternatives that work better for you. You
mention grep, I've suggested sed as a general way to fix formatting things,
there is also something called "uniq" on linux or cygwin.
I have gotten into the habit of using these for a variety of data
manipulation tasks, only feed clean data into R.

$ echo -e a bc\\na bc
a bc
a bc

$ echo -e a bc\\na bc | uniq
a bc

$ uniq --help
Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).

With no options, matching lines are merged to the first occurrence.

Mandatory arguments to long options are mandatory for short options too.
? -c, --count?????????? prefix lines by the number of occurrences
? -d, --repeated??????? only print duplicate lines
? -D, --all-repeated[=delimit-method]? print all duplicate lines
??????????????????????? delimit-method={none(default),prepend,separate}
??????????????????????? Delimiting is done with blank lines
? -f, --skip-fields=N?? avoid comparing the first N fields
? -i, --ignore-case???? ignore differences in case when comparing
? -s, --skip-chars=N??? avoid comparing the first N characters
? -u, --unique????????? only print unique lines
? -z, --zero-terminated? end lines with 0 byte, not newline
? -w, --check-chars=N?? compare no more than N characters in lines
????? --help???? display this help and exit
????? --version? output version information and exit

A field is a run of blanks (usually spaces and/or TABs), then non-blank
characters.? Fields are skipped before chars.

Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use `sort -u' without `uniq'.
Also, comparisons honor the rules specified by `LC_COLLATE'.

Bert Gunter

Thu, Oct 14, 2010 8:55 AM #

If I understand correctly, the poster knows what regex error pattern
to look for, in which case (mod memory capacity -- but 200 mb should
not be a problem, I think) is not merely

cleanData <- dirtyData[!grepl("errorPatternregex",dirtyData)]

sufficient?

Cheers,
Bert

On Thu, Oct 14, 2010 at 4:05 AM, Mike Marchywka <marchywka at hotmail.com> wrote:






----------------------------------------

From: santosh.srinivas at gmail.com
To: r-help at r-project.org
Date: Thu, 14 Oct 2010 11:27:57 +0530
Subject: [R] Drop matching lines from readLines

Dear R-group,

I have some noise in my text file (coding issues!) ... I imported a 200 MB
text file using readlines
Used grep to find the lines with the error?

What is the easiest way to drop those lines? I plan to write back the
"cleaned" data set to my base file.

Generally for text processing, I've been using utilities external to R
although there may be R alternatives that work better for you. You
mention grep, I've suggested sed as a general way to fix formatting things,
there is also something called "uniq" on linux or cygwin.
I have gotten into the habit of using these for a variety of data
manipulation tasks, only feed clean data into R.

$ echo -e a bc\\na bc
a bc
a bc

$ echo -e a bc\\na bc | uniq
a bc

$ uniq --help
Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).

With no options, matching lines are merged to the first occurrence.

Mandatory arguments to long options are mandatory for short options too.
? -c, --count?????????? prefix lines by the number of occurrences
? -d, --repeated??????? only print duplicate lines
? -D, --all-repeated[=delimit-method]? print all duplicate lines
??????????????????????? delimit-method={none(default),prepend,separate}
??????????????????????? Delimiting is done with blank lines
? -f, --skip-fields=N?? avoid comparing the first N fields
? -i, --ignore-case???? ignore differences in case when comparing
? -s, --skip-chars=N??? avoid comparing the first N characters
? -u, --unique????????? only print unique lines
? -z, --zero-terminated? end lines with 0 byte, not newline
? -w, --check-chars=N?? compare no more than N characters in lines
????? --help???? display this help and exit
????? --version? output version information and exit

A field is a run of blanks (usually spaces and/or TABs), then non-blank
characters.? Fields are skipped before chars.

Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use `sort -u' without `uniq'.
Also, comparisons honor the rules specified by `LC_COLLATE'.

Thanks.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter
Genentech Nonclinical Biostatistics

Santosh Srinivas

Thu, Oct 14, 2010 8:57 AM #

Yes, thanks ... that works.

-----Original Message-----
From: Bert Gunter [mailto:gunter.berton at gene.com] 
Sent: 14 October 2010 21:26
To: Mike Marchywka
Cc: santosh.srinivas at gmail.com; r-help at r-project.org
Subject: Re: [R] Drop matching lines from readLines

If I understand correctly, the poster knows what regex error pattern
to look for, in which case (mod memory capacity -- but 200 mb should
not be a problem, I think) is not merely

cleanData <- dirtyData[!grepl("errorPatternregex",dirtyData)]

sufficient?

Cheers,
Bert

On Thu, Oct 14, 2010 at 4:05 AM, Mike Marchywka <marchywka at hotmail.com>
wrote:

MB

things,

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.

Bert Gunter
Genentech Nonclinical Biostatistics