how to delete specific rows in a data frame where the first column matches any string from a list
Andrew Choens wrote:
I regularly deal with a similar pattern at work. People send me these big long .csv files and I have to run them through some pattern analysis to decide which rows I keep and which rows I kill off. As others have mentioned, Perl is a good candidate for this task. Another option would be a quick SQL query. It should be a snap to pull this into something like Access or OOo Base . . . . or better yet, a real database like Postgres, MySQL, etc. In case you aren't too familiar with SQL, this query could be done by deleting the rows using a self join (syntax varies by product). But, if the pattern is as simple as it sounds and / or this is a one-time job, using SQL is over-kill for the situation. I often use sed in places where Perl is over-kill, but I can't think of any way to match from row to row with sed. If anyone knows how to do this with sed, it would (probably) be easier than trying to learn how to use perl. And, I would like to know how to do this with sed too.
(this is actually off-topic, but since it may be interesting for the general public, i keep the response cc: to r-help) yes, you can do this with sed. suppose you have two files, one (say, sample.txt) with the data to be filtered, record fields separated by, e.g., a tab character, and another (say, filter.txt) with patterns to be matched. a row from the first is passed to output only of its second field does not match any of the patterns -- this corresponds to (a simplified version of) the original problem. then, the following should do: sed "$(sed 's/^/\/^[^\\t]\\+\\t/; s/$/\/d/' filter.txt)" sample.txt > filtered-sample.txt (unless the patterns contain characters that interfere with the shell or sed's syntax, in which case they'd have to be appropriately escaped.) vQ