Skip to content

Help with isolating and comparing data from two files.

3 messages · jim holtman, ajn21

#
Hello,

I was hoping that someone would be able to help me or at least point me in
the right direction regarding a problem I am having. I am a new R user, and
I've been trying to read tutorials but they haven't been much help to me so
far.

The problem is relatively simple as I've already created working solutions
in Java and Perl, but I need a solution in R as well. 

I have two text files, say pos.txt and reg.txt. In pos.txt, the data is
listed for example:

c22 1445  - CG 1 4
c22 1542 + CG 2 3
c22 1678 + CG 13 15
...

etc. for thousands of lines. The most important column is column 2, which
lists "position" (e.g. 1445, 1542, 1678). In reg.txt, data is listed as:

c22 1440 1500 cpg: 44 56 ......
c22 1520 1700 cpg: 56 87 ......
c22 1800 1900 cpg: 58 90 ......
...

where the values in column 2 is the "start" position and values in column 3
are the "end" position. There are 10 columns total but I just listed the
first few. Also, the text files are different lengths.


Essentially, my problem is trying to take the position listed in column 2 of
pos.txt and try to find the region (based on start and end positions) listed
in reg.txt. Then I need to print:

c22 "start" "end" "position" + 1 5 

where the last 3 columns are from pos.txt as well (i.e. all of the lines
don't end in  + 1 5, but rather the values for the columns in pos.txt).
Also, the position needs to be within the start and end position.

So far I've been able to use read.table to create a data frame for each text
file, and I've also named each column (e.g. reg.data$end) and I can output
each column individually. However, the problem I keep facing is how to
compare the numbers for "position" in pos.txt to the numbers for "start" and
"end" in reg.txt. I tried to use: 

if ((pos >= start) | (pos <= end))..

but an error comes up that says the files aren't the same length.

In Java and Perl I used nested loops to cycle through each element in one
file, and compare it to every element in the other file, and then printed to
a new text file. As such, I was trying to learn a bit more about arrays in
R, but if you know of a better way in R to do this then please let me know.

Any help is greatly appreciated.

Thank you,
AJ

--
View this message in context: http://r.789695.n4.nabble.com/Help-with-isolating-and-comparing-data-from-two-files-tp3543170p3543170.html
Sent from the R help mailing list archive at Nabble.com.
#
Is this what you are after?
V1   V2 V3 V4 V5 V6
1 c22 1445  - CG  1  4
2 c22 1542  + CG  2  3
3 c22 1678  + CG 13 15
V1   V2   V3   V4 V5 V6     V7
1 c22 1440 1500 cpg: 44 56 ......
2 c22 1520 1700 cpg: 56 87 ......
3 c22 1800 1900 cpg: 58 90 ......
+     # get indices of match
+     indx <- (pos$V2 >= reg$V2[i]) & (pos$V2 <= reg$V3[i])
+     if (!any(indx)) return(NULL)  # no match
+     # create new dataframe
+     cbind(reg[rep(i, sum(indx)), 1:3], pos[indx, ])
+ })
V1   V2   V3  V1   V2 V3 V4 V5 V6
1   c22 1440 1500 c22 1445  - CG  1  4
2   c22 1520 1700 c22 1542  + CG  2  3
2.1 c22 1520 1700 c22 1678  + CG 13 15

        
On Mon, May 23, 2011 at 12:00 AM, ajn21 <ajn21 at case.edu> wrote:

  
    
#
Jim,

Thank you for your help! That is precisely what I was looking for. From your
help I was able to edit the output and then print it to a txt file (because
I didn't want to print it all in the R console due to the thousands of
lines).

R is a very powerful language, but it is rather difficult for me to learn
because it is so different from other languages I've used.

Regards,
AJ

--
View this message in context: http://r.789695.n4.nabble.com/Help-with-isolating-and-comparing-data-from-two-files-tp3543170p3544658.html
Sent from the R help mailing list archive at Nabble.com.