Skip to content

two difficult loop

4 messages · Bert Gunter, Jim Lemon, greg holly

#
Dear all;



I have two data sets, data=map and data=ref). A small part of each data set
are given below. Data map has more than 27 million and data ref has about
560 rows. Basically I need run two different task. My R codes for these
task are given below but they do not work properly.

I sincerely do appreciate your helps.


Regards,

Greg



Task 1)

For example, the first and second columns for row 1 in data ref are 29220
63933. So I need write an R code normally first look the first row in ref
(which they are 29220 and 63933) than summing the column of "map$rate" and
give the number of rows that >0.85. Then do the same for the second,
third....in ref. At the end I would like a table gave below (the results I
need). Please notice the all value specified in ref data file are exist in
map$reg column.



Task2)

Again example, the first and second columns for row 1 in data ref are 29220
63933. So I need write an R code give the minimum map$p for the 29220
-63933 intervals in map file. Than

do the same for the second, third....in ref.




#my attempt for the first question

temp<-map[order(map$reg, map$p),]

count<-1

temp<-unique(temp$reg

for(i in 1:length(ref) {

  for(j in 1:length(ref)

  {

temp1<-if (temp[pos[i]==ref[ref$reg1,] & (temp[pos[j]==ref[ref$reg2,]
& temp[cumsum(temp$rate)
count=count+1

    }

}

#my attempt for the second question



temp<-map[order(map$reg, map$p),]

count<-1

temp<-unique(temp$reg

for(i in 1:length(ref) {

  for(j in 1:length(ref)

  {

temp2<-if (temp[pos[i]==ref[ref$reg1,] & (temp[pos[j]==ref[ref$reg2,])

output<-temp2[temp2$p==min(temp2$p),]

    }

}



Data sets


  Data= map

  reg   p      rate

 10276 0.700  3.867e-18

 71608 0.830  4.542e-16

 29220 0.430  1.948e-15

 99542 0.220  1.084e-15

 26441 0.880  9.675e-14

 95082 0.090  7.349e-13

 36169 0.480  9.715e-13

 55572 0.500  9.071e-12

 65255 0.300  1.688e-11

 51960 0.970  1.163e-10

 55652 0.388  3.750e-10

 63933 0.250  9.128e-10

 35170 0.720  7.355e-09

 06491 0.370  1.634e-08

 85508 0.470  1.057e-07

 86666 0.580  7.862e-07

 04758 0.810  9.501e-07

 06169 0.440  1.104e-06

 63933 0.750  2.624e-06

 41838 0.960  8.119e-06


 data=ref

  reg1         reg2

  29220     63933

  26441     41838

  06169     10276

  74806     92643

  73732     82451

  86042     93502

  85508     95082



       the results I need

     reg1      reg2 n

   29220   63933  12

   26441   41838   78

   06169 10276  125

   74806 92643   11

   73732 82451   47

   86042 93502   98

   85508 95082  219
#
Greg:

I was not able to understand your task 1. Perhaps others can.

My understanding of your task 2 is that for each row of ref, you wish
to find all rows,of map such that the reg values in those rows fall
between the reg1 and reg2 values in ref (inclusive change <= to < if
you don't want the endpoints), and then you want the minimum map$p
values of all those rows. If that is correct, I believe this will do
it (but caution, untested, as you failed to provide data in a
convenient form, e.g. using dput() )

task2 <- with(map,vapply(seq_len(nrow(ref)),function(i)
min(p[ref[i,1]<=reg & reg <= ref[i,2] ]),0))


If my understanding is incorrect, please ignore both the above and the
following:


The "solution" I have given above seems inefficient, so others may be
able to significantly improve it if you find that it takes too long.
OTOH, my understanding of your specification is that you need to
search for all rows in map data frame that meet the criterion for each
row of ref, and without further information, I don't know how to do
this without just repeating the search 560 times.


Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Sun, Jun 12, 2016 at 1:14 PM, greg holly <mak.hholly at gmail.com> wrote:
#
Hi Greg,
You've got a problem that you don't seem to have identified. Your
"reg" field in the "map" data frame can define at most 100000 unique
values. This means that each value will be repeated about 270 times.
Unless there are constraints you haven't mentioned, we would expect
that in 135 cases for each value, the values in each "ref" row will be
in the reverse order and the spans may overlap. I notice that you may
have tried to get around this by sorting the "map" data frame, but
then the order of the rows is different, and the number of rows
"between" any two values changes. Apart from this, it is almost
certain that the number of values of "p > 0.85" in the multiple runs
between each set of "ref" values will be different. It is possible to
perform both tasks that you mention, but only the second will yield an
unique or tied value for all of the cases. So your result data frame
will have an unspecified number of values for each row in "ref" for
the first task.

Jim
On Mon, Jun 13, 2016 at 6:14 AM, greg holly <mak.hholly at gmail.com> wrote:
#
Hi Jim;

Thanks so much for this info. I did not know this as I am very much new in
R, So do you think that, rather than using unique !duplicated would be
better to use?

Thanks in advance,

Greg
On Sun, Jun 12, 2016 at 7:06 PM, Jim Lemon <drjimlemon at gmail.com> wrote: