Skip to content

arithmetic problem

3 messages · Iain Gallagher, Gabor Grothendieck, William Dunlap

#
Hello list

I have a problem with a dataset (see toy example below) where I am trying to find the difference between two (or more numbers) and discard those observations which fall outside a set interval.

An example and further explanation:

   values      ind
1    2655      7A5
2    3028      7A5
3     689   ABBA-1
4    1336   ABBA-1
5    1560   ABBA-1
6    2820   ABLIM1
7    3339   ABLIM1
8     171    ACSM5
9     195    ACSM5
10     43 ADAMDEC1
11    129 ADAMDEC1
12   1105     AFF1
13   3202     AFF1
14    852     AFF3
15   2461     AFF3
16     45     AKT1
17    397     AKT1
18   1430     AQP2
19   2402     AQP2
20   2551 ARHGAP19

Each number in the values column above is associated with a label (in the ind column). For some inds there will be only 2 values but as can be seen from the data other inds have many values.

Here's what I want to do using the ABBA-1 data from above as an example:

calculate the differences between each value:

1560-1336 = 224
1336-689 = 647

then use these values to create an index that will allow me to pull out values between set limits. If I set the limits to between 200 and 300 then the index will reference rows 4 & 5 in the above data set.

I hope this is reasonably clear and I appreciate any suggestions.

Thanks

Iain
#
Here are are assuming

1. for each row that if that row's value is within 200 - 300 of the
prior or next value with the same ind then that row should be extracted.
2. the input is sorted by values within ind
 If that's not the intention then modify the code accordingly.

First we read in the data into data frame DF.

Then we define between(x, min, max) which is a function that returns a
vector whose
ith component is TRUE if x[i] is between min and max.

Then use ave() to get a selection vector.  In this case ave returns a vector of
zeros and ones and we convert that to the logical vector sel which
defines the selection.

# read the data
Lines <- "values      ind
1    2655      7A5
2    3028      7A5
3     689   ABBA-1
4    1336   ABBA-1
5    1560   ABBA-1
6    2820   ABLIM1
7    3339   ABLIM1
8     171    ACSM5
9     195    ACSM5
10     43 ADAMDEC1
11    129 ADAMDEC1
12   1105     AFF1
13   3202     AFF1
14    852     AFF3
15   2461     AFF3
16     45     AKT1
17    397     AKT1
18   1430     AQP2
19   2402     AQP2
20   2551 ARHGAP19"
DF <- read.table(textConnection(Lines), header = TRUE)

between <- function(x, min, max) x > min & max > x

sel <- ave(DF$values, DF$ind, FUN = function(v)
	between(c(FALSE, diff(v)), 200, 300) | between(c(diff(v), FALSE), 200, 300)
) > 0

DF[sel, ]



On Sat, May 30, 2009 at 10:13 AM, Iain Gallagher
<iaingallagher at btopenworld.com> wrote:
#
Since DF is sorted appropriately we could speed that up by avoiding
the repeated function calls done by ave() by or-ing in to your
between() clauses the clause
    ind[-1]==ind[-length(ind)]
as in
    sel1 <- with(DF, c( {dv<-values[-1]-values[-length(values)];dv>200&dv<300} & ind[-1]==ind[-length(ind)], FALSE))
(This one just gives the lower of each pair.)

Someone recently proposed making a function like diff in which you
could insert the operator of your choice, like "==" here, instead of
the usual "-".  That might make code like this easier to understand.