need help - R-help | R Mailing Lists

Fri, Aug 12, 2005 2:04 PM #

Hi, there:
I think i need to re-phrase my question since last time I did not get
any reply but i think the question is not that hard, probably i did
not make the question clear:

I want to find cases like
35, 90, 330, 330, 335

from the rest which look like
3, 3, 3, 3.2, 3.3
4, 4.4, 4.5, 4.6, 4.7
....

basically there is one (or more) big 'gap' in the case i seek. 

thanks,

weiwei

Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

Daniel Nordlund

Fri, Aug 12, 2005 2:21 PM #

Weiwei,

You will have to specify what you mean by a big gap before anyone can help.  And I still don't understand what your data look like.  Is

35, 90, 330, 330, 335

supposed to represent a sequence or a row of a matrix (or data frame)?

Dan Nordlund
Bothell, WA

Weiwei Shi

Fri, Aug 12, 2005 2:33 PM #

Hi, there:
here is some part from my previous email:

          [,1]      [,2]       [,3]       [,4]       [,5]
[1,] 34.216166 96.928587 330.125990 330.183222 330.201215
[2,]  2.819183  8.134491   8.275841   8.525256   8.828448
[3,]  2.819183  7.541680   7.550333   8.374636   8.690998
[4,]  4.672551  5.036353   5.072710   5.152218   5.223204
[5,]  5.470131  5.500513   5.674139   5.689151   5.770423
[6,]  4.480287  4.628300   4.797686   4.814106   4.823345

I want to filter out the first 3 cases from the rest and the criteria
is I am looking for a "gap". 

My way is using std(eachrow)/median(each) and set up a threshold,
which is very naive, but fast and good enough. But I want it better
and more "academic". Please be advised. I think clustering might help,
but it needs to be quick since t2 has 30000 rows.

Thanks,

On 8/12/05, Daniel Nordlund <res90sx5 at verizon.net> wrote:

Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

Jim Lemon

Sat, Aug 13, 2005 2:28 PM #

Weiwei Shi wrote:

Hi Weiwei,

I think your method of defining a central value for the large proportion 
of values and then setting a criterion for outliers is valid (or at 
least as valid as many other ways of defining outliers). However, here 
is a different method, sorting the vector of values and then looking for 
a "gap" with a specified multiple (gap.prop) of the mean differences 
between the smaller values. It returns the first value after the "gap" 
(easily changed to all the values after). To account for vectors that 
have negative values the minimum value is subtracted when calculating 
"newx" and then added to the result. For your data, a gap.prop of 20 
works, but the default value of 10 doesn't. It also won't work where 
large values are typical and small ones are the outliers (well, it will 
indicate where the "gap" is).

Jim
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: find.first.gap.R
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20050813/99cdabfe/find.first.gap.pl