scatterplot of 100000 points and pdf file format

From: Ted.Harding at nessie.mcc.ac.uk

On 25-Nov-04 Ted Harding wrote:
'unique' will eat x for breakfast, indeed, but will have some
trouble chewing (x,y).

I still can't think of a neat way of doing that.

Best wishes,
Ted.
Sorry, I don't want to be misunderstood.
I didn't mean that 'unique' won't work for arrays.
What I meant was:

X<-round(rnorm(1e6),3);Y<-round(rnorm(1e6),3)
system.time(unique(X))
[1] 0.74 0.07 0.81 0.00 0.00
system.time(unique(cbind(X,Y)))
[1] 350.81   4.56 356.54   0.00   0.00
Do you know if majority of that time is spent in unique() itself?  If so,
which method?  What I see is:
X<-round(rnorm(1e6),3);Y<-round(rnorm(1e6),3)
system.time(unique(X), gcFirst=TRUE)
[1] 0.25 0.01 0.26   NA   NA
system.time(unique(cbind(X,Y)), gcFirst=TRUE)
[1] 101.80   0.34 104.61     NA     NA
system.time(dat <- data.frame(x=X, y=Y), gcFirst=TRUE)
[1] 10.17  0.00 10.24    NA    NA
system.time(unique(dat), gcFirst=TRUE)
[1] 23.94  0.11 24.15    NA    NA

Andy
However, still rounding to 3 d.p. we can try packing:

Z<-100000000*X + 1000*Y
system.time(W<-unique(Z))
[1] 0.83 0.05 0.88 0.00 0.00
length(W)
[1] 961523

Though the runtime is small we don't get much reduction
and still W has to be unpacked.

With rounding to 2 d.p.

X<-round(rnorm(1e6),2);Y<-round(rnorm(1e6),2)
Z<-100000000*X + 1000*Y
system.time(W<-unique(Z))
[1] 1.31 0.01 1.32 0.00 0.00
length(W)
[1] 209882

so now it's about 1/5, but visible discretisation must be
getting close.

With 1 d.p.

X<-round(rnorm(1e6),1);Y<-round(rnorm(1e6),1)
Z<-100000000*X + 1000*Y
system.time(W<-unique(Z))
[1] 0.92 0.01 0.93 0.00 0.00
length(W)
[1] 4953

there's a good reduction (about 1/200) but the discretisation
would definitely now be visible. However, as I suggested before,
there's an issue of choice of constant (i.e. of the resolution
of the discretisation so that there's a useful reduction and
also the plot is acceptable).

I'd still like to learn of a method which avoids the
above method of packing, which strikes me as clumsy
(but maybe it's the best way after all).

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 25-Nov-04                                       Time: 01:45:48
------------------------------ XFMail ------------------------------

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html

Another possibility might be to use a 2d kernel density estimate (eg.
kde2d from library(MASS).  Then for the high density areas plot the
density contours, for the low density areas plot the individual
points.

Hadley
Hi Andy,
From: Ted.Harding at nessie.mcc.ac.uk
[...]
X<-round(rnorm(1e6),3);Y<-round(rnorm(1e6),3)
system.time(unique(X))
[1] 0.74 0.07 0.81 0.00 0.00
system.time(unique(cbind(X,Y)))
[1] 350.81   4.56 356.54   0.00   0.00
Do you know if majority of that time is spent in unique() itself?
 If so, which method?  What I see is:

X<-round(rnorm(1e6),3);Y<-round(rnorm(1e6),3)
system.time(unique(X), gcFirst=TRUE)
[1] 0.25 0.01 0.26   NA   NA
system.time(unique(cbind(X,Y)), gcFirst=TRUE)
[1] 101.80   0.34 104.61     NA     NA
system.time(dat <- data.frame(x=X, y=Y), gcFirst=TRUE)
[1] 10.17  0.00 10.24    NA    NA
system.time(unique(dat), gcFirst=TRUE)
[1] 23.94  0.11 24.15    NA    NA

Andy
I want to look into this a bit more systematically (I have
an idea why 'unique' may be taking longer on the array from
'cbind' than on the dataframe), but I will be doing this on
a much faster machine than I immediately have to hand, so
will report results (if interesting) later.

Meanwhile, I'm not sure what you mean by "which method?",
and I'm also wondering what "gcFirst" is about.

Thanks,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 25-Nov-04                                       Time: 14:30:39
------------------------------ XFMail ------------------------------
(Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> writes:
I want to look into this a bit more systematically (I have
an idea why 'unique' may be taking longer on the array from
'cbind' than on the dataframe),
Just look inside the functions. One is pasting columns together, the
other is using a paste() construct inside an apply() function. So with
two columns by 1e6 rows, one is doing one large paste and the other a
million small ones.
O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907