scatterplot of 100000 points and pdf file format

13 messages · Wolski, (Ted Harding), Marc Schwartz +5 more

Original

1

13

Wolski

Wed, Nov 24, 2004 7:34 AM #

Hi,

I want to draw a scatter plot with 1M  and more points and save it as pdf.
This makes the pdf file large.
So i tried to save the file first as png and than convert it to pdf. 
This looks OK if printed but if viewed e.g. with acrobat as document 
figure the quality is bad.

Anyone knows a way to reduce the size but keep the quality?


/E

Dipl. bio-chem. Witold Eryk Wolski
MPI-Moleculare Genetic
Ihnestrasse 63-73 14195 Berlin
tel: 0049-30-83875219                 __("<    _
http://www.molgen.mpg.de/~wolski      \__/    'v'
http://r4proteomics.sourceforge.net    ||    /   \
mail: witek96 at users.sourceforge.net    ^^     m m
      wolski at molgen.mpg.de

Wed, Nov 24, 2004 8:16 AM #

On 24-Nov-04 Witold Eryk Wolski wrote:

If you want the PDF file to preserve the info about all the
1M points then the problem has no solution. The png file
will already have suppressed most of this (which is one
reason for poor quality).

I think you should give thought to reducing what you need
to plot.

Think about it: suppose you plot with a resolution of
1/200 points per inch (about the limit at which the eye
begins to see rough edges). Then you have 40000 points
per square inch. If your 1M points are separate but as
closely packed as possible, this requires 25 square inches,
or a 5x5 inch (= 12.7x12.7 cm) square. And this would be
solid black!

Presumably in your plot there is a very large number of
points which are effectively indistinguisable from other
points, so these could be eliminated without spoiling
the plot.

I don't have an obviously best strategy for reducing what
you actually plot, but perhaps one line to think along
might be the following:

1. Multiply the data by some factor and then round the
   results to an integer (to avoid problems in step 2).
   Factor chosen so that the result of (4) below is
   satisfactory.

2. Eliminate duplicates in the result of (1).

3. Divide by the factor you used in (1).

4. Plot the result; save plot to PDF.

As to how to do it in R: the critical step is (2),
which with so many points could be very heavy unless
done by a well-chosen procedure. I'm not expert enough
to advise about that, but no doubt others are.

Good luck!
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 24-Nov-04                                       Time: 16:16:28
------------------------------ XFMail ------------------------------

Wed, Nov 24, 2004 8:22 AM #

On Wed, 2004-11-24 at 16:34 +0100, Witold Eryk Wolski wrote:

Hi Eryk!

Part of the problem is that in a pdf file, the vector based instructions
will need to be defined for each of your 10 ^ 6 points in order to draw
them.

When trying to create a simple example:

pdf()
plot(rnorm(1000000), rnorm(1000000))
dev.off()

The pdf file is 55 Mb in size.

One immediate thought was to try a ps file and using the above plot, the
ps file was "only" 23 Mb in size. So note that ps can be more efficient.

Going to a bitmap might result in a much smaller file, but as you note,
the quality does degrade as compared to a vector based image.

I tried the above to a png, then converted to a pdf (using 'convert')
and as expected, the image both viewed and printed was "pixelated",
since the pdf instructions are presumably drawing pixels and not vector
based objects.

Depending upon what you plan to do with the image, you may have to
choose among several options, resulting in tradeoffs between image
quality and file size.

If you can create the bitmap file explicitly in the size that you
require for printing or incorporating in a document, that is one way to
go and will preserve, to an extent, the overall fixed size image
quality, while keeping file size small.

Another option to consider for the pdf approach, if it does not
compromise the integrity of your plot, is to remove any duplicate data
points if any exist. Thus, you will not need what are in effect
redundant instructions in the pdf file. This may not be possible
depending upon the nature of your data (ie. doubles) without considering
some tolerance level for "equivalence".

Perhaps others will have additional ideas.

HTH,

Marc Schwartz

Wolski

Wed, Nov 24, 2004 8:43 AM #

Hi,

I tried the ps idea. But I am using pdflatex.
You get a even larger size reduction if you convert the ps into a pdf 
using ps2pdf.
But unfortunately there is a quality loss.

I have found almost a working solution:
a) Save the scatterplot without axes and with par(mar=c(0,0,0,0)) as png .
b) convert it using any program to pnm
c) read the pnm file using pixmap
d) Add axes labels and lines afterwards with par(new=TRUE)

And this looks like I would like that it looks like. But unfortunately 
acroread and gv on window is crashing when I try to print the file.

png(file="pepslop.png",width=500,height=500)
par(mar=c(0,0,0,0))
X2<-rnorm(100000)
Y2<-X2*10+rnorm(100000)
plot(X2,Y2,pch=".",xlab="",ylab="",main="",axes=F)
dev.off()

pdf(file="pepslop.pdf",width=7,height=7)
par(mar=c(3.2,3.2,1,1))
x <- read.pnm("pepslop.pnm" )
plot(x)
par(new=TRUE)
par(mar=c(3.2,3.2,1,1))
plot(X2,Y2,pch=".",xlab="",ylab="",main="",type="n")
mtext(expression(m[nominal]),side=1,line=2)
mtext(expression(mod(m[monoisotopic],1)),side=2,line=2)
legend(1000,4,expression(paste(lambda[DB],"=",0.000495)),col=2,lty=1,lwd=1)
abline(test,col=2,lwd=2)
dev.off()

Marc Schwartz wrote:

On Wed, 2004-11-24 at 16:34 +0100, Witold Eryk Wolski wrote:

Hi,

I want to draw a scatter plot with 1M  and more points and save it as pdf.
This makes the pdf file large.
So i tried to save the file first as png and than convert it to pdf. 
This looks OK if printed but if viewed e.g. with acrobat as document 
figure the quality is bad.

Anyone knows a way to reduce the size but keep the quality?

Hi Eryk!

Part of the problem is that in a pdf file, the vector based instructions
will need to be defined for each of your 10 ^ 6 points in order to draw
them.

When trying to create a simple example:

pdf()
plot(rnorm(1000000), rnorm(1000000))
dev.off()

The pdf file is 55 Mb in size.

One immediate thought was to try a ps file and using the above plot, the
ps file was "only" 23 Mb in size. So note that ps can be more efficient.

Going to a bitmap might result in a much smaller file, but as you note,
the quality does degrade as compared to a vector based image.

I tried the above to a png, then converted to a pdf (using 'convert')
and as expected, the image both viewed and printed was "pixelated",
since the pdf instructions are presumably drawing pixels and not vector
based objects.

Depending upon what you plan to do with the image, you may have to
choose among several options, resulting in tradeoffs between image
quality and file size.

If you can create the bitmap file explicitly in the size that you
require for printing or incorporating in a document, that is one way to
go and will preserve, to an extent, the overall fixed size image
quality, while keeping file size small.

Another option to consider for the pdf approach, if it does not
compromise the integrity of your plot, is to remove any duplicate data
points if any exist. Thus, you will not need what are in effect
redundant instructions in the pdf file. This may not be possible
depending upon the nature of your data (ie. doubles) without considering
some tolerance level for "equivalence".

Perhaps others will have additional ideas.

HTH,

Marc Schwartz

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Dipl. bio-chem. Witold Eryk Wolski
MPI-Moleculare Genetic
Ihnestrasse 63-73 14195 Berlin
tel: 0049-30-83875219                 __("<    _
http://www.molgen.mpg.de/~wolski      \__/    'v'
http://r4proteomics.sourceforge.net    ||    /   \
mail: witek96 at users.sourceforge.net    ^^     m m
      wolski at molgen.mpg.de

Wed, Nov 24, 2004 8:48 AM #

On Wed, 24 Nov 2004, Witold Eryk Wolski wrote:

Try the "hexbin" Bioconductor package, which gives hexagonally-binned 
density scatterplots. Even for tens of thousands of points this is often 
much better than a scatterplot.

 	-thomas

Brian Ripley

Wed, Nov 24, 2004 8:50 AM #

On Wed, 24 Nov 2004 Ted.Harding at nessie.mcc.ac.uk wrote:

unique will eat that for breakfast

[1] 0.55 0.09 0.64 0.00 0.00

[1] 10001

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

John

Wed, Nov 24, 2004 9:37 AM #

On Wednesday 24 November 2004 07:34, Witold Eryk Wolski wrote:

I would strongly suggest a different method to present the data such as a 
contour plot or 3D bar plot.  An XY plot with a million points is unlikely to 
be readable unless it is produced as a large format print.  At 200 DPI 
printed, 1,000,000 discrete points requires a minimum of a 5 inch (12.7          
cm) by 5 inch area.  Besides, other than being visually overwhelming, what 
information would such a plot offer a viewer?

John

Wed, Nov 24, 2004 9:46 AM #

On Wed, 2004-11-24 at 17:43 +0100, Witold Eryk Wolski wrote:

Eryk,

I tried this approach and was able to print without problem here under
FC3, using acroread.

Also, I think that you left out:

test <- lm(Y2 ~ X2)

in the code above, lest abline() will fail. Also I changed the x,y
coordinates of the legend, since (1000, 4) is outside the plot range for
the points that I generated here.

Interesting approach, it reduced the pdf file size to about 7 Mb.

BTW, any chance that there is a huge black hole in the center of
that...  ;-)

Best,

Marc

Barry Rowlingson

Wed, Nov 24, 2004 10:19 AM #

I recall some of our extreme value statistics people printing things 
like this. Several million points on a plot. Most of which were in a 
big, thick block of toner, and then a few hundred at the extremes which 
was where they where interested in looking.

  Of course these things took an hour to print on a PostScript printer 
at the time. I think I suggested only plotting points for which X > 
someThreshold. Saved on toner and time. Got a bit tricky in the 
bivariate case though, where you really needed to plot points outside 
some ellipse that you knew would otherwise be a big black blob, and then 
you filled that in with a black ellipse.

  Contours or aggregation wasn't any use, since they were interested in 
the point patterns of the extreme value data.

Baz

Patrick Connolly

Wed, Nov 24, 2004 3:35 PM #

On Wed, 24-Nov-2004 at 10:22AM -0600, Marc Schwartz wrote:

|> On Wed, 2004-11-24 at 16:34 +0100, Witold Eryk Wolski wrote:

|> > Hi,
|> > 
|> > I want to draw a scatter plot with 1M  and more points and save it as pdf.
|> > This makes the pdf file large.
|> > So i tried to save the file first as png and than convert it to pdf. 
|> > This looks OK if printed but if viewed e.g. with acrobat as document 
|> > figure the quality is bad.
|> > 
|> > Anyone knows a way to reduce the size but keep the quality?
|> 
|> Hi Eryk!
|> 
|> Part of the problem is that in a pdf file, the vector based instructions
|> will need to be defined for each of your 10 ^ 6 points in order to draw
|> them.
|> 
|> When trying to create a simple example:
|> 
|> pdf()
|> plot(rnorm(1000000), rnorm(1000000))
|> dev.off()
|> 
|> The pdf file is 55 Mb in size.
|> 
|> One immediate thought was to try a ps file and using the above plot, the
|> ps file was "only" 23 Mb in size. So note that ps can be more efficient.
|> 
|> Going to a bitmap might result in a much smaller file, but as you note,
|> the quality does degrade as compared to a vector based image.
|> 
|> I tried the above to a png, then converted to a pdf (using 'convert')
|> and as expected, the image both viewed and printed was "pixelated",
|> since the pdf instructions are presumably drawing pixels and not vector
|> based objects.

Using bitmap( ... , res = 300), I get a bitmap file of 56 Kb.

It's rather slow, most of the time being taken up using gs which is
converting the vector image I suspect.  Time would be much shorter if,
say a circle of diameter of 4 is left unplotted in the middle and
others have mentioned other ways of reducing redundant points.

A pdf file slightly larger than the png file can be made directly from
OpenOffice that has the png imported into it.  For a plot of 160mm
square, this pdf printed unpixelated.

Depending on what size (dimensions) you need to finish up with, you
might find you could get away with a lower resolution than 300 dpi,
but I usually find 200 too ragged.

HTH

Patrick Connolly
HortResearch
Mt Albert
Auckland
New Zealand 
Ph: +64-9 815 4200 x 7188
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~
I have the world`s largest collection of seashells. I keep it on all
the beaches of the world ... Perhaps you`ve seen it.  ---Steven Wright 
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~

Wed, Nov 24, 2004 4:37 PM #

On 24-Nov-04 Prof Brian Ripley wrote:

'unique' will eat x for breakfast, indeed, but will have some
trouble chewing (x,y).

I still can't think of a neat way of doing that.

Best wishes,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 25-Nov-04                                       Time: 00:37:15
------------------------------ XFMail ------------------------------

Wed, Nov 24, 2004 5:45 PM #

On 25-Nov-04 Ted Harding wrote:

Sorry, I don't want to be misunderstood.
I didn't mean that 'unique' won't work for arrays.
What I meant was:

[1] 0.74 0.07 0.81 0.00 0.00

[1] 350.81   4.56 356.54   0.00   0.00

However, still rounding to 3 d.p. we can try packing:

[1] 0.83 0.05 0.88 0.00 0.00

[1] 961523

Though the runtime is small we don't get much reduction
and still W has to be unpacked.

With rounding to 2 d.p.

[1] 1.31 0.01 1.32 0.00 0.00

[1] 209882

so now it's about 1/5, but visible discretisation must be
getting close.

With 1 d.p.

[1] 0.92 0.01 0.93 0.00 0.00

[1] 4953

there's a good reduction (about 1/200) but the discretisation
would definitely now be visible. However, as I suggested before,
there's an issue of choice of constant (i.e. of the resolution
of the discretisation so that there's a useful reduction and
also the plot is acceptable).

I'd still like to learn of a method which avoids the
above method of packing, which strikes me as clumsy
(but maybe it's the best way after all).

Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 25-Nov-04                                       Time: 01:45:48
------------------------------ XFMail ------------------------------

Wolski

Thu, Nov 25, 2004 12:33 AM #

Prof Brian Ripley wrote:

On Wed, 24 Nov 2004 Ted.Harding at nessie.mcc.ac.uk wrote:

On 24-Nov-04 Witold Eryk Wolski wrote:

Hi,
I want to draw a scatter plot with 1M  and more points
and save it as pdf.
This makes the pdf file large.
So i tried to save the file first as png and than convert
it to pdf. This looks OK if printed but if viewed e.g. with
acrobat as document figure the quality is bad.

Anyone knows a way to reduce the size but keep the quality?


If you want the PDF file to preserve the info about all the
1M points then the problem has no solution. The png file
will already have suppressed most of this (which is one
reason for poor quality).

I think you should give thought to reducing what you need
to plot.

Think about it: suppose you plot with a resolution of
1/200 points per inch (about the limit at which the eye
begins to see rough edges). Then you have 40000 points
per square inch. If your 1M points are separate but as
closely packed as possible, this requires 25 square inches,
or a 5x5 inch (= 12.7x12.7 cm) square. And this would be
solid black!

Presumably in your plot there is a very large number of
points which are effectively indistinguisable from other
points, so these could be eliminated without spoiling
the plot.

I don't have an obviously best strategy for reducing what
you actually plot, but perhaps one line to think along
might be the following:

1. Multiply the data by some factor and then round the
  results to an integer (to avoid problems in step 2).
  Factor chosen so that the result of (4) below is
  satisfactory.

2. Eliminate duplicates in the result of (1).

3. Divide by the factor you used in (1).

4. Plot the result; save plot to PDF.

As to how to do it in R: the critical step is (2),
which with so many points could be very heavy unless
done by a well-chosen procedure. I'm not expert enough
to advise about that, but no doubt others are.


unique will eat that for breakfast

x <- runif(1e6)
system.time(xx <- unique(round(x, 4)))

[1] 0.55 0.09 0.64 0.00 0.00

length(xx)

[1] 10001

?table -> reduces the data
and
?image -> shows it.
And this is doing exactly what I need. (not my idea but one of Thomas 
Untern??her).  Thanks Thomas.


/E

Dipl. bio-chem. Witold Eryk Wolski
MPI-Moleculare Genetic
Ihnestrasse 63-73 14195 Berlin
tel: 0049-30-83875219                 __("<    _
http://www.molgen.mpg.de/~wolski      \__/    'v'
http://r4proteomics.sourceforge.net    ||    /   \
mail: witek96 at users.sourceforge.net    ^^     m m
      wolski at molgen.mpg.de