Skip to content

Clean up a scatterplot with too much data

5 messages · DimmestLemming, Karl Ove Hufthammer, Dennis Murphy +1 more

#
I'm working with a lot of data right now, but I'm new to R, and not very good
with it, hence my request for help. What type of graph could I use to
straighten out things like...

http://r.789695.n4.nabble.com/file/n3711389/Untitled.png 

...this?

I want to see general frequencies. Should I use something like a 3D
histogram, or is there an easier way like, say, shading? I'm sure these are
both possible, but I don't know which is easiest or how to implement either
of them.

Thanks!

--
View this message in context: http://r.789695.n4.nabble.com/Clean-up-a-scatterplot-with-too-much-data-tp3711389p3711389.html
Sent from the R help mailing list archive at Nabble.com.
#
Hi,

One solution could be to subsample the data, or jitter the data (give it
some random noise). A more elegant solution, imho, is to use a 2d
histogram (3d histogram is not a good alternative, I think it is much
better to use color instead of a third dimension). I don't think this is
easy to make using the standard plot system in R, but ggplot2 handles it
nicely. This would involve you needing to learn ggplot2, but I would
highly recommend that anyways :). An example of the plot I have in mind
can be seen at:

http://had.co.nz/ggplot2/stat_bin2d.html

Just scroll down a bit for some examples.

cheers,
Paul
On 08/02/2011 05:26 AM, DimmestLemming wrote:

  
    
#
DimmestLemming wrote:

            
Three nice alternatives:

example(smoothScatter)
example(sunflowerplot)
library(hexbin)
example(hexbinplot)

(And do remove the outliers before plotting.)
#
In addition to the other responses (all of which I liked), a couple of
other alternatives to consider are 2D density plots (see ?kde2d in the
MASS package, for example) or geom_tile() in the ggplot2 package,
which you can think of as a 3D histogram projected to 2D with color
corresponding to (relative) frequency, as suggested by Paul Hiemstra.
geom_tile() is a discretized, gridded version of a hexbin plot, but I
would start with the hexbin myself. I echo KOH's comment: make sure
you remove the outliers first, especially that one in the upper left
corner :)

After looking at your plot, here's my question: why would you plot
kills/minute vs. minutes played? Doesn't the first variable render the
second one moot? Wouldn't kills vs. minutes played be a more relevant
(scatter)plot? If you have information on the skill level of the
players, you could incorporate that information into the plot as well.
There are several nice ways to go if this is the case.

If kills/minute is the more appropriate measure, a univariate density
plot would make sense, or a histogram.

HTH,
Dennis
On Mon, Aug 1, 2011 at 10:26 PM, DimmestLemming <NICOADAMS000 at gmail.com> wrote:
#
On 08/02/2011 01:07 PM, Dennis Murphy wrote:
When using geom_tile you need to bin the data yourself. I much prefer
using stat_bin2d which does all the work for you.

cheers,
Paul