Skip to content

ggplot2 / reshape / Question on manipulating data

4 messages · Pete Kazmier, Hadley Wickham

#
I'm an R newbie but recently discovered the ggplot2 and reshape
packages which seem incredibly useful and much easier to use for a
beginner.  Using the data from the IMDB, I'm trying to see how the
average movie rating varies by year.  Here is what my data looks like:
Title  Histogram VoteCount VoteMean Year
1                !Huff (2004) (TV) 0000000016       299      8.4 2004
8              'Allo 'Allo! (1982) 0000000125       829      8.6 1982
50              .hack//SIGN (2002) 0000001113       150      7.0 2002
56            1-800-Missing (2003) 0000000103       118      5.4 2003
66  Greatest Artists (2000) (mini) 00..000016       110      7.8 2000
77 00 Scariest Movie (2004) (mini) 00..000115       256      8.6 2004

The above data is not aggregated.  So after playing around with basic
R functionality, I stumbled across the 'aggregate' function and was
able to see the information in the manner I desired (average movie
rating by year).
Having just discovered gglot2, I wanted to create the same graph but
augment it with a color attribute based on the total number of votes
in a year.  So first I tried to see if I could reproduce the above:
This did not work as expected because the x-axis contained labels for
each and every year making it impossible to read whereas the plot
created with basic R had nice x-axis labels.  How do I get 'qplot' to
treat the x-axis in a similar manner to 'plot'?

After playing around further, I was able to get 'qplot' to work in a
manner similar to 'plot' with regards to the x-axis labels by using
'melt' and 'cast'.  The 'qplot' now behaves correctly:
How do 'byYear' and 'byYear2' differ?  I am trying to use 'typeof' but
both seem to be lists.  However, they are clearly different in some
way because 'qplot' graphs them differently.

Finally, I'd like to use a color attribute to 'qplot' to augment each
point with a color based on the total number of votes for the year.
Using attributes with 'qplot' seems simple, but I'm having a hard time
grooming my data appropriately.  I believe this requires aggregation
by summing the VoteCount column.  Is there a way to cast the data
using different aggregation functions for various columns?  In my
case, I want the mean of the VoteMean column, and the sum of the
VoteCount column.  Then I want to produce a graph showing the average
movie rating per year but with each point colored to reflect the total
number of votes for that year.  Any pointers?

Thanks,
Pete
#
On 7/12/07, Pete Kazmier <pete-expires-20070910 at kazmier.com> wrote:
Have you tried using the movies dataset included in ggplot?  Or is
there some data that you want that is not in that dataset.
The problem is probably that Year is a factor - and factors are
labelled on every level (even if they overlap - which is a bug).
There's no terribly easy way to fix this, but the following will work:

qplot(as.numeric(as.character(Year)), x, data=byYear)
Try using str - it's much more helpful, and you should see the
different quickly.
Not easily, unfortunately.  However, you could do:

cast(mratings, Year ~ variable, c(mean, sum)), subset = variable %in%
c("VoteMean", "VoteCount"))

which will give you a mean and sum for both.
Using the built in movies data:

mm <- melt(movies, id=1:2, m=c("rating", "votes"))
msum <- cast(mm, year ~ variable, c(mean, sum))

qplot(year, rating_mean, data=msum, colour=votes_sum)
qplot(year, rating_mean, data=msum, colour=votes_sum, geom="line")

Hadley
#
"hadley wickham" <h.wickham at gmail.com> writes:
It's funny that you mention this because I had intended to write this
email about a month ago but was delayed due to other reasons.  In any
case, when I was typing this up last night, I wanted to recreate my
steps but I could not find the IMDB movie data I had used originally.
I searched everywhere to no avail so I downloaded the data myself and
groomed it.  Only now do I remember that I had used the movies dataset
included in ggplot.
Thanks!  This is the function I've been looking for in my quest to
learn about internal data types of R.  Too bad it has such a terrible
name!
Great!  This is exactly what I was looking to do.  By the way, does
any of your documentation use the movie dataset as an example?  I'm
curious what else I can do with the dataset.  For example, how can I
use ggplot's facets to see the same information by type of movie?  I'm
unsure of how to manipulate the binary variables into a single
variable so that it can be treated as levels.

Thanks!
Pete
#
On 7/12/07, Pete Kazmier <pete-expires-20070910 at kazmier.com> wrote:
A lot of the examples do use the movies data, but I don't think any of
it is particularly revealing.  You might want to look at the results
for the 2007 infovis visualisation challenge
(http://www.apl.jhu.edu/Misc/Visualization/) which uses similar data.
Submission isn't complete yet, but you can see my teams entry at
http://had.co.nz/infovis-2007/.  There are lots of interesting stories
to pursue.

I think I will update the movies data to include the first genre as
another column.  That will make it easier to facet by genre

Hadley