Histogram from frequency data in pre-made bins - R-help

Sun, Aug 21, 2011 3:20 AM #

Dear R user,
I am using UK census data on travel to work. The authorities have provided a
breakdown in each area by mode (car, bicycle etc.) and distance travelled (0
? 2 km, 2 ? 5 km etc). Therefore, after processing, the data for Sheffield
look like this https://files.one.ubuntu.com/ej2VtVbJTEaelvMRlsocRg :

dshef <- read.table("distmodesheff.csv", sep=",", header=TRUE)
print(dshef)


      Dist  Tr Bici  Met  Pas  Foot   Bus   Car
1     2 >   45  571  491 2125 16644  4469 13494
2   2 ? 5   80 1136 2540 4738  3659 17290 30212
3  5 ? 10  217  466 2335 3994  1041 12963 35221
4 10 ? 20  191   76  491 1333   332  2439 16322
5 20 ? 30  168    6   25  235    41   175  3711
6 30 ? 40   78    6    3  122    20    74  2179
7  40 ? 60 349    6   21  261    96   333  3501
8     60 < 332   62  125  369   534   433  3276
9    Other 148   40   79  905   388   622  6481

It's interesting to look at the different distributions of different
transport modes: 

attach(dshef)
rs <- rbind(Tr,Bici,Met,Pas,Foot,Bus,Car)

barplot(rs, beside=TRUE, names=Dist, col=rainbow(7), legend=TRUE)

http://r.789695.n4.nabble.com/file/n3758198/1.png 

This is brilliant, and creates output similar to that of OO calc:

http://r.789695.n4.nabble.com/file/n3758198/egraphmini.jpg 

However, as you can see, the pre-made categories (0 ? 2 km etc.) are
unevenly spaced bins within a continuous variable. This puts the analysis
into histogram mode (with frequency determined by the area, not the height).
What I would look for for the vector Car, for example, would be something
like this: 

n <- c(rep(1.5,Car[1]), rep(3,Car[2]), rep(7.7,Car[3]),
rep(15,Car[4]),rep(25,Car[5]), 

	rep(35,Car[6]), rep(50,Car[7]), rep(100,Car[8]) )

hist(n, breaks=c(0,2,5,10,20,30,40,60,200))

http://r.789695.n4.nabble.com/file/n3758198/2.png 

This produces a histogram, but it's a tedious an ugly way of getting there.
Also, this does not allow for trend-line analysis of the likely distribution
of the continuous variable distance: lines(density(n)), for example results
in peaks around my arbitrary value.

Has anyone else encountered similar issues? I've searched high and low but
can find no solution other than creating a barplot with variable widths:
http://r.789695.n4.nabble.com/Histogram-using-frequency-data-td827927.html

Any ideas about how to resolve this issue very greatly appreciated.
Eventually I hope to model the distribution of distances travelled in order
to estimate the mean distance within each bin.

Many thanks, 

Robin


--
View this message in context: http://r.789695.n4.nabble.com/Histogram-from-frequency-data-in-pre-made-bins-tp3758198p3758198.html
Sent from the R help mailing list archive at Nabble.com.

RobinLovelace

Mon, Aug 22, 2011 2:28 AM #

Update: I have recreated an artificial distribution using uniform random
numbers

n <- c(runif(Car[1],0,2), runif(Car[2],2,5),runif(Car[3],5,10),
runif(Car[4],10,20), 
	runif(Car[5],20,30), runif(Car[6],30,40), runif(Car[7],40,60),
	runif(Car[8],60,200) )

The resulting density distribution is very jumpy, but should, in theory
allow me to fit a distribution to it and then extract the bin means from a
random sample of the given distribution. Again, this is tedious and far from
ideal, but cannot see any way around it. Also the distributions I fit to
this artificial dataset shoot up to infinity as x => 0.

Any ideas anyone???

--
View this message in context: http://r.789695.n4.nabble.com/Histogram-from-frequency-data-in-pre-made-bins-tp3758198p3759645.html
Sent from the R help mailing list archive at Nabble.com.

RobinLovelace

Mon, Aug 22, 2011 8:25 AM #

Sorry to anyone who tried but failed to download the data - seems not to be
there.

Here's a new link to it please take a look.

http://ubuntuone.com/p/1C6U/

--
View this message in context: http://r.789695.n4.nabble.com/Histogram-from-frequency-data-in-pre-made-bins-tp3758198p3760458.html
Sent from the R help mailing list archive at Nabble.com.