Analysis of pre-calculated frequency distribution?
On Sun, 21 Nov 2004 Ted.Harding at nessie.mcc.ac.uk wrote:
On 21-Nov-04 Dan Bolser wrote:
Sorry for the dumb question, but I cant work out how to do this. Quick version, How can I re-bin a given frequency distribution using new breaks without reference to the original data? Given distribution has integer valued bins. Long version, I am loading a frequency table into R from a file. The original data is very large, and it is a very simple process to get a frequency distribution from an SQL database, so in all this is a convenient method for me. Point being I don't start with 'raw' data. The data looks like this...
dat
COUNT FREQUENCY 1 1 5734 2 2 1625 [...] 365 9442 1 366 12280 1 [...] People tipically quote the curve in the form of y ~ Cx^(-a). I want to use the binning method of parameter estimation given here... http://www.ece.uc.edu/~annexste/Courses/cs690/Zipf,%20Power-law,%20Paret o%20-%20a%20ranking%20tutorial.htm (bin the data with exponentially increasing bin widths within the data range). But I can't work out how to re-bin my existing frequency data.
Hi Dan,
Your starting point can be the fact that the number of cases
with property i ("in class i") is COUNT_i + FREQUENCY_I
So if you construct a vector with these numbers in it you have
in effect reconstructed the original data.
I.e. N[i] <- COUNT[i]*FREQUENCY[i]
Cheers for this, I was trying this, but my results looked wrong with respect to the data shown on the webpage cited above. Thanks to James Holtman for the other suggestion - My confusion was coming from thinking I had to use hist, but in fact cut + tapply was the ticket. Cheers, Dan.
which can be done in one stroke with N <- COUNT*FREQUENCY One way (and maybe others can suggest better) to bin these classes non-uniformly could be: Say you have k "upper" breakpoints for your k bins, say BP, so that e.g. if BP[1] = 2 then there are N[1]+N[2] cases with class <= 2, and if BP[2] = 5 then there are N[3] + N[4] + N[5] cases with class > 2 and class <= 5, and so on. In your case BP[k] = 366. Let csN <- cumsum(N) Then (if I've not overlooked something) diff(c(0,csN[BP])) will give you the counts in yhour new bins. E.g. (just to show it should work):
> N<-rep(1,31) > BP<-c(1,3,7,15,31) > csN <- cumsum(N) > diff(c(0,csN[BP]))
[1] 1 2 4 8 16
> BP<-c(2,3,5,9,17,31) > diff(c(0,csN[BP]))
[1] 2 1 2 4 8 14 I hope this matches the sort of thing you have in mind! Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 [NB: New number!] Date: 21-Nov-04 Time: 16:47:05 ------------------------------ XFMail ------------------------------