Analysis of pre-calculated frequency distribution? - R-help

Sun, Nov 21, 2004 6:35 AM #

Sorry for the dumb question, but I cant work out how to do this. 

Quick version, 

How can I re-bin a given frequency distribution using new breaks without
reference to the original data? Given distribution has integer valued
bins.


Long version,

I am loading a frequency table into R from a file. The original data is
very large, and it is a very simple process to get a frequency
distribution from an SQL database, so in all this is a convenient method
for me. Point being I don't start with 'raw' data.

The data looks like this...

COUNT FREQUENCY
1                1 5734
2                2 1625
3                3  793
4                4  480
5                5  294
6                6  237
7                7  205
8                8  200
9                9  123
10              10  108
11              11   90
12              12   62
13              13   60
14              14   68
15              15   64
16              16   56
17              17   68
18              18   45
19              19   38
20              20   37
21              21   29
22              22   39
23              23   35
24              24   33
25              25   36
...
148            153    5
149            156    2
150            157    3
151            158    2
152            159    2
153            162    1
154            163    3
155            164    3
156            165    2
157            166    1
158            168    2
159            169    4
160            170    1
...
354           2106    1
355           2189    1
356           2194    1
357           2217    1
358           2246    1
359           2474    1
360           2801    1
361           3697    1
362           3702    1
363           7353    1
364           8738    1
365           9442    1
366          12280    1



This is a tipical 'count / frequency' distribution in biology, where low
counts of a certain property are very frequent (across genomes, proteins,
ecosystems, etc...), and high counts of of a certain property are very
rare.

In the above example a certain property occurs 12280 times with a
frequency of 1, another property occurs 9442 times with the same
frequency. At the other end of the extreem, a certain property occurs once
with a frequency of 5734, and another property occurs twice with a
frequency of 1625. 

This kind of distribution is variously known as a "zipf", a "power law", a
"Pareto", "scale free", "heavy tailed" or a "80:20" distribution, or
coloquially "the dominance of the few over the many". The term I choose is
a "log linear" distribution, because that makes no assumptions about the
underlying cause of the overall shape.

People tipically quote the curve in the form of y ~ Cx^(-a). I want to use
the binning method of parameter estimation given here...

http://www.ece.uc.edu/~annexste/Courses/cs690/Zipf,%20Power-law,%20Pareto%20-%20a%20ranking%20tutorial.htm

(bin the data with exponentially increasing bin widths within the data
range).

But I can't work out how to re-bin my existing frequency data.

Sorry for the long question, 
all the best
Dan.

(Ted Harding)

Sun, Nov 21, 2004 8:47 AM #

On 21-Nov-04 Dan Bolser wrote:

Hi Dan,
Your starting point can be the fact that the number of cases
with property i ("in class i") is COUNT_i + FREQUENCY_I

So if you construct a vector with these numbers in it you have
in effect reconstructed the original data.

I.e.  N[i] <- COUNT[i]*FREQUENCY[i]

which can be done in one stroke with N <- COUNT*FREQUENCY

One way (and maybe others can suggest better) to bin these
classes non-uniformly could be:

  Say you have k "upper" breakpoints for your k bins,
  say BP, so that e.g. if BP[1] = 2 then there are N[1]+N[2]
  cases with class <= 2, and if BP[2] = 5 then there are
  N[3] + N[4] + N[5] cases with class > 2 and class <= 5,
  and so on. In your case BP[k] = 366.

  Let

    csN <- cumsum(N)

  Then (if I've not overlooked something)

    diff(c(0,csN[BP]))

  will give you the counts in yhour new bins.

E.g. (just to show it should work):

  > N<-rep(1,31)
  > BP<-c(1,3,7,15,31)
  > csN <- cumsum(N)
  > diff(c(0,csN[BP]))
  [1]  1  2  4  8 16


  > BP<-c(2,3,5,9,17,31)
  > diff(c(0,csN[BP]))
  [1]  2  1  2  4  8 14

I hope this matches the sort of thing you have in mind!
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 21-Nov-04                                       Time: 16:47:05
------------------------------ XFMail ------------------------------

Dan Bolser

Sun, Nov 21, 2004 10:12 AM #

On Sun, 21 Nov 2004 Ted.Harding at nessie.mcc.ac.uk wrote:

Cheers for this, I was trying this, but my results looked wrong with
respect to the data shown on the webpage cited above.

Thanks to James Holtman for the other suggestion - 

My confusion was coming from thinking I had to use hist, but in fact cut +
tapply was the ticket.

Cheers,
Dan.