Skip to content

data

2 messages · Val, jim holtman

Val
#
Hi all,

I am trying to read and summarize  a big data frame( >10M records)

Here is the sample of my data
state,city,x
1,12,100
1,12,100
1,12,200
1,13,200
1,13,100
1,13,100
1,14,200
2,21,200
2,21,200
2,21,100
2,23,100
2,23,200
2,34,200
2,34,100
2,35,100

I want  get  the total count by state, and the  the number of cities
by state. The x variable is either 100 or 200 and count each

The result should look like as follows.

state,city,count,100's,200's
1,3,7,4,3
2,4,8,4,4

At the present I am doing it  in several steps and taking too long

Is there an efficient way of doing this?
#
This should be reasonably efficient with 'dplyr':
+ 1,12,100
+ 1,12,100
+ 1,12,200
+ 1,13,200
+ 1,13,100
+ 1,13,100
+ 1,14,200
+ 2,21,200
+ 2,21,200
+ 2,21,100
+ 2,23,100
+ 2,23,200
+ 2,34,200
+ 2,34,100
+ 2,35,100")
+             group_by(state) %>%
+             summarise(nCities = length(unique(city)),
+                 count = n(),
+                 `100's` = sum(x == 100),
+                 `200's` = sum(x == 200)
+                 )
# A tibble: 2 ? 5
  state nCities count `100's` `200's`
  <int>   <int> <int>   <int>   <int>
1     1       3     7       4       3
2     2       4     8       4       4


Or you can also use data.table:
+ 1,12,100
+ 1,12,100
+ 1,12,200
+ 1,13,200
+ 1,13,100
+ 1,13,100
+ 1,14,200
+ 2,21,200
+ 2,21,200
+ 2,21,100
+ 2,23,100
+ 2,23,200
+ 2,34,200
+ 2,34,100
+ 2,35,100")
+           count = .N,
+           `100's` = sum(x == 100),
+           `200's` = sum(x == 200)
+           )
+         , keyby = state
+         ]
   state nCities count 100's 200's
1:     1       3     7     4     3
2:     2       4     8     4     4



Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.
On Sat, Dec 3, 2016 at 10:40 AM, Val <valkremk at gmail.com> wrote: