Skip to content

big data?

6 messages · Peter Langfelder, David Winsemius, Mike Harwood +1 more

#
What tools do you like for working with tab delimited text files 
up to 1.5 GB (under Windows 7 with 8 GB RAM)?


       Standard tools for smaller data sometimes grab all the available 
RAM, after which CPU usage drops to 3% ;-)


       The "bigmemory" project won the 2010 John Chambers Award but "is 
not available (for R version 3.1.0)".


       findFn("big data", 999) downloaded 961 links in 437 packages. 
That contains tools for data PostgreSQL and other formats, but I 
couldn't find anything for large tab delimited text files.


       Absent a better idea, I plan to write a function getField to 
extract a specific field from the data, then use that to split the data 
into 4 smaller files, which I think should be small enough that I can do 
what I want.


       Thanks,
       Spencer
#
Have you tried read.csv.sql from package sqldf?

Peter

On Tue, Aug 5, 2014 at 10:20 AM, Spencer Graves
<spencer.graves at structuremonitoring.com> wrote:
#
On Aug 5, 2014, at 10:20 AM, Spencer Graves wrote:

            
?data.table::fread
There is the colbycol package with which I have no experience, but I understand it is designed to partition data into column sized objects.
#--- from its help file-----
cbc.get.col {colbycol}	R Documentation
Reads a single column from the original file into memory

Description

Function cbc.read.table reads a file, stores it column by column in disk file and creates a colbycol object. Functioncbc.get.col queries this object and returns a single column.
David Winsemius
Alameda, CA, USA
#
The read.table.ffdf function in the ff package can read in delimited files 
and store them to disk as individual columns.  The ffbase package provides 
additional data management and analytic functionality.  I have used these 
packages on 15 Gb files of 18 million rows and 250 columns.
On Tuesday, August 5, 2014 1:39:03 PM UTC-5, David Winsemius wrote:
#
Thanks to all who replied.  For the record, I will summarize here 
what I tried and what I learned:


       Mike Harwood suggested the ff package.  David Winsemius suggested 
data.table and colbycol.  Peter Langfelder suggested sqldf.


       sqldf::read.csv.sql allowed me to create an SQL command to read a 
column or a subset of the rows of a 400 GB tab-delimited file in roughly 
a minute on a 2.3 GHz dual core machine running Windows 7 with 8 GB 
RAM.  It also read a column of a 1.3 GB file in 4 minutes.  The 
documentation was sufficient to allow me to easily get what I wanted 
with a minimum of effort.


       If I needed to work with these data regularly, I might experiment 
with colbycol and ff:  The documentation suggested to me that these 
packages might allow me to get quicker answers from routine tasks after 
some preprocessing.  Of course, I could also do the preprocessing 
manually with sqldf.


       Thanks, again.
       Spencer
On 8/6/2014 9:39 AM, Mike Harwood wrote:

  
    
#
correcting a typo (400 MB, not GB.  Thanks to David Winsemius for 
reporting it).  Spencer


###############


       Thanks to all who replied.  For the record, I will summarize here 
what I tried and what I learned:


       Mike Harwood suggested the ff package.  David Winsemius suggested 
data.table and colbycol.  Peter Langfelder suggested sqldf.


       sqldf::read.csv.sql allowed me to create an SQL command to read a 
column or a subset of the rows of a 400 MB tab-delimited file in roughly 
a minute on a 2.3 GHz dual core machine running Windows 7 with 8 GB RAM. 
  It also read a column of a 1.3 GB file in 4 minutes.  The 
documentation was sufficient to allow me to easily get what I wanted 
with a minimum of effort.


       If I needed to work with these data regularly, I might experiment 
with colbycol and ff:  The documentation suggested to me that these 
packages might allow me to get quicker answers from routine tasks after 
some preprocessing.  Of course, I could also do the preprocessing 
manually with sqldf.


       Thanks, again.
       Spencer
On 8/6/2014 9:39 AM, Mike Harwood wrote: