What tools do you like for working with tab delimited text files
up to 1.5 GB (under Windows 7 with 8 GB RAM)?
Standard tools for smaller data sometimes grab all the available
RAM, after which CPU usage drops to 3% ;-)
The "bigmemory" project won the 2010 John Chambers Award but "is
not available (for R version 3.1.0)".
findFn("big data", 999) downloaded 961 links in 437 packages.
That contains tools for data PostgreSQL and other formats, but I
couldn't find anything for large tab delimited text files.
Absent a better idea, I plan to write a function getField to
extract a specific field from the data, then use that to split the data
into 4 smaller files, which I think should be small enough that I can do
what I want.
Thanks,
Spencer
big data?
6 messages · Peter Langfelder, David Winsemius, Mike Harwood +1 more
Have you tried read.csv.sql from package sqldf? Peter On Tue, Aug 5, 2014 at 10:20 AM, Spencer Graves
<spencer.graves at structuremonitoring.com> wrote:
What tools do you like for working with tab delimited text files up to
1.5 GB (under Windows 7 with 8 GB RAM)?
Standard tools for smaller data sometimes grab all the available RAM,
after which CPU usage drops to 3% ;-)
The "bigmemory" project won the 2010 John Chambers Award but "is not
available (for R version 3.1.0)".
findFn("big data", 999) downloaded 961 links in 437 packages. That
contains tools for data PostgreSQL and other formats, but I couldn't find
anything for large tab delimited text files.
Absent a better idea, I plan to write a function getField to extract a
specific field from the data, then use that to split the data into 4 smaller
files, which I think should be small enough that I can do what I want.
Thanks,
Spencer
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On Aug 5, 2014, at 10:20 AM, Spencer Graves wrote:
What tools do you like for working with tab delimited text files up to 1.5 GB (under Windows 7 with 8 GB RAM)?
?data.table::fread
Standard tools for smaller data sometimes grab all the available RAM, after which CPU usage drops to 3% ;-)
The "bigmemory" project won the 2010 John Chambers Award but "is not available (for R version 3.1.0)".
findFn("big data", 999) downloaded 961 links in 437 packages. That contains tools for data PostgreSQL and other formats, but I couldn't find anything for large tab delimited text files.
Absent a better idea, I plan to write a function getField to extract a specific field from the data, then use that to split the data into 4 smaller files, which I think should be small enough that I can do what I want.
There is the colbycol package with which I have no experience, but I understand it is designed to partition data into column sized objects.
#--- from its help file-----
cbc.get.col {colbycol} R Documentation
Reads a single column from the original file into memory
Description
Function cbc.read.table reads a file, stores it column by column in disk file and creates a colbycol object. Functioncbc.get.col queries this object and returns a single column.
Thanks,
Spencer
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius Alameda, CA, USA
The read.table.ffdf function in the ff package can read in delimited files and store them to disk as individual columns. The ffbase package provides additional data management and analytic functionality. I have used these packages on 15 Gb files of 18 million rows and 250 columns.
On Tuesday, August 5, 2014 1:39:03 PM UTC-5, David Winsemius wrote:
On Aug 5, 2014, at 10:20 AM, Spencer Graves wrote:
What tools do you like for working with tab delimited text files up
to 1.5 GB (under Windows 7 with 8 GB RAM)? ?data.table::fread
Standard tools for smaller data sometimes grab all the available
RAM, after which CPU usage drops to 3% ;-)
The "bigmemory" project won the 2010 John Chambers Award but "is
not available (for R version 3.1.0)".
findFn("big data", 999) downloaded 961 links in 437 packages. That
contains tools for data PostgreSQL and other formats, but I couldn't find anything for large tab delimited text files.
Absent a better idea, I plan to write a function getField to
extract a specific field from the data, then use that to split the data
into 4 smaller files, which I think should be small enough that I can do
what I want.
There is the colbycol package with which I have no experience, but I
understand it is designed to partition data into column sized objects.
#--- from its help file-----
cbc.get.col {colbycol} R Documentation
Reads a single column from the original file into memory
Description
Function cbc.read.table reads a file, stores it column by column in disk
file and creates a colbycol object. Functioncbc.get.col queries this object
and returns a single column.
Thanks,
Spencer
______________________________________________ R-h... at r-project.org <javascript:> mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA ______________________________________________ R-h... at r-project.org <javascript:> mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Thanks to all who replied. For the record, I will summarize here
what I tried and what I learned:
Mike Harwood suggested the ff package. David Winsemius suggested
data.table and colbycol. Peter Langfelder suggested sqldf.
sqldf::read.csv.sql allowed me to create an SQL command to read a
column or a subset of the rows of a 400 GB tab-delimited file in roughly
a minute on a 2.3 GHz dual core machine running Windows 7 with 8 GB
RAM. It also read a column of a 1.3 GB file in 4 minutes. The
documentation was sufficient to allow me to easily get what I wanted
with a minimum of effort.
If I needed to work with these data regularly, I might experiment
with colbycol and ff: The documentation suggested to me that these
packages might allow me to get quicker answers from routine tasks after
some preprocessing. Of course, I could also do the preprocessing
manually with sqldf.
Thanks, again.
Spencer
On 8/6/2014 9:39 AM, Mike Harwood wrote:
The read.table.ffdf function in the ff package can read in delimited files and store them to disk as individual columns. The ffbase package provides additional data management and analytic functionality. I have used these packages on 15 Gb files of 18 million rows and 250 columns. On Tuesday, August 5, 2014 1:39:03 PM UTC-5, David Winsemius wrote:
On Aug 5, 2014, at 10:20 AM, Spencer Graves wrote:
What tools do you like for working with tab delimited text files up
to 1.5 GB (under Windows 7 with 8 GB RAM)? ?data.table::fread
Standard tools for smaller data sometimes grab all the available
RAM, after which CPU usage drops to 3% ;-)
The "bigmemory" project won the 2010 John Chambers Award but "is
not available (for R version 3.1.0)".
findFn("big data", 999) downloaded 961 links in 437 packages. That
contains tools for data PostgreSQL and other formats, but I couldn't find anything for large tab delimited text files.
Absent a better idea, I plan to write a function getField to
extract a specific field from the data, then use that to split the data
into 4 smaller files, which I think should be small enough that I can do
what I want.
There is the colbycol package with which I have no experience, but I
understand it is designed to partition data into column sized objects.
#--- from its help file-----
cbc.get.col {colbycol} R Documentation
Reads a single column from the original file into memory
Description
Function cbc.read.table reads a file, stores it column by column in disk
file and creates a colbycol object. Functioncbc.get.col queries this object
and returns a single column.
Thanks,
Spencer
______________________________________________ R-h... at r-project.org <javascript:> mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA ______________________________________________ R-h... at r-project.org <javascript:> mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Spencer Graves, PE, PhD President and Chief Technology Officer Structure Inspection and Monitoring, Inc. 751 Emerson Ct. San Jos?, CA 95126 ph: 408-655-4567 web: www.structuremonitoring.com
correcting a typo (400 MB, not GB. Thanks to David Winsemius for
reporting it). Spencer
###############
Thanks to all who replied. For the record, I will summarize here
what I tried and what I learned:
Mike Harwood suggested the ff package. David Winsemius suggested
data.table and colbycol. Peter Langfelder suggested sqldf.
sqldf::read.csv.sql allowed me to create an SQL command to read a
column or a subset of the rows of a 400 MB tab-delimited file in roughly
a minute on a 2.3 GHz dual core machine running Windows 7 with 8 GB RAM.
It also read a column of a 1.3 GB file in 4 minutes. The
documentation was sufficient to allow me to easily get what I wanted
with a minimum of effort.
If I needed to work with these data regularly, I might experiment
with colbycol and ff: The documentation suggested to me that these
packages might allow me to get quicker answers from routine tasks after
some preprocessing. Of course, I could also do the preprocessing
manually with sqldf.
Thanks, again.
Spencer
On 8/6/2014 9:39 AM, Mike Harwood wrote:
The read.table.ffdf function in the ff package can read in delimited files and store them to disk as individual columns. The ffbase package provides additional data management and analytic functionality. I have used these packages on 15 Gb files of 18 million rows and 250 columns. On Tuesday, August 5, 2014 1:39:03 PM UTC-5, David Winsemius wrote:
On Aug 5, 2014, at 10:20 AM, Spencer Graves wrote:
What tools do you like for working with tab delimited text files up
to 1.5 GB (under Windows 7 with 8 GB RAM)? ?data.table::fread
Standard tools for smaller data sometimes grab all the available
RAM, after which CPU usage drops to 3% ;-)
The "bigmemory" project won the 2010 John Chambers Award but "is
not available (for R version 3.1.0)".
findFn("big data", 999) downloaded 961 links in 437 packages. That
contains tools for data PostgreSQL and other formats, but I couldn't find anything for large tab delimited text files.
Absent a better idea, I plan to write a function getField to
extract a specific field from the data, then use that to split the data
into 4 smaller files, which I think should be small enough that I can do
what I want.
There is the colbycol package with which I have no experience, but I
understand it is designed to partition data into column sized objects.
#--- from its help file-----
cbc.get.col {colbycol} R Documentation
Reads a single column from the original file into memory
Description
Function cbc.read.table reads a file, stores it column by column in disk
file and creates a colbycol object. Functioncbc.get.col queries this object
and returns a single column.
Thanks,
Spencer
______________________________________________ R-h... at r-project.org <javascript:> mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA ______________________________________________ R-h... at r-project.org <javascript:> mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.