I agree that quoted integer columns are not the most efficient way of
delivering csv-files. However, the sad reality is that one receives such
formats and still needs to read the data. Therefore it is not helpful to
state that one should 'consider "character" to be the correct colClass
in case an integer is surrounded by quotes'.
The philosophy of read.table.ffdf is delegating the actual csv-parsing
to a parse engine 'similarly' parametrized like 'read.table'. It is not
'bad coding practice' - but a conscious design decision - to assume that
the parse engine behaves consistently, which read.table does not yet: it
automatically recognizes a quoted integer column as 'integer', but when
asked to explicitly interpret the column as 'integer' it does refuse to
do so. So there is nothing wrong with read.table.ffdf (but something can
be improved about read.table). It is *not* the 'best solution [...] to
rewrite read.table.ffdf()' given that it nicely imports such data, see
4+1 ways to do so below.
Jens Oehlschl?gel
# --- first create a csv file for demonstration
-------------------------------
require(ff)
file <- "test.csv"
path <- "c:/tmp"
n <- 1e2
d <- data.frame(x=1:n, y=shQuote(1:n))
write.csv(d, file=file.path(path,file), row.names=FALSE, quote=FALSE)
# --- how to do it with read.table.ffdf
---------------------------------------
# 1 let the parse engine ignore colClasses and hope for the best
fixedengine <- function(file, ..., colClasses=NA){
read.csv(file, ...)
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10,
FUN="fixedengine")
df
# 2 Suspend colClasses(=NA) for the quoted integer column only
df <- read.csv.ffdf(file=file.path(path,file), first.rows = 10,
colClasses=c("integer", NA))
df
# 3 do your own type conversion using transFUN
# after reading the problematic column as character
# Being able to inject regexps is quite powerful isn't it?
# Or error handlinig in case of varying column format!
custominterp <- function(d){
d[[2]] <- as.integer(gsub('"', '', d[[2]]))
d
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10,
colClasses=c("integer", "character"), FUN="read.csv", transFUN=custominterp)
df
# 4 do your own line parsing and type conversion
# Here you can even handle non-standard formats
# such as varying number of columns
customengine <- function(file, header=TRUE, col.names, colClasses=NA,
nrows=0, skip=0, fileEncoding="", comment.char = ""){
l <- scan(file, what="character", nlines=nrows+header, skip=skip,
fileEncoding=fileEncoding, comment.char = comment.char)
s <- do.call("rbind", strsplit(l, ","))
if (header){
d <- data.frame(as.integer(s[-1,1]), as.integer(gsub('"','',s[-1,2])))
names(d) <- s[1,]
}else{
d <- data.frame(as.integer(s[,1]), as.integer(gsub('"','',s[,2])))
}
if (!missing(col.names))
names(d) <- col.names
d
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10,
FUN="customengine")
df
# 5 use a parsing engine that can apply colClasses to quoted integers
# Unfortunately Henry Bengtson's readDataFrame does not work as a
# parse engine for read.table.ffdf because read.table.ffdf expects
# the parse engine to read successive chunks from a file connection
# while readDataFrame only accepts a filename as input file spec.
# Yes it has 'skip', but using that would reread the file from scratch
# for each chunk (O(N^2) costs)
read.table() with quoted integers
5 messages · Jens Oehlschlägel, Joshua Ulrich, Milan Bouchet-Valat +1 more
On Thu, Oct 3, 2013 at 9:44 AM, Jens Oehlschl?gel
<Jens.Oehlschlaegel at truecluster.com> wrote:
I agree that quoted integer columns are not the most efficient way of delivering csv-files. However, the sad reality is that one receives such formats and still needs to read the data. Therefore it is not helpful to state that one should 'consider "character" to be the correct colClass in case an integer is surrounded by quotes'. The philosophy of read.table.ffdf is delegating the actual csv-parsing to a parse engine 'similarly' parametrized like 'read.table'. It is not 'bad coding practice' - but a conscious design decision - to assume that the parse engine behaves consistently, which read.table does not yet: it automatically recognizes a quoted integer column as 'integer', but when asked to explicitly interpret the column as 'integer' it does refuse to do
read.table() does not "automatically recognize a quoted integer column as 'integer'". If colClasses is not specified, it reads the entire column into a 'character' vector and then calls type.convert() on it. type.convert() does all the necessary work to determine what class the 'character' vector should be converted to. If colClasses is specified, quotes are not interpreted in non-'character' columns. You want scan() to allocate an 'integer' vector, and then ensure (on each read from the column in the file) that the value read is a valid 'integer' type, while interpreting quotes (which strtol does not do, so someone would have to write and test this new functionality). So your complaint is more with scan() than read.table(). And more with Strtoi() (and therefore strtol) than scan().
so. So there is nothing wrong with read.table.ffdf (but something can be
improved about read.table). It is *not* the 'best solution [...] to rewrite
read.table.ffdf()' given that it nicely imports such data, see 4+1 ways to
do so below.
Jens Oehlschl?gel
# --- first create a csv file for demonstration
-------------------------------
require(ff)
file <- "test.csv"
path <- "c:/tmp"
n <- 1e2
d <- data.frame(x=1:n, y=shQuote(1:n))
write.csv(d, file=file.path(path,file), row.names=FALSE, quote=FALSE)
# --- how to do it with read.table.ffdf
---------------------------------------
# 1 let the parse engine ignore colClasses and hope for the best
fixedengine <- function(file, ..., colClasses=NA){
read.csv(file, ...)
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10,
FUN="fixedengine")
df
# 2 Suspend colClasses(=NA) for the quoted integer column only
df <- read.csv.ffdf(file=file.path(path,file), first.rows = 10,
colClasses=c("integer", NA))
df
# 3 do your own type conversion using transFUN
# after reading the problematic column as character
# Being able to inject regexps is quite powerful isn't it?
# Or error handlinig in case of varying column format!
custominterp <- function(d){
d[[2]] <- as.integer(gsub('"', '', d[[2]]))
d
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10,
colClasses=c("integer", "character"), FUN="read.csv", transFUN=custominterp)
df
# 4 do your own line parsing and type conversion
# Here you can even handle non-standard formats
# such as varying number of columns
customengine <- function(file, header=TRUE, col.names, colClasses=NA,
nrows=0, skip=0, fileEncoding="", comment.char = ""){
l <- scan(file, what="character", nlines=nrows+header, skip=skip,
fileEncoding=fileEncoding, comment.char = comment.char)
s <- do.call("rbind", strsplit(l, ","))
if (header){
d <- data.frame(as.integer(s[-1,1]),
as.integer(gsub('"','',s[-1,2])))
names(d) <- s[1,]
}else{
d <- data.frame(as.integer(s[,1]),
as.integer(gsub('"','',s[,2])))
}
if (!missing(col.names))
names(d) <- col.names
d
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10,
FUN="customengine")
df
# 5 use a parsing engine that can apply colClasses to quoted integers
# Unfortunately Henry Bengtson's readDataFrame does not work as a
# parse engine for read.table.ffdf because read.table.ffdf expects
# the parse engine to read successive chunks from a file connection
# while readDataFrame only accepts a filename as input file spec.
# Yes it has 'skip', but using that would reread the file from scratch
# for each chunk (O(N^2) costs)
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-- Joshua Ulrich | about.me/joshuaulrich FOSS Trading | www.fosstrading.com
Le vendredi 04 octobre 2013 ? 07:34 -0500, Joshua Ulrich a ?crit :
On Thu, Oct 3, 2013 at 9:44 AM, Jens Oehlschl?gel <Jens.Oehlschlaegel at truecluster.com> wrote:
I agree that quoted integer columns are not the most efficient way of delivering csv-files. However, the sad reality is that one receives such formats and still needs to read the data. Therefore it is not helpful to state that one should 'consider "character" to be the correct colClass in case an integer is surrounded by quotes'. The philosophy of read.table.ffdf is delegating the actual csv-parsing to a parse engine 'similarly' parametrized like 'read.table'. It is not 'bad coding practice' - but a conscious design decision - to assume that the parse engine behaves consistently, which read.table does not yet: it automatically recognizes a quoted integer column as 'integer', but when asked to explicitly interpret the column as 'integer' it does refuse to do
read.table() does not "automatically recognize a quoted integer column as 'integer'". If colClasses is not specified, it reads the entire column into a 'character' vector and then calls type.convert() on it. type.convert() does all the necessary work to determine what class the 'character' vector should be converted to. If colClasses is specified, quotes are not interpreted in non-'character' columns.
That's pretty much the definition of "automatic". The fact that this is realized by type.convert() is really an implementation detail. But there's little point in discussing the question of whether this is automatic enough. Better concentrate on the actual result.
You want scan() to allocate an 'integer' vector, and then ensure (on each read from the column in the file) that the value read is a valid 'integer' type, while interpreting quotes (which strtol does not do, so someone would have to write and test this new functionality).
Yes, I think that's where the change should go. From a first look at
scan.c:extractItem(), it seems that adapting scan() to skip quotes in
the string before calling Strtoi() would not be too invasive and would
not create a significant overhead. No string copy would even be involved
since the pointer to the beginning of the string would just have to be
increased to skip the quote character, and the null character be added a
little earlier in the string.
So this line:
INTEGER(ans)[i] = Strtoi(buffer, 10);
would just have to be changed to something like:
char *quote;
if(buffer[0] == '\"' && (quote = strchr(buffer += 1, '\"')) != NULL)
*quote = '\0';
INTEGER(ans)[i] = Strtoi(buffer, 10);
For cleaner operation, the hardcoded '\"' could be replaced with the
contents of read.table()'s quote argument.
What do R core developers think about this small modification?
So your complaint is more with scan() than read.table(). And more with Strtoi() (and therefore strtol) than scan().
The complaint is about the combination of read.table() and scan(). It has nothing to do with strtol(), which has no reason to accept quotes as it's not designed to read CSV files... Regards
so. So there is nothing wrong with read.table.ffdf (but something can be
improved about read.table). It is *not* the 'best solution [...] to rewrite
read.table.ffdf()' given that it nicely imports such data, see 4+1 ways to
do so below.
Jens Oehlschl?gel
# --- first create a csv file for demonstration
-------------------------------
require(ff)
file <- "test.csv"
path <- "c:/tmp"
n <- 1e2
d <- data.frame(x=1:n, y=shQuote(1:n))
write.csv(d, file=file.path(path,file), row.names=FALSE, quote=FALSE)
# --- how to do it with read.table.ffdf
---------------------------------------
# 1 let the parse engine ignore colClasses and hope for the best
fixedengine <- function(file, ..., colClasses=NA){
read.csv(file, ...)
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10,
FUN="fixedengine")
df
# 2 Suspend colClasses(=NA) for the quoted integer column only
df <- read.csv.ffdf(file=file.path(path,file), first.rows = 10,
colClasses=c("integer", NA))
df
# 3 do your own type conversion using transFUN
# after reading the problematic column as character
# Being able to inject regexps is quite powerful isn't it?
# Or error handlinig in case of varying column format!
custominterp <- function(d){
d[[2]] <- as.integer(gsub('"', '', d[[2]]))
d
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10,
colClasses=c("integer", "character"), FUN="read.csv", transFUN=custominterp)
df
# 4 do your own line parsing and type conversion
# Here you can even handle non-standard formats
# such as varying number of columns
customengine <- function(file, header=TRUE, col.names, colClasses=NA,
nrows=0, skip=0, fileEncoding="", comment.char = ""){
l <- scan(file, what="character", nlines=nrows+header, skip=skip,
fileEncoding=fileEncoding, comment.char = comment.char)
s <- do.call("rbind", strsplit(l, ","))
if (header){
d <- data.frame(as.integer(s[-1,1]),
as.integer(gsub('"','',s[-1,2])))
names(d) <- s[1,]
}else{
d <- data.frame(as.integer(s[,1]),
as.integer(gsub('"','',s[,2])))
}
if (!missing(col.names))
names(d) <- col.names
d
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10,
FUN="customengine")
df
# 5 use a parsing engine that can apply colClasses to quoted integers
# Unfortunately Henry Bengtson's readDataFrame does not work as a
# parse engine for read.table.ffdf because read.table.ffdf expects
# the parse engine to read successive chunks from a file connection
# while readDataFrame only accepts a filename as input file spec.
# Yes it has 'skip', but using that would reread the file from scratch
# for each chunk (O(N^2) costs)
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-- Joshua Ulrich | about.me/joshuaulrich FOSS Trading | www.fosstrading.com
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
I think this is not the right approach -- quoting is a transport-layer feature of the CSV format, not part of the application layer. Quotes should always be interpreted away from column data before any data is handed to the application layer. (CSV does not _have_ any application layer; type information is conspicuously absent.) If quoting is incorrectly treated as a feature of the values rather than the encoding of the values, there's just going to be the same problem with datetime columns, and any other column types. So I disagree -- parsing quotes is never the column data-converter's job, it's read.table's job. Please refer to this specification of CSV: http://kanspra.org/memberdirectory.csv particularly this part: "Fields may always be delimited with double quotes. The delimiters will always be discarded." and the implementation note which follows. Other CSV specs, like RFC 4180, contain similar statements. I think the only way to comply with "always" discarding delimiters is to do it in read.table. Peter
On Fri, Oct 4, 2013 at 6:58 AM, Milan Bouchet-Valat <nalimilan at club.fr> wrote:
Le vendredi 04 octobre 2013 ? 07:34 -0500, Joshua Ulrich a ?crit :
On Thu, Oct 3, 2013 at 9:44 AM, Jens Oehlschl?gel <Jens.Oehlschlaegel at truecluster.com> wrote:
I agree that quoted integer columns are not the most efficient way of delivering csv-files. However, the sad reality is that one receives such formats and still needs to read the data. Therefore it is not helpful to state that one should 'consider "character" to be the correct colClass in case an integer is surrounded by quotes'. The philosophy of read.table.ffdf is delegating the actual csv-parsing to a parse engine 'similarly' parametrized like 'read.table'. It is not 'bad coding practice' - but a conscious design decision - to assume that the parse engine behaves consistently, which read.table does not yet: it automatically recognizes a quoted integer column as 'integer', but when asked to explicitly interpret the column as 'integer' it does refuse to do
read.table() does not "automatically recognize a quoted integer column as 'integer'". If colClasses is not specified, it reads the entire column into a 'character' vector and then calls type.convert() on it. type.convert() does all the necessary work to determine what class the 'character' vector should be converted to. If colClasses is specified, quotes are not interpreted in non-'character' columns.
That's pretty much the definition of "automatic". The fact that this is realized by type.convert() is really an implementation detail. But there's little point in discussing the question of whether this is automatic enough. Better concentrate on the actual result.
You want scan() to allocate an 'integer' vector, and then ensure (on each read from the column in the file) that the value read is a valid 'integer' type, while interpreting quotes (which strtol does not do, so someone would have to write and test this new functionality).
Yes, I think that's where the change should go. From a first look at
scan.c:extractItem(), it seems that adapting scan() to skip quotes in
the string before calling Strtoi() would not be too invasive and would
not create a significant overhead. No string copy would even be involved
since the pointer to the beginning of the string would just have to be
increased to skip the quote character, and the null character be added a
little earlier in the string.
So this line:
INTEGER(ans)[i] = Strtoi(buffer, 10);
would just have to be changed to something like:
char *quote;
if(buffer[0] == '\"' && (quote = strchr(buffer += 1, '\"')) != NULL)
*quote = '\0';
INTEGER(ans)[i] = Strtoi(buffer, 10);
For cleaner operation, the hardcoded '\"' could be replaced with the
contents of read.table()'s quote argument.
What do R core developers think about this small modification?
So your complaint is more with scan() than read.table(). And more with Strtoi() (and therefore strtol) than scan().
The complaint is about the combination of read.table() and scan(). It has nothing to do with strtol(), which has no reason to accept quotes as it's not designed to read CSV files... Regards
so. So there is nothing wrong with read.table.ffdf (but something can be
improved about read.table). It is *not* the 'best solution [...] to rewrite
read.table.ffdf()' given that it nicely imports such data, see 4+1 ways to
do so below.
Jens Oehlschl?gel
# --- first create a csv file for demonstration
-------------------------------
require(ff)
file <- "test.csv"
path <- "c:/tmp"
n <- 1e2
d <- data.frame(x=1:n, y=shQuote(1:n))
write.csv(d, file=file.path(path,file), row.names=FALSE, quote=FALSE)
# --- how to do it with read.table.ffdf
---------------------------------------
# 1 let the parse engine ignore colClasses and hope for the best
fixedengine <- function(file, ..., colClasses=NA){
read.csv(file, ...)
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10,
FUN="fixedengine")
df
# 2 Suspend colClasses(=NA) for the quoted integer column only
df <- read.csv.ffdf(file=file.path(path,file), first.rows = 10,
colClasses=c("integer", NA))
df
# 3 do your own type conversion using transFUN
# after reading the problematic column as character
# Being able to inject regexps is quite powerful isn't it?
# Or error handlinig in case of varying column format!
custominterp <- function(d){
d[[2]] <- as.integer(gsub('"', '', d[[2]]))
d
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10,
colClasses=c("integer", "character"), FUN="read.csv", transFUN=custominterp)
df
# 4 do your own line parsing and type conversion
# Here you can even handle non-standard formats
# such as varying number of columns
customengine <- function(file, header=TRUE, col.names, colClasses=NA,
nrows=0, skip=0, fileEncoding="", comment.char = ""){
l <- scan(file, what="character", nlines=nrows+header, skip=skip,
fileEncoding=fileEncoding, comment.char = comment.char)
s <- do.call("rbind", strsplit(l, ","))
if (header){
d <- data.frame(as.integer(s[-1,1]),
as.integer(gsub('"','',s[-1,2])))
names(d) <- s[1,]
}else{
d <- data.frame(as.integer(s[,1]),
as.integer(gsub('"','',s[,2])))
}
if (!missing(col.names))
names(d) <- col.names
d
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10,
FUN="customengine")
df
# 5 use a parsing engine that can apply colClasses to quoted integers
# Unfortunately Henry Bengtson's readDataFrame does not work as a
# parse engine for read.table.ffdf because read.table.ffdf expects
# the parse engine to read successive chunks from a file connection
# while readDataFrame only accepts a filename as input file spec.
# Yes it has 'skip', but using that would reread the file from scratch
# for each chunk (O(N^2) costs)
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-- Joshua Ulrich | about.me/joshuaulrich FOSS Trading | www.fosstrading.com
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On Fri, Oct 4, 2013 at 9:20 AM, Peter Meilstrup
<peter.meilstrup at gmail.com> wrote:
I think this is not the right approach -- quoting is a transport-layer feature of the CSV format, not part of the application layer. Quotes should always be interpreted away from column data before any data is handed to the application layer. (CSV does not _have_ any application layer; type information is conspicuously absent.) If quoting is incorrectly treated as a feature of the values rather than the encoding of the values, there's just going to be the same problem with datetime columns, and any other column types. So I disagree -- parsing quotes is never the column data-converter's job, it's read.table's job. Please refer to this specification of CSV:
Wrong URL -- I was checking my assumptions by googling examples of CSV files in the wild and seeing what my spreadsheet programs did with them (which practice would settle several errant beliefs in this thread.) The CSV specs I was referring to are: http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm http://tools.ietf.org/html/rfc4180
particularly this part: "Fields may always be delimited with double quotes. The delimiters will always be discarded." and the implementation note which follows. Other CSV specs, like RFC 4180, contain similar statements. I think the only way to comply with "always" discarding delimiters is to do it in read.table. Peter On Fri, Oct 4, 2013 at 6:58 AM, Milan Bouchet-Valat <nalimilan at club.fr> wrote:
Le vendredi 04 octobre 2013 ? 07:34 -0500, Joshua Ulrich a ?crit :
On Thu, Oct 3, 2013 at 9:44 AM, Jens Oehlschl?gel <Jens.Oehlschlaegel at truecluster.com> wrote:
I agree that quoted integer columns are not the most efficient way of delivering csv-files. However, the sad reality is that one receives such formats and still needs to read the data. Therefore it is not helpful to state that one should 'consider "character" to be the correct colClass in case an integer is surrounded by quotes'. The philosophy of read.table.ffdf is delegating the actual csv-parsing to a parse engine 'similarly' parametrized like 'read.table'. It is not 'bad coding practice' - but a conscious design decision - to assume that the parse engine behaves consistently, which read.table does not yet: it automatically recognizes a quoted integer column as 'integer', but when asked to explicitly interpret the column as 'integer' it does refuse to do
read.table() does not "automatically recognize a quoted integer column as 'integer'". If colClasses is not specified, it reads the entire column into a 'character' vector and then calls type.convert() on it. type.convert() does all the necessary work to determine what class the 'character' vector should be converted to. If colClasses is specified, quotes are not interpreted in non-'character' columns.
That's pretty much the definition of "automatic". The fact that this is realized by type.convert() is really an implementation detail. But there's little point in discussing the question of whether this is automatic enough. Better concentrate on the actual result.
You want scan() to allocate an 'integer' vector, and then ensure (on each read from the column in the file) that the value read is a valid 'integer' type, while interpreting quotes (which strtol does not do, so someone would have to write and test this new functionality).
Yes, I think that's where the change should go. From a first look at
scan.c:extractItem(), it seems that adapting scan() to skip quotes in
the string before calling Strtoi() would not be too invasive and would
not create a significant overhead. No string copy would even be involved
since the pointer to the beginning of the string would just have to be
increased to skip the quote character, and the null character be added a
little earlier in the string.
So this line:
INTEGER(ans)[i] = Strtoi(buffer, 10);
would just have to be changed to something like:
char *quote;
if(buffer[0] == '\"' && (quote = strchr(buffer += 1, '\"')) != NULL)
*quote = '\0';
INTEGER(ans)[i] = Strtoi(buffer, 10);
For cleaner operation, the hardcoded '\"' could be replaced with the
contents of read.table()'s quote argument.
What do R core developers think about this small modification?
So your complaint is more with scan() than read.table(). And more with Strtoi() (and therefore strtol) than scan().
The complaint is about the combination of read.table() and scan(). It has nothing to do with strtol(), which has no reason to accept quotes as it's not designed to read CSV files... Regards
so. So there is nothing wrong with read.table.ffdf (but something can be
improved about read.table). It is *not* the 'best solution [...] to rewrite
read.table.ffdf()' given that it nicely imports such data, see 4+1 ways to
do so below.
Jens Oehlschl?gel
# --- first create a csv file for demonstration
-------------------------------
require(ff)
file <- "test.csv"
path <- "c:/tmp"
n <- 1e2
d <- data.frame(x=1:n, y=shQuote(1:n))
write.csv(d, file=file.path(path,file), row.names=FALSE, quote=FALSE)
# --- how to do it with read.table.ffdf
---------------------------------------
# 1 let the parse engine ignore colClasses and hope for the best
fixedengine <- function(file, ..., colClasses=NA){
read.csv(file, ...)
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10,
FUN="fixedengine")
df
# 2 Suspend colClasses(=NA) for the quoted integer column only
df <- read.csv.ffdf(file=file.path(path,file), first.rows = 10,
colClasses=c("integer", NA))
df
# 3 do your own type conversion using transFUN
# after reading the problematic column as character
# Being able to inject regexps is quite powerful isn't it?
# Or error handlinig in case of varying column format!
custominterp <- function(d){
d[[2]] <- as.integer(gsub('"', '', d[[2]]))
d
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10,
colClasses=c("integer", "character"), FUN="read.csv", transFUN=custominterp)
df
# 4 do your own line parsing and type conversion
# Here you can even handle non-standard formats
# such as varying number of columns
customengine <- function(file, header=TRUE, col.names, colClasses=NA,
nrows=0, skip=0, fileEncoding="", comment.char = ""){
l <- scan(file, what="character", nlines=nrows+header, skip=skip,
fileEncoding=fileEncoding, comment.char = comment.char)
s <- do.call("rbind", strsplit(l, ","))
if (header){
d <- data.frame(as.integer(s[-1,1]),
as.integer(gsub('"','',s[-1,2])))
names(d) <- s[1,]
}else{
d <- data.frame(as.integer(s[,1]),
as.integer(gsub('"','',s[,2])))
}
if (!missing(col.names))
names(d) <- col.names
d
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10,
FUN="customengine")
df
# 5 use a parsing engine that can apply colClasses to quoted integers
# Unfortunately Henry Bengtson's readDataFrame does not work as a
# parse engine for read.table.ffdf because read.table.ffdf expects
# the parse engine to read successive chunks from a file connection
# while readDataFrame only accepts a filename as input file spec.
# Yes it has 'skip', but using that would reread the file from scratch
# for each chunk (O(N^2) costs)
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-- Joshua Ulrich | about.me/joshuaulrich FOSS Trading | www.fosstrading.com
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel