Skip to content

read columns of quoted numbers as factors

8 messages · Bernardo Rangel Tura, Mike Marchywka, James Hirschorn +3 more

#
Suppose I have a data file (possibly with a huge number of columns), where the 
columns with factors are coded as "1", "2", "3", etc ... The default behavior of 
read.table is to convert these columns to integer vectors. 

Is there a way to get read.table to recognize that columns of quoted numbers 
represent factors (while unquoted numbers are interpreted as integers), without 
explicitly setting them with colClasses ?
#
On Mon, 2010-10-04 at 09:39 -0700, james hirschorn wrote:
Hi James,

I think you solve ypur problem using the options colClasses in the
read.table command, something like this

rea.table('name.of.table',colClasses=c(rep(30,'integer'),rep(5,'numeric'),etc))
#
On Oct 4, 2010, at 18:39 , james hirschorn wrote:

            
I don't think there's a simple way, because the modus operandi of read.table is to read everything as character and then see whether it can be converted to numeric, and at that point any quotes will have been lost.

One possibility, somewhat dependent on the exact file format, would be to temporarily set quote="", see which columns contains quote characters, and, on a second pass, read those columns as factors, using  a computed colClasses argument. It will break down if you have space-separated columns with quoted multi-word strings, though.

  
    
#
----------------------------------------
While this specific example may or may not lend itself to a solution within R,
I would just mention that it is not a faux pas to modify your data file
with something like sed or awk prior to feeding it to some program like R.
Quotes,spaces, commas, etc, may be something that the target app can handle
or it may just be easier to change the format with a familiar tool designed
for that.
#
Yes, your solution of setting quote="" would read the multi-word strings 
incorrectly. A more complicated version of your solution should work: First 
check which columns are identified as strings, and then apply your solution to 
the remaining columns.

I'm a newbie at R, but it seems to me that there is a "logical inconsistency" in 
R: write.table puts quotes around numbers when they form a column of factors, 
but does not put quotes for a column of integers. Since read.table is the "dual" 
of write.table it seems that it should treat quoted and unquoted columns 
differently, analogously to write.table. However, there does not even seem to be 
an option to make read.table behave analogously.


----- Original Message ----
From: peter dalgaard <pdalgd at gmail.com>
To: james hirschorn <j_hirschorn at yahoo.com>
Cc: r-help at r-project.org
Sent: Tue, October 5, 2010 7:25:52 AM
Subject: Re: [R] read columns of quoted numbers as factors
On Oct 4, 2010, at 18:39 , james hirschorn wrote:

            
I don't think there's a simple way, because the modus operandi of read.table is 
to read everything as character and then see whether it can be converted to 
numeric, and at that point any quotes will have been lost.

One possibility, somewhat dependent on the exact file format, would be to 
temporarily set quote="", see which columns contains quote characters, and, on a 
second pass, read those columns as factors, using  a computed colClasses 
argument. It will break down if you have space-separated columns with quoted 
multi-word strings, though.

  
    
#
On Oct 5, 2010, at 8:41 PM, james hirschorn wrote:

            
Factors are internally represented as positive integers, but have a  
separate "layer" of their levels and labels. What I suspect you are  
seeing and calling "numbers" are the character-valued labels.

 > write.table(data.frame(nums=-1:-5, facs= factor(-1:-5)), file="",  
row.names=F)
"nums" "facs"
-1 "-1"
-2 "-2"
-3 "-3"
-4 "-4"
-5 "-5"

That does not seem at all "logically inconsistent" to me.
#
On Mon, Oct 4, 2010 at 12:39 PM, james hirschorn <j_hirschorn at yahoo.com> wrote:
Although its a bit messy its nevertheless only a few lines of code to
transform the quote-and-digit columns to non-numeric, read them in and
transform back. For example, if ! does not appear in the file we could
insert ! characters into the quote-and-digit columns and remove them
afterwards:

L <- readLines("myfile.dat")
L2 <- gsub('"(\\d+)"', "!\\1", L) # insert !
DF <- read.table(textConnection(L2), header = TRUE)

# remove !
ix <- sapply(DF, is.factor)
DF[ix] <- lapply(DF[ix], function(x) factor(gsub("!", "", x)))

str(DF)
#
On 10/06/2010 02:41 AM, james hirschorn wrote:
Probably more painful than that if column separators can appear in
strings. The best I can think of involves trying to reread the columns
that get classified as numeric with colClasses="numeric" and see if they
fail. A general solution likely requires changing scan() at C-level.
Yes, and far from the only such case in R. (Even more annoying to my
eyes is that factor levels get reordered alphabetically, so write.table
is really not an option for storage of data frames anyway).

However, the quoting of factor levels on output from write.table is not
happening to distinguish numbers from character strings. Rather, it is
for potentially multi-word level names.