I have a bunch of data sets that were created for the libsvm tool. They are in "colon separated sparse format". i.e. 1 5:1 27:3 345:10 Is a row with the label of "1" and only has values in columns 5, 27, and 345. I want to read these into a data.frame in R. Is there a simple way to do this? -- Noah Silverman, M.S. UCLA Department of Statistics 8117 Math Sciences Building Los Angeles, CA 90095
Convert COLON separated format
5 messages · Noah Silverman, Hasan Diwan, Rui Barradas +2 more
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121009/941b8e53/attachment.pl>
Hello,
Here's a function that doesn't do it all but might help.
fun <- function(x){
x1 <- unlist(strsplit(x, " "))
x2 <- x1[nchar(x1) > 0]
i <- as.integer(x2[1])
x3 <- unlist(strsplit(x2[-1], ":"))
j <- as.integer(x3[rep(c(TRUE, FALSE), length(x3)/2)])
y <- numeric(max(j))
y[j] <- as.numeric(x3[rep(c(FALSE, TRUE), length(x3)/2)])
list(row = i, line = y)
}
x <- "1 5:1 27:3 345:10"
fun(x)
If you know that your labels, i.e., row numbers are consecutive, have
the function return just 'y', not a list.
Then use readLines to read the file in and lapply fun to it. Something like
ln <- readLines(filename)
lst <- lapply(ln, fun)
Then you'll have another problem. The lines' lengths. They shouldn't be
all the same, so in order to make a data.frame or matrix you'll need
extra work. Try the code above and say whether it's on the right track.
Also, take a look at package Matrix. It's a recommended package and it
implements sparse matrices.
Hope this helps,
Rui Barradas
Em 09-10-2012 05:56, Noah Silverman escreveu:
I have a bunch of data sets that were created for the libsvm tool. They are in "colon separated sparse format". i.e. 1 5:1 27:3 345:10 Is a row with the label of "1" and only has values in columns 5, 27, and 345. I want to read these into a data.frame in R. Is there a simple way to do this? -- Noah Silverman, M.S. UCLA Department of Statistics 8117 Math Sciences Building Los Angeles, CA 90095
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
If you want something that is fast, read the file in, strip off the colon/data, write it out to a temp and then read it back in. Here is a 355K line file:
temp <- tempfile()
input <- readLines('/temp/colon.txt')
length(input)
[1] 355212
system.time(input <- gsub("(:[0-9]+)", "", input))
user system elapsed 0.72 0.00 0.74
head(input)
[1] "1 5 27 345" "1 5 27 345" "1 5 27 345" "1 5 27 345" "1 5 27 345" "1 5 27 345"
writeLines(input, temp) system.time(newInput <- read.table(temp))
user system elapsed 1.08 0.02 1.13
dim(newInput)
[1] 355212 4
head(newInput)
V1 V2 V3 V4 1 1 5 27 345 2 1 5 27 345 3 1 5 27 345 4 1 5 27 345 5 1 5 27 345 6 1 5 27 345
On Tue, Oct 9, 2012 at 12:56 AM, Noah Silverman <noahsilverman at ucla.edu> wrote:
I have a bunch of data sets that were created for the libsvm tool. They are in "colon separated sparse format". i.e. 1 5:1 27:3 345:10 Is a row with the label of "1" and only has values in columns 5, 27, and 345. I want to read these into a data.frame in R. Is there a simple way to do this? -- Noah Silverman, M.S. UCLA Department of Statistics 8117 Math Sciences Building Los Angeles, CA 90095
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.
Matrix::spMatrix can help.
Read your data file with lns <- readLines("fileName") to get
something like
lns <- c("1 5:15 7:17 9:19",
"2 2:22 8:28",
"4 6:46")
Then use a function like the following that reformats the
data to the i=row,j=col,x=value vectors that spMatrix can use.
f <- function(lns, nrow=NULL, ncol=NULL)
{
# expect lines of the form "rowNum<whiteSpace>colNum:value[<whiteSpace>colNum:value ...]"
triples <- unlist(lapply(strsplit(lns, "[ \t]+"), function(ln)paste(sep=":",ln[1],ln[-1]))))
triples <- strsplit(triples, ":")
if (any(which <- vapply(triples, length, 0) != 3)) stop("formatting error")
ijx <- matrix(as.numeric(unlist(triples)), ncol=3, byrow=TRUE)
if (is.null(nrow)) nrow <- max(ijx[,1])
if (is.null(ncol)) ncol <- max(ijx[,2])
spMatrix(nrow=nrow, ncol=ncol, i=ijx[,1], j=ijx[,2], x=ijx[,3])
}
Use it as
f(lns)
4 x 9 sparse Matrix of class "dgTMatrix" [1,] . . . . 15 . 17 . 19 [2,] . 22 . . . . . 28 . [3,] . . . . . . . . . [4,] . . . . . 46 . . . or, if you know the number of rows and columns, tell it:
f(lns, 10, 10)
10 x 10 sparse Matrix of class "dgTMatrix" [1,] . . . . 15 . 17 . 19 . [2,] . 22 . . . . . 28 . . [3,] . . . . . . . . . . [4,] . . . . . 46 . . . . [5,] . . . . . . . . . . [6,] . . . . . . . . . . [7,] . . . . . . . . . . [8,] . . . . . . . . . . [9,] . . . . . . . . . . [10,] . . . . . . . . . . Use as.matrix() on its output if you don't want to continue using the sparse matrix format. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Noah Silverman Sent: Monday, October 08, 2012 9:57 PM To: r-help Subject: [R] Convert COLON separated format I have a bunch of data sets that were created for the libsvm tool. They are in "colon separated sparse format". i.e. 1 5:1 27:3 345:10 Is a row with the label of "1" and only has values in columns 5, 27, and 345. I want to read these into a data.frame in R. Is there a simple way to do this? -- Noah Silverman, M.S. UCLA Department of Statistics 8117 Math Sciences Building Los Angeles, CA 90095
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.