Skip to content

Convert COLON separated format

5 messages · Noah Silverman, Hasan Diwan, Rui Barradas +2 more

#
I have a bunch of data sets that were created for the libsvm tool.  They are in "colon separated sparse format".

i.e.

1  5:1  27:3  345:10

Is a row with the label of "1" and only has values in columns 5, 27, and 345.

I want to read these into a data.frame in R.  

Is there a simple way to do this?

--
Noah Silverman, M.S.
UCLA Department of Statistics
8117 Math Sciences Building
Los Angeles, CA 90095
#
Hello,

Here's a function that doesn't do it all but might help.

fun <- function(x){
     x1 <- unlist(strsplit(x, " "))
     x2 <- x1[nchar(x1) > 0]
     i <- as.integer(x2[1])
     x3 <- unlist(strsplit(x2[-1], ":"))
     j <- as.integer(x3[rep(c(TRUE, FALSE), length(x3)/2)])
     y <- numeric(max(j))
     y[j] <- as.numeric(x3[rep(c(FALSE, TRUE), length(x3)/2)])
     list(row = i, line = y)
}

x <- "1  5:1  27:3  345:10"
fun(x)

If you know that your labels, i.e., row numbers are consecutive, have 
the function return just 'y', not a list.
Then use readLines to read the file in and lapply fun to it. Something like

ln <- readLines(filename)
lst <- lapply(ln, fun)

Then you'll have another problem. The lines' lengths. They shouldn't be 
all the same, so in order to make a data.frame or matrix you'll need 
extra work. Try the code above and say whether it's on the right track.

Also, take a look at package Matrix. It's a recommended package and it 
implements sparse matrices.

Hope this helps,

Rui Barradas

Em 09-10-2012 05:56, Noah Silverman escreveu:
#
If you want something that is fast, read the file in, strip off the
colon/data, write it out to a temp and then read it back in.  Here is
a 355K line file:
[1] 355212
user  system elapsed
   0.72    0.00    0.74
[1] "1  5  27  345" "1  5  27  345" "1  5  27  345" "1  5  27  345" "1
 5  27  345" "1  5  27  345"
user  system elapsed
   1.08    0.02    1.13
[1] 355212      4
V1 V2 V3  V4
1  1  5 27 345
2  1  5 27 345
3  1  5 27 345
4  1  5 27 345
5  1  5 27 345
6  1  5 27 345
On Tue, Oct 9, 2012 at 12:56 AM, Noah Silverman <noahsilverman at ucla.edu> wrote:

  
    
#
Matrix::spMatrix can help.

Read your data file with lns <- readLines("fileName") to get
something like
   lns <- c("1 5:15 7:17 9:19",
                 "2 2:22 8:28",
                 "4 6:46")
Then use a function like the following that reformats the
data to the i=row,j=col,x=value vectors that spMatrix can use.
   f <- function(lns, nrow=NULL, ncol=NULL)
   {
      # expect lines of the form "rowNum<whiteSpace>colNum:value[<whiteSpace>colNum:value ...]"
      triples <- unlist(lapply(strsplit(lns, "[ \t]+"), function(ln)paste(sep=":",ln[1],ln[-1]))))
      triples <- strsplit(triples, ":")
      if (any(which <- vapply(triples, length, 0) != 3)) stop("formatting error")
      ijx <- matrix(as.numeric(unlist(triples)), ncol=3, byrow=TRUE)
      if (is.null(nrow)) nrow <- max(ijx[,1])
      if (is.null(ncol)) ncol <- max(ijx[,2])
      spMatrix(nrow=nrow, ncol=ncol, i=ijx[,1], j=ijx[,2], x=ijx[,3])
   }
Use it as
4 x 9 sparse Matrix of class "dgTMatrix"

[1,] .  . . . 15  . 17  . 19
[2,] . 22 . .  .  .  . 28  .
[3,] .  . . .  .  .  .  .  .
[4,] .  . . .  . 46  .  .  .

or, if you know the number of rows and columns, tell it:
10 x 10 sparse Matrix of class "dgTMatrix"

 [1,] .  . . . 15  . 17  . 19 .
 [2,] . 22 . .  .  .  . 28  . .
 [3,] .  . . .  .  .  .  .  . .
 [4,] .  . . .  . 46  .  .  . .
 [5,] .  . . .  .  .  .  .  . .
 [6,] .  . . .  .  .  .  .  . .
 [7,] .  . . .  .  .  .  .  . .
 [8,] .  . . .  .  .  .  .  . .
 [9,] .  . . .  .  .  .  .  . .
[10,] .  . . .  .  .  .  .  . .

Use as.matrix() on its output if you don't want to continue
using the sparse matrix format.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com