Skip to content

"unsparse" a vector

10 messages · Bert Gunter, Petr Savicky, William Dunlap +1 more

#
Suppose I have a vector of strings:
c("A1B2","A3C4","B5","C6A7B8")
[1] "A1B2"   "A3C4"   "B5"     "C6A7B8"
where each string is a sequence of <column><value> pairs
(fixed width, in this example both value and name are 1 character, in
reality the column name is 6 chars and value is 2 digits).
I need to convert it to a data frame:
data.frame(A=c(1,3,0,7),B=c(2,0,5,8),C=c(0,4,0,6))
  A B C
1 1 2 0
2 3 0 4
3 0 5 0
4 7 8 6

how do I do that?
thanks.
#
To be clear, I can do that with nested for loops:

v <- c("A1B2","A3C4","B5","C6A7B8")
l <- strsplit(gsub("(.{2})","\\1,",v),",")
d <- data.frame(A=vector(length=4,mode="integer"),
                B=vector(length=4,mode="integer"),
                C=vector(length=4,mode="integer"))

for (i in 1:length(l)) {
  l1 <- l[[i]]
  for (j in 1:length(l1)) {
    d[[substring(l1[j],1,1)]][i] <- as.numeric(substring(l1[j],2,2))
  }
}


but I am afraid that handling 1,000,000 (=length(unlist(l))) strings in
a loop will kill me.

  
    
#
I suspect there are cleverer ways to do it, especially using packages
like stringr and gsubfn, but using base tools, you can hack it without
too much effort:

?gregexpr

is the key. To get started (x is your example vector of character strings):
[[1]]
[1] 1 3
attr(,"match.length")
[1] 2 2
attr(,"useBytes")
[1] TRUE

[[2]]
[1] 1 3
attr(,"match.length")
[1] 2 2
attr(,"useBytes")
[1] TRUE

[[3]]
[1] 1
attr(,"match.length")
[1] 2
attr(,"useBytes")
[1] TRUE

[[4]]
[1] 1 3 5
attr(,"match.length")
[1] 2 2 2
attr(,"useBytes")
[1] TRUE

The components of the result give you indices of the start and stop
values for each "entry" in your final matrix/data frame. You can thus
lapply() on this list to get the column name-value pairs substrings
and decode them.

Alternatively, if all your names are really 6 characters and all your
values are really two digits,
?nchar and ?substring will get you the name-value substrings directly.

I leave the niggling details to you (or to other helpeRs -- especially
those who can suggest a more elegant approach).

-- Bert
On Wed, Feb 8, 2012 at 12:34 PM, Sam Steingold <sds at gnu.org> wrote:

  
    
#
Sam:
On Wed, Feb 8, 2012 at 12:56 PM, Sam Steingold <sds at gnu.org> wrote:
Well, that depends on how "dead" you can stand being. Try it with a
1000 entry subvector and see how bad it gets. A few extra minutes of
computing time to save many more minutes of programming time seems a
reasonable tradeoff. Alternatively, see ?compile to compile your
solution into bytecode, which might give a few fold reduction in time
(or not). The calculation could also be parallelized using the
parallel package, I'm sure.

-- Bert

  
    
#
in this case, many hours of computing time.
Error: object 'compile' not found
#
Sorry, it's in package compiler, now part of the standard distro. My bad.

-- Bert
On Wed, Feb 8, 2012 at 1:23 PM, Sam Steingold <sds at gnu.org> wrote:

  
    
#
On Wed, Feb 08, 2012 at 03:56:12PM -0500, Sam Steingold wrote:
Hi.

The inner loop may be vectorized.

  d <- as.matrix(d)
  for (i in 1:length(l)) {
    l1 <- l[[i]]
    d[i, substring(l1,1,1)] <- as.numeric(substring(l1,2,2))
  }

    A B C
  1 1 2 0
  2 3 0 4
  3 0 5 0
  4 7 8 6

If the number of rows of d is large, a matrix is probably
more efficient.

Hope this helps.

Petr Savicky.
#
When compute time is important it often helps
to loop over columns instead of over rows (assuming
there are fewer columns than rows, the usual
case).  E.g., putting your code into a function f0
and the column-looping version into f1:

f0 <- function(v) {
  n <- length(v)
  d <- data.frame(A=vector(length=n,mode="integer"),
                  B=vector(length=n,mode="integer"),
                  C=vector(length=n,mode="integer"))
  l <- strsplit(gsub("(.{2})","\\1,",v),",")
  for (i in seq_along(l)) {
    l1 <- l[[i]]
    for (j in seq_along(l1)) {
      d[[substring(l1[j],1,1)]][i] <- as.integer(substring(l1[j],2,2))
    }
  }
  d
}

f1 <- function(v) {
  n <- length(v)
  letters <- c("A", "B", "C")
  names(letters) <- letters
  data.frame(lapply(letters,
              function(letter) {
                  retval <- integer(n)
                  hasLetter <- grepl(letter, v)
                  retval[hasLetter] <- as.integer(
                     gsub(sprintf("^.*%s([[:digit:]]+).*$", letter), 
                          "\\1", 
                          v[hasLetter]))
                  retval
              }))
}

I get the following times for a 10,000 long v like yours
(and the results are the same):
user  system elapsed 
   0.13    0.00    0.14
user  system elapsed 
  10.75    0.19   10.99
[1] TRUE

If I double the length of v, your code takes 53 seconds (5x slower,
quadratic behavior?) while mine takes 0.17 (less than linear, suggesting
that its time is still dominated by the function call overhead for such
small input vectors).

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
loop is too slow.
it appears that sparseMatrix does what I want:

ll <- lapply(l,length)
i <- rep(1:4, ll)
vv <- unlist(l)
j1 <- as.factor(substring(vv,1,1))
t <- table(j1)
j <- position of elements of j1 in names(t)
sparseMatrix(i,j,x=as.numeric(substring(vv,2,2)), dimnames = names(t))

so, the question is, how do I produce a vector of positions?

i.e., from vectors
[1] "A" "B" "A" "C" "A" "B"
and
[1] "A" "B" "C"
I need to produce a vector
[1] 1 2 1 3 1 2
of positions of the elements of the first vector in the second vector.

PS. Of course, I would much prefer a dataframe to a matrix...

  
    
#
On Wed, Feb 08, 2012 at 05:01:01PM -0500, Sam Steingold wrote:
This particular thing may be done as follows

  match(c("A", "B", "A", "C", "A", "B"), c("A", "B", "C"))
  [1] 1 2 1 3 1 2
As the final result or also as an intermediate result?

Changing individual rows in a data frame is much slower
than in a matrix.

Compare

  n <- 10000
  mat <- matrix(1:(2*n), nrow=n)
  df <- as.data.frame(mat)

  system.time( for (i in 1:n) { mat[i, 1] <- 0 } )

     user  system elapsed 
    0.021   0.000   0.021 

  system.time( for (i in 1:n) { df[i, 1] <- 0 } )

     user  system elapsed 
    4.997   0.069   5.084 

This effect is specific to working with rows. Working
with the whole columns is a different thing.

  system.time( {
  col1 <- df[[1]]
  for (i in 1:n) { col1[i] <- 0 }
  df[[1]] <- col1
  } )

    user  system elapsed 
   0.019   0.000   0.019 

Hope this helps.

Petr Savicky.