"unsparse" a vector - R-help

Wed, Feb 8, 2012 12:34 PM #

Suppose I have a vector of strings:
c("A1B2","A3C4","B5","C6A7B8")
[1] "A1B2"   "A3C4"   "B5"     "C6A7B8"
where each string is a sequence of <column><value> pairs
(fixed width, in this example both value and name are 1 character, in
reality the column name is 6 chars and value is 2 digits).
I need to convert it to a data frame:
data.frame(A=c(1,3,0,7),B=c(2,0,5,8),C=c(0,4,0,6))
  A B C
1 1 2 0
2 3 0 4
3 0 5 0
4 7 8 6

how do I do that?
thanks.

Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://mideasttruth.com http://jihadwatch.org http://pmw.org.il
http://openvotingconsortium.org http://iris.org.il http://memri.org
What's the difference between Apathy & Ignorance? -I don't know and don't care!

Sam Steingold

Wed, Feb 8, 2012 12:56 PM #

To be clear, I can do that with nested for loops:

v <- c("A1B2","A3C4","B5","C6A7B8")
l <- strsplit(gsub("(.{2})","\\1,",v),",")
d <- data.frame(A=vector(length=4,mode="integer"),
                B=vector(length=4,mode="integer"),
                C=vector(length=4,mode="integer"))

for (i in 1:length(l)) {
  l1 <- l[[i]]
  for (j in 1:length(l1)) {
    d[[substring(l1[j],1,1)]][i] <- as.numeric(substring(l1[j],2,2))
  }
}


but I am afraid that handling 1,000,000 (=length(unlist(l))) strings in
a loop will kill me.

Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://palestinefacts.org http://iris.org.il http://camera.org
http://ffii.org http://www.PetitionOnline.com/tap12009/
An elephant is a mouse with an operating system.

Bert Gunter

Wed, Feb 8, 2012 12:58 PM #

I suspect there are cleverer ways to do it, especially using packages
like stringr and gsubfn, but using base tools, you can hack it without
too much effort:

?gregexpr

is the key. To get started (x is your example vector of character strings):

[[1]]
[1] 1 3
attr(,"match.length")
[1] 2 2
attr(,"useBytes")
[1] TRUE

[[2]]
[1] 1 3
attr(,"match.length")
[1] 2 2
attr(,"useBytes")
[1] TRUE

[[3]]
[1] 1
attr(,"match.length")
[1] 2
attr(,"useBytes")
[1] TRUE

[[4]]
[1] 1 3 5
attr(,"match.length")
[1] 2 2 2
attr(,"useBytes")
[1] TRUE

The components of the result give you indices of the start and stop
values for each "entry" in your final matrix/data frame. You can thus
lapply() on this list to get the column name-value pairs substrings
and decode them.

Alternatively, if all your names are really 6 characters and all your
values are really two digits,
?nchar and ?substring will get you the name-value substrings directly.

I leave the niggling details to you (or to other helpeRs -- especially
those who can suggest a more elegant approach).

-- Bert

On Wed, Feb 8, 2012 at 12:34 PM, Sam Steingold <sds at gnu.org> wrote:

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

Bert Gunter

Wed, Feb 8, 2012 1:02 PM #

Sam:

On Wed, Feb 8, 2012 at 12:56 PM, Sam Steingold <sds at gnu.org> wrote:

Well, that depends on how "dead" you can stand being. Try it with a
1000 entry subvector and see how bad it gets. A few extra minutes of
computing time to save many more minutes of programming time seems a
reasonable tradeoff. Alternatively, see ?compile to compile your
solution into bytecode, which might give a few fold reduction in time
(or not). The calculation could also be parallelized using the
parallel package, I'm sure.

-- Bert

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

Sam Steingold

Wed, Feb 8, 2012 1:23 PM #

in this case, many hours of computing time.

Error: object 'compile' not found

Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://dhimmi.com http://iris.org.il http://pmw.org.il
http://camera.org http://truepeace.org http://ffii.org
((lambda (x) (list x (list 'quote x))) '(lambda (x) (list x (list 'quote x))))

Bert Gunter

Wed, Feb 8, 2012 1:27 PM #

Sorry, it's in package compiler, now part of the standard distro. My bad.

-- Bert

On Wed, Feb 8, 2012 at 1:23 PM, Sam Steingold <sds at gnu.org> wrote:

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

Petr Savicky

Wed, Feb 8, 2012 1:32 PM #

On Wed, Feb 08, 2012 at 03:56:12PM -0500, Sam Steingold wrote:

Hi.

The inner loop may be vectorized.

  d <- as.matrix(d)
  for (i in 1:length(l)) {
    l1 <- l[[i]]
    d[i, substring(l1,1,1)] <- as.numeric(substring(l1,2,2))
  }

    A B C
  1 1 2 0
  2 3 0 4
  3 0 5 0
  4 7 8 6

If the number of rows of d is large, a matrix is probably
more efficient.

Hope this helps.

Petr Savicky.

William Dunlap

Wed, Feb 8, 2012 1:53 PM #

When compute time is important it often helps
to loop over columns instead of over rows (assuming
there are fewer columns than rows, the usual
case).  E.g., putting your code into a function f0
and the column-looping version into f1:

f0 <- function(v) {
  n <- length(v)
  d <- data.frame(A=vector(length=n,mode="integer"),
                  B=vector(length=n,mode="integer"),
                  C=vector(length=n,mode="integer"))
  l <- strsplit(gsub("(.{2})","\\1,",v),",")
  for (i in seq_along(l)) {
    l1 <- l[[i]]
    for (j in seq_along(l1)) {
      d[[substring(l1[j],1,1)]][i] <- as.integer(substring(l1[j],2,2))
    }
  }
  d
}

f1 <- function(v) {
  n <- length(v)
  letters <- c("A", "B", "C")
  names(letters) <- letters
  data.frame(lapply(letters,
              function(letter) {
                  retval <- integer(n)
                  hasLetter <- grepl(letter, v)
                  retval[hasLetter] <- as.integer(
                     gsub(sprintf("^.*%s([[:digit:]]+).*$", letter), 
                          "\\1", 
                          v[hasLetter]))
                  retval
              }))
}

I get the following times for a 10,000 long v like yours
(and the results are the same):

user  system elapsed 
   0.13    0.00    0.14

user  system elapsed 
  10.75    0.19   10.99

[1] TRUE

If I double the length of v, your code takes 53 seconds (5x slower,
quadratic behavior?) while mine takes 0.17 (less than linear, suggesting
that its time is still dominated by the function call overhead for such
small input vectors).

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Sam Steingold

Wed, Feb 8, 2012 2:01 PM #

loop is too slow.
it appears that sparseMatrix does what I want:

ll <- lapply(l,length)
i <- rep(1:4, ll)
vv <- unlist(l)
j1 <- as.factor(substring(vv,1,1))
t <- table(j1)
j <- position of elements of j1 in names(t)
sparseMatrix(i,j,x=as.numeric(substring(vv,2,2)), dimnames = names(t))

so, the question is, how do I produce a vector of positions?

i.e., from vectors
[1] "A" "B" "A" "C" "A" "B"
and
[1] "A" "B" "C"
I need to produce a vector
[1] 1 2 1 3 1 2
of positions of the elements of the first vector in the second vector.

PS. Of course, I would much prefer a dataframe to a matrix...

Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://honestreporting.com http://truepeace.org http://openvotingconsortium.org
http://iris.org.il http://jihadwatch.org http://camera.org
Failure is not an option. It comes bundled with your Microsoft product.

Petr Savicky

Thu, Feb 9, 2012 3:35 AM #

On Wed, Feb 08, 2012 at 05:01:01PM -0500, Sam Steingold wrote:

This particular thing may be done as follows

  match(c("A", "B", "A", "C", "A", "B"), c("A", "B", "C"))
  [1] 1 2 1 3 1 2

As the final result or also as an intermediate result?

Changing individual rows in a data frame is much slower
than in a matrix.

Compare

  n <- 10000
  mat <- matrix(1:(2*n), nrow=n)
  df <- as.data.frame(mat)

  system.time( for (i in 1:n) { mat[i, 1] <- 0 } )

     user  system elapsed 
    0.021   0.000   0.021 

  system.time( for (i in 1:n) { df[i, 1] <- 0 } )

     user  system elapsed 
    4.997   0.069   5.084 

This effect is specific to working with rows. Working
with the whole columns is a different thing.

  system.time( {
  col1 <- df[[1]]
  for (i in 1:n) { col1[i] <- 0 }
  df[[1]] <- col1
  } )

    user  system elapsed 
   0.019   0.000   0.019 

Hope this helps.

Petr Savicky.