Successive subsets from a vector?

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20060822/1b4812e6/attachment.pl
An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: https://stat.ethz.ch/pipermail/r-help/attachments/20060822/8190d0bf/attachment.pl
embed(VECTOR, 5)[, 5:1]

gives the subsets, so something like

    apply(embed(VECTOR, 5)[, 5:1], 1, paste, collapse="")

does the job.

The following is a bit more efficient

    ind <- 1:(length(VECTOR)-4)
    do.call(paste, c(lapply(0:4, function(j) VECTOR[ind+j]), sep=""))

but by looking at how embed() works it could be made as efficient.

Larger example:

VECTOR <- sample(1:10, 1e5, replace=TRUE)
system.time(apply(embed(VECTOR, 5)[, 5:1], 1, paste, collapse=""))
[1] 5.73 0.05 5.81   NA   NA
system.time({ind <- 1:(length(VECTOR)-4)
+ do.call(paste, c(lapply(0:4, function(j) VECTOR[ind+j]), sep=""))
+ })
[1] 1.00 0.01 1.01   NA   NA

The loop method took 195 secs.  Just assigning to an answer of the correct 
length reduced this to 5 secs.  e.g. use

    ADDRESSES <- character(length(VECTOR)-4)

Moral: don't grow vectors repeatedly.

I'd like to pick every imbricated five character long subsets from a 
vector. I guess there is some efficient way to do this without loops...
Here is a for-loop-version and a model for output:

VECTOR=c(1,4,2,6,5,0,11,10,4,3,6,8,6);

ADDRESSES=c();
You do not need the semicolons, and they just confuse readers.
for(i in 1:(length(VECTOR)-4)){
	ADDRESSES[i]=paste(VECTOR[i:(i+4)],collapse="")	
}

 > ADDRESSES
[1] "14265"   "42650"   "265011"  "6501110" "5011104" "0111043" 
"1110436" "104368"
[9] "43686"

Atte Tenkanen
University of Turku, Finland

	[[alternative text/enriched version deleted]]

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
Thanks!

I have used tons of for- and while-loops (I'm ashamed to reveal these scripts, but I'm primarily a musician;-) http://users.utu.fi/attenka/SetTheoryScripts.r), taken some or more cup of cocoa and mostly been happy ;-) Now I got so many new ways to do these things, that it takes a while to ruminate all the ideas here.

Atte
   embed(VECTOR, 5)[, 5:1]

gives the subsets, so something like

   apply(embed(VECTOR, 5)[, 5:1], 1, paste, collapse="")

does the job.

The following is a bit more efficient

   ind <- 1:(length(VECTOR)-4)
   do.call(paste, c(lapply(0:4, function(j) VECTOR[ind+j]), sep=""))

but by looking at how embed() works it could be made as efficient.

Larger example:

VECTOR <- sample(1:10, 1e5, replace=TRUE)
system.time(apply(embed(VECTOR, 5)[, 5:1], 1, paste, collapse=""))
[1] 5.73 0.05 5.81   NA   NA
system.time({ind <- 1:(length(VECTOR)-4)
+ do.call(paste, c(lapply(0:4, function(j) VECTOR[ind+j]), sep=""))
+ })
[1] 1.00 0.01 1.01   NA   NA

The loop method took 195 secs.  Just assigning to an answer of the 
correct 
length reduced this to 5 secs.  e.g. use

   ADDRESSES <- character(length(VECTOR)-4)

Moral: don't grow vectors repeatedly.

On Tue, 22 Aug 2006, kone wrote:

I'd like to pick every imbricated five character long subsets 
from a 
vector. I guess there is some efficient way to do this without 
loops...> Here is a for-loop-version and a model for output:
VECTOR=c(1,4,2,6,5,0,11,10,4,3,6,8,6);

ADDRESSES=c();
You do not need the semicolons, and they just confuse readers.

for(i in 1:(length(VECTOR)-4)){
	ADDRESSES[i]=paste(VECTOR[i:(i+4)],collapse="")	
}

 > ADDRESSES
[1] "14265"   "42650"   "265011"  "6501110" "5011104" "0111043" 
"1110436" "104368"
[9] "43686"

Atte Tenkanen
University of Turku, Finland

	[[alternative text/enriched version deleted]]

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-
guide.html> and provide commented, minimal, self-contained, 
reproducible code.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

The loop method took 195 secs.  Just assigning to an answer of the correct
length reduced this to 5 secs.  e.g. use

    ADDRESSES <- character(length(VECTOR)-4)

Moral: don't grow vectors repeatedly.
Other languages (eg. Java) grow the size of the vector independently
of the number of observations in it (I think Java doubles the size
whenever the vector is filled), thus changing O(n) behaviour to O(log
n).  I've always wondered why R doesn't do this.

Hadley

The loop method took 195 secs.  Just assigning to an answer of the correct
length reduced this to 5 secs.  e.g. use

    ADDRESSES <- character(length(VECTOR)-4)

Moral: don't grow vectors repeatedly.
Other languages (eg. Java) grow the size of the vector independently
of the number of observations in it (I think Java doubles the size
whenever the vector is filled), thus changing O(n) behaviour to O(log
n).  I've always wondered why R doesn't do this.
At one point at least that was too expensive on memory/address space (and 
it may still be for 32-bit OSes). There is even a 'truelength' field in 
the vector header to allow for such a strategy, and the strategy is used 
in scan() and elsewhere.

In my experience it is relatively rare not to know the vector length in 
advance in R code.
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
Here is a solution that uses gsub with a negative lookahead perl-style
regexp to do it:

VECTOR <- c(1,4,2,6,5,0,11,10,4,3,6,8,6)
e <- "([[:digit:]]+),(?=([[:digit:]]+),([[:digit:]]+),([[:digit:]]+),([[:digit:]]+))"
out <- gsub(e, "\\1\\2\\3\\4\\5 ", paste(VECTOR, collapse = ","), perl = TRUE)
head(strsplit(out, " ")[[1]], -1)  # uses head from R 2.4.0
I'd like to pick every imbricated five character long subsets from a
vector. I guess there is some efficient way to do this without loops...
Here is a for-loop-version and a model for output:

VECTOR=c(1,4,2,6,5,0,11,10,4,3,6,8,6);

ADDRESSES=c();
for(i in 1:(length(VECTOR)-4)){
       ADDRESSES[i]=paste(VECTOR[i:(i+4)],collapse="")
}

 > ADDRESSES
[1] "14265"   "42650"   "265011"  "6501110" "5011104" "0111043"
"1110436" "104368"
[9] "43686"

Atte Tenkanen
University of Turku, Finland

       [[alternative text/enriched version deleted]]

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.