Is R the right choice for simulating first passage times of random walks?

3 messages · Paul Menzel, michael.weylandt at gmail.com (R. Michael Weylandt

Mon, Aug 1, 2011 5:49 AM #

Am Sonntag, den 31.07.2011, 23:32 -0500 schrieb R. Michael Weylandt :

I will try Dennis? solution right away but looked at your suggestions
first. Thank you very much.

That is a nice suggestion. For a symmetric random walk this is indeed
possible and equivalent to looking when the walk first hits zero.

That is the same idea as from Bill [1]. The problem is, when the walk
never returns to zero in a sample, `which.max(?everything FALSE)`
returns 1 [2]. That is no problem though, when we do not have to worry
about a walk starting with a positive value and adding 1 (+1) can be
omitted when we count the epochs of first hitting 0 instead of the time
of how long the walk stayed negative, which is always one less.

Additionally my check `(x>=0) & (xLag <0)` is redundant when we know we
start with a negative value. `(x>=0)` should be good enough in this
case.

The line above seems to look the columns instead of rows. I think the
following is correct since after the `apply()` above the random walks
are in the columns.

	R[,R[1,]==(1)] = -1 * R[,R[1,]==(1)]

Welcome to my world. I would have never thought that simulating random
walks with a length of say a million would create that much data and
push common desktop systems with let us say 4 GB of RAM to their limits.

I left that one out, because as written above the check can be
shortened.

That is true. But to learn some optimization techniques on a simple
example is much appreciated and will hopefully help me later on for the
iterated random walk cases.

Nice!

I have to write those translations down. On first sight though we need
again to handle the case where it stays negative the whole time. `D`
then has length 0 and we have to count that for a walk longer than
`BIGNUMBER`.

Just testing for 0 for the iterated cases will not be enough for
iterated random walks since an iterated random walk can go from negative
to non-negative without being zero at this time/epoch.

I implemented all your suggestions and got the following.

-------- 8< -------- code -------- >8 --------
f4 <- function(n = 100000, # number of simulations
               length = 100000) # length of iterated sum
{
	R = matrix(sample(c(-1L,1L),length*n,replace=T),nrow=n)
	R = apply(R,1,cumsum) ## this applies cumsum `row-wise' to R and will make your life INFINITELY better
	fTemp <- function(x) {
		if (x[1] >= 0 ) {
			return(1)
		}

		for (i in 1:length-1) {
			if (x[i] < 0 && x[i + 1] >= 0) {
				return(as.integer(i/2 + 2))
			}
		}
	}
	countNegative = apply(R,2,fTemp)
	tabulate(as.vector(countNegative), length)
}

f5 <- function(n = 100000, # number of simulations
               length = 100000) # length of iterated sum
{
	R = matrix(sample(c(-1L,1L), length*n,replace=T),nrow=n)

	R = apply(R,1,cumsum)
	R[,R[1,]==(1)] = -1 * R[,R[1,]==(1)] # If the first element in the row is positive, flip the entire row

	R <- R>=0
	countNegative = apply(R,2,which.max)
	tabulate(as.vector(countNegative), length)
}

f6 <- function(n = 100000, # number of simulations
               length = 100000) # length of iterated sum
{
	x = cumsum(sample(c(-1L,1L), length*n,replace=T))
	D = diff(which(c(0, x) == 0))
	tabulate(D, max(D))
}
-------- 8< -------- code -------- >8 --------

The timings differ quite much which is expected though.

User      System verstrichen
      2.700       0.008       2.729

User      System verstrichen 
      4.457       0.004       4.475

User      System verstrichen
      8.033       0.380       8.739

User      System verstrichen
      9.640       0.812      10.588

User      System verstrichen
      4.208       0.328       4.606

So `f6` seems to be the most efficient setting right now and even is
slightly faster than `f1` with the for loops. But we have to keep in
mind that both operate on different data sets although `set.seed(1)` is
used and `f6` treats the problem totally different.

One other thought is that when reusing the walks starting with a positiv
term and flipping those we can probably also take the backward/reverse
walk (dual problem). I will try that too.


Thank you very much,

Paul


[1] https://stat.ethz.ch/pipermail/r-help/2011-July/285015.html
[2] https://stat.ethz.ch/pipermail/r-help/2011-July/285396.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110801/d4411d79/attachment.bin>

michael.weylandt at gmail.com (R. Michael Weylandt

Mon, Aug 1, 2011 9:43 AM #

I've only got a 20 minute layover, but three quick remarks:

1) Do a sanity check on your data size: if you want a million walks of a thousand steps, that already gets you to a billion integers to store--even at a very low bound of one byte each, thats already 1GB for the data and you still have to process it all and run the OS. If you bump this to walks of length 10k, you are in big trouble. 

Considered like that, it shouldn't surprise you that you are getting near memory limits. 

If you really do need such a large simulation and are willing to make the time/space tradeoff, it may be worth doing simulations in smaller batches (say 50-100) and aggregating the needed stats for analysis. Also, consider direct use of the rm() function for memory management. 

2) If you know that which.max()==1 can't happen for your data, might this trick be easier than forcing it through some tricky logic inside the which.max()

X=which.max(...)
if(X[1]==1) X=Inf # or whatever value

3) I dont have any texts at hand to confirm this but isn't the expected value of the first hit time of a RW infinite? I think a  handwaving proof can be squeezed out of the optional stopping theorem with T=min(T_a,T_b) for a<0<b and let a -> -Inf. 

If I remember right, this suggests you are trying to calculate a CI for a distribution with no finite moments, a difficult task to say the least. 

Hope these help and I'll write a more detailed reply to your notes below later,

Michael Weylandt

PS - what's an iterated RW? This is all outside my field (hence my spitball on #2 above)

PS2 - sorry about the row/column mix-up: I usually think of sample paths as rows...

On Aug 1, 2011, at 8:49 AM, Paul Menzel <paulepanter at users.sourceforge.net> wrote:

Am Sonntag, den 31.07.2011, 23:32 -0500 schrieb R. Michael Weylandt :

Glad to help -- I haven't taken a look at Dennis' solution (which may be far
better than mine), but if you do want to keep going down the path outlined
below you might consider the following:

I will try Dennis? solution right away but looked at your suggestions
first. Thank you very much.

Instead of throwing away a simulation if something starts negative, why not
just multiply the entire sample by -1: that lets you still use the sample
and saves you some computations: of course you'll have to remember to adjust
your final results accordingly.

That is a nice suggestion. For a symmetric random walk this is indeed
possible and equivalent to looking when the walk first hits zero.

This might avoid the loop:

x = ## Whatever x is.
xLag = c(0,x[-length(x)]) # 'lag' x by 1 step.
which.max((x>=0) & (xLag <0)) + 1 # Depending on how you've decided to count
things, this +1 may be extraneous.

The inner expression sets a 0 except where there is a switch from negative
to positive and a one there: the which.max function returns the location of
the first maximum, which is the first 1, in the vector. If you are
guaranteed the run starts negative, then the location of the first positive
should give you the length of the negative run.

That is the same idea as from Bill [1]. The problem is, when the walk
never returns to zero in a sample, `which.max(?everything FALSE)`
returns 1 [2]. That is no problem though, when we do not have to worry
about a walk starting with a positive value and adding 1 (+1) can be
omitted when we count the epochs of first hitting 0 instead of the time
of how long the walk stayed negative, which is always one less.

Additionally my check `(x>=0) & (xLag <0)` is redundant when we know we
start with a negative value. `(x>=0)` should be good enough in this
case.

This all gives you,

f4 <- function(n = 100000, # number of simulations
              length = 100000) # length of iterated sum
{
      R = matrix(sample(c(-1L,1L), length*n,replace=T),nrow=n)

      R = apply(R,1,cumsum)

                     R[R[,1]==(1),] = -1 * R[R[,1]==(-1),] # If the first element in the row is positive, flip the entire row

The line above seems to look the columns instead of rows. I think the
following is correct since after the `apply()` above the random walks
are in the columns.

   R[,R[1,]==(1)] = -1 * R[,R[1,]==(1)]

      fTemp <- function(x) {

                           xLag = c(0,x[-length(x)])
                           return(which.max((x>=0) & (xLag <0))+1)

      countNegative = apply(R,2,fTemp)
      tabulate(as.vector(countNegative), length)
}

That just crashed my computer though, so I wouldn't recommend it for large
n,length.

Welcome to my world. I would have never thought that simulating random
walks with a length of say a million would create that much data and
push common desktop systems with let us say 4 GB of RAM to their limits.

Instead, you can help a little by combining the lagging and the &
all in one.

f4 <- function(n = 100000, llength = 100000)
{
   R = matrix(sample(c(-1L,1L), length*n,replace=T),nrow=n)
   R = apply(R,1,cumsum)
   R[R[,1]==(1),] = -1 * R[R[,1]==(-1),] # If the first element in the row is positive, flip the entire row
   R = (cbind(rep(0,NROW(R)),R)<0)&(cbind(R,rep(0,NROW(R)))>=0)
   countNegative = apply(R,1,which.max) + 1
   return (tabulate(as.vector(countNegative), length) )

I left that one out, because as written above the check can be
shortened.

Of course, this is all starting to approach a very specific question that
could actually be approached much more efficiently if it's your end goal
(though I think I remember from your first email a different end goal):

That is true. But to learn some optimization techniques on a simple
example is much appreciated and will hopefully help me later on for the
iterated random walk cases.

We can use the symmetry and "restart"ability of RW to do the following:

x = cumsum(sample(c(-1L,1L),BIGNUMBER,replace=T)
D  = diff(which(x == 0))

Nice!

This will give you a vector of how long x stays positive or negative at a
time. Thinking through some simple translations lets you see that this set
has the same distribution as how long a RW that starts negative stays
negative.

I have to write those translations down. On first sight though we need
again to handle the case where it stays negative the whole time. `D`
then has length 0 and we have to count that for a walk longer than
`BIGNUMBER`.

Again, this is only good for answering a very specific question
about random walks and may not be useful if you have other more complicated
questions in sight.

Just testing for 0 for the iterated cases will not be enough for
iterated random walks since an iterated random walk can go from negative
to non-negative without being zero at this time/epoch.

I implemented all your suggestions and got the following.

-------- 8< -------- code -------- >8 --------
f4 <- function(n = 100000, # number of simulations
              length = 100000) # length of iterated sum
{
   R = matrix(sample(c(-1L,1L),length*n,replace=T),nrow=n)
   R = apply(R,1,cumsum) ## this applies cumsum `row-wise' to R and will make your life INFINITELY better
   fTemp <- function(x) {
       if (x[1] >= 0 ) {
           return(1)
       }

       for (i in 1:length-1) {
           if (x[i] < 0 && x[i + 1] >= 0) {
               return(as.integer(i/2 + 2))
           }
       }
   }
   countNegative = apply(R,2,fTemp)
   tabulate(as.vector(countNegative), length)
}

f5 <- function(n = 100000, # number of simulations
              length = 100000) # length of iterated sum
{
   R = matrix(sample(c(-1L,1L), length*n,replace=T),nrow=n)

   R = apply(R,1,cumsum)
   R[,R[1,]==(1)] = -1 * R[,R[1,]==(1)] # If the first element in the row is positive, flip the entire row

   R <- R>=0
   countNegative = apply(R,2,which.max)
   tabulate(as.vector(countNegative), length)
}

f6 <- function(n = 100000, # number of simulations
              length = 100000) # length of iterated sum
{
   x = cumsum(sample(c(-1L,1L), length*n,replace=T))
   D = diff(which(c(0, x) == 0))
   tabulate(D, max(D))
}
-------- 8< -------- code -------- >8 --------

The timings differ quite much which is expected though.

# f1 is using only for loops but only does half the calculations
# and does not yet flip random walks starting with a positive value.
set.seed(1) ; system.time( z1 <- f1(300, 1e5) )

      User      System verstrichen
     2.700       0.008       2.729

# f1 adapted with flips
set.seed(1) ; system.time( z1f <- f1withflip(300, 1e5) )

      User      System verstrichen 
     4.457       0.004       4.475

set.seed(1) ; system.time( z4 <- f4(300, 1e5) )

      User      System verstrichen
     8.033       0.380       8.739

set.seed(1) ; system.time( z5 <- f5(300, 1e5) )

      User      System verstrichen
     9.640       0.812      10.588

set.seed(1) ; system.time( z6 <- f6(300, 1e5) )

      User      System verstrichen
     4.208       0.328       4.606

So `f6` seems to be the most efficient setting right now and even is
slightly faster than `f1` with the for loops. But we have to keep in
mind that both operate on different data sets although `set.seed(1)` is
used and `f6` treats the problem totally different.

One other thought is that when reusing the walks starting with a positiv
term and flipping those we can probably also take the backward/reverse
walk (dual problem). I will try that too.


Thank you very much,

Paul


[1] https://stat.ethz.ch/pipermail/r-help/2011-July/285015.html
[2] https://stat.ethz.ch/pipermail/r-help/2011-July/285396.html

Paul Menzel

Mon, Aug 1, 2011 5:28 PM #

Am Montag, den 01.08.2011, 12:43 -0400 schrieb R. Michael Weylandt :

I already did that, saved the result and ran it again. I also found [1]
and will look to do these things in parallel, since the simulations do
not depend on each other. I hope I can avoid the matrix then and use
just `replicate()`.

I was looking for such a feature the last days and read to set the
variables to `NULL` to delete them somewhere. Now I found the correct
command. Thank you!

Noted for when I need that again.

That is indeed correct. The generating function of the first hitting
time of zero T? is (g_T?)(s) ? 1/s (1 - \sqrt(1 - s). Therefore

(g_T?)?(s) ? 1/s? (1 - \sqrt(1 - s) + 1/s (2(1 - s))^(-?) ? ? for s ? 1.

Apparently there are several ways to prove that.

I do not know. It scares me. ;-) For the symmetric simple random walk
S_n ? ?_{i=0}^n X_i I want to verify (1).

(1)	n^(-?) ~ p_n ? P(max_{1 ? k ? n} S_n < 0)

a(x) ~ b(x) means that the quotient converges to 1 for x ? ?.

[?]

I am sorry, I meant *integrated* random walk [3][4]. Basically that is
the integral (?area? ? can be negative).

	A_n ? ?_{i=0}^n S_i = ?_{i=0}^n (n - i + 1) X_i

Being 0, S? and X? can always be omitted. So I basically just need to do
one `cumsum()` more over the walks.

No problem at all. I already was confused that it looked differently
(transposed) after the first `apply()`. But it made sense.


Thanks,

Paul


[1] http://www.bioconductor.org/help/course-materials/2010/BioC2010/EfficientRProgramming.pdf
[2] http://www.steinsaltz.me.uk/probA/ProbALN13.pdf
[3] http://www-stat.stanford.edu/~amir/preprints/irw.ps
[4] http://arxiv.org/abs/0911.5456
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110802/9ef937bf/attachment.bin>