algorithm help

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110106/82118229/attachment.pl>
Hi, I am seeking help on designing an algorithm to identify the
locations of stretches of 1s in a vector of 0s and 1s. Below is
an simple example:

dat<-as.data.frame(cbind(a=c(F,F,T,T,T,T,F,F,T,T,F,T,T,T,T,F,F,F,F,T)
  ,b=c(4,12,13,16,18,20,28,30,34,46,47,49,61,73,77,84,87,90,95,97)))

dat
   a  b
1  0  4
2  0 12
3  1 13
4  1 16
5  1 18
6  1 20
7  0 28
8  0 30
9  1 34
10 1 46
11 0 47
12 1 49
13 1 61
14 1 73
15 1 77
16 0 84
17 0 87
18 0 90
19 0 95
20 1 97

In this dataset, "b" is sorted and denotes the location for each
number in "a". 
So I would like to find the starting & ending locations for each
stretch of 1s within "a", also counting the number of 1s in each
stretch as well.
Hope the results from the algorithm would be:

stretch   start   end   No.of.1s
1         13      20    4
2         34      46    2
3         49      77    4
4         97      97    1

I can imagine using for loops can do the job, but I feel it's not a
clever way to do this. Is there an efficient algorithm that can do
this fast?

Thanks for any suggestions.
John
The basic information you need can be got using rle() ("run length
encoding"). See '?rle'. In your example:

  rle(dat$a)
  # Run Length Encoding
  #   lengths: int [1:8] 2 4 2 2 1 4 4 1
  #   values : num [1:8] 0 1 0 1 0 1 0 1
  ## Note: F -> 0, T -> 1

The following has a somewhat twisted logic at the end, and may
be flawed, but you can probably adapt it!

  L <- rle(dat$a)$lengths
  V <- rle(dat$a)$values
  pos <- c(1,cumsum(L))
  V1 <- c(-1,V)
  1+pos[V1==0]
  # [1]  3  9 12 20
  ## Positions in the series dat$a where each run of "T" (i.e. 1)
  ##   starts

Hoping this helps,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at wlandres.net>
Fax-to-email: +44 (0)870 094 0861
Date: 06-Jan-11                                       Time: 22:57:44
------------------------------ XFMail ------------------------------
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110106/dc2367b8/attachment.pl>
-----Original Message-----
From: r-help-bounces at r-project.org 
[mailto:r-help-bounces at r-project.org] On Behalf Of array chip
Sent: Thursday, January 06, 2011 3:29 PM
To: ted.harding at wlandres.net
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] algorithm help

Thanks very much, Ted. Yes, it does what I need!

I made a routine to do this:

f.fragment<-function(a,b) {
    dat<-as.data.frame(cbind(a,b))

    L <- rle(dat$a)$lengths
    V <- rle(dat$a)$values
    pos <- c(1,cumsum(L))
    V1 <- c(-1,V)
    start<-1+pos[V1==0]
    end<-pos[V1==1]

    cbind(stretch=1:length(start),start=dat$b[start]
     ,end=dat$b[end],no.of.1s=L[V==1])

}

f.fragment(dat$a,dat$b)

     stretch start end no.of.1s
[1,]       1    13  20        4
[2,]       2    34  46        2
[3,]       3    49  77        4
[4,]       4    97  97        1
You need to be more careful about the first
and last rows in the dataset.  I think yours
only works when a starts with 0 and ends with 1.

  > f.fragment(c(1,1,0,0), c(11,12,13,14))
       stretch start end no.of.1s
  [1,]       1    NA  12        2
  > f.fragment(c(1,1,0,1), c(11,12,13,14))
       stretch start end no.of.1s
  [1,]       1    14  12        2
  [2,]       1    14  14        1
  > f.fragment(c(0,1,0,1), c(11,12,13,14))
       stretch start end no.of.1s
  [1,]       1    12  12        1
  [2,]       2    14  14        1
  > f.fragment(c(0,1,0,0), c(11,12,13,14))
       stretch start end no.of.1s
  [1,]       1    12  12        1
  [2,]       2    NA  12        1
  > f.fragment(c(1,1,1,1), c(11,12,13,14))
       stretch end no.of.1s
  [1,]       1  14        4
  [2,]       0  14        4
  > f.fragment(c(0,0,0,0), c(11,12,13,14))
       stretch start
  [1,]       1    NA

The following does better.  It keeps things as
logical vectors as long as possible, which tends
to work better when dealing with runs.
  f <- function(a, b) {
       isFirstIn1Run <- c(TRUE, a[-1] != a[-length(a)]) & a==1
       isLastIn1Run <- c(a[-1] != a[-length(a)], TRUE) & a==1
       data.frame(stretch=seq_len(sum(isFirstIn1Run)),
                  start = b[isFirstIn1Run],
                  end = b[isLastIn1Run],
                  no.of.1s = which(isLastIn1Run) - which(isFirstIn1Run)
+ 1)
  }
  > f(c(1,1,0,0), c(11,12,13,14))
    stretch start end no.of.1s
  1       1    11  12        2
  > f(c(1,1,0,1), c(11,12,13,14))
    stretch start end no.of.1s
  1       1    11  12        2
  2       2    14  14        1
  > f(c(0,1,0,1), c(11,12,13,14))
    stretch start end no.of.1s
  1       1    12  12        1
  2       2    14  14        1
  > f(c(0,1,0,0), c(11,12,13,14))
    stretch start end no.of.1s
  1       1    12  12        1
  > f(c(1,1,1,1), c(11,12,13,14))
    stretch start end no.of.1s
  1       1    11  14        4
  > f(c(0,0,0,0), c(11,12,13,14))
  [1] stretch  start    end      no.of.1s
  <0 rows> (or 0-length row.names)

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
John

________________________________
From: "ted.harding at wlandres.net" <ted.harding at wlandres.net>

Cc: r-help at stat.math.ethz.ch
Sent: Thu, January 6, 2011 2:57:47 PM
Subject: RE: [R] algorithm help

On 06-Jan-11 22:16:38, array chip wrote:
Hi, I am seeking help on designing an algorithm to identify the
locations of stretches of 1s in a vector of 0s and 1s. Below is
an simple example:

dat<-as.data.frame(cbind(a=c(F,F,T,T,T,T,F,F,T,T,F,T,T,T,T,F,F,F,F,T)
  ,b=c(4,12,13,16,18,20,28,30,34,46,47,49,61,73,77,84,87,90,95,97)))

dat
   a  b
1  0  4
2  0 12
3  1 13
4  1 16
5  1 18
6  1 20
7  0 28
8  0 30
9  1 34
10 1 46
11 0 47
12 1 49
13 1 61
14 1 73
15 1 77
16 0 84
17 0 87
18 0 90
19 0 95
20 1 97

In this dataset, "b" is sorted and denotes the location for each
number in "a". 
So I would like to find the starting & ending locations for each
stretch of 1s within "a", also counting the number of 1s in each
stretch as well.
Hope the results from the algorithm would be:

stretch   start   end   No.of.1s
1         13      20    4
2         34      46    2
3         49      77    4
4         97      97    1

I can imagine using for loops can do the job, but I feel it's not a
clever way to do this. Is there an efficient algorithm that can do
this fast?

Thanks for any suggestions.
John

The basic information you need can be got using rle() ("run length
encoding"). See '?rle'. In your example:

  rle(dat$a)
  # Run Length Encoding
  #   lengths: int [1:8] 2 4 2 2 1 4 4 1
  #   values : num [1:8] 0 1 0 1 0 1 0 1
  ## Note: F -> 0, T -> 1

The following has a somewhat twisted logic at the end, and may
[[elided Yahoo spam]]

  L <- rle(dat$a)$lengths
  V <- rle(dat$a)$values
  pos <- c(1,cumsum(L))
  V1 <- c(-1,V)
  1+pos[V1==0]
  # [1]  3  9 12 20
  ## Positions in the series dat$a where each run of "T" (i.e. 1)
  ##   starts

Hoping this helps,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at wlandres.net>
Fax-to-email: +44 (0)870 094 0861
Date: 06-Jan-11                                       Time: 22:57:44
------------------------------ XFMail ------------------------------

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110106/4d785181/attachment.pl>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 06-Jan-11 22:16:38, array chip wrote:
Hi, I am seeking help on designing an algorithm to identify the
locations of stretches of 1s in a vector of 0s and 1s. Below is
an simple example:

dat<-as.data.frame(cbind(a=c(F,F,T,T,T,T,F,F,T,T,F,T,T,T,T,F,F,F,F,T)
  ,b=c(4,12,13,16,18,20,28,30,34,46,47,49,61,73,77,84,87,90,95,97)))

dat
   a  b
1  0  4
2  0 12
3  1 13
4  1 16
5  1 18
6  1 20
7  0 28
8  0 30
9  1 34
10 1 46
11 0 47
12 1 49
13 1 61
14 1 73
15 1 77
16 0 84
17 0 87
18 0 90
19 0 95
20 1 97

In this dataset, "b" is sorted and denotes the location for each
number in "a". 
So I would like to find the starting & ending locations for each
stretch of 1s within "a", also counting the number of 1s in each
stretch as well.
Hope the results from the algorithm would be:

stretch   start   end   No.of.1s
1         13      20    4
2         34      46    2
3         49      77    4
4         97      97    1

I can imagine using for loops can do the job, but I feel it's not a
clever way to do this. Is there an efficient algorithm that can do
this fast?

Thanks for any suggestions.
John
The basic information you need can be got using rle() ("run length
encoding"). See '?rle'. In your example:

  rle(dat$a)
  # Run Length Encoding
  #   lengths: int [1:8] 2 4 2 2 1 4 4 1
  #   values : num [1:8] 0 1 0 1 0 1 0 1
  ## Note: F -> 0, T -> 1

The following has a somewhat twisted logic at the end, and may
be flawed, but you can probably adapt it!

  L <- rle(dat$a)$lengths
  V <- rle(dat$a)$values
  pos <- c(1,cumsum(L))
  V1 <- c(-1,V)
  1+pos[V1==0]
  # [1]  3  9 12 20
  ## Positions in the series dat$a where each run of "T" (i.e. 1)
  ##   starts
A different approach would be to use the diff() function:

Where
diff(dat$a)
[1]  0  1  0  0  0 -1  0  1  0 -1  1  0  0  0 -1  0  0  0  1

is not equal 0, the value is changing from 0 to 1 or one to 0.
The indices of the first new value can be found by:
which(diff(dat$a)!=0) + 1
[1]  3  7  9 11 12 16 20

where it is changing from 0 to 1 is at
which(diff(dat$a)==1) + 1
[1]  3  9 12 20

where it is changing from 1 to 0 is at
which(diff(dat$a)==-1) + 1
[1]  7 11 16

By taking into consideration if the first value and the last values are
0 or 1, you can calculate the length.

Cheers,

Rainer
Hoping this helps,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at wlandres.net>
Fax-to-email: +44 (0)870 094 0861
Date: 06-Jan-11                                       Time: 22:57:44
------------------------------ XFMail ------------------------------

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
- -- 
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
Biology, UCT), Dipl. Phys. (Germany)

Centre of Excellence for Invasion Biology
Natural Sciences Building
Office Suite 2039
Stellenbosch University
Main Campus, Merriman Avenue
Stellenbosch
South Africa

Tel:        +33 - (0)9 53 10 27 44
Cell:       +27 - (0)8 39 47 90 42
Fax (SA):   +27 - (0)8 65 16 27 82
Fax (D) :   +49 - (0)3 21 21 25 22 44
Fax (FR):   +33 - (0)9 58 10 27 44
email:      Rainer at krugs.de

Skype:      RMkrug
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk0m1dMACgkQoYgNqgF2egoQbACcCB3iFQ6SKYfL4KVX8AMAN9Gp
1awAn0Z+8KXnOmwCLu61gihc8xZIT++j
=O+xA
-----END PGP SIGNATURE-----