Hello, duplicated() does not seem to work for a long vector. For example, if you download the data from https://docs.google.com/open?id=0B6-m45Jvl3ZmNmpaSlJWMXo5bmc (a vector with about 12,000 numbers) and then run the following code which does duplicated() over the whole vector but just shows the last 30 elements: data.frame( tail(verylong, 30), tail(duplicated(verylong), 30) ) you'll see that at the end of the very long vector everything is listed as a duplicate of the preceding element (even though it shouldn't be). On the other hand, if you run the following code which just takes out the last 30 elements of the vector and does duplicated on them: data.frame( tail(verylong, 30), duplicated(tail(verylong, 30)) ) you get the correct results (FALSE shows up wherever the value in the first column changes). Does anyone know why this happens, and if there's a fix? I notice the documentation for duplicated() says: "Long vectors are supported for the default method of duplicated, but may only be usable if nmax is supplied." But I've tried running this with a high value of nmax given, and it still gives me the same problem. So far the only way I've figured out to get this duplicated()-like vector is to use a for loop going through one item at a time, but that takes about a minute to run. Best, Steve Politzer-Ahles -- Stephen Politzer-Ahles University of Kansas Linguistics Department http://people.ku.edu/~sjpa/
duplicated() with long vectors
6 messages · Stephen Politzer-Ahles, Sarah Goslee, Brian Ripley
Hi, duplicated() doesn't just look at consecutive values, but anywhere in the object. Since your 12320-element vector has only 48 separate values, and all of them occur before the last 30 elements, so duplicated() returns TRUE. You might be looking for something involving rle(). What are you trying to accomplish? Sarah On Wed, Dec 5, 2012 at 3:53 PM, Stephen Politzer-Ahles
<politzerahless at gmail.com> wrote:
Hello, duplicated() does not seem to work for a long vector. For example, if you download the data from https://docs.google.com/open?id=0B6-m45Jvl3ZmNmpaSlJWMXo5bmc (a vector with about 12,000 numbers) and then run the following code which does duplicated() over the whole vector but just shows the last 30 elements: data.frame( tail(verylong, 30), tail(duplicated(verylong), 30) ) you'll see that at the end of the very long vector everything is listed as a duplicate of the preceding element (even though it shouldn't be). On the other hand, if you run the following code which just takes out the last 30 elements of the vector and does duplicated on them: data.frame( tail(verylong, 30), duplicated(tail(verylong, 30)) ) you get the correct results (FALSE shows up wherever the value in the first column changes). Does anyone know why this happens, and if there's a fix? I notice the documentation for duplicated() says: "Long vectors are supported for the default method of duplicated, but may only be usable if nmax is supplied." But I've tried running this with a high value of nmax given, and it still gives me the same problem. So far the only way I've figured out to get this duplicated()-like vector is to use a for loop going through one item at a time, but that takes about a minute to run. Best, Steve Politzer-Ahles
-- Sarah Goslee http://www.functionaldiversity.org
Hi Sarah, Thanks a lot for your explanation. I was mistakenly under the impression that duplicated() only looked at immediately preceding element, not all preceding elements. What I was trying to do was get a vector saying, for each item, whether that item is the same as the preceding item. Now that I think of it, I could do this easily by copying the vector, shifting it over one (by removing the first element and adding something to the end), and then just compare the elements of the two vectors directly. Best, Steve
On Wed, Dec 5, 2012 at 3:08 PM, Sarah Goslee <sarah.goslee at gmail.com> wrote:
Hi, duplicated() doesn't just look at consecutive values, but anywhere in the object. Since your 12320-element vector has only 48 separate values, and all of them occur before the last 30 elements, so duplicated() returns TRUE. You might be looking for something involving rle(). What are you trying to accomplish? Sarah On Wed, Dec 5, 2012 at 3:53 PM, Stephen Politzer-Ahles <politzerahless at gmail.com> wrote:
Hello, duplicated() does not seem to work for a long vector. For example, if you download the data from https://docs.google.com/open?id=0B6-m45Jvl3ZmNmpaSlJWMXo5bmc (a vector with about 12,000 numbers) and then run the following code which does duplicated() over the whole vector but just shows the last 30 elements: data.frame( tail(verylong, 30), tail(duplicated(verylong), 30) ) you'll see that at the end of the very long vector everything is listed as a duplicate of the preceding element (even though it shouldn't be). On the other hand, if you run the following code which just takes out the last 30 elements of the vector and does duplicated on them: data.frame( tail(verylong, 30), duplicated(tail(verylong, 30)) ) you get the correct results (FALSE shows up wherever the value in the first column changes). Does anyone know why this happens, and if there's a fix? I notice the documentation for duplicated() says: "Long vectors are supported for the default method of duplicated, but may only be usable if nmax is supplied." But I've tried running this with a high value of nmax given, and it still gives me the same problem. So far the only way I've figured out to get this duplicated()-like vector is to use a for loop going through one item at a time, but that takes about a minute to run. Best, Steve Politzer-Ahles
-- Sarah Goslee http://www.functionaldiversity.org
Stephen Politzer-Ahles University of Kansas Linguistics Department http://people.ku.edu/~sjpa/
What I was trying to do was get a vector saying, for each item, whether that item is the same as the preceding item. Now that I think of it, I could do this easily by copying the vector, shifting it over one (by removing the first element and adding something to the end), and then just compare the elements of the two vectors directly.
Right. Did you look at rle() yet? Though for your particular simple case,
system.time(verylong[1:(n-1)] == verylong[2:n])
user system elapsed 0.001 0.000 0.002 is nearly instantaneous. On Wed, Dec 5, 2012 at 5:04 PM, Stephen Politzer-Ahles
<politzerahless at gmail.com> wrote:
Hi Sarah, Thanks a lot for your explanation. I was mistakenly under the impression that duplicated() only looked at immediately preceding element, not all preceding elements. What I was trying to do was get a vector saying, for each item, whether that item is the same as the preceding item. Now that I think of it, I could do this easily by copying the vector, shifting it over one (by removing the first element and adding something to the end), and then just compare the elements of the two vectors directly. Best, Steve On Wed, Dec 5, 2012 at 3:08 PM, Sarah Goslee <sarah.goslee at gmail.com> wrote:
Hi, duplicated() doesn't just look at consecutive values, but anywhere in the object. Since your 12320-element vector has only 48 separate values, and all of them occur before the last 30 elements, so duplicated() returns TRUE. You might be looking for something involving rle(). What are you trying to accomplish? Sarah On Wed, Dec 5, 2012 at 3:53 PM, Stephen Politzer-Ahles <politzerahless at gmail.com> wrote:
Hello, duplicated() does not seem to work for a long vector. For example, if you download the data from https://docs.google.com/open?id=0B6-m45Jvl3ZmNmpaSlJWMXo5bmc (a vector with about 12,000 numbers) and then run the following code which does duplicated() over the whole vector but just shows the last 30 elements: data.frame( tail(verylong, 30), tail(duplicated(verylong), 30) ) you'll see that at the end of the very long vector everything is listed as a duplicate of the preceding element (even though it shouldn't be). On the other hand, if you run the following code which just takes out the last 30 elements of the vector and does duplicated on them: data.frame( tail(verylong, 30), duplicated(tail(verylong, 30)) ) you get the correct results (FALSE shows up wherever the value in the first column changes). Does anyone know why this happens, and if there's a fix? I notice the documentation for duplicated() says: "Long vectors are supported for the default method of duplicated, but may only be usable if nmax is supplied." But I've tried running this with a high value of nmax given, and it still gives me the same problem. So far the only way I've figured out to get this duplicated()-like vector is to use a for loop going through one item at a time, but that takes about a minute to run. Best, Steve Politzer-Ahles
-- Sarah Goslee http://www.functionaldiversity.org
On 05/12/2012 21:08, Sarah Goslee wrote:
Hi, duplicated() doesn't just look at consecutive values, but anywhere in the object. Since your 12320-element vector has only 48 separate values, and all of them occur before the last 30 elements, so duplicated() returns TRUE. You might be looking for something involving rle(). What are you trying to accomplish?
And BTW, 'long vector' is a technical term in R: not 12,000, but more than 2 billion elements. You will hear it a lot more in the run-up to the next 'minor' release of R (currently R-devel, maybe 2.16.0-to-be, which is the only version from which that quote comes that I am aware of). The posting guide asked for 'at a minimum' information: if you are using an unreleased development version of R you really must tell us (and should not be reporting to the R-help list).
Sarah On Wed, Dec 5, 2012 at 3:53 PM, Stephen Politzer-Ahles <politzerahless at gmail.com> wrote:
Hello, duplicated() does not seem to work for a long vector. For example, if you download the data from https://docs.google.com/open?id=0B6-m45Jvl3ZmNmpaSlJWMXo5bmc (a vector with about 12,000 numbers) and then run the following code which does duplicated() over the whole vector but just shows the last 30 elements: data.frame( tail(verylong, 30), tail(duplicated(verylong), 30) ) you'll see that at the end of the very long vector everything is listed as a duplicate of the preceding element (even though it shouldn't be). On the other hand, if you run the following code which just takes out the last 30 elements of the vector and does duplicated on them: data.frame( tail(verylong, 30), duplicated(tail(verylong, 30)) ) you get the correct results (FALSE shows up wherever the value in the first column changes). Does anyone know why this happens, and if there's a fix? I notice the documentation for duplicated() says: "Long vectors are supported for the default method of duplicated, but may only be usable if nmax is supplied." But I've tried running this with a high value of nmax given, and it still gives me the same problem. So far the only way I've figured out to get this duplicated()-like vector is to use a for loop going through one item at a time, but that takes about a minute to run. Best, Steve Politzer-Ahles
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Sorry, that's my mistake, I should not have said 'long vector'; mine is just a normal vector. I'm not actually using a development version. Best, Steve
On Wed, Dec 5, 2012 at 4:22 PM, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
And BTW, 'long vector' is a technical term in R: not 12,000, but more than 2 billion elements. You will hear it a lot more in the run-up to the next 'minor' release of R (currently R-devel, maybe 2.16.0-to-be, which is the only version from which that quote comes that I am aware of). The posting guide asked for 'at a minimum' information: if you are using an unreleased development version of R you really must tell us (and should not be reporting to the R-help list).
Sarah On Wed, Dec 5, 2012 at 3:53 PM, Stephen Politzer-Ahles <politzerahless at gmail.com> wrote:
Hello, duplicated() does not seem to work for a long vector. For example, if you download the data from https://docs.google.com/open?id=0B6-m45Jvl3ZmNmpaSlJWMXo5bmc (a vector with about 12,000 numbers) and then run the following code which does duplicated() over the whole vector but just shows the last 30 elements: data.frame( tail(verylong, 30), tail(duplicated(verylong), 30) ) you'll see that at the end of the very long vector everything is listed as a duplicate of the preceding element (even though it shouldn't be). On the other hand, if you run the following code which just takes out the last 30 elements of the vector and does duplicated on them: data.frame( tail(verylong, 30), duplicated(tail(verylong, 30)) ) you get the correct results (FALSE shows up wherever the value in the first column changes). Does anyone know why this happens, and if there's a fix? I notice the documentation for duplicated() says: "Long vectors are supported for the default method of duplicated, but may only be usable if nmax is supplied." But I've tried running this with a high value of nmax given, and it still gives me the same problem. So far the only way I've figured out to get this duplicated()-like vector is to use a for loop going through one item at a time, but that takes about a minute to run. Best, Steve Politzer-Ahles
-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Stephen Politzer-Ahles University of Kansas Linguistics Department http://people.ku.edu/~sjpa/