use subset to trim data but include last per category

Hello,

I bumped into the following funny use-case. I have too much data for a given plot. I have the following data frame df:
str(df)
'data.frame':	5015 obs. of  5 variables:
 $ n          : Factor w/ 5 levels "1000","2000",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ iter       : int  10 20 30 40 50 60 70 80 90 100 ...
 $ Error      : num  1.05e-02 1.24e-03 3.67e-04 1.08e-04 4.05e-05 ...
 $ Duality_Gap: num  20080 3789 855 443 321 ...
 $ Runtime    : num  0.00536 0.01353 0.01462 0.01571 0.01681 ...

But if I plot e.g. Runtime vs log(Duality Gap) I have too many observations due to taking a snapshot every 10 iterations rather than say 500 and the plot looks very cluttered. So I would like to trim the data frame including only those records for which iter is multiple of 500 and so I do this:

df <- subset(df, iter %% 500 == 0)

This gives me almost exactly what I need except that the last and most important Duality Gap observations are of course gone due to the filtering ... I would like to change the subset clause to be iter %% 500 _or_ the record is the last per n (n is my problem size and category in this case) ... how can I do that?

I thought of adding a new column that flags whether a given row is the last element per category as "last" Boolean but this is a bit too complicated .. is there a simpler condition construct that can be used with the subset command?

TIA,
Best regards,
Giovanni
dfthin <- df[ c(which(iter %% 500 == 0),nrow(df) ]

or

 dfthin <- subset(df, (iter %% 500 == 0) | (seq.int(nrow(df)==nrow(df)))

N.B. You should avoid using the name "df" for your variables, because it is the name of a built-in function that you are hiding by doing so. Others may be confused, and eventually you may want to use that function yourself. One solution is to use DF for your variables... another is to use more descriptive names.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

Hello,

I bumped into the following funny use-case. I have too much data for a
given plot. I have the following data frame df: 

str(df)
'data.frame':	5015 obs. of  5 variables:
$ n          : Factor w/ 5 levels "1000","2000",..: 1 1 1 1 1 1 1 1 1 1
...
$ iter       : int  10 20 30 40 50 60 70 80 90 100 ...
$ Error      : num  1.05e-02 1.24e-03 3.67e-04 1.08e-04 4.05e-05 ...
$ Duality_Gap: num  20080 3789 855 443 321 ...
$ Runtime    : num  0.00536 0.01353 0.01462 0.01571 0.01681 ...

But if I plot e.g. Runtime vs log(Duality Gap) I have too many
observations due to taking a snapshot every 10 iterations rather than
say 500 and the plot looks very cluttered. So I would like to trim the
data frame including only those records for which iter is multiple of
500 and so I do this:

df <- subset(df, iter %% 500 == 0)

This gives me almost exactly what I need except that the last and most
important Duality Gap observations are of course gone due to the
filtering ... I would like to change the subset clause to be iter %%
500 _or_ the record is the last per n (n is my problem size and
category in this case) ... how can I do that?

I thought of adding a new column that flags whether a given row is the
last element per category as "last" Boolean but this is a bit too
complicated .. is there a simpler condition construct that can be used
with the subset command?

TIA,
Best regards,
Giovanni    
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hi Jeff,

Thanks for your help, but this doesn't work, there are two problems. First and most important I need to keep the last _per category_ where my category is n and not the last globally. Second, there seems to be an issue with the subset variation that ends up not filtering anything ... but this is a minor thing.

Best.
Giovanni

dfthin <- df[ c(which(iter %% 500 == 0),nrow(df) ]

or

dfthin <- subset(df, (iter %% 500 == 0) | (seq.int(nrow(df)==nrow(df)))

N.B. You should avoid using the name "df" for your variables, because it is the name of a built-in function that you are hiding by doing so. Others may be confused, and eventually you may want to use that function yourself. One solution is to use DF for your variables... another is to use more descriptive names.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                     Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

Giovanni Azua <bravegag at gmail.com> wrote:

Hello,

I bumped into the following funny use-case. I have too much data for a
given plot. I have the following data frame df: 

str(df)
'data.frame':	5015 obs. of  5 variables:
$ n          : Factor w/ 5 levels "1000","2000",..: 1 1 1 1 1 1 1 1 1 1
...
$ iter       : int  10 20 30 40 50 60 70 80 90 100 ...
$ Error      : num  1.05e-02 1.24e-03 3.67e-04 1.08e-04 4.05e-05 ...
$ Duality_Gap: num  20080 3789 855 443 321 ...
$ Runtime    : num  0.00536 0.01353 0.01462 0.01571 0.01681 ...

But if I plot e.g. Runtime vs log(Duality Gap) I have too many
observations due to taking a snapshot every 10 iterations rather than
say 500 and the plot looks very cluttered. So I would like to trim the
data frame including only those records for which iter is multiple of
500 and so I do this:

df <- subset(df, iter %% 500 == 0)

This gives me almost exactly what I need except that the last and most
important Duality Gap observations are of course gone due to the
filtering ... I would like to change the subset clause to be iter %%
500 _or_ the record is the last per n (n is my problem size and
category in this case) ... how can I do that?

I thought of adding a new column that flags whether a given row is the
last element per category as "last" Boolean but this is a bit too
complicated .. is there a simpler condition construct that can be used
with the subset command?

TIA,
Best regards,
Giovanni    
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

I would like to change the
subset clause to be iter %% 500 _or_ the record is the last per n 
If your data.frame df is sorted by n you can define the function
   isLastInRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
and use it as
   subset(df, iter %% 500 == 0 | isLastInRun(n)) 

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
Of Giovanni Azua
Sent: Sunday, September 09, 2012 8:14 AM
To: r-help at r-project.org
Subject: [R] use subset to trim data but include last per category

Hello,

I bumped into the following funny use-case. I have too much data for a given plot. I have
the following data frame df:

str(df)
'data.frame':	5015 obs. of  5 variables:
 $ n          : Factor w/ 5 levels "1000","2000",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ iter       : int  10 20 30 40 50 60 70 80 90 100 ...
 $ Error      : num  1.05e-02 1.24e-03 3.67e-04 1.08e-04 4.05e-05 ...
 $ Duality_Gap: num  20080 3789 855 443 321 ...
 $ Runtime    : num  0.00536 0.01353 0.01462 0.01571 0.01681 ...

But if I plot e.g. Runtime vs log(Duality Gap) I have too many observations due to taking a
snapshot every 10 iterations rather than say 500 and the plot looks very cluttered. So I
would like to trim the data frame including only those records for which iter is multiple of
500 and so I do this:

df <- subset(df, iter %% 500 == 0)

This gives me almost exactly what I need except that the last and most important Duality
Gap observations are of course gone due to the filtering ... I would like to change the
subset clause to be iter %% 500 _or_ the record is the last per n (n is my problem size and
category in this case) ... how can I do that?

I thought of adding a new column that flags whether a given row is the last element per
category as "last" Boolean but this is a bit too complicated .. is there a simpler condition
construct that can be used with the subset command?

TIA,
Best regards,
Giovanni
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120909/3d0f8070/attachment.pl>