Skip to content

Extract

19 messages · Robert Knight, Ebert,Timothy Aaron, Val +5 more

Val
#
Hi All,

I want to extract new variables from a string and add it to the dataframe.
Sample data is csv file.

dat<-read.csv(text="Year, Sex,string
2002,F,15 xc Ab
2003,F,14
2004,M,18 xb 25 35 21
2005,M,13 25
2006,M,14 ac 256 AV 35
2007,F,11",header=TRUE)

The string column has  a maximum of five variables. Some rows have all
and others may not have all the five variables. If missing then  fill
it with NA,
Desired result is shown below,


Year,Sex,string, S1, S2, S3 S4,S5
2002,F,15 xc Ab, 15,xc,Ab, NA, NA
2003,F,14, 14,NA,NA,NA,NA
2004,M,18 xb 25 35 21,18, xb, 25, 35, 21
2005,M,13 25,13, 25,NA,NA,NA
2006,M,14 ac 256 AV 35, 14, ac, 256, AV, 35
2007,F,11, 11,NA,NA,NA,NA

Any help?
Thank you in advance.
#
I would split dat$string into it's own vector, break it apart at the spaces
into an array, and then place dat$year and dat$sex in positions 1 and 2 of
that newly created array.
On Fri, Jul 19, 2024, 12:52?PM Val <valkremk at gmail.com> wrote:

            

  
  
#
The desired result is odd.
1) It looks like the string is duplicated in the desired result. The first line of data has "15, xc, Ab",  and the desired result has "15, xc, Ab, 15, xc, Ab"
2) The example has S1 through S5, but the desired result has data for eight variables in the first line (not five).
3) The desired result has a different number of variables for each line.
4) Are you assuming that all missing data is at the end of the string? If there are 5 variables (S1 .... S5), do you know that "15, xc, Ab" is S1 = 15, S2 = 'xc', and S3 = 'Ab' rather than S2=15, S4='xc' and S5='Ab' ?

This isn't exactly what you asked for, but maybe I was confused somewhere. This approach puts string data into variables in order. In this approach one mixes string and numeric data. The string is not duplicated.

library(tidyr)

dat <- read.csv(text="Year,Sex,string
2002,F,15 xc Ab
2003,F,14
2004,M,18 xb 25 35 21
2005,M,13 25
2006,M,14 ac 256 AV 35
2007,F,11", header=TRUE, stringsAsFactors=FALSE)

# split the 'string' column based on spaces
dat_separated <- dat |>
  separate(string, into = paste0("S", 1:5), sep = " ",
           fill = "right", extra = "merge")

Tim


-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Val
Sent: Friday, July 19, 2024 12:52 PM
To: r-help at R-project.org (r-help at r-project.org) <r-help at r-project.org>
Subject: [R] Extract

[External Email]

Hi All,

I want to extract new variables from a string and add it to the dataframe.
Sample data is csv file.

dat<-read.csv(text="Year, Sex,string
2002,F,15 xc Ab
2003,F,14
2004,M,18 xb 25 35 21
2005,M,13 25
2006,M,14 ac 256 AV 35
2007,F,11",header=TRUE)

The string column has  a maximum of five variables. Some rows have all and others may not have all the five variables. If missing then  fill it with NA, Desired result is shown below,


Year,Sex,string, S1, S2, S3 S4,S5
2002,F,15 xc Ab, 15,xc,Ab, NA, NA
2003,F,14, 14,NA,NA,NA,NA
2004,M,18 xb 25 35 21,18, xb, 25, 35, 21
2005,M,13 25,13, 25,NA,NA,NA
2006,M,14 ac 256 AV 35, 14, ac, 256, AV, 35
2007,F,11, 11,NA,NA,NA,NA

Any help?
Thank you in advance.

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.r-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Val
#
Thank you and sorry for the confusion.
The desired result should have 8 variables as a comma separated in
each line.  The string variable  is  considered as one variable.
The output of your script is wfine for me.  Thank you!
On Fri, Jul 19, 2024 at 1:00?PM Ebert,Timothy Aaron <tebert at ufl.edu> wrote:
#
Here is another way... for data analysis, the idiomatic result is usually more useful, though for presentation in a final result the wide result might be desired.

library(dplyr)
library(tidyr)

dat<-read.csv(text=
"Year, Sex,string
2002,F,15 xc Ab
2003,F,14
2004,M,18 xb 25 35 21
2005,M,13 25
2006,M,14 ac 256 AV 35
2007,F,11"
, header=TRUE )

idiomatic <- (
    dat
    %>% mutate( string = strsplit( string, " " ) )
    %>% unnest( cols = string )
    %>% group_by( Year, Sex )
    %>% mutate( s_name = paste0( "S", seq_along( string ) ) )
    %>% ungroup()
)
idiomatic # each row has unique Year, Sex, and s_name

wide <- (
    idiomatic
    %>% spread( s_name, string )
)
wide
On July 19, 2024 11:23:48 AM PDT, Val <valkremk at gmail.com> wrote:

  
    
#
I did not look closely at the solutions that you were offered, but
note that you did not specify in your post whether the numbers in your
string were to be character or numeric variables after they are broken
out into their own columns. I believe that they are character in the
solutions, but you should check this. If you want them as numeric,
e.g., for further processing, you will need to convert them. Or
vice-versa.

Bert
On Fri, Jul 19, 2024 at 9:52?AM Val <valkremk at gmail.com> wrote:
Val
#
Thank you Jeff and Bert for your help!
The components of the string  could be nixed (i.e,  numeric, character
or date). Once that is splitted it would be easy for me to format it
accordingly.
On Fri, Jul 19, 2024 at 2:10?PM Bert Gunter <bgunter.4567 at gmail.com> wrote:
1 day later
#
Val:
I wanted to add here a base R solution to your problem that I realize
you can happily ignore. However, in the course of puzzling over how to
do it using the R native pipe syntax ("|>") , I learned some new stuff
that I thought others might find useful, and it seemed sensible to
keep the code with this thread for comparison.

 I want to acknowledge that in the course of my labor, I posted a
query to R-Help to which Iris Simmons posted a very clever answer that
I would never have figured out myself and that is used below at the
end to change a subset of the names of the modified data frame via a
pipe.

Here's the whole solution starting from your (excellent!) example dat:

   dat <- dat$string |>
      strsplit(" ") |>
      sapply(FUN = \(x)c(x, rep(NA, 5 - length(x)))) |>
      t() |> cbind(dat, ..2 = _)

   ## And Iris's trick for changing a subset of attributes, i.e. the
"names", in a pipe
   dat |> names() |> _[4:8] <- paste0("s", 1:5)

## and here's the result:
Year Sex          string s1   s2   s3   s4   s5
1 2002   F        15 xc Ab 15   xc   Ab <NA> <NA>
2 2003   F              14 14 <NA> <NA> <NA> <NA>
3 2004   M  18 xb 25 35 21 18   xb   25   35   21
4 2005   M           13 25 13   25 <NA> <NA> <NA>
5 2006   M 14 ac 256 AV 35 14   ac  256   AV   35
6 2007   F              11 11 <NA> <NA> <NA> <NA>

As I noted previously, all columns beyond Sex are character

Cheers,
Bert
On Fri, Jul 19, 2024 at 12:26?PM Val <valkremk at gmail.com> wrote:
Val
#
Thank   you Bert!
However, the last line of the script.

dat |> names() |> _[4:8] <- paste0("s", 1:5)

is giving me an error as shown below
Error: pipe placeholder can only be used as a named argument

Thank you!
On Sat, Jul 20, 2024 at 7:41?PM Bert Gunter <bgunter.4567 at gmail.com> wrote:
#
We can use read.table for a base R solution

string <- read.table(text = dat$string, fill = TRUE, header = FALSE,
na.strings = "")
names(string) <- paste0("S", seq_along(string))
cbind(dat[-3], string)
On Fri, Jul 19, 2024 at 12:52?PM Val <valkremk at gmail.com> wrote:

  
    
#
I get no error. Please show the entirety of the code you used that
produced the error. Also, are you using a current R version? I am, and
if you are not, there might have been changes from your version to
mine that caused the error.

However, as you were already given satisfactory solutions before, and
Gabor has provided you a simple non-piped base R version that is
probably better anyway, feel free to ignore my request as a waste of
your time.

-- Bert
On Sun, Jul 21, 2024 at 9:36?AM Val <valkremk at gmail.com> wrote:
#
Nice! -- Let read.table do the work of handling the NA's.
However, even simpler is to use the 'colnames' argument of
read.table() for the column names no?

      string <- read.table(text = dat$string, fill = TRUE, header =
FALSE, na.strings = "",
col.names = paste0("s", 1:5))
      dat <- cbind(dat, string)

-- Bert

On Sun, Jul 21, 2024 at 10:16?AM Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
#
Fixing col.names=paste0("S", 1:5) assumes that there will be 5 columns and
we may not want to do that.  If there are only 3 fields in string, at the most,
we may wish to generate only 3 columns.
On Sun, Jul 21, 2024 at 2:20?PM Bert Gunter <bgunter.4567 at gmail.com> wrote:

  
    
#
As always, good point.
Here's a piped version of your code for those who are pipe
afficianados. As I'm not very skilled with pipes, it might certainly
be improved.
dat <-
      dat$string |>
         read.table( text = _, fill = TRUE, header = FALSE, na.strings = "")  |>
         (\(x)'names<-'(x,paste0("s", seq_along(x))))() |>
         (\(x)cbind(dat, x))()

-- Bert


On Sun, Jul 21, 2024 at 11:30?AM Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
#
Base R. Regarding code improvements:

1. Personally I find (\(...) ...)() notation hard to read (although by
placing (\(x), the body and )() on 3 separate lines it can be improved
somewhat). Instead let us use a named function. The name of the
function can also serve to self document the code.

2. The use of dat both at the start of the pipeline and then again
within a later step of the pipeline goes against a strict left to
right flow. In general if this occurs it is either a sign that we need
to break the pipeline into two or that we need to find another
approach which is what we do here.

We can use the base R code below. Note that the column names produced
by transform(S = read.table(...)) are S.V1, S.V2, etc. so to fix the
column names remove .V from all column names as in the fix_colnames
function shown. It does no harm to apply that to all column names
since the remaining column names will not match.

  fix_colnames <- function(x) {
    setNames(x, sub("\\.V", "", names(x)))
  }

  dat |>
     transform(S = read.table(text = string,
       header = FALSE, fill = TRUE, na.strings = "")) |>
       fix_colnames()

Another way to write this which does not use a separate defined
function nor the anonymous function notation is to box the output of
transform:

  dat |>
     transform(S = read.table(text = string,
       header = FALSE, fill = TRUE, na.strings = "")) |>
       list(x = _) |>
       with( setNames(x, sub("\\.V", "", names(x))) )

dplyr. Alternately use dplyr in which case we can make use of
rename_with . In this case read.table(...) creates column names V1,
V2, etc. and mutate does not change them so simply replacing V with S
at the start of each column name in the output of read.table will do.
Also we can pipe the read.table output directly to rename_with using a
nested pipeline, i.e. the second pipe is entirely within mutate rather
than after it) since mutate won't change the column names. The win
here is because, unlike transform, mutate does not require the S= that
is needed with transform (although it allows it had we wanted it).

  library(dplyr)

  dat |>
     mutate(read.table(text = string,
       header = FALSE, fill = TRUE, na.strings = "")  |>
      rename_with(~ sub("^V", "S", .x))
    )
On Sun, Jul 21, 2024 at 3:08?PM Bert Gunter <bgunter.4567 at gmail.com> wrote:

  
    
#
Excellent message, Gabor.

Many tools we use are quite flexible and I just want to mention dplyr does have ways to use something like mutate to rename a column, albeit rename(0 is more specifically designed to do the job.

Here is an example of how mutate() can rename by making a new column and removing the old by using a sort of pipeline within mutate():

mydata <- data.frame(a=1, b=2)
mutate(mydata, 
       c=a, 
       a=NULL, 
       d=b, 
       b=NULL)

The result:
c d
1 1 2

It is effectively the same as following up with a select as an alternative:

mydata |>
  mutate(c=a,
         d=b) |>
  select(c,d)

What people may not quite have grasped is that pipes are not a panacea and can be used alongside all kinds of other methods. Much of dplyr, such as shown above, but also in things like the filter() verb, does a sort of internal pipelining and can apply successive transformations before returning a result suitable for another part of a pipeline. Part of the philosophy was to make more functions where the first argument was something like a data.frame object (but it could be other things) that could be passed along in a pipeline. Trying to shoehorn in other functions that want the item in other positions makes for less intuitive code using place markers like period or underscore.

Pipelines are seen by many as a linear construct but as you point out, with careful design, you can make bigger pipelines that are more like graphs with some regions being a sub-pipeline and do fairly complex things, albeit hard for people to read and understand.

Maybe later, we can discuss again why some people insist on some kind of purity of using the base of languages that are not really expected to stay still but to evolve.


-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Gabor Grothendieck
Sent: Monday, July 22, 2024 7:49 AM
To: Bert Gunter <bgunter.4567 at gmail.com>
Cc: r-help at R-project.org (r-help at r-project.org) <r-help at r-project.org>
Subject: Re: [R] Extract

Base R. Regarding code improvements:

1. Personally I find (\(...) ...)() notation hard to read (although by
placing (\(x), the body and )() on 3 separate lines it can be improved
somewhat). Instead let us use a named function. The name of the
function can also serve to self document the code.

2. The use of dat both at the start of the pipeline and then again
within a later step of the pipeline goes against a strict left to
right flow. In general if this occurs it is either a sign that we need
to break the pipeline into two or that we need to find another
approach which is what we do here.

We can use the base R code below. Note that the column names produced
by transform(S = read.table(...)) are S.V1, S.V2, etc. so to fix the
column names remove .V from all column names as in the fix_colnames
function shown. It does no harm to apply that to all column names
since the remaining column names will not match.

  fix_colnames <- function(x) {
    setNames(x, sub("\\.V", "", names(x)))
  }

  dat |>
     transform(S = read.table(text = string,
       header = FALSE, fill = TRUE, na.strings = "")) |>
       fix_colnames()

Another way to write this which does not use a separate defined
function nor the anonymous function notation is to box the output of
transform:

  dat |>
     transform(S = read.table(text = string,
       header = FALSE, fill = TRUE, na.strings = "")) |>
       list(x = _) |>
       with( setNames(x, sub("\\.V", "", names(x))) )

dplyr. Alternately use dplyr in which case we can make use of
rename_with . In this case read.table(...) creates column names V1,
V2, etc. and mutate does not change them so simply replacing V with S
at the start of each column name in the output of read.table will do.
Also we can pipe the read.table output directly to rename_with using a
nested pipeline, i.e. the second pipe is entirely within mutate rather
than after it) since mutate won't change the column names. The win
here is because, unlike transform, mutate does not require the S= that
is needed with transform (although it allows it had we wanted it).

  library(dplyr)

  dat |>
     mutate(read.table(text = string,
       header = FALSE, fill = TRUE, na.strings = "")  |>
      rename_with(~ sub("^V", "S", .x))
    )
On Sun, Jul 21, 2024 at 3:08?PM Bert Gunter <bgunter.4567 at gmail.com> wrote:

  
    
#
Thanks.

I found this to be quite informative and a nice example of how useful
R-Help can be as a resource for R users.

Best,
Bert

On Mon, Jul 22, 2024 at 4:50?AM Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
#
I had missed that one can pass fix.empty.names = TRUE to transform and
if we do that then we can
put an unnamed data.frame in transform like we can with mutate so
making that change we have the following
base R solution where there is an inner nested pipeline within the
outer pipeline as with the dplyr example.

  transform(dat,
    read.table(text = string, header = FALSE, na.strings = "", fill =
TRUE), fix.empty.names = TRUE) |>
      list(x = _) |>
      with( setNames(x, sub("V", "S", names(x)) )
    )


On Mon, Jul 22, 2024 at 7:49?AM Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:

  
    
#
But have we lured you to the dark side with the tidyverse yet ;-)
On Mon, 22 Jul 2024, 15:22 Bert Gunter, <bgunter.4567 at gmail.com> wrote: