Extract

But have we lured you to the dark side with the tidyverse yet ;-)

Thanks.

I found this to be quite informative and a nice example of how useful
R-Help can be as a resource for R users.

Best,
Bert

On Mon, Jul 22, 2024 at 4:50?AM Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
Base R. Regarding code improvements:

1. Personally I find (\(...) ...)() notation hard to read (although by
placing (\(x), the body and )() on 3 separate lines it can be improved
somewhat). Instead let us use a named function. The name of the
function can also serve to self document the code.

2. The use of dat both at the start of the pipeline and then again
within a later step of the pipeline goes against a strict left to
right flow. In general if this occurs it is either a sign that we need
to break the pipeline into two or that we need to find another
approach which is what we do here.

We can use the base R code below. Note that the column names produced
by transform(S = read.table(...)) are S.V1, S.V2, etc. so to fix the
column names remove .V from all column names as in the fix_colnames
function shown. It does no harm to apply that to all column names
since the remaining column names will not match.

  fix_colnames <- function(x) {
    setNames(x, sub("\\.V", "", names(x)))
  }

  dat |>
     transform(S = read.table(text = string,
       header = FALSE, fill = TRUE, na.strings = "")) |>
       fix_colnames()

Another way to write this which does not use a separate defined
function nor the anonymous function notation is to box the output of
transform:

  dat |>
     transform(S = read.table(text = string,
       header = FALSE, fill = TRUE, na.strings = "")) |>
       list(x = _) |>
       with( setNames(x, sub("\\.V", "", names(x))) )

dplyr. Alternately use dplyr in which case we can make use of
rename_with . In this case read.table(...) creates column names V1,
V2, etc. and mutate does not change them so simply replacing V with S
at the start of each column name in the output of read.table will do.
Also we can pipe the read.table output directly to rename_with using a
nested pipeline, i.e. the second pipe is entirely within mutate rather
than after it) since mutate won't change the column names. The win
here is because, unlike transform, mutate does not require the S= that
is needed with transform (although it allows it had we wanted it).

  library(dplyr)

  dat |>
     mutate(read.table(text = string,
       header = FALSE, fill = TRUE, na.strings = "")  |>
      rename_with(~ sub("^V", "S", .x))
    )

On Sun, Jul 21, 2024 at 3:08?PM Bert Gunter <bgunter.4567 at gmail.com>
wrote:
As always, good point.
Here's a piped version of your code for those who are pipe
afficianados. As I'm not very skilled with pipes, it might certainly
be improved.
dat <-
      dat$string |>
         read.table( text = _, fill = TRUE, header = FALSE, na.strings
= "")  |>
         (\(x)'names<-'(x,paste0("s", seq_along(x))))() |>
         (\(x)cbind(dat, x))()

-- Bert

On Sun, Jul 21, 2024 at 11:30?AM Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
Fixing col.names=paste0("S", 1:5) assumes that there will be 5
columns and
we may not want to do that.  If there are only 3 fields in string,
at the most,
we may wish to generate only 3 columns.

On Sun, Jul 21, 2024 at 2:20?PM Bert Gunter <bgunter.4567 at gmail.com>
wrote:
Nice! -- Let read.table do the work of handling the NA's.
However, even simpler is to use the 'colnames' argument of
read.table() for the column names no?

      string <- read.table(text = dat$string, fill = TRUE, header =
FALSE, na.strings = "",
col.names = paste0("s", 1:5))
      dat <- cbind(dat, string)

-- Bert

On Sun, Jul 21, 2024 at 10:16?AM Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
We can use read.table for a base R solution

string <- read.table(text = dat$string, fill = TRUE, header =
FALSE,
na.strings = "")
names(string) <- paste0("S", seq_along(string))
cbind(dat[-3], string)

On Fri, Jul 19, 2024 at 12:52?PM Val <valkremk at gmail.com> wrote:
Hi All,

I want to extract new variables from a string and add it to
the dataframe.
Sample data is csv file.

dat<-read.csv(text="Year, Sex,string
2002,F,15 xc Ab
2003,F,14
2004,M,18 xb 25 35 21
2005,M,13 25
2006,M,14 ac 256 AV 35
2007,F,11",header=TRUE)

The string column has  a maximum of five variables. Some rows
have all
and others may not have all the five variables. If missing
then  fill
it with NA,
Desired result is shown below,

Year,Sex,string, S1, S2, S3 S4,S5
2002,F,15 xc Ab, 15,xc,Ab, NA, NA
2003,F,14, 14,NA,NA,NA,NA
2004,M,18 xb 25 35 21,18, xb, 25, 35, 21
2005,M,13 25,13, 25,NA,NA,NA
2006,M,14 ac 256 AV 35, 14, ac, 256, AV, 35
2007,F,11, 11,NA,NA,NA,NA

Any help?
Thank you in advance.

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible
code.

--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible
code.

--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Extract

Thread (19 messages)