extracting characters from a string

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20130123/8bf203c7/attachment.pl>
Hi,
You could try this:
dat1<-read.table(text=pub,sep=",",fill=TRUE,stringsAsFactors=F)
dat2<- as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub(" $","",gsub("^ |\\w+$","",x)))),stringsAsFactors=F)

?dat2
#??????? V1????????????? V2???????? V3???????? V4
#1?? Brown????????? Santos?????? Rome?? Don Juan 
#2 Benigni?????????????????????????????????????? 
#3? Arstra?? Van den Hoops?? lamarque?????? 
A.K.

----- Original Message -----
From: Biau David <djmbiau at yahoo.fr>
To: r help list <r-help at r-project.org>
Cc: 
Sent: Wednesday, January 23, 2013 12:38 PM
Subject: [R] extracting characters from a string

Dear All,

I have a data frame of vectors of publication names such as 'pub':

pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
pub2 <- c('Benigni D')
pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D')

pub <- rbind(pub1, pub2, pub3)

I would like to construct a dataframe with only author's last name and each last name in columns and the publication in rows. Basically I want to get rid of the initials (max 2, always before a comma) and spaces surounding last name. I would like to avoid a loop.

ps: If I could have even a short explanation of the code that extract the values of the character string that would also be great!

?
David

??? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
1. Study a regular expression tutorial on the web to learn how to do this.

2. ?regex in R summarizes (tersely! -- but clearly) R's regex's.

3. ?grep tells you about R's regular expression manipulation functions.

-- Bert
Dear All,

I have a data frame of vectors of publication names such as 'pub':

pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
pub2 <- c('Benigni D')
pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D')

pub <- rbind(pub1, pub2, pub3)

I would like to construct a dataframe with only author's last name and each last name in columns and the publication in rows. Basically I want to get rid of the initials (max 2, always before a comma) and spaces surounding last name. I would like to avoid a loop.

ps: If I could have even a short explanation of the code that extract the values of the character string that would also be great!

David

        [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
Hello,

Try the following.

fun <- function(x, sep = ", "){
	s <- unlist(strsplit(x, sep))
	regmatches(s, regexpr("[[:alpha:]]*", s))
}

fun(pub)

Hope this helps,

Rui Barradas

Em 23-01-2013 17:38, Biau David escreveu:
Dear All,

I have a data frame of vectors of publication names such as 'pub':

pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
pub2 <- c('Benigni D')
pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D')

pub <- rbind(pub1, pub2, pub3)

I would like to construct a dataframe with only author's last name and each last name in columns and the publication in rows. Basically I want to get rid of the initials (max 2, always before a comma) and spaces surounding last name. I would like to avoid a loop.

ps: If I could have even a short explanation of the code that extract the values of the character string that would also be great!

David

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Hello,

I've just noticed that my first solution would only return the first set 
of alphabetic characters, such as "Van", not "Van den Hoops".
The following will solve that problem.

fun2 <- function(x, sep = ", "){
	x <- strsplit(x, sep)
	m <- lapply(x, function(y) gregexpr(" [[:alpha:]]*$", y))
	res <- lapply(seq_along(x), function(i)
		regmatches(x[[i]], m[[i]], invert = TRUE))
	res <- lapply(res, unlist)
	lapply(res, function(y) y[nchar(y) > 0])
}
fun2(pub)

Hope this helps,

Rui Barradas

Em 23-01-2013 18:33, Rui Barradas escreveu:
Hello,

Try the following.

fun <- function(x, sep = ", "){
     s <- unlist(strsplit(x, sep))
     regmatches(s, regexpr("[[:alpha:]]*", s))
}

fun(pub)

Hope this helps,

Rui Barradas

Em 23-01-2013 17:38, Biau David escreveu:
Dear All,

I have a data frame of vectors of publication names such as 'pub':

pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
pub2 <- c('Benigni D')
pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D')

pub <- rbind(pub1, pub2, pub3)

I would like to construct a dataframe with only author's last name and
each last name in columns and the publication in rows. Basically I
want to get rid of the initials (max 2, always before a comma) and
spaces surounding last name. I would like to avoid a loop.

ps: If I could have even a short explanation of the code that extract
the values of the character string that would also be great!

David

    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20130124/50d61354/attachment.pl>