extracting character values

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20130113/ffaf1f68/attachment-0001.pl>
Dear all,

I have a dataframe of names (netw), with each cell including last name and initials of an author; some cells have NA. I would like to extract only the last name from each cell; this new dataframe is calle 'res'

Here is what I do:

res <- data.frame(matrix(NA, nrow=dim(netw)[1], ncol=dim(netw)[2]))

for (i in 1:x)
{
wh <- regexpr('[a-z]{3,}', as.character(netw[,i]))
res[i] <- substring(as.character(netw[,i]), wh, wh + attr(wh,'match.length')-1)
}

the problem is that I cannot manage to extract 'complex' names properly such as ' van der hoops bf  ': here I only get 'van', the real last name is 'van der hoops' and 'bf' are the initials. Basically the last name has always a minimum of 3 consecutive letters, but may have 3 or more letters separated by one or more space; the cell may start by a space too; initials never have more than 2 letters.

Someone would have a nice idea for that? Thanks,

Maybe some poeple will, but an example of your data will actually help 
them to help.

Your code is not reproducible without providing the netw object.

Best,
Uwe Ligges
David

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

HI,

Not sure this helps:
netw<-read.table(text="
lastname_initial, year
Aaron H, 1900
Beecher HW, 1947
Cannon JP, 1985
Stone WC, 1982
?van der hoops bf, 1948
NA, 1976
",sep=",",header=TRUE,stringsAsFactors=FALSE)

res1<-sub("^[[:space:]]*(.*?)[[:space:]]*$","\\1",gsub("\\w+$","",netw[,1]))
res1[!is.na(res1)]
#[1] "Aaron"???????? "Beecher"?????? "Cannon"??????? "Stone"??????? 
#[5] "van der hoops"
A.K.

----- Original Message -----
From: Biau David <djmbiau at yahoo.fr>
To: r help list <r-help at r-project.org>
Cc: 
Sent: Sunday, January 13, 2013 3:53 AM
Subject: [R] extracting character values

Dear all,

I have a dataframe of names (netw), with each cell including last name and initials of an author; some cells have NA. I would like to extract only the last name from each cell; this new dataframe is calle 'res'

Here is what I do:

res <- data.frame(matrix(NA, nrow=dim(netw)[1], ncol=dim(netw)[2]))

for (i in 1:x)
{
wh <- regexpr('[a-z]{3,}', as.character(netw[,i]))
res[i] <- substring(as.character(netw[,i]), wh, wh + attr(wh,'match.length')-1)
}

?
the problem is that I cannot manage to extract 'complex' names properly such as ' van der hoops bf? ': here I only get 'van', the real last name is 'van der hoops' and 'bf' are the initials. Basically the last name has always a minimum of 3 consecutive letters, but may have 3 or more letters separated by one or more space; the cell may start by a space too; initials never have more than 2 letters.

Someone would have a nice idea for that? Thanks,

David

??? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.