Skip to content

split character vector by multiple keywords simultaneously

4 messages · Andrew Robinson, Greg Snow, sunny

#
Hi. I have a character vector that looks like this:
[1] "Company name: The first company General Manager: John Doe I Managers:
John Doe II, John Doe III"
[2] "Company name: The second company General Manager: Jane Doe I"                                   
[3] "Company name: The third company Managers: Jane Doe II, Jane Doe III" 

I know all the keywords, i.e. "Company name:", "General Manager:",
"Managers:" etc. I'm looking for a way to split this character vector into
multiple character vectors, with one column for each keyword and the
corresponding values for each, i.e.

        Company name          General Manager                  Managers
1  The first company            John Doe I                John Doe II, John
Doe III
2 The second company        Jane Doe I                          
3  The third company                                          Jane Doe II,
Jane Doe III

I have tried a lot to find something suitable but haven't so far. Any help
will be greatly appreciated. I am running R-2.12.1 on x86_64 linux.

Thanks.

--
View this message in context: http://r.789695.n4.nabble.com/split-character-vector-by-multiple-keywords-simultaneously-tp3497033p3497033.html
Sent from the R help mailing list archive at Nabble.com.
#
A hack would be to use gsub() to prepend e.g. XXX to the keywords that
you want, perform a strsplit() to break the lines into component
strings, and then substr() to extract the pieces that you want from
those strings.

Cheers

Andrew
On Wed, May 04, 2011 at 04:08:40PM -0700, sunny wrote:

  
    
1 day later
#
Will all the keywords always be present in the same order?  Or are you looking for the keywords, but some may be absent or in different orders?

Look into the gsubfn package for some tools that could help.

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of sunny
Sent: Wednesday, May 04, 2011 5:09 PM
To: r-help at r-project.org
Subject: [R] split character vector by multiple keywords simultaneously

Hi. I have a character vector that looks like this:
[1] "Company name: The first company General Manager: John Doe I Managers:
John Doe II, John Doe III"
[2] "Company name: The second company General Manager: Jane Doe I"                                   
[3] "Company name: The third company Managers: Jane Doe II, Jane Doe III" 

I know all the keywords, i.e. "Company name:", "General Manager:",
"Managers:" etc. I'm looking for a way to split this character vector into
multiple character vectors, with one column for each keyword and the
corresponding values for each, i.e.

        Company name          General Manager                  Managers
1  The first company            John Doe I                John Doe II, John
Doe III
2 The second company        Jane Doe I                          
3  The third company                                          Jane Doe II,
Jane Doe III

I have tried a lot to find something suitable but haven't so far. Any help
will be greatly appreciated. I am running R-2.12.1 on x86_64 linux.

Thanks.

--
View this message in context: http://r.789695.n4.nabble.com/split-character-vector-by-multiple-keywords-simultaneously-tp3497033p3497033.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
1 day later
#
Andrew Robinson-6 wrote:
Thanks, that got me started. I am sure there are much easier ways of doing
this, but in case someone comes looking, here's my solution:

keywordlist <- c("Company name:", "General manager:", "Manager:")

# Attach "XXX" to the beginning of each keyword:
for (i in 1:length(keywordlist)) {
temp <- gsub(keywordlist[i],paste("XXX",keywordlist[i],sep=""),temp)
}

# Split each row into a list:
temp <- strsplit(temp,"XXX")
# Eliminate empty elements:
temp <- lapply(temp, function(x) x[which(x!='')])

# Since each keyword happens to include a colon at the end, split each list
element generated above into exactly two parts, pre-colon for the keyword
and post-colon for the value. Since values may contain colons themselves,
avoid spurious matches by using n=2 in str_split_fixed function from stringr
package:
library(stringr)
temp <- lapply(temp,function(x) str_split_fixed(x,':',n=2))

# Convert each list element into a data frame. The transpose makes sure that
the first row of each data frame is the set of keywords. Each data frame has
2 rows - one with the keywords and the second with the values:
temp <- lapply(temp, function(x) replace(as.data.frame(t(x)),,t(x)))

# Copy the first row of each data frame to the name of the corresponding
column:
for (i in 1:length(temp)) {
names(temp[[i]]) <- as.character(temp[[i]][1,])
}

# Now join all the data frames in the list by column names. This way it
doesn't matter if some keywords are absent in some cases:
final_data <- do.call(rbind.fill,temp)

# We now have one large data frame with the odd numbered rows containing the
keywords and the even numbered rows containing the values. Since we already
have the keywords in the name, we can eliminate the odd numbered rows:
final_data <- final_data[seq(2,dim(final_data)[1],2),]

-S.

--
View this message in context: http://r.789695.n4.nabble.com/split-character-vector-by-multiple-keywords-simultaneously-tp3497033p3506776.html
Sent from the R help mailing list archive at Nabble.com.