I have a list of scientific names in a data set. I would like to split the
names into genus, species and subspecies. Not all names include a
subspecies. Could someone show me how to do this?
My example code is:
a <- matrix(c('genusA speciesA', 10,
'genusB speciesAA', 20,
'genusC speciesAAA subspeciesA', 15,
'genusC speciesAAA subspeciesB', 25), nrow=4, byrow=TRUE)
aa <- data.frame(a)
colnames(aa) <- c('species', 'counts')
aa
# The code returns
species counts
1 genusA speciesA 10
2 genusB speciesAA 20
3 genusC speciesAAA subspeciesA 15
4 genusC speciesAAA subspeciesB 25
# I would like there to be 4 columns as below
genus species subspecies counts
genusA speciesA no.subspecies 10
genusB speciesAA no.subspecies 20
genusC speciesAAA subspeciesA 15
genusC speciesAAA subspeciesB 25
I have tried using 'strsplit', but cannot get the desired result. Thank you
for any help with this.
Mark Miller
Gainesville, Florida
I have a list of scientific names in a data set. I would like
to split the names into genus, species and subspecies.
Not all names include a subspecies. Could someone show me how
to do this?
My example code is:
a <- matrix(c('genusA speciesA', 10,
'genusB speciesAA', 20,
'genusC speciesAAA subspeciesA', 15,
'genusC speciesAAA subspeciesB', 25), nrow=4, byrow=TRUE)
aa <- data.frame(a)
colnames(aa) <- c('species', 'counts')
aa
# The code returns
species counts
1 genusA speciesA 10
2 genusB speciesAA 20
3 genusC speciesAAA subspeciesA 15
4 genusC speciesAAA subspeciesB 25
# I would like there to be 4 columns as below
genus species subspecies counts
genusA speciesA no.subspecies 10
genusB speciesAA no.subspecies 20
genusC speciesAAA subspeciesA 15
genusC speciesAAA subspeciesB 25
I have tried using 'strsplit', but cannot get the desired result.
Thank you for any help with this.
Mark Miller
Gainesville, Florida
The following seems to work for your example. However, others
can probably propose a less clumsy version (but at least this
one breaks it down into its elements):
a <- matrix(c('genusA speciesA', 10,
'genusB speciesAA', 20,
'genusC speciesAAA subspeciesA', 15,
'genusC speciesAAA subspeciesB', 25), nrow=4, byrow=TRUE)
a
# [,1] [,2]
# [1,] "genusA speciesA" "10"
# [2,] "genusB speciesAA" "20"
# [3,] "genusC speciesAAA subspeciesA" "15"
# [4,] "genusC speciesAAA subspeciesB" "25"
A <- NULL
for( i in (1:nrow(a))){
Names <- unlist(strsplit(a[i,1],"[ ]+"))
if(length(Names)==2) Names <- c(Names,"no.subspecies")
A <- rbind(A,c(Names,a[i,2]))
}
colnames(A) <- c("Genus","Species","Subspecies","Count")
A <- as.data.frame(A)
A$Count <- as.numeric(A$Count)
A
# Genus Species Subspecies Count
# 1 genusA speciesA no.subspecies 1
# 2 genusB speciesAA no.subspecies 3
# 3 genusC speciesAAA subspeciesA 2
# 4 genusC speciesAAA subspeciesB 4
Hoping this helps!
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 04-Nov-09 Time: 21:37:03
------------------------------ XFMail ------------------------------
OOPS! Sorry, I made an oversight in the code I posted just now
(and I didn't check the result carefullt enough ... ).
The line which was A$Count <- as.numeric(A$Count) should have
been A$Count <- as.numeric(levels(A$Count)) (i.e. I overlooked
that A$Count as first constructed is a *factor*)!
So the full corrected code is as follows:
a <- matrix(c('genusA speciesA', 10,
'genusB speciesAA', 20,
'genusC speciesAAA subspeciesA', 15,
'genusC speciesAAA subspeciesB', 25), nrow=4, byrow=TRUE)
a
# [,1] [,2]
# [1,] "genusA speciesA" "10"
# [2,] "genusB speciesAA" "20"
# [3,] "genusC speciesAAA subspeciesA" "15"
# [4,] "genusC speciesAAA subspeciesB" "25"
A <- NULL
for( i in (1:nrow(a))){
Names <- unlist(strsplit(a[i,1],"[ ]+"))
if(length(Names)==2) Names <- c(Names,"no.subspecies")
A <- rbind(A,c(Names,a[i,2]))
}
colnames(A) <- c("Genus","Species","Subspecies","Count")
A <- as.data.frame(A)
A$Count <- as.numeric(levels(A$Count))
A
# Genus Species Subspecies Count
# 1 genusA speciesA no.subspecies 10
# 2 genusB speciesAA no.subspecies 15
# 3 genusC speciesAAA subspeciesA 20
# 4 genusC speciesAAA subspeciesB 25
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 04-Nov-09 Time: 21:47:29
------------------------------ XFMail ------------------------------
OPPS^2!! Did it again. The version given below now does seem to work
properly: last line now changed (yet again) to
A$Count <- as.numeric(levels(A$Count)[unclass(A$Count)])
On 04-Nov-09 21:47:32, Ted Harding wrote:
OOPS! Sorry, I made an oversight in the code I posted just now
(and I didn't check the result carefullt enough ... ).
I have a list of scientific names in a data set. I would like to split
the names into genus, species and subspecies. Not all names include a
subspecies. Could someone show me how to do this?
strsplit should work for your example...
data.frame(
genus=sapply(strsplit(aa, " "), "[", 1),
species=sapply(strsplit(aa, " "), "[", 2),
subspecies=sapply(strsplit(aa, " "), "[", 3) ## will be NA for missing
subsp
)
However, scientific names are often pretty messy - I often have datasets
like this...
x
[1] "Aquilegia caerulea James var. caerulea"
[2] "Aquilegia caerulea James var. ochroleuca Hook."
[3] "Aquilegia caerulea James var. pinetorum (Tidestrom) Payson ex Kearney
& Peebles"
[4] "Aquilegia caerulea James"
[5] "Aquilegia chaplinei Standl."
[6] "Aquilegia chaplinei Standley ex Payson"
[7] "Aquilegia chrysantha Gray var. chrysantha"
[8] "Aquilegia chrysantha Gray"
So I first strip out author names using strsplit and use grep to find
subspecies/variety abbreviations
noauthor<-function(x){
## split name into vector of separate words
y<-strsplit(x, " ")
sapply(y, function(x){
n<-grep( "^var\\.$|^ssp\\.$|^var$|^f\\.$",x)
# apply a function to paste together the first and second elements
# plus element after matching var., spp., f. (or and others)
# use sort in case the name includes both var and spp -sometimes happens
paste( x[sort(c(1:2, n,n+1))], collapse=" ") })}
noauthor(x[1:8])
[1] "Aquilegia caerulea var. caerulea"
[2] "Aquilegia caerulea var. ochroleuca"
[3] "Aquilegia caerulea var. pinetorum"
[4] "Aquilegia caerulea"
[5] "Aquilegia chaplinei"
[6] "Aquilegia chaplinei"
[7] "Aquilegia chrysantha var. chrysantha"
[8] "Aquilegia chrysantha"
Chris