splitting scientific names into genus, species, and subspecies - R-help

Wed, Nov 4, 2009 1:09 PM #

I have a list of scientific names in a data set.  I would like to split the
names into genus, species and subspecies.  Not all names include a
subspecies.  Could someone show me how to do this?

My example code is:


a <- matrix(c('genusA speciesA', 10,   
              'genusB speciesAA', 20,   
              'genusC speciesAAA subspeciesA', 15, 
              'genusC speciesAAA subspeciesB', 25), nrow=4, byrow=TRUE)

aa <- data.frame(a)

colnames(aa) <- c('species', 'counts')

aa


# The code returns

                                   species  counts
1                     genusA speciesA     10
2                    genusB speciesAA     20
3 genusC speciesAAA subspeciesA     15
4 genusC speciesAAA subspeciesB     25



# I would like there to be 4 columns as below

genus  species    subspecies    counts

genusA speciesA   no.subspecies   10
genusB speciesAA  no.subspecies   20
genusC speciesAAA subspeciesA     15
genusC speciesAAA subspeciesB     25


I have tried using 'strsplit', but cannot get the desired result.  Thank you
for any help with this.


Mark Miller
Gainesville, Florida

View this message in context: http://old.nabble.com/splitting-scientific-names-into-genus%2C-species%2C-and-subspecies-tp26204666p26204666.html
Sent from the R help mailing list archive at Nabble.com.

(Ted Harding)

Wed, Nov 4, 2009 1:37 PM #

On 04-Nov-09 21:09:42, Mark W. Miller wrote:

The following seems to work for your example. However, others
can probably propose a less clumsy version (but at least this
one breaks it down into its elements):

a <- matrix(c('genusA speciesA', 10,
              'genusB speciesAA', 20,   
              'genusC speciesAAA subspeciesA', 15, 
              'genusC speciesAAA subspeciesB', 25), nrow=4, byrow=TRUE)

a
#      [,1]                            [,2]
# [1,] "genusA speciesA"               "10"
# [2,] "genusB speciesAA"              "20"
# [3,] "genusC speciesAAA subspeciesA" "15"
# [4,] "genusC speciesAAA subspeciesB" "25"

A <- NULL
for( i in (1:nrow(a))){
  Names <- unlist(strsplit(a[i,1],"[ ]+"))
  if(length(Names)==2) Names <- c(Names,"no.subspecies")
  A <- rbind(A,c(Names,a[i,2]))
}
colnames(A) <- c("Genus","Species","Subspecies","Count")
A <- as.data.frame(A)
A$Count <- as.numeric(A$Count)

A
#    Genus    Species    Subspecies Count
# 1 genusA   speciesA no.subspecies     1
# 2 genusB  speciesAA no.subspecies     3
# 3 genusC speciesAAA   subspeciesA     2
# 4 genusC speciesAAA   subspeciesB     4

Hoping this helps!
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 04-Nov-09                                       Time: 21:37:03
------------------------------ XFMail ------------------------------

(Ted Harding)

Wed, Nov 4, 2009 1:47 PM #

OOPS! Sorry, I made an oversight in the code I posted just now
(and I didn't check the result carefullt enough ... ).

The line which was A$Count <- as.numeric(A$Count) should have
been A$Count <- as.numeric(levels(A$Count)) (i.e. I overlooked
that A$Count as first constructed is a *factor*)!

So the full corrected code is as follows:

a <- matrix(c('genusA speciesA', 10,
              'genusB speciesAA', 20,   
              'genusC speciesAAA subspeciesA', 15, 
              'genusC speciesAAA subspeciesB', 25), nrow=4, byrow=TRUE)

a
#      [,1]                            [,2]
# [1,] "genusA speciesA"               "10"
# [2,] "genusB speciesAA"              "20"
# [3,] "genusC speciesAAA subspeciesA" "15"
# [4,] "genusC speciesAAA subspeciesB" "25"

A <- NULL
for( i in (1:nrow(a))){
  Names <- unlist(strsplit(a[i,1],"[ ]+"))
  if(length(Names)==2) Names <- c(Names,"no.subspecies")
  A <- rbind(A,c(Names,a[i,2]))
}
colnames(A) <- c("Genus","Species","Subspecies","Count")
A <- as.data.frame(A)
A$Count <- as.numeric(levels(A$Count))

A
#    Genus    Species    Subspecies Count
# 1 genusA   speciesA no.subspecies    10
# 2 genusB  speciesAA no.subspecies    15
# 3 genusC speciesAAA   subspeciesA    20
# 4 genusC speciesAAA   subspeciesB    25

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 04-Nov-09                                       Time: 21:47:29
------------------------------ XFMail ------------------------------

(Ted Harding)

Wed, Nov 4, 2009 2:15 PM #

OPPS^2!! Did it again. The version given below now does seem to work
properly: last line now changed (yet again) to

  A$Count <- as.numeric(levels(A$Count)[unclass(A$Count)])

On 04-Nov-09 21:47:32, Ted Harding wrote:

a <- matrix(c('genusA speciesA', 10,
              'genusB speciesAA', 20,   
              'genusC speciesAAA subspeciesA', 15, 
              'genusC speciesAAA subspeciesB', 25), nrow=4, byrow=TRUE)

a
#      [,1]                            [,2]
# [1,] "genusA speciesA"               "10"
# [2,] "genusB speciesAA"              "20"
# [3,] "genusC speciesAAA subspeciesA" "15"
# [4,] "genusC speciesAAA subspeciesB" "25"

A <- NULL
for( i in (1:nrow(a))){
  Names <- unlist(strsplit(a[i,1],"[ ]+"))
  if(length(Names)==2) Names <- c(Names,"no.subspecies")
  A <- rbind(A,c(Names,a[i,2]))
}
colnames(A) <- c("Genus","Species","Subspecies","Count")
A <- as.data.frame(A)
A$Count <- as.numeric(levels(A$Count)[unclass(A$Count)])

A
# 1 genusA   speciesA no.subspecies    10
# 2 genusB  speciesAA no.subspecies    20
# 3 genusC speciesAAA   subspeciesA    15
# 4 genusC speciesAAA   subspeciesB    25

Ted
[I plead hypocaffeinaemia]


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 04-Nov-09                                       Time: 22:15:50
------------------------------ XFMail ------------------------------

Chris Stubben

Wed, Nov 4, 2009 2:19 PM #

Mark W. Miller wrote:

strsplit should work for your example...

data.frame( 
  genus=sapply(strsplit(aa, " "), "[", 1),
species=sapply(strsplit(aa, " "), "[", 2),
subspecies=sapply(strsplit(aa, " "), "[", 3)   ## will be NA for missing
subsp 
 ) 

However, scientific names are often pretty messy - I often have datasets
like this...
x
 [1] "Aquilegia caerulea James var. caerulea"                                         
 [2] "Aquilegia caerulea James var. ochroleuca Hook."                                 
 [3] "Aquilegia caerulea James var. pinetorum (Tidestrom) Payson ex Kearney
& Peebles"
 [4] "Aquilegia caerulea James"                                                       
 [5] "Aquilegia chaplinei Standl."                                                    
 [6] "Aquilegia chaplinei Standley ex Payson"                                         
 [7] "Aquilegia chrysantha Gray var. chrysantha"                                      
 [8] "Aquilegia chrysantha Gray"       

So I first strip out author names using strsplit and use grep to find
subspecies/variety abbreviations 

noauthor<-function(x){
  ## split name into vector of separate words
  y<-strsplit(x, " ")
  sapply(y, function(x){  
        n<-grep( "^var\\.$|^ssp\\.$|^var$|^f\\.$",x)
# apply a function to paste together the first and second elements
# plus element after matching var., spp., f. (or and others) 
# use sort in case the name includes both var and spp -sometimes happens
        paste( x[sort(c(1:2, n,n+1))], collapse=" ")  })}


noauthor(x[1:8])
[1] "Aquilegia caerulea var. caerulea"    
[2] "Aquilegia caerulea var. ochroleuca"  
[3] "Aquilegia caerulea var. pinetorum"   
[4] "Aquilegia caerulea"                  
[5] "Aquilegia chaplinei"                 
[6] "Aquilegia chaplinei"                 
[7] "Aquilegia chrysantha var. chrysantha"
[8] "Aquilegia chrysantha"    


Chris

View this message in context: http://old.nabble.com/splitting-scientific-names-into-genus%2C-species%2C-and-subspecies-tp26204666p26205654.html
Sent from the R help mailing list archive at Nabble.com.