Webscraping the Plants Database - R-SIG-ecology

Wed, Jan 2, 2013 2:42 PM #

Hi Tim,

There's no need to scrape and parse: check out the Download PLANTS
database link on the left side of plants.usda.gov

Sarah

On Wed, Jan 2, 2013 at 5:48 PM, Tim Seipel <t.seipel at env.ethz.ch> wrote:

--
Sarah Goslee
http://www.functionaldiversity.org

Sarah Goslee

Wed, Jan 2, 2013 2:43 PM #

Sorry, hit send too quickly.

To get the parts you want, you need to go through the advanced search
and download. That tool will provide you with a neat csv file.

On Wed, Jan 2, 2013 at 5:48 PM, Tim Seipel <t.seipel at env.ethz.ch> wrote:

--
Sarah Goslee
http://www.functionaldiversity.org

Tim Seipel

Wed, Jan 2, 2013 2:48 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-ecology/attachments/20130102/0ed79369/attachment.pl>

Scott Chamberlain

Wed, Jan 2, 2013 2:52 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-ecology/attachments/20130102/57504ddd/attachment.pl>

Tim Seipel

Wed, Jan 2, 2013 2:58 PM #

Thanks Sarah,
Didn't realize you can go through advanced search webpage to get all the 
fields!
Thanks,
Tim

On 02.01.13 23:43, Sarah Goslee wrote:

Chris Stubben

Thu, Jan 3, 2013 9:36 AM #

Tim Seipel wrote

If you use the advanced search page, check the box in the bottom right
corner to "Display search URL for future use".  Depending on the fields you
select, that should give you something like...

http://plants.usda.gov/java/AdvancedSearchServlet?sciname=Astragalus%20miser&dsp_symbol=on&dsp_statefips=on&dsp_family=on&dsp_dur=on&dsp_grwhabt=on&dsp_nativestatuscode=on&dsp_fed_te_status=on&Synonyms=all&viewby=sciname

Just take that URL and paste the species at the end and then use the
readHTMLTable function in the XML package.

url <-
"http://plants.usda.gov/java/AdvancedSearchServlet?dsp_symbol=on&dsp_statefips=on&dsp_family=on&dsp_dur=on&dsp_grwhabt=on&dsp_nativestatuscode=on&dsp_fed_te_status=on&Synonyms=all&viewby=sciname&sciname="

species <- "Astragalus miser"
url2 <- paste(url, species, sep="")
x<-readHTMLTable(url2)

These pages have lots of formatting tables, so the results are quite
messy... Sometimes it helps to count the number of rows in each table, 

sapply(x, nrow)
NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL 
  65    1   56   46   43    9    1    1    1    1 

But once you find the table you need (#8), then you can just select that
directly.

x[[8]]

or use the which option to get table 8.  I also removed the newlines from
column names...
x2<-readHTMLTable(url2, which=8, stringsAsFactors=FALSE)
names(x2) <- gsub("(.*?)\r\n.*", "\\1", names(x2) )
x2

t(x2)
                   [,1]                                                    
Symbol             "ASMI9"                                                 
Scientific Name    "Astragalus miser"                                      
State and Province "USA (AZ, CO, ID, MT, NV, SD, UT, WA, WY), CAN (AB, BC)"
Family             "Fabaceae"                                              
Duration           "Perennial"                                             
Growth Habit       "Forb/herb"                                             
Native Status      "L48 (N), CAN (N)"                                      
Federal T/E Status ""     


You should be able to wrap that in a loop and go through your species

species<- "Festuca idahoensis"
url2 <- paste(url, species, sep="")
x2<-readHTMLTable(url2, which=8, stringsAsFactors=FALSE)

names(x2) <- gsub("(.*?)\r\n.*", "\\1", names(x2) )

 t(x2)
                   [,1]                                                                    
Symbol             "FEID"                                                                  
Scientific Name    "Festuca idahoensis"                                                    
State and Province "USA (AZ, CA, CO, ID, MT, NM, NV, OR, SD, UT, WA, WY),
CAN (AB, BC, SK)"
Family             "Poaceae"                                                               
Duration           "Perennial"                                                             
Growth Habit       "Graminoid"                                                             
Native Status      "L48 (N), CAN (N)"                                                      
Federal T/E Status ""       




Chris Stubben












--
View this message in context: http://r-sig-ecology.471788.n2.nabble.com/Webscraping-the-Plants-Database-tp7577775p7577781.html
Sent from the r-sig-ecology mailing list archive at Nabble.com.