Extracting desired numbers from complicated lines of web pages
try this: left as an exercise to the reader if these have to be
grouped by 'userid' which might be the case and therefore you might
want to check for non-existent values. Also on the last line you did
not say it there are only those three values, or could there be more.
input <- readLines(textConnection('
+ [1] "\t\t\t<li id=\"friendCount\"><a
href=\"/user_details_friends?userid=--T8djg0nrb_yMMMA3Y0jQ\">108
Friends</a></li>"
+
+ [2] "\t\t\t<li id=\"reviewCount\"><a
href=\"/user_details_reviews_self?userid=--T8djg0nrb_yMMMA3Y0jQ\">151
Reviews</a></li>"
+
+ [3] "\t\t\t\t<li id=\"updatesCount\">5 Review Updates</li>"
+
+ [4] "\t\t\t\t<li id=\"ftrCount\"><a
href=\"/user_details_reviews_self?review_filter=first&userid=--T8djg0nrb_yMMMA3Y0jQ\">1
First</a></li>"
+
+ [5] "\t\t\t\t<li id=\"fanCount\">2 Fans</li>"
+
+ [6] "\t\t\t\t<li id=\"localPhotoCount\"><a
href=\"/user_local_photos?userid=--T8djg0nrb_yMMMA3Y0jQ\">54 Local
Photos</a></li>"
+
+ [7] <p id="review_votes" class="smaller"><img
src="http://s3-media2.ak.yelpcdn.com/assets/0/www/img/cf265851428e/ico/reviewVotes.gif"
alt=""> Review votes:<br> 2022 Useful, 1591 Funny, and 1756 Cool</p>
+
+ [[alternative HTML version deleted]]'))
# extract the data by brute force and then break apart into a dataframe
count <- lapply(input, function(.line){
+ if (grepl('[0-9]+ Friends', .line))
+ return(sub(".*>([0-9]+) (Friends).*", "\\1:\\2", .line))
+ if (grepl("[0-9]+ Reviews", .line))
+ return(sub(".*>([0-9]+) (Reviews).*", "\\1:\\2", .line))
+ if (grepl("[0-9]+ Review Update", .line))
+ return(sub(".*>([0-9]+) (Review Update).*", "\\1:\\2", .line))
+ if (grepl("[0-9]+ First", .line))
+ return(sub(".*>([0-9]+) (First).*", "\\1:\\2", .line))
+ if (grepl("[0-9]+ Fans", .line))
+ return(sub(".*>([0-9]+) (Fans).*", "\\1:\\2", .line))
+ if (grepl("[0-9]+ Local Photos", .line))
+ return(sub(".*>([0-9]+) (Local Photos).*", "\\1:\\2", .line))
+ if (grepl("[0-9]+ Useful", .line))
+ return(c( # vector with multiple values
+ sub(".* ([0-9]+) (Useful).*", "\\1:\\2", .line)
+ , sub(".* ([0-9]+) (Funny).*", "\\1:\\2", .line)
+ , sub(".* ([0-9]+) (Cool).*", "\\1:\\2", .line)
+ ))
+ return(NULL)
+ })
# create dataframe
df <- data.frame(do.call(rbind, strsplit(unlist(count), ":")))
names(df) <- c("Value", "Variable")
df
Value Variable 1 108 Friends 2 151 Reviews 3 5 Review Update 4 1 First 5 2 Fans 6 54 Local Photos 7 2022 Useful 8 1591 Funny 9 1756 Cool
On Sun, Aug 5, 2012 at 11:16 AM, Shelby McIntyre <smcintyremobile at me.com> wrote:
I need to extract the indicted (bold & underlined) numbers from lines coming off web pages.
Of course I don't know ahead of time the location or length of the number. What I do know
is the tag "Friends", and "Reviews", etc. In fact, it would be good to end up with
Value Variable
108 Friends
151 Reviews
5 Review Updates
NA First <-- assuming here that "First" did not show up on an line
etc.
Of particular trouble is line [7] which requires extracting 3 numbers 2022 (Useful), 1591 (Funny) and 1756 (Cool).
============== Extraction problem lines ===========
[1] "\t\t\t<li id=\"friendCount\"><a href=\"/user_details_friends?userid=--T8djg0nrb_yMMMA3Y0jQ\">108 Friends</a></li>"
[2] "\t\t\t<li id=\"reviewCount\"><a href=\"/user_details_reviews_self?userid=--T8djg0nrb_yMMMA3Y0jQ\">151 Reviews</a></li>"
[3] "\t\t\t\t<li id=\"updatesCount\">5 Review Updates</li>"
[4] "\t\t\t\t<li id=\"ftrCount\"><a href=\"/user_details_reviews_self?review_filter=first&userid=--T8djg0nrb_yMMMA3Y0jQ\">1 First</a></li>"
[5] "\t\t\t\t<li id=\"fanCount\">2 Fans</li>"
[6] "\t\t\t\t<li id=\"localPhotoCount\"><a href=\"/user_local_photos?userid=--T8djg0nrb_yMMMA3Y0jQ\">54 Local Photos</a></li>"
[7] <p id="review_votes" class="smaller"><img src="http://s3-media2.ak.yelpcdn.com/assets/0/www/img/cf265851428e/ico/reviewVotes.gif" alt=""> Review votes:<br> 2022 Useful, 1591 Funny, and 1756 Cool</p>
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.