Skip to content

Extracting desired numbers from complicated lines of web pages

2 messages · Shelby McIntyre, jim holtman

#
try this:  left as an exercise to the reader if these have to be
grouped by 'userid' which might be the case and therefore you might
want to check for non-existent values.  Also on the last line you did
not say it there are only those three values, or could there be more.


input <- readLines(textConnection('
+ [1] "\t\t\t<li id=\"friendCount\"><a
href=\"/user_details_friends?userid=--T8djg0nrb_yMMMA3Y0jQ\">108
Friends</a></li>"
+
+  [2] "\t\t\t<li id=\"reviewCount\"><a
href=\"/user_details_reviews_self?userid=--T8djg0nrb_yMMMA3Y0jQ\">151
Reviews</a></li>"
+
+  [3] "\t\t\t\t<li id=\"updatesCount\">5 Review Updates</li>"
+
+  [4] "\t\t\t\t<li id=\"ftrCount\"><a
href=\"/user_details_reviews_self?review_filter=first&amp;userid=--T8djg0nrb_yMMMA3Y0jQ\">1
First</a></li>"
+
+  [5] "\t\t\t\t<li id=\"fanCount\">2 Fans</li>"
+
+  [6] "\t\t\t\t<li id=\"localPhotoCount\"><a
href=\"/user_local_photos?userid=--T8djg0nrb_yMMMA3Y0jQ\">54 Local
Photos</a></li>"
+
+  [7] <p id="review_votes" class="smaller"><img
src="http://s3-media2.ak.yelpcdn.com/assets/0/www/img/cf265851428e/ico/reviewVotes.gif"
alt=""> Review votes:<br> 2022 Useful, 1591 Funny, and 1756 Cool</p>
+
+         [[alternative HTML version deleted]]'))
+     if (grepl('[0-9]+ Friends', .line))
+         return(sub(".*>([0-9]+) (Friends).*", "\\1:\\2", .line))
+     if (grepl("[0-9]+ Reviews", .line))
+         return(sub(".*>([0-9]+) (Reviews).*", "\\1:\\2", .line))
+     if (grepl("[0-9]+ Review Update", .line))
+         return(sub(".*>([0-9]+) (Review Update).*", "\\1:\\2", .line))
+     if (grepl("[0-9]+ First", .line))
+         return(sub(".*>([0-9]+) (First).*", "\\1:\\2", .line))
+     if (grepl("[0-9]+ Fans", .line))
+         return(sub(".*>([0-9]+) (Fans).*", "\\1:\\2", .line))
+     if (grepl("[0-9]+ Local Photos", .line))
+         return(sub(".*>([0-9]+) (Local Photos).*", "\\1:\\2", .line))
+     if (grepl("[0-9]+ Useful", .line))
+         return(c(  # vector with multiple values
+             sub(".* ([0-9]+) (Useful).*", "\\1:\\2", .line)
+           , sub(".* ([0-9]+) (Funny).*", "\\1:\\2", .line)
+           , sub(".* ([0-9]+) (Cool).*", "\\1:\\2", .line)
+           ))
+     return(NULL)
+ })
Value      Variable
1   108       Friends
2   151       Reviews
3     5 Review Update
4     1         First
5     2          Fans
6    54  Local Photos
7  2022        Useful
8  1591         Funny
9  1756          Cool

        
On Sun, Aug 5, 2012 at 11:16 AM, Shelby McIntyre <smcintyremobile at me.com> wrote: