Skip to content

regex - extracting src url

3 messages · Omar André Gonzáles Díaz, Bert Gunter, Martin Morgan

#
Hi,I have a DF with a column with "html", like this:

<IMG SRC="
https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?"
BORDER="0" HEIGHT="1" WIDTH="1" ALT="Advertisement">


I need to get this:


https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=
?


I've got this so far:


https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?\"
BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement


With this is the code I've used:

carreras_normal$Impression.Tag..image. <-
gsub("<img.+?src=[\"'](.*?)[\"'].*?>","\\1",carreras_normal$Impression.Tag..image.,
                                  ignore.case = T)



*But I still need to use get rid of this part:*


https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=
?*\" BORDER=\"0\" HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement*


Thank you for your help.

Omar Gonz?les.
#
?strsplit  #I think
My "solution" assumes a fixed format for the URL's as shown in your
example. If that is not the case, it doesn't work.
+ BORDER="0" HEIGHT="1" WIDTH="1" ALT="Advertisement">'
[1] "<IMG SRC=\"https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?\"\nBORDER=\"0\"
HEIGHT=\"1\" WIDTH=\"1\" ALT=\"Advertisement\">"
[[1]]
[1] "https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?"



Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Mar 21, 2016 at 9:44 PM, Omar Andr? Gonz?les D?az
<oma.gonzales at gmail.com> wrote:
#
On 03/22/2016 12:44 AM, Omar Andr? Gonz?les D?az wrote:
You're querying an xml string, so use xpath, e.g., via the XML library

 > as.character(xmlParse(y)[["//IMG/@SRC"]])
[1] 
"https://ad.doubleclick.net/ddm/trackimp/N344006.1960500FACEBOOKAD/B9589414.130145906;dc_trk_aid=303019819;dc_trk_cid=69763238;ord=[timestamp];dc_lat=;dc_rdid=;tag_for_child_directed_treatment=?"

`xmlParse()` translates the character string into  an XML document. `[[` 
subsets the document to extract a single element. "//IMG/@SRC" follows 
the xpath specification (this section 
https://www.w3.org/TR/xpath-31/#abbrev of the specification provides a 
quick guide) to find, starting from the 'root' of the document, a node, 
at any depth, labeled IMG containing an attribute labeled SRC.

A variation, if there were several IMG tags to be extracted, would be

   xpathSApply(xmlParse(y), "//IMG/@SRC", as.character)
This email message may contain legally privileged and/or confidential information.  If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited.  If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.