Skip to content

URLdecode problems

3 messages · Oliver Keyes, Jeff Newmiller, Hadley Wickham

#
Hey all,

So, I'm attempting to decode some (and I don't know why anyone did this)
URl-encoded user agents. Running URLdecode over them generates the error:

"Error in rawToChar(out) : embedded nul in string"

Okay, so there's an embedded nul - fair enough. Presumably decoding the URL
is exposing it in a format R doesn't like. Except when I try to dig down
and work out what an encoded nul looks like, in order to simply remove them
with something like gsub(), I end up with several different strings, all of
which apparently resolve to an embedded nul:
Error in rawToChar(out) : embedded nul in string: '0; @\0L'
In addition: Warning message:
In URLdecode("0;%20@%gIL") :
  out-of-range values treated as 0 in coercion to raw
Error in rawToChar(out) : embedded nul in string: ' \0e'
In addition: Warning message:
In URLdecode("%20%use") :
  out-of-range values treated as 0 in coercion to raw

I'm a relative newb to encodings, so maybe the fault is simply in my
understanding of how this should work, but - why are both strings being
read as including nuls, despite having different values? And how would I go
about removing said nuls?
#
I would guess that the original URLs were encoded somehow (non-ASCII), and the person who received them didn't understand how to deal with them either and url-encoded them with the thought that they would not lose information that way. Unfortunately, they probably lost the meta information as to how they were originally encoded, and without that this turns into a detective job that will likely need C's ability (perhaps via RCpp) to ignore type information to put things back. If you are lucky all strings were originally encoded the same way... if really lucky they were all UTF8 or UTF16 (which would have nuls and other odd bytes). Proceeding with the broken strings you have now will almost certainly not work. The fragments shown are not even vaguely recognizable as URLs, so I don't see how we can do anything meaningful with them.

Please read the Posting Guide. One point made there to note is that if C becomes part of the question then R-devel becomes the more appropriate list. The other is that for all of these lists plain text email is expected (nor HTML). 
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.
On September 1, 2014 9:02:33 AM PDT, Oliver Keyes <okeyes at wikimedia.org> wrote:
#
Hi Oliver,

I think you're being misled by the default behaviour of warnings: they
all get displayed at once, before control returns to the console.  If
you making them immediate, you get a slightly more informative error:
Warning in URLdecode("0;%20@%gIL") :
  out-of-range values treated as 0 in coercion to raw
Error in rawToChar(out) : embedded nul in string: '0; @\0L'

So the out of range value (%g...) is getting converted to a raw(0),
aka a nul. Then rawToChar() chokes.

The code for URLdecode is simple enough that I'd recommend rewriting
yourself to better handle bad inputs.

Hadley
On Mon, Sep 1, 2014 at 11:02 AM, Oliver Keyes <okeyes at wikimedia.org> wrote: