Skip to content
Back to formatted view

Raw Message

Message-ID: <51E54EB1.7070101@ucdavis.edu>
Date: 2013-07-16T13:46:25Z
From: Duncan Temple Lang
Subject: Weird 'xmlEventParse' encoding issue
In-Reply-To: <51E3DFD0.1000302@cognition.uni-freiburg.de>

Hi Sascha

 Your code gives the correct results on my machine (OS X),
either reading from the file directly or via readLines() and passing
the text to xmlEventParse().

 The problem might be the version of the XML package or your environment
settings.  And it is important to report the session information.
So you should provide the output from

   sessionInfo()
   Sys.getenv()
   libxmlVersion()


 D

On 7/15/13 4:41 AM, Sascha Wolfer wrote:
> Dear list,
> 
> I have got a weird encoding problem with the xmlEventParse() function from the 'XML' package.
> 
> I tried finding an answer on the web for several hours and a Stack Exchange question came back without success :(
> 
> So here's the problem. I created a small XML test file, which looks like this:
> 
> <?xml version="1.0" encoding="iso-8859-1"?>
> <!DOCTYPE testFile>
> <s type="manual">auch der Schulleiter steht daf?r zur Verf?gung. Das ist se?haft mit ? und ?...</s>
> 
> This file is encoded with the iso-8859-1 encoding which is also defined in its header.
> 
> I have 3 handler functions, definitions as follows:
> 
> sE2 <- function (name, attrs) {
>   if (name == "s") {
>     get.text <<- T }
> }
> 
> eE2 <- function (name, attrs) {
>   if (name == "s") {
>     get.text <<- F
>   }
> }
> 
> tS2 <- function (content, ...) {
>   if (get.text & nchar(content) > 0) {
>     collected.text <<- c(collected.text, content)
>   }
> }
> 
> I have one wrapper function around xmlEventParse(), definition as follows:
> 
> get.all.text <- function (file) {
>   t1 <- Sys.time()
>   read.file <- paste(readLines(file, encoding = ""), collapse = " ")
>   print(read.file)
>   assign("collected.text", c(), env = .GlobalEnv)
>   assign("get.text", F, env = .GlobalEnv)
>   xmlEventParse(read.file, asText = T, list(startElement = sE2,
>                                            endElement = eE2,
>                                            text = tS2),
>                error = function (...) { },
>                saxVersion = 1)
>   t2 <- Sys.time()
>   cat("That took", round(difftime(t2,t1, units="secs"), 1), "seconds.\n")
>   cat("Result of reading is in variable 'collected.text'.\n")
>   collected.text
> }
> 
> The output of calling get.all.text(<test file>) is as follows:
> [1] "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?> <!DOCTYPE testFile> <s type=\"manual\">auch der Schulleiter steht
> daf?r zur Verf?gung. Das ist se?haft mit ? und ?...</s> "
> That took 0 seconds.
> Result of reading is in variable 'collected.text'.
> [1] "auch der Schulleiter steht daf"                        "??r zur Verf??gung. Das ist se??haft mit ?? und ??..."
> 
> Now the REALLY weird thing (for me) is that R obviously reads in the file correctly (first output) with 'readLines()'.
> Then this output is passed to xmlEventParse. Afterwards the output is broken and it sometimes also inserts weird breaks
> were special characters occur.
> 
> Do you have any ideas how to solve this problem?
> 
> I cannot use the xmlParse() function because I need the SAX functionality of xmlEventParse(). I also tried reading the
> file with xmlEventParse() directly (with asText = F). No changes...
> 
> Thanks a lot,
> Sascha W.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.