Skip to content
Prev 314874 / 398506 Next

tm: custom reader for readPlain

Le mardi 08 janvier 2013 ? 15:56 -0500, Simon Kiss a ?crit :
You should create a reader function that takes as an input the text
content you pasted at the end of your messages, parses it as
appropriate, and returns a PlainTextDocument. The information can be set
using the meta() function on the document object before returning it.
You can see how this process works by looking at the readFactivaHTML.R
file from my tm.plugin.factiva package, and probably from other packages
too (do not use readFactivaXML.R as it uses a method that only works for
XML input). Of course, parsing the input will take some work, but it
shouldn't be too hard if you split each line into a field identifier
(the part before ":") and the value of the field, and create a character
vector from that.

An information you did not give us is how are distributed the different
articles you need to import. If they are each in a separate files, you
can adapt DirSource() from tm so that it calls your reader function on
each file. If they are in one file, you need to create a custom source
that will read the file, split it and call the reader function on the
part corresponding to each article; this latter way is illustrated by
the HTML part of the FactivaSource.R file (again, skip the XML part).

Finally, maybe you can extract the articles in a different format,
ideally in XML, which is easier to use? Or maybe this newspaper is
available on Factiva, in which case my package will work for you?


Hope this helps