text analysis errors
On 2021-01-07 11:34 +1100, Jim Lemon wrote:
On Thu, Jan 7, 2021 at 10:40 AM Gordon Ballingrud <gob.allingrud at gmail.com> wrote:
Hello all,
I have asked this question on many forums without response. And although
I've made progress myself, I am stuck as to how to respond to a particular
error message.
I have a question about text-analysis packages and code. The general idea
is that I am trying to perform readability analyses on a collection of
about 4,000 Word files. I would like to do any of a number of such
analyses, but the problem now is getting R to recognize the uploaded files
as data ready for analysis. But I have been getting error messages. Let me
show what I have done so far. I have three separate commands because I
broke the file of 4,000 files up into three separate ones because,
evidently, the file was too voluminous to be read alone in its entirety.
So, I divided the files up into three roughly similar folders. They are
called ?WPSCASES? one through three. Here is my code, with the error
messages for each command recorded below:
token <-
tokenize("/Users/Gordon/Desktop/WPSCASESONE/",lang="en",doc_id="sample")
The code is the same for the other folders; the name of the folder is
different, but otherwise identical.
The error message reads:
*Error in nchar(tagged.text[, "token"], type = "width") : invalid multibyte
string, element 348*
The error messages are the same for the other two commands. But the
'element' number is different. It's 925 for the second folder, and 4302 for
the third.
token2 <-
tokenize("/Users/Gordon/Desktop/WPSCASES2/",lang="en",doc_id="sample")
token3 <-
tokenize("/Users/Gordon/Desktop/WPSCASES3/",lang="en",doc_id="sample")
These are the other commands if that's helpful.
I?ve tried to discover whether the ?element? that the error message
mentions corresponds to the file of that number in the file?s order. But
since folder 3 does not have 4,300 files in it, I think that that was
unlikely. Please let me know if you can figure out how to fix this stuff so
that I can start to use ?koRpus? commands, like ?readability? and its
progeny.
Thank you,
Gordon
Hi Gordon, Looks to me as though you may have to extract the text from the Word files. Export As Text.
Hi! quanteda::tokenizer says it needs a character vector or ?corpus? as input https://www.rdocumentation.org/packages/quanteda/versions/0.99.12/topics/tokenize ... or is this tokenize from the tokenizers package, I found something about ?doc_id? here: https://cran.r-project.org/web/packages/tokenizers/vignettes/introduction-to-tokenizers.html You can convert docx to markdown using pandoc: pandoc --from docx --to markdown $inputfile odt also works, and many others. I believe pandoc is included in RStudio. But I have never used it from there myself, so that is really bad advice I think. To read doc, I use wvHtml: wvHtml $inputfile - 2> /dev/null | w3m -dump -T text/html Rasmus -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20210107/7fd2204a/attachment.sig>