Skip to content
Prev 386768 / 398506 Next

text analysis errors

On 2021-01-07 11:34 +1100, Jim Lemon wrote:
Hi!  

quanteda::tokenizer says it needs a 
character vector or ?corpus? as input

	https://www.rdocumentation.org/packages/quanteda/versions/0.99.12/topics/tokenize

... or is this tokenize from the 
tokenizers package, I found something 
about ?doc_id? here:

	https://cran.r-project.org/web/packages/tokenizers/vignettes/introduction-to-tokenizers.html

You can convert docx to markdown using 
pandoc:

	pandoc --from docx --to markdown $inputfile

odt also works, and many others.  

I believe pandoc is included in RStudio.  
But I have never used it from there 
myself, so that is really bad advice I 
think.

To read doc, I use wvHtml:

	wvHtml $inputfile - 2> /dev/null | w3m -dump -T text/html

Rasmus

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20210107/7fd2204a/attachment.sig>