convert list to Dataframe
I did this on the source files which were semi-colon delimted (to delimit the fields, I am not sure what character denotes the new tweet) After loading the tm package
txt <- system.file("texts", "txt", package = "tm")
(twitter <- Corpus(DirSource(txt),
+ readerControl = list(language = "lat")))
then
twitter <- tm_map(twitter, removeWords, stopwords("english"))
That last command took about an hour to complete.
onyourmark wrote:
Hi. I have a huge list called twitter:
dim(twitter)
NULL
str(twitter)
List of 1 $ :Classes 'PlainTextDocument', 'TextDocument', 'character' atomic [1:35575] 11999;10:47:14;20;10;2009;ObamaLouverture;Trails Mixed Lessons For Governance From Campaigner-in-chief: President obama jumps campaign 09 tuesday.. http://bit.ly/2eHMaN;Florida;USA;FL;;;27.6648274;-81.5157535 12210;10:47:37;20;10;2009;David_Stringer;William Hague heading Washington meets Gen. Jim Jones, Sen. John McCain others. Will Obama team raise worries EU ties?;London, England;United Kingdom;Greater London;Westminster;;51.5001524;-0.1262362 12355;10:47:53;20;10;2009;Singsabit;RT @Drudge_Report PAPER: Excuses wearing thin Obama, media pals... http://tinyurl.com/yfw6cd9;So. California;USA;CA;;;36.778261;-119.4179324 12407;10:47:59;20;10;2009;obamavideonews;Obama News Obama Afghanistan troop decision timing (AFP) : AFP - Pres.. http://bit.ly/3KPUr8 #obama #video;USA;USA;;;;37.09024;-95.712891 ... .. ..- attr(*, "Author")= chr(0) .. ..- attr(*, "DateTimeStamp")= POSIXlt[1:9], format: "2009-10-31 04:46:56" .. ..- attr(*, "Description")= chr(0) .. ..- attr(*, "Heading")= chr(0) .. ..- attr(*, "ID")= chr "1" .. ..- attr(*, "Language")= chr "en" .. ..- attr(*, "LocalMetaData")= list() .. ..- attr(*, "Origin")= chr(0) - attr(*, "CMetaData")=List of 3 ..$ NodeID : num 0 ..$ MetaData:List of 2 .. ..$ create_date: POSIXlt[1:9], format: "2009-10-31 04:46:56" .. ..$ creator : Named chr "" .. .. ..- attr(*, "names")= chr "LOGNAME" ..$ Children: NULL ..- attr(*, "class")= chr "MetaDataNode" - attr(*, "DMetaData")='data.frame': 1 obs. of 1 variable: ..$ MetaID: num 0 - attr(*, "class")= chr [1:3] "VCorpus" "Corpus" "list" It contains tweets but in many languages. The "columns" are separated by semi-colons. I am using the tm package and it is a "corpus". It looks like this: 547282;06:37:17;21;10;2009;dani_jade18;@Laura_Whyte1 day :p;Huddersfield/Lincoln;United Kingdom;Kirklees;Kirklees;;53.6468475;-1.7727296 547283;06:37:17;21;10;2009;fabiomafra;algu?m traz mais lenha pro computador da facool? BOM DIA.;Belo Horizonte - MG - BR;Brazil;MG;;;-19.8157306;-43.9542226 547284;06:37:17;21;10;2009;romanotr;???, "????????? ??? ??????" ???????????? ?????? ????? ?? ???????? ?????, ?? 173 ?????? ?? 81 ????? ???????? ???????. ??????,??????...;Portugal Aveiro;Portugal;Aveiro;;;40.6411848;-8.6536169 547285;06:37:18;21;10;2009;Y_T_;Playing: Beth Orton <\;Someone's Daughter>\;;Kanazawa, Japan;Japan;Ishikawa Prefecture;;;36.5613254;136.6562051 Error: invalid input '547286;06:37:18;21;10;2009;Atogey;????????????????????????????????????????????????????????????????????????RT @zuola ???????????? @wenyunc I want to convert it to "fields" or columns and so I thought I should convert it to a dataframe. I tried
twitterDF<-as.data.frame(twitter)
Error in sort.list(y) : invalid input '547286;06:37:18;21;10;2009;Atogey;????????????????????????????????????????????????????????????????????????RT @zuola ???????????? @wenyunchao ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????*????????????????????????????????????????????????????????????????????????????????????;???????????????;China;Zhejiang;;;28.695035;119.751054' in 'utf8towcs'
Can anyone suggest what I can do? P.S. Actually, I would love to remove all the non-English tweets but I have no clue about how to do that.
View this message in context: http://old.nabble.com/convert-list-to-Dataframe-tp26148889p26148898.html Sent from the R help mailing list archive at Nabble.com.