Skip to content

convert list to Dataframe

6 messages · onyourmark, Duncan Murdoch, David Winsemius

#
Hi. I have a huge list called twitter:
NULL
List of 1
 $ :Classes 'PlainTextDocument', 'TextDocument', 'character'  atomic
[1:35575] 11999;10:47:14;20;10;2009;ObamaLouverture;Trails Mixed Lessons For
Governance From Campaigner-in-chief: President obama jumps  campaign 09 
tuesday.. http://bit.ly/2eHMaN;Florida;USA;FL;;;27.6648274;-81.5157535
12210;10:47:37;20;10;2009;David_Stringer;William Hague heading  Washington 
meets  Gen. Jim Jones, Sen. John McCain  others. Will Obama team raise
worries  EU ties?;London, England;United Kingdom;Greater
London;Westminster;;51.5001524;-0.1262362
12355;10:47:53;20;10;2009;Singsabit;RT @Drudge_Report PAPER: Excuses wearing
thin  Obama, media pals... http://tinyurl.com/yfw6cd9;So.
California;USA;CA;;;36.778261;-119.4179324
12407;10:47:59;20;10;2009;obamavideonews;Obama News Obama   Afghanistan
troop decision timing (AFP) : AFP - Pres.. http://bit.ly/3KPUr8 #obama
#video;USA;USA;;;;37.09024;-95.712891 ...
  .. ..- attr(*, "Author")= chr(0) 
  .. ..- attr(*, "DateTimeStamp")= POSIXlt[1:9], format: "2009-10-31
04:46:56"
  .. ..- attr(*, "Description")= chr(0) 
  .. ..- attr(*, "Heading")= chr(0) 
  .. ..- attr(*, "ID")= chr "1"
  .. ..- attr(*, "Language")= chr "en"
  .. ..- attr(*, "LocalMetaData")= list()
  .. ..- attr(*, "Origin")= chr(0) 
 - attr(*, "CMetaData")=List of 3
  ..$ NodeID  : num 0
  ..$ MetaData:List of 2
  .. ..$ create_date: POSIXlt[1:9], format: "2009-10-31 04:46:56"
  .. ..$ creator    : Named chr ""
  .. .. ..- attr(*, "names")= chr "LOGNAME"
  ..$ Children: NULL
  ..- attr(*, "class")= chr "MetaDataNode"
 - attr(*, "DMetaData")='data.frame':   1 obs. of  1 variable:
  ..$ MetaID: num 0
 - attr(*, "class")= chr [1:3] "VCorpus" "Corpus" "list"

It contains tweets but in many languages. The "columns" are separated by
semi-colons. I am using the tm package and it is a "corpus".

It looks like this:

547282;06:37:17;21;10;2009;dani_jade18;@Laura_Whyte1   day
:p;Huddersfield/Lincoln;United
Kingdom;Kirklees;Kirklees;;53.6468475;-1.7727296
547283;06:37:17;21;10;2009;fabiomafra;algu?m traz mais lenha pro computador
da facool? BOM DIA.;Belo Horizonte - MG -
BR;Brazil;MG;;;-19.8157306;-43.9542226
547284;06:37:17;21;10;2009;romanotr;???, "????????? ??? ??????" ????????????
?????? ????? ?? ???????? ?????, ?? 173 ?????? ?? 81 ????? ???????? ???????.
??????,??????...;Portugal Aveiro;Portugal;Aveiro;;;40.6411848;-8.6536169
547285;06:37:18;21;10;2009;Y_T_;Playing: Beth Orton &lt\;Someone's
Daughter&gt\;;Kanazawa, Japan;Japan;Ishikawa
Prefecture;;;36.5613254;136.6562051
Error: invalid input
'547286;06:37:18;21;10;2009;Atogey;????????????????????????????????????????????????????????????????????????RT
@zuola ???????????? @wenyunc

I want to convert it to "fields" or columns and so I thought I should
convert it to a dataframe. I tried
Error in sort.list(y) : 
  invalid input
'547286;06:37:18;21;10;2009;Atogey;????????????????????????????????????????????????????????????????????????RT
@zuola ???????????? @wenyunchao
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????*????????????????????????????????????????????????????????????????????????????????????;???????????????;China;Zhejiang;;;28.695035;119.751054'
in 'utf8towcs'
Can anyone suggest what I can do? 

P.S. Actually, I would love to remove all the non-English tweets but I have
no clue about how to do that.
#
Three suggestions:

-- drop the idea of using a dataframe. It's only appropriate when the  
data is rectangular.
-- look at strsplit for separating at "@" characters.
-- post the output of dput() on your sample, since email is probably  
not capable of rendering this data without creating distortions.
#
Hello. The "fields" are separated by a ';'. I think that the data is
"rectangular" in the sense that there are about 15 fields for each row. Some
of the fields are empty. In the dput() display below, it seems that the rows
are delimited by ' " ' .
Any idea from this?

Here is the end of the output for dput(twitter)

"4927861;05:04:14;28;10;2009;HOYTSTHEATRES;GameStop Brings  15K  Manage
Holiday Rush [Black Friday]
http://bit.ly/2d3OJg;Australia;Australia;;;;-25.274398;133.775136", 
"4927863;05:04:14;28;10;2009;padden;Rachel  master chef  cook 
anytime!;Sydney, Australia;Australia;NSW;;;-33.867139;151.207114", 
"4927878;05:04:17;28;10;2009;GSpotMagazine;The penalty  success   bored 
attentions  people  formerly snubbed you. -Mary Wilson Little
#quote;UK;United Kingdom;;;;55.378051;-3.435973", 
"4927885;05:04:20;28;10;2009;super_assassin;@triplejsr flight  conchords,
pleeeeeaaase :) thanks rosie
xx;Australia;Australia;;;;-25.274398;133.775136", 
"4927893;05:04:21;28;10;2009;SLMFE;Gestern:Achso,ja okey,um 5 nach las ich
jemanden komen der dir die Akupunkturnadel(zb 5!im Ohr!)entfernt..Um 10 n.
kommt immer noch keiner..;Germany;Germany;;;;51.165691;10.451526", 
"4927901;05:04:23;28;10;2009;mikesemple;HHS Secretary pushes health care
reform  rural America: By Christopher Smart The health-care crisis  ..
http://bit.ly/49Iqcu;London;United Kingdom;Greater
London;Westminster;;51.5001524;-0.1262362", 
"4927913;05:04:26;28;10;2009;coax_k;Facebook Headquarters  Studio O+A: San
Francisco based interior design firm Studio O+A  designed  ..
http://bit.ly/hdqWp;Sydney;Australia;NSW;;;-33.867139;151.207114"
), Author = character(0), DateTimeStamp = structure(list(sec =
56.4049999713898, 
    min = 46L, hour = 4L, mday = 31L, mon = 9L, year = 109L, 
    wday = 6L, yday = 303L, isdst = 0L), .Names = c("sec", "min", 
"hour", "mday", "mon", "year", "wday", "yday", "isdst"), class = c("POSIXt", 
"POSIXlt"), tzone = "GMT"), Description = character(0), Heading =
character(0), ID = "1", Language = "en", LocalMetaData = list(), Origin =
character(0), class = c("PlainTextDocument", 
"TextDocument", "character"))), CMetaData = structure(list(NodeID = 0, 
    MetaData = structure(list(create_date = structure(list(sec =
56.4059998989105, 
        min = 46L, hour = 4L, mday = 31L, mon = 9L, year = 109L, 
        wday = 6L, yday = 303L, isdst = 0L), .Names = c("sec", 
    "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
    ), class = c("POSIXt", "POSIXlt"), tzone = "GMT"), creator =
structure("", .Names = "LOGNAME")), .Names = c("create_date", 
    "creator")), Children = NULL), .Names = c("NodeID", "MetaData", 
"Children"), class = "MetaDataNode"), DMetaData = structure(list(
    MetaID = 0), .Names = "MetaID", row.names = c(NA, -1L), class =
"data.frame"), class = c("VCorpus", 
"Corpus", "list"))
onyourmark wrote:

  
    
#
On 01/11/2009 7:43 AM, onyourmark wrote:
It's a list, but more importantly it's a VCorpus and a Corpus.  You 
should use the functions appropriate to those classes to extract the 
strings making up the data, declare their encoding properly (or convert 
them to your native encoding), then use read.delim() on a textConnection 
to read them in.

Duncan Murdoch
#
On Nov 1, 2009, at 8:24 AM, onyourmark wrote:

            
There either are 15 fields or there aren't. You can't make a dataframe  
with an approximate number of fields. In the fragment below there  
appear to be 14 fields. Try:

twitfrag <-  
strsplit(c("4927861;05:04:14;28;10;2009;HOYTSTHEATRES;GameStop Brings   
15K  Manage
Holiday Rush [Black Friday] http://bit.ly/2d3OJg;Australia;Australia;;;;-25.274398;133.775136 
",
"4927863;05:04:14;28;10;2009;padden;Rachel  master chef  cook  
anytime!;Sydney, Australia;Australia;NSW;;;-33.867139;151.207114",
"4927878;05:04:17;28;10;2009;GSpotMagazine;The penalty  success   bored
attentions  people  formerly snubbed you. -Mary Wilson Little  
#quote;UK;United Kingdom;;;;55.378051;-3.435973",
"4927885;05:04:20;28;10;2009;super_assassin;@triplejsr flight   
conchords,
pleeeeeaaase :) thanks rosie  
xx;Australia;Australia;;;;-25.274398;133.775136",
"4927893;05:04:21;28;10;2009;SLMFE;Gestern:Achso,ja okey,um 5 nach las  
ich
jemanden komen der dir die Akupunkturnadel(zb 5!im Ohr!)entfernt..Um  
10 n.
kommt immer noch keiner..;Germany;Germany;;;;51.165691;10.451526",
"4927901;05:04:23;28;10;2009;mikesemple;HHS Secretary pushes health care
reform  rural America: By Christopher Smart The health-care crisis  ..
http://bit.ly/49Iqcu;London;United Kingdom;Greater
London;Westminster;;51.5001524;-0.1262362",
"4927913;05:04:26;28;10;2009;coax_k;Facebook Headquarters  Studio O+A:  
San
Francisco based interior design firm Studio O+A  designed  ..
http://bit.ly/hdqWp;Sydney;Australia;NSW;;;-33.867139;151.207114"
), ";")
twitfrag

I think you will see some patterns emerging.
They are strings (in our aRgot, objects of type character.) That is an  
effect of whatever processing you have done with components of the tm  
package, the entirety of which you are failing to share with us.
The whole point of using dput is to create a complete representation  
of an object.
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
#
I did this on the source files which were semi-colon delimted (to delimit the
fields, I am not sure what character denotes the new tweet)

After loading the tm package
+ readerControl = list(language = "lat")))

then

twitter <- tm_map(twitter, removeWords, stopwords("english"))

That last command took about an hour to complete.
onyourmark wrote: