If you read the data frame with read.csv() or one of the other read() functions, use the asis=TRUE argument to prevent conversion to factors. If not do the conversion first:
# Convert factors to characters
DataMatrix <- sapply(TF2list, as.character)
# Split the vector of hits
DataList <- sapply(DataMatrix[, 2], strsplit, split=",")
# Use the values in Regulator to name the parts of the list
names(DataList) <- DataMatrix[,"Regulator"]
# Now create a data frame
# How long is the longest list of hits?
mx <- max(sapply(DataList, length))
# Now add NAs to vectors shorter than mx
DataList2 <- lapply(DataList, function(x) c(x, rep(NA, mx-length(x))))
# Finally convert back to a data frame
TF2list2 <- do.call(data.frame, DataList2)
Try this on a portion of the list, say 25 lines and print each object to see what is happening.
----------------------------------------
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77843-4352
-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Matthew
Sent: Tuesday, April 30, 2019 4:31 PM
To: r-help at r-project.org
Subject: [R] Fwd: Re: transpose and split dataframe
Thanks for your reply. I was trying to simplify it a little, but must
have got it wrong. Here is the real dataframe, TF2list:
?str(TF2list)
'data.frame':??? 152 obs. of? 2 variables:
?$ Regulator: Factor w/ 87 levels "AT1G02065","AT1G13960",..: 17 6 6 54
54 82 82 82 82 82 ...
?$ hits???? : Factor w/ 97 levels
"AT1G05675,AT3G12910,AT1G22810,AT1G14540,AT1G21120,AT1G07160,AT5G22520,AT1G56250,AT2G31345,AT5G22530,AT4G11170,A"|
__truncated__,..: 65 57 90 57 87 57 56 91 31 17 ...
?? And the first few lines resulting from dput(head(TF2list)):
dput(head(TF2list))
structure(list(Regulator = structure(c(17L, 6L, 6L, 54L, 54L,
82L), .Label = c("AT1G02065", "AT1G13960", "AT1G18860", "AT1G23380",
"AT1G29280", "AT1G29860", "AT1G30650", "AT1G55600", "AT1G62300",
"AT1G62990", "AT1G64000", "AT1G66550", "AT1G66560", "AT1G66600",
"AT1G68150", "AT1G69310", "AT1G69490", "AT1G69810", "AT1G70510", ...
This is another way of looking at the first 4 entries (Regulator is
tab-separated from hits):
Regulator
? hits
1
AT1G69490
?AT4G31950,AT5G24110,AT1G26380,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G79680,AT3G02840,AT5G25260,AT5G57220,AT2G37430,AT2G26560,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT5G05300,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT5G52760,AT5G66020,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT2G02010,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT2G40180,AT1G59865,AT4G35180,AT4G15417,AT1G51820,AT1G06135,AT1G36622,AT5G42830
2
AT1G29860
?AT4G31950,AT5G24110,AT1G05675,AT3G12910,AT5G64905,AT1G22810,AT1G14540,AT1G79680,AT1G07160,AT3G23250,AT5G25260,AT1G53625,AT5G57220,AT2G37430,AT3G54150,AT1G56250,AT3G23230,AT1G16420,AT1G78410,AT4G22030,AT1G69930,AT4G03460,AT4G11470,AT5G25250,AT5G36925,AT4G14450,AT2G30750,AT1G16150,AT1G02930,AT2G19190,AT4G11890,AT1G72520,AT4G31940,AT5G37490,AT4G08555,AT5G66020,AT5G26920,AT3G57460,AT4G23220,AT3G15518,AT2G43620,AT1G35210,AT5G46295,AT1G17147,AT1G11925,AT2G39200,AT1G02920,AT4G35180,AT4G15417,AT1G51820,AT4G40020,AT1G06135
3
AT1G2986
?AT5G64905,AT1G21120,AT1G07160,AT5G25260,AT1G53625,AT1G56250,AT2G31345,AT4G11170,AT1G66090,AT1G26410,AT3G55840,AT1G69930,AT4G03460,AT5G25250,AT5G36925,AT1G26420,AT5G42380,AT1G16150,AT2G22880,AT1G02930,AT4G11890,AT1G72520,AT5G66020,AT2G43620,AT2G44370,AT4G15975,AT1G35210,AT5G46295,AT1G11925,AT2G39200,AT1G02920,AT4G14370,AT4G35180,AT4G15417,AT2G18690,AT5G11140,AT1G06135,AT5G42830
?? So, the goal would be to
first: Transpose the existing dataframe so that the factor Regulator
becomes a column name (column 1 name = AT1G69490, column2 name
AT1G29860, etc.) and the hits associated with each Regulator become
rows. Hits is a comma separated 'list' ( I do not not know if
technically it is an R list.), so it would have to be comma
'unseparated' with each entry becoming a row (col 1 row 1 = AT4G31950,
col 1 row 2 - AT5G24410, etc); like this :
AT1G69490
AT4G31950
AT5G24110
AT1G05675
AT5G64905
... I did not include all the rows)
I think it would be best to actually make the first entry a separate
dataframe ( 1 column with name = AT1G69490 and number of rows depending
on the number of hits), then make the second column (column name =
AT1G29860, and number of rows depending on the number of hits) into a
new dataframe and do a full join of of the two dataframes; continue by
making the third column (column name = AT1G2986) into a dataframe and
full join it with the previous; continue for the 152 observations so
that then end result is a dataframe with 152 columns and number of rows
depending on the entry with the greatest number of hits. The full joins
I can do with dplyr, but getting up to that point seems rather difficult.
This would get me what my ultimate goal would be; each Regulator is a
column name (152 columns) and a given row has either NA or the same hit.
?? This seems very difficult to me, but I appreciate any attempt.
Matthew
On 4/30/2019 4:34 PM, David L Carlson wrote: