Dear all:
I have a big data file of 60000 columns and 60000 rows like that:
AA AC AA AA .......AT
CC CC CT CT.......TC
..........................
.........................
I want to transpose it and the output is a new like that
AA CC ............
AC CC............
AA CT.............
AA CT.........
....................
....................
AT TC.............
The keypoint is I can't read it into R by read.table() because the
data is too large,so I try that:
c<-file("silygenotype.txt","r")
geno_t<-list()
repeat{
line<-readLines(c,n=1)
if (length(line)==0)break #end of file
line<-unlist(strsplit(line,"\t"))
geno_t<-cbind(geno_t,line)
}
write.table(geno_t,"xxx.txt")
It works but it is too slow ,how to optimize it???
Thank you
Yao He
?????????????????????????
Master candidate in 2rd year
Department of Animal genetics & breeding
Room 436,College of Animial Science&Technology,
China Agriculture University,Beijing,100193
E-mail: yao.h.1988 at gmail.com
??????????????????????????
How to transpose it in a fast way?
12 messages · Peter Langfelder, Ista Zahn, Peter Dalgaard +6 more
On Wed, Mar 6, 2013 at 4:18 PM, Yao He <yao.h.1988 at gmail.com> wrote:
Dear all:
I have a big data file of 60000 columns and 60000 rows like that:
AA AC AA AA .......AT
CC CC CT CT.......TC
..........................
.........................
I want to transpose it and the output is a new like that
AA CC ............
AC CC............
AA CT.............
AA CT.........
....................
....................
AT TC.............
The keypoint is I can't read it into R by read.table() because the
data is too large,so I try that:
c<-file("silygenotype.txt","r")
geno_t<-list()
repeat{
line<-readLines(c,n=1)
if (length(line)==0)break #end of file
line<-unlist(strsplit(line,"\t"))
geno_t<-cbind(geno_t,line)
}
write.table(geno_t,"xxx.txt")
It works but it is too slow ,how to optimize it???
I hate to be negative, but this will also not work on a 60000x 60000 matrix. At some point R will complain either about the lack of memory or about you trying to allocate a vector that is too long. I think your best bet is to look at file-backed data packages (for example, package bigmemory). Look at this URL: http://cran.r-project.org/web/views/HighPerformanceComputing.html and scroll down to Large memory and out-of-memory data. Some of the packages may have the functionality you are looking for and may do it faster than your code. If this doesn't help, you _may_ be able to make your code work, albeit slowly, if you replace the cbind() by data.frame. cbind() will in this case produce a matrix, and matrices are limited to 2^31 elements, which is less than 60000 times 60000. A data.frame is a special type of list and so _may_ be able to handle that many elements, given enough system RAM. There are experts on this list who will correct me if I'm wrong. If you are on a linux system, you can use split (type man split at the shell prompt to see help) to split the file into smaller chunks of say 5000 lines or so. Process each file separately, write it into a separate output file, then use the linux utility paste to "paste" the files side-by-side into the final output. Further, if you want to make it faster, do not grow geno_t by cbind'ing a new column to it in each iteration. Pre-allocate a matrix or data frame of an appropriate number of rows and columns and fill it out as you go. But it will still be slow, which I think is due to the inherent slowness of readLines and possibly strsplit. HTH, Peter
On Wed, Mar 6, 2013 at 7:56 PM, Peter Langfelder
<peter.langfelder at gmail.com> wrote:
On Wed, Mar 6, 2013 at 4:18 PM, Yao He <yao.h.1988 at gmail.com> wrote:
Dear all:
I have a big data file of 60000 columns and 60000 rows like that:
AA AC AA AA .......AT
CC CC CT CT.......TC
..........................
.........................
I want to transpose it and the output is a new like that
AA CC ............
AC CC............
AA CT.............
AA CT.........
....................
....................
AT TC.............
The keypoint is I can't read it into R by read.table() because the
data is too large,so I try that:
c<-file("silygenotype.txt","r")
geno_t<-list()
repeat{
line<-readLines(c,n=1)
if (length(line)==0)break #end of file
line<-unlist(strsplit(line,"\t"))
geno_t<-cbind(geno_t,line)
}
write.table(geno_t,"xxx.txt")
It works but it is too slow ,how to optimize it???
I hate to be negative, but this will also not work on a 60000x 60000 matrix. At some point R will complain either about the lack of memory or about you trying to allocate a vector that is too long.
Maybe this depends on the R version. I have not tried it, but the dev version of R can handle much larger vectors. See http://stat.ethz.ch/R-manual/R-devel/library/base/html/LongVectors.html Yau He, if you are feeling adventurous you could give the development version of R a try. Best, Ista
I think your best bet is to look at file-backed data packages (for example, package bigmemory). Look at this URL: http://cran.r-project.org/web/views/HighPerformanceComputing.html and scroll down to Large memory and out-of-memory data. Some of the packages may have the functionality you are looking for and may do it faster than your code. If this doesn't help, you _may_ be able to make your code work, albeit slowly, if you replace the cbind() by data.frame. cbind() will in this case produce a matrix, and matrices are limited to 2^31 elements, which is less than 60000 times 60000. A data.frame is a special type of list and so _may_ be able to handle that many elements, given enough system RAM. There are experts on this list who will correct me if I'm wrong. If you are on a linux system, you can use split (type man split at the shell prompt to see help) to split the file into smaller chunks of say 5000 lines or so. Process each file separately, write it into a separate output file, then use the linux utility paste to "paste" the files side-by-side into the final output. Further, if you want to make it faster, do not grow geno_t by cbind'ing a new column to it in each iteration. Pre-allocate a matrix or data frame of an appropriate number of rows and columns and fill it out as you go. But it will still be slow, which I think is due to the inherent slowness of readLines and possibly strsplit. HTH, Peter
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
1 day later
On Mar 7, 2013, at 01:18 , Yao He wrote:
Dear all:
I have a big data file of 60000 columns and 60000 rows like that:
AA AC AA AA .......AT
CC CC CT CT.......TC
..........................
.........................
I want to transpose it and the output is a new like that
AA CC ............
AC CC............
AA CT.............
AA CT.........
....................
....................
AT TC.............
The keypoint is I can't read it into R by read.table() because the
data is too large,so I try that:
c<-file("silygenotype.txt","r")
geno_t<-list()
repeat{
line<-readLines(c,n=1)
if (length(line)==0)break #end of file
line<-unlist(strsplit(line,"\t"))
geno_t<-cbind(geno_t,line)
}
write.table(geno_t,"xxx.txt")
It works but it is too slow ,how to optimize it???
As others have pointed out, that's a lot of data!
You seem to have the right idea: If you read the columns line by line there is nothing to transpose. A couple of points, though:
- The cbind() is a potential performance hit since it copies the list every time around. geno_t <- vector("list", 60000) and then
geno_t[[i]] <- <etc>
- You might use scan() instead of readLines, strsplit
- Perhaps consider the data type as you seem to be reading strings with 16 possible values (I suspect that R already optimizes string storage to make this point moot, though.)
Peter Dalgaard, Professor Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
Perhaps you could process this with a unix/Linux utility "Awk", before reading the file into R. -Sohail
From: r-help-bounces at r-project.org [r-help-bounces at r-project.org] On Behalf Of peter dalgaard [pdalgd at gmail.com]
Sent: Friday, March 08, 2013 5:08 AM
To: Yao He
Cc: R help
Subject: Re: [R] How to transpose it in a fast way?
Sent: Friday, March 08, 2013 5:08 AM
To: Yao He
Cc: R help
Subject: Re: [R] How to transpose it in a fast way?
On Mar 7, 2013, at 01:18 , Yao He wrote:
> Dear all:
>
> I have a big data file of 60000 columns and 60000 rows like that:
>
> AA AC AA AA .......AT
> CC CC CT CT.......TC
> ..........................
> .........................
>
> I want to transpose it and the output is a new like that
> AA CC ............
> AC CC............
> AA CT.............
> AA CT.........
> ....................
> ....................
> AT TC.............
>
> The keypoint is I can't read it into R by read.table() because the
> data is too large,so I try that:
> c<-file("silygenotype.txt","r")
> geno_t<-list()
> repeat{
> line<-readLines(c,n=1)
> if (length(line)==0)break #end of file
> line<-unlist(strsplit(line,"\t"))
> geno_t<-cbind(geno_t,line)
> }
> write.table(geno_t,"xxx.txt")
>
> It works but it is too slow ,how to optimize it???
As others have pointed out, that's a lot of data!
You seem to have the right idea: If you read the columns line by line there is nothing to transpose. A couple of points, though:
- The cbind() is a potential performance hit since it copies the list every time around. geno_t <- vector("list", 60000) and then
geno_t[[i]] <- <etc>
- You might use scan() instead of readLines, strsplit
- Perhaps consider the data type as you seem to be reading strings with 16 possible values (I suspect that R already optimizes string storage to make this point moot, though.)
--
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
The information contained in this electronic e-mail transmission and any attachments are intended only for the use of the individual or entity to whom or to which it is addressed, and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this communication is not the intended recipient, or the employee or agent responsible for delivering this communication to the intended recipient, you are hereby notified that any dissemination, distribution, copying or disclosure of this communication and any attachment is strictly prohibited. If you have received this transmission in error, please notify the sender immediately by telephone and electronic mail, and delete the original communication and any attachment from any computer, server or other electronic recording or storage device or medium. Receipt by anyone other than the intended recipient is not a waiver of any attorney-client, physician-patient or other privilege.
You could use the fact that scan reads the data rowwise, and the fact
that arrays are stored columnwise:
# generate a small example dataset
exampl <- array(letters[1:25], dim=c(5,5))
write.table(exampl, file="example.dat", row.names=FALSE. col.names=FALSE,
sep="\t", quote=FALSE)
# and read...
d <- scan("example.dat", what=character())
d <- array(d, dim=c(5,5))
t(exampl) == d
Although this is probably faster, it doesn't help with the large size.
You could used the n option of scan to read chunks/blocks and feed
those to, for example, an ff array (which you ideally have
preallocated).
HTH,
Jan
peter dalgaard <pdalgd at gmail.com> schreef:
On Mar 7, 2013, at 01:18 , Yao He wrote:
Dear all:
I have a big data file of 60000 columns and 60000 rows like that:
AA AC AA AA .......AT
CC CC CT CT.......TC
..........................
.........................
I want to transpose it and the output is a new like that
AA CC ............
AC CC............
AA CT.............
AA CT.........
....................
....................
AT TC.............
The keypoint is I can't read it into R by read.table() because the
data is too large,so I try that:
c<-file("silygenotype.txt","r")
geno_t<-list()
repeat{
line<-readLines(c,n=1)
if (length(line)==0)break #end of file
line<-unlist(strsplit(line,"\t"))
geno_t<-cbind(geno_t,line)
}
write.table(geno_t,"xxx.txt")
It works but it is too slow ,how to optimize it???
As others have pointed out, that's a lot of data!
You seem to have the right idea: If you read the columns line by
line there is nothing to transpose. A couple of points, though:
- The cbind() is a potential performance hit since it copies the
list every time around. geno_t <- vector("list", 60000) and then
geno_t[[i]] <- <etc>
- You might use scan() instead of readLines, strsplit
- Perhaps consider the data type as you seem to be reading strings
with 16 possible values (I suspect that R already optimizes string
storage to make this point moot, though.)
--
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On Mar 8, 2013, at 6:01 AM, Jan van der Laan wrote:
You could use the fact that scan reads the data rowwise, and the fact that arrays are stored columnwise: # generate a small example dataset exampl <- array(letters[1:25], dim=c(5,5)) write.table(exampl, file="example.dat", row.names=FALSE. col.names=FALSE, sep="\t", quote=FALSE)
This might avoid creation of some of the intermediate copies:
MASS::write.matrix( matrix( scan("example.dat", what=character()), 5,5), file="fil.out")
I tested it up to a 5000 x 5000 file:
exampl <- array(letters[1:25], dim=c(5000,5000))
MASS::write.matrix( matrix( scan("example.dat", what=character()), 5000,5000), file="fil.out")
Read 25000000 items
Not sure of the exact timing. Probably 5-10 minutes. The exampl-object takes 200,001,400 bytes. and did not noticeably stress my machine. Most of my RAM remains untouched. I'm going out on errands and will run timing on a 10K x 10K test case within a system.time() enclosure. Scan did report successfully reading 100000000 items fairly promptly.
David.
> # and read...
> d <- scan("example.dat", what=character())
> d <- array(d, dim=c(5,5))
>
> t(exampl) == d
>
>
> Although this is probably faster, it doesn't help with the large size. You could used the n option of scan to read chunks/blocks and feed those to, for example, an ff array (which you ideally have preallocated).
>
> HTH,
>
> Jan
>
>
>
>
> peter dalgaard <pdalgd at gmail.com> schreef:
>
>> On Mar 7, 2013, at 01:18 , Yao He wrote:
>>
>>> Dear all:
>>>
>>> I have a big data file of 60000 columns and 60000 rows like that:
>>>
>>> AA AC AA AA .......AT
>>> CC CC CT CT.......TC
>>> ..........................
>>> .........................
>>>
>>> I want to transpose it and the output is a new like that
>>> AA CC ............
>>> AC CC............
>>> AA CT.............
>>> AA CT.........
>>> ....................
>>> ....................
>>> AT TC.............
>>>
>>> The keypoint is I can't read it into R by read.table() because the
>>> data is too large,so I try that:
>>> c<-file("silygenotype.txt","r")
>>> geno_t<-list()
>>> repeat{
>>> line<-readLines(c,n=1)
>>> if (length(line)==0)break #end of file
>>> line<-unlist(strsplit(line,"\t"))
>>> geno_t<-cbind(geno_t,line)
>>> }
>>> write.table(geno_t,"xxx.txt")
>>>
>>> It works but it is too slow ,how to optimize it???
>>
>>
>> As others have pointed out, that's a lot of data!
>>
>> You seem to have the right idea: If you read the columns line by line there is nothing to transpose. A couple of points, though:
>>
>> - The cbind() is a potential performance hit since it copies the list every time around. geno_t <- vector("list", 60000) and then
>> geno_t[[i]] <- <etc>
>>
>> - You might use scan() instead of readLines, strsplit
>>
>> - Perhaps consider the data type as you seem to be reading strings with 16 possible values (I suspect that R already optimizes string storage to make this point moot, though.)
>>
>> --
>> Peter Dalgaard, Professor
>> Center for Statistics, Copenhagen Business School
>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>> Phone: (+45)38153501
>> Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA
On Mar 8, 2013, at 9:31 AM, David Winsemius wrote:
On Mar 8, 2013, at 6:01 AM, Jan van der Laan wrote:
You could use the fact that scan reads the data rowwise, and the fact that arrays are stored columnwise: # generate a small example dataset exampl <- array(letters[1:25], dim=c(5,5)) write.table(exampl, file="example.dat", row.names=FALSE. col.names=FALSE, sep="\t", quote=FALSE)
This might avoid creation of some of the intermediate copies:
MASS::write.matrix( matrix( scan("example.dat", what=character()), 5,5), file="fil.out")
I tested it up to a 5000 x 5000 file:
exampl <- array(letters[1:25], dim=c(5000,5000))
MASS::write.matrix( matrix( scan("example.dat", what=character()), 5000,5000), file="fil.out")
Read 25000000 items
Not sure of the exact timing. Probably 5-10 minutes. The exampl-object takes 200,001,400 bytes. and did not noticeably stress my machine. Most of my RAM remains untouched. I'm going out on errands and will run timing on a 10K x 10K test case within a system.time() enclosure. Scan did report successfully reading 100000000 items fairly promptly.
system.time( {MASS::write.matrix( matrix( scan("example.dat", what=character()), 10000,10000), file="fil.out") } )
Read 100000000 items
user system elapsed
487.100 912.613 1415.228
system.time( {MASS::write.matrix( matrix( scan("example.dat", what=character()), 500,500), file="fil.out") } )
Read 250000 items user system elapsed 1.184 2.507 3.834 And so it seems to scale linearly:
3.834 * 100000000/250000
[1] 1533.6
-- David.
# and read...
d <- scan("example.dat", what=character())
d <- array(d, dim=c(5,5))
t(exampl) == d
Although this is probably faster, it doesn't help with the large size. You could used the n option of scan to read chunks/blocks and feed those to, for example, an ff array (which you ideally have preallocated).
HTH,
Jan
peter dalgaard <pdalgd at gmail.com> schreef:
On Mar 7, 2013, at 01:18 , Yao He wrote:
Dear all:
I have a big data file of 60000 columns and 60000 rows like that:
AA AC AA AA .......AT
CC CC CT CT.......TC
..........................
.........................
I want to transpose it and the output is a new like that
AA CC ............
AC CC............
AA CT.............
AA CT.........
....................
....................
AT TC.............
The keypoint is I can't read it into R by read.table() because the
data is too large,so I try that:
c<-file("silygenotype.txt","r")
geno_t<-list()
repeat{
line<-readLines(c,n=1)
if (length(line)==0)break #end of file
line<-unlist(strsplit(line,"\t"))
geno_t<-cbind(geno_t,line)
}
write.table(geno_t,"xxx.txt")
It works but it is too slow ,how to optimize it???
As others have pointed out, that's a lot of data!
You seem to have the right idea: If you read the columns line by line there is nothing to transpose. A couple of points, though:
- The cbind() is a potential performance hit since it copies the list every time around. geno_t <- vector("list", 60000) and then
geno_t[[i]] <- <etc>
- You might use scan() instead of readLines, strsplit
- Perhaps consider the data type as you seem to be reading strings with 16 possible values (I suspect that R already optimizes string storage to make this point moot, though.)
--
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius Alameda, CA, USA
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius Alameda, CA, USA
On Mar 8, 2013, at 10:59 AM, David Winsemius wrote:
On Mar 8, 2013, at 9:31 AM, David Winsemius wrote:
On Mar 8, 2013, at 6:01 AM, Jan van der Laan wrote:
You could use the fact that scan reads the data rowwise, and the fact that arrays are stored columnwise: # generate a small example dataset exampl <- array(letters[1:25], dim=c(5,5)) write.table(exampl, file="example.dat", row.names=FALSE. col.names=FALSE, sep="\t", quote=FALSE)
This might avoid creation of some of the intermediate copies:
MASS::write.matrix( matrix( scan("example.dat", what=character()), 5,5), file="fil.out")
I tested it up to a 5000 x 5000 file:
exampl <- array(letters[1:25], dim=c(5000,5000))
MASS::write.matrix( matrix( scan("example.dat", what=character()), 5000,5000), file="fil.out")
Read 25000000 items
Not sure of the exact timing. Probably 5-10 minutes. The exampl-object takes 200,001,400 bytes. and did not noticeably stress my machine. Most of my RAM remains untouched. I'm going out on errands and will run timing on a 10K x 10K test case within a system.time() enclosure. Scan did report successfully reading 100000000 items fairly promptly.
system.time( {MASS::write.matrix( matrix( scan("example.dat", what=character()), 10000,10000), file="fil.out") } )
Read 100000000 items user system elapsed 487.100 912.613 1415.228
system.time( {MASS::write.matrix( matrix( scan("example.dat", what=character()), 500,500), file="fil.out") } )
Read 250000 items user system elapsed 1.184 2.507 3.834 And so it seems to scale linearly:
3.834 * 100000000/250000
[1] 1533.6
However, another posting today reminds us that this would best be attempted in a version of R that can handle matrices of that are larger than 2^15-1:
10000^2 <= 2^31-1
[1] TRUE
60000^2 <= 2^31-1
[1] FALSE R 3.0 is scheduled for release soon and you can compile it from sources if your machine is properly equipped. It has larger integers, and I _think_ may support such larger matrices.
David.
>
>> --
>> David.
>>
>>> # and read...
>>> d <- scan("example.dat", what=character())
>>> d <- array(d, dim=c(5,5))
>>>
>>> t(exampl) == d
>>>
>>>
>>> Although this is probably faster, it doesn't help with the large size. You could used the n option of scan to read chunks/blocks and feed those to, for example, an ff array (which you ideally have preallocated).
>>>
>>> HTH,
>>>
>>> Jan
>>>
>>>
>>>
>>>
>>> peter dalgaard <pdalgd at gmail.com> schreef:
>>>
>>>> On Mar 7, 2013, at 01:18 , Yao He wrote:
>>>>
>>>>> Dear all:
>>>>>
>>>>> I have a big data file of 60000 columns and 60000 rows like that:
>>>>>
>>>>> AA AC AA AA .......AT
>>>>> CC CC CT CT.......TC
>>>>> ..........................
>>>>> .........................
>>>>>
>>>>> I want to transpose it and the output is a new like that
>>>>> AA CC ............
>>>>> AC CC............
>>>>> AA CT.............
>>>>> AA CT.........
>>>>> ....................
>>>>> ....................
>>>>> AT TC.............
>>>>>
>>>>> The keypoint is I can't read it into R by read.table() because the
>>>>> data is too large,so I try that:
>>>>> c<-file("silygenotype.txt","r")
>>>>> geno_t<-list()
>>>>> repeat{
>>>>> line<-readLines(c,n=1)
>>>>> if (length(line)==0)break #end of file
>>>>> line<-unlist(strsplit(line,"\t"))
>>>>> geno_t<-cbind(geno_t,line)
>>>>> }
>>>>> write.table(geno_t,"xxx.txt")
>>>>>
>>>>> It works but it is too slow ,how to optimize it???
>>>>
>>>>
>>>> As others have pointed out, that's a lot of data!
>>>>
>>>> You seem to have the right idea: If you read the columns line by line there is nothing to transpose. A couple of points, though:
>>>>
>>>> - The cbind() is a potential performance hit since it copies the list every time around. geno_t <- vector("list", 60000) and then
>>>> geno_t[[i]] <- <etc>
>>>>
>>>> - You might use scan() instead of readLines, strsplit
>>>>
>>>> - Perhaps consider the data type as you seem to be reading strings with 16 possible values (I suspect that R already optimizes string storage to make this point moot, though.)
>>>>
>>>> --
>>>> Peter Dalgaard, Professor
>>>> Center for Statistics, Copenhagen Business School
>>>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>>>> Phone: (+45)38153501
>>>> Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
>>>>
>>>> ______________________________________________
> snipped
> David Winsemius
> Alameda, CA, USA
>
David Winsemius
Alameda, CA, USA
On 03/08/2013 06:01 AM, Jan van der Laan wrote:
You could use the fact that scan reads the data rowwise, and the fact that
arrays are stored columnwise:
# generate a small example dataset
exampl <- array(letters[1:25], dim=c(5,5))
write.table(exampl, file="example.dat", row.names=FALSE. col.names=FALSE,
sep="\t", quote=FALSE)
# and read...
d <- scan("example.dat", what=character())
d <- array(d, dim=c(5,5))
t(exampl) == d
Although this is probably faster, it doesn't help with the large size. You could
used the n option of scan to read chunks/blocks and feed those to, for example,
an ff array (which you ideally have preallocated).
I think it's worth asking what the overall goal is; all we get from this
exercise is another large file that we can't easily manipulate in R!
But nothing like a little challenge. The idea I think would be to transpose in
chunks of rows by scanning in some number of rows and writing to a temporary file
tpose1 <- function(fin, nrowPerChunk, ncol) {
v <- scan(fin, character(), nmax=ncol * nrowPerChunk)
m <- matrix(v, ncol=ncol, byrow=TRUE)
fout <- tempfile()
write(m, fout, nrow(m), append=TRUE)
fout
}
Apparently the data is 60k x 60k, so we could maybe easily read 60k x 10k at a
time from some file fl <- "big.txt"
ncol <- 60000L
nrowPerChunk <- 10000L
nChunks <- ncol / nrowPerChunk
fin <- file(fl); open(fin)
fls <- replicate(nChunks, tpose1(fin, nrowPerChunk, ncol))
close(fin)
'fls' is now a vector of file paths, each containing a transposed slice of the
matrix. The next task is to splice these together. We could do this by taking a
slice of rows from each file, cbind'ing them together, and writing to an output
splice <- function(fout, cons, nrowPerChunk, ncol) {
slices <- lapply(cons, function(con) {
v <- scan(con, character(), nmax=nrowPerChunk * ncol)
matrix(v, nrowPerChunk, byrow=TRUE)
})
m <- do.call(cbind, slices)
write(t(m), fout, ncol(m), append=TRUE)
}
We'd need to use open connections as inputs and output
cons <- lapply(fls, file); for (con in cons) open(con)
fout <- file("big_transposed.txt"); open(fout, "w")
xx <- replicate(nChunks, splice(fout, cons, nrowPerChunk, nrowPerChunk))
for (con in cons) close(con)
close(fout)
As another approach, it looks like the data are from genotypes. If they really
only consist of pairs of A, C, G, T, then two pairs e.g., 'AA' 'CT' could be
encoded as a single byte
alf <- c("A", "C", "G", "T")
nms <- outer(alf, alf, paste0)
map <- outer(setNames(as.raw(0:15), nms),
setNames(as.raw(bitwShiftL(0:15, 4)), nms),
"|")
with e.g.,
> map[matrix(c("AA", "CT"), ncol=2)]
[1] d0
This translates the problem of representing the 60k x 60k array as a 3.6 billion
element vector of 60k * 60k * 8 bytes (approx. 30 Gbytes) to one of 60k x 30k =
1.8 billion elements (fits in R-2.15 vectors) of approx 1.8 Gbyte (probably
usable in an 8 Gbyte laptop).
Personally, I would probably put this data in a netcdf / rdf5 file. Perhaps I'd
use snpStats or GWAStools in Bioconductor http://bioconductor.org.
Martin
HTH, Jan peter dalgaard <pdalgd at gmail.com> schreef:
On Mar 7, 2013, at 01:18 , Yao He wrote:
Dear all:
I have a big data file of 60000 columns and 60000 rows like that:
AA AC AA AA .......AT
CC CC CT CT.......TC
..........................
.........................
I want to transpose it and the output is a new like that
AA CC ............
AC CC............
AA CT.............
AA CT.........
....................
....................
AT TC.............
The keypoint is I can't read it into R by read.table() because the
data is too large,so I try that:
c<-file("silygenotype.txt","r")
geno_t<-list()
repeat{
line<-readLines(c,n=1)
if (length(line)==0)break #end of file
line<-unlist(strsplit(line,"\t"))
geno_t<-cbind(geno_t,line)
}
write.table(geno_t,"xxx.txt")
It works but it is too slow ,how to optimize it???
As others have pointed out, that's a lot of data!
You seem to have the right idea: If you read the columns line by line there is
nothing to transpose. A couple of points, though:
- The cbind() is a potential performance hit since it copies the list every
time around. geno_t <- vector("list", 60000) and then
geno_t[[i]] <- <etc>
- You might use scan() instead of readLines, strsplit
- Perhaps consider the data type as you seem to be reading strings with 16
possible values (I suspect that R already optimizes string storage to make
this point moot, though.)
--
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20130309/060fd940/attachment.pl>
3 days later
Thanks for everybody's help! I learn a lot from this discuss! 2013/3/10 jim holtman <jholtman at gmail.com>:
Did you check out the 'colbycol' package. On Fri, Mar 8, 2013 at 5:46 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
On 03/08/2013 06:01 AM, Jan van der Laan wrote:
You could use the fact that scan reads the data rowwise, and the fact that
arrays are stored columnwise:
# generate a small example dataset
exampl <- array(letters[1:25], dim=c(5,5))
write.table(exampl, file="example.dat", row.names=FALSE. col.names=FALSE,
sep="\t", quote=FALSE)
# and read...
d <- scan("example.dat", what=character())
d <- array(d, dim=c(5,5))
t(exampl) == d
Although this is probably faster, it doesn't help with the large size.
You could
used the n option of scan to read chunks/blocks and feed those to, for
example,
an ff array (which you ideally have preallocated).
I think it's worth asking what the overall goal is; all we get from this
exercise is another large file that we can't easily manipulate in R!
But nothing like a little challenge. The idea I think would be to
transpose in chunks of rows by scanning in some number of rows and writing
to a temporary file
tpose1 <- function(fin, nrowPerChunk, ncol) {
v <- scan(fin, character(), nmax=ncol * nrowPerChunk)
m <- matrix(v, ncol=ncol, byrow=TRUE)
fout <- tempfile()
write(m, fout, nrow(m), append=TRUE)
fout
}
Apparently the data is 60k x 60k, so we could maybe easily read 60k x 10k
at a time from some file fl <- "big.txt"
ncol <- 60000L
nrowPerChunk <- 10000L
nChunks <- ncol / nrowPerChunk
fin <- file(fl); open(fin)
fls <- replicate(nChunks, tpose1(fin, nrowPerChunk, ncol))
close(fin)
'fls' is now a vector of file paths, each containing a transposed slice of
the matrix. The next task is to splice these together. We could do this by
taking a slice of rows from each file, cbind'ing them together, and writing
to an output
splice <- function(fout, cons, nrowPerChunk, ncol) {
slices <- lapply(cons, function(con) {
v <- scan(con, character(), nmax=nrowPerChunk * ncol)
matrix(v, nrowPerChunk, byrow=TRUE)
})
m <- do.call(cbind, slices)
write(t(m), fout, ncol(m), append=TRUE)
}
We'd need to use open connections as inputs and output
cons <- lapply(fls, file); for (con in cons) open(con)
fout <- file("big_transposed.txt"); open(fout, "w")
xx <- replicate(nChunks, splice(fout, cons, nrowPerChunk,
nrowPerChunk))
for (con in cons) close(con)
close(fout)
As another approach, it looks like the data are from genotypes. If they
really only consist of pairs of A, C, G, T, then two pairs e.g., 'AA' 'CT'
could be encoded as a single byte
alf <- c("A", "C", "G", "T")
nms <- outer(alf, alf, paste0)
map <- outer(setNames(as.raw(0:15), nms),
setNames(as.raw(bitwShiftL(0:**15, 4)), nms),
"|")
with e.g.,
map[matrix(c("AA", "CT"), ncol=2)]
[1] d0 This translates the problem of representing the 60k x 60k array as a 3.6 billion element vector of 60k * 60k * 8 bytes (approx. 30 Gbytes) to one of 60k x 30k = 1.8 billion elements (fits in R-2.15 vectors) of approx 1.8 Gbyte (probably usable in an 8 Gbyte laptop). Personally, I would probably put this data in a netcdf / rdf5 file. Perhaps I'd use snpStats or GWAStools in Bioconductor http://bioconductor.org. Martin
HTH, Jan peter dalgaard <pdalgd at gmail.com> schreef: On Mar 7, 2013, at 01:18 , Yao He wrote:
Dear all:
I have a big data file of 60000 columns and 60000 rows like that:
AA AC AA AA .......AT
CC CC CT CT.......TC
..........................
.........................
I want to transpose it and the output is a new like that
AA CC ............
AC CC............
AA CT.............
AA CT.........
....................
....................
AT TC.............
The keypoint is I can't read it into R by read.table() because the
data is too large,so I try that:
c<-file("silygenotype.txt","r"**)
geno_t<-list()
repeat{
line<-readLines(c,n=1)
if (length(line)==0)break #end of file
line<-unlist(strsplit(line,"\**t"))
geno_t<-cbind(geno_t,line)
}
write.table(geno_t,"xxx.txt")
It works but it is too slow ,how to optimize it???
As others have pointed out, that's a lot of data!
You seem to have the right idea: If you read the columns line by line
there is
nothing to transpose. A couple of points, though:
- The cbind() is a potential performance hit since it copies the list
every
time around. geno_t <- vector("list", 60000) and then
geno_t[[i]] <- <etc>
- You might use scan() instead of readLines, strsplit
- Perhaps consider the data type as you seem to be reading strings with
16
possible values (I suspect that R already optimizes string storage to
make
this point moot, though.)
--
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
______________________________**________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide http://www.R-project.org/** posting-guide.html <http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.
______________________________**________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide http://www.R-project.org/** posting-guide.html <http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
______________________________**________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide http://www.R-project.org/** posting-guide.html <http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.
--
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
????????????????????????? Master candidate in 2rd year Department of Animal genetics & breeding Room 436,College of Animial Science&Technology, China Agriculture University,Beijing,100193 E-mail: yao.h.1988 at gmail.com ??????????????????????????