Manage huge database - R-help

Barry Rowlingson · 2008-09-22T10:53:27Z

2008/9/22 Jos? E. Lozano : > Exactly, raw data, but a little more complex since all the 500000 variables > are in text format, so the width is around 2,500,000. > Thanks, I will check. Right now I am reading line by line the file. It's > time consuming, but since I will do it only once, just to rearrange the data > into smaller tables to query, it's ok. A language like python, perl, or even awk might be able to help you slice your data up. > Is genetic DNA data (indiv

Barry Rowlingson

Mon, Sep 22, 2008 3:53 AM #

2008/9/22 Jos? E.  Lozano <lozalojo at jcyl.es>:

A language like python, perl, or even awk might be able to help you
slice your data up.

So is each line just ACCGTATAT etc etc?

 If you have fixed width fields in a file, so that every line is the
same length, then you can use random access methods to get to a
particular value - just multiply the line length by the row number you
want and add the column number. In R you can do this with seek() on a
connection. This should be fast because it seeks by bytes, instead of
having to scan all the comma-separated stuff. The only problem comes
when your data doesn't quite conform, and you can end up reading junk.
When doing this, it's a good idea to test your dataset first to make
sure the lines and fields are right.

Example with dummy.dna:

aaaccctttgggaaa
gattacagattacaa
aaaaaaacccccggg
gggggtgggggtggg
aaaaaaaaaaccccc

 each line has 15 bases, and on my OS there's one additional invisible
character to mark the line end. Windows uses 2, but your data might
not be Windows format... So anyway, my multiplier is 16. Hence to get
a slice of the file of four columns from column 7 for some rows:

[1] "gatt"
[1] "cccc"
[1] "gggg"

 The speed of this should be independent of the size of your data file.

Barry

José E. Lozano

Mon, Sep 22, 2008 4:00 AM #

Exacty, A_G, A_A, G_G and the such.

Nice hint! I didn?t think on this. But I fear that if I have missing values
on the file I wont be able to read the right information...

Yes, I am trying to figure out if all the lines have the exact same lenght
to use a random access method to read it.

Thanks,
Jose Lozano

(Ted Harding)

Mon, Sep 22, 2008 9:03 AM #

On 22-Sep-08 11:00:30, Jos? E. Lozano wrote:

If you were using Linux, I would suggest a command on the lines of

  cat filename | awk '{print(length($0))}'

which would give you the length of each line. But since you have
around 2000 lines, to simply check whether they all have the same
length (in bytes/characters) you can extend the above to

  cat filename | awk '{print(length($0))}' | sort -u

which will present you with all the different line-lengths. If they
are all the same length you will get one number.

I just tested this on a file with lines exceeding 500,000 characters
in length, and it worked perfectly well even for such long lines.

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 22-Sep-08                                       Time: 17:03:21
------------------------------ XFMail ------------------------------