Hi, I'm needing some help finding a function to read a large text file into an array in R. The data are essentially presence / absence / na data for many species and come as a grid with each species name (after two spaces) at the beginning of the matrix defining the map for that species. An excerpt could therefore be: SPECIES1 999001099 900110109 011101000 901100101 110100019 901110019 SPECIES2 999000099 900110119 011101100 901010101 110000019 900000019 SPECIES3 999001099 900100109 011100010 901100100 110100019 901110019 where 9 is actually na, 0 is absence and 1 presence. The final array I want to create should have dimensions that are the x and y coordinates and the number of species (known in advance). (In this example dim = c(9,6,3)). It would be sort of neat if the code could also read the species name into the appropriate names attribute, but this is a refinement that I could probably do if someone can help me read the data into R and into an array in the first place. I'm currently thinking a line by line approach using readLines might be the best option, but I've got a very long file - well over 100 species, each a matrix of 70 x 100 datapoints. making this option rther time consuming, I expect - especially as the next dataset has 1300 species and a much larger grid... Any hints would be gratefully recieved. Colin Beale Macaulay Land Use Research Institute
reading long matrix
4 messages · Colin Beale, jim holtman, Gabor Grothendieck
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/r-help/attachments/20051222/a53b00ae/attachment.pl
One way to do this is to use read.fwf. I have borrowed Jim's
use of scan and use a similar calculation to get the indexes
of the breaks, breaks. We then determine the common number
of rows and columns in each species.
The second group of statements replaces all 9's with spaces
so that upon parsing them as numbers they will be NAs and then sets
up a text connection to the resulting character vector. These are then
read in by read.fwf, nr rows at a time and the result is
unlist'ed to a numeric vector, nums. The last statement
reshapes it into an array and adds the species names as
the last dimension names.
# read data in
L <- scan("clipboard", what = "")
breaks <- grep("^[[:alpha:]]", L)
nr <- breaks[2] - breaks[1] - 1; nc <- nchar(L[2])
# parse numbers
n <- length(L[-breaks]) / nr
con <- textConnection(gsub("9", " ", L[-breaks]))
nums <- unlist(replicate(n, read.fwf(con, widths = rep(1, nc), n = nr)))
result <- array(nums, c(6,9,3), c(NULL, NULL, L[breaks]))
On 12/22/05, Colin Beale <c.beale at macaulay.ac.uk> wrote:
Hi, I'm needing some help finding a function to read a large text file into an array in R. The data are essentially presence / absence / na data for many species and come as a grid with each species name (after two spaces) at the beginning of the matrix defining the map for that species. An excerpt could therefore be: SPECIES1 999001099 900110109 011101000 901100101 110100019 901110019 SPECIES2 999000099 900110119 011101100 901010101 110000019 900000019 SPECIES3 999001099 900100109 011100010 901100100 110100019 901110019 where 9 is actually na, 0 is absence and 1 presence. The final array I want to create should have dimensions that are the x and y coordinates and the number of species (known in advance). (In this example dim = c(9,6,3)). It would be sort of neat if the code could also read the species name into the appropriate names attribute, but this is a refinement that I could probably do if someone can help me read the data into R and into an array in the first place. I'm currently thinking a line by line approach using readLines might be the best option, but I've got a very long file - well over 100 species, each a matrix of 70 x 100 datapoints. making this option rther time consuming, I expect - especially as the next dataset has 1300 species and a much larger grid... Any hints would be gratefully recieved. Colin Beale Macaulay Land Use Research Institute
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
One correction. I had hard coded the last statement for testing with the data provided. Change it to this for generality: result <- array(nums, c(nr, nc, n), c(NULL, NULL, L[breaks]))
On 12/22/05, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
One way to do this is to use read.fwf. I have borrowed Jim's
use of scan and use a similar calculation to get the indexes
of the breaks, breaks. We then determine the common number
of rows and columns in each species.
The second group of statements replaces all 9's with spaces
so that upon parsing them as numbers they will be NAs and then sets
up a text connection to the resulting character vector. These are then
read in by read.fwf, nr rows at a time and the result is
unlist'ed to a numeric vector, nums. The last statement
reshapes it into an array and adds the species names as
the last dimension names.
# read data in
L <- scan("clipboard", what = "")
breaks <- grep("^[[:alpha:]]", L)
nr <- breaks[2] - breaks[1] - 1; nc <- nchar(L[2])
# parse numbers
n <- length(L[-breaks]) / nr
con <- textConnection(gsub("9", " ", L[-breaks]))
nums <- unlist(replicate(n, read.fwf(con, widths = rep(1, nc), n = nr)))
result <- array(nums, c(6,9,3), c(NULL, NULL, L[breaks]))
On 12/22/05, Colin Beale <c.beale at macaulay.ac.uk> wrote:
Hi, I'm needing some help finding a function to read a large text file into an array in R. The data are essentially presence / absence / na data for many species and come as a grid with each species name (after two spaces) at the beginning of the matrix defining the map for that species. An excerpt could therefore be: SPECIES1 999001099 900110109 011101000 901100101 110100019 901110019 SPECIES2 999000099 900110119 011101100 901010101 110000019 900000019 SPECIES3 999001099 900100109 011100010 901100100 110100019 901110019 where 9 is actually na, 0 is absence and 1 presence. The final array I want to create should have dimensions that are the x and y coordinates and the number of species (known in advance). (In this example dim = c(9,6,3)). It would be sort of neat if the code could also read the species name into the appropriate names attribute, but this is a refinement that I could probably do if someone can help me read the data into R and into an array in the first place. I'm currently thinking a line by line approach using readLines might be the best option, but I've got a very long file - well over 100 species, each a matrix of 70 x 100 datapoints. making this option rther time consuming, I expect - especially as the next dataset has 1300 species and a much larger grid... Any hints would be gratefully recieved. Colin Beale Macaulay Land Use Research Institute
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html