Dear All, I am observing a strange behavior and searching the archives and help pages didn't help much. I have a csv with a variable number of fields in each line. I use dataPoints <- read.csv(inputFile, head=FALSE, sep=";",fill =TRUE); to read it in, and it works. But - some lines are long and 'wrap', or split and continue on the next line. So when I check the dim of the frame, they are not correct and I can see when I do a printout that the lines is split into two in the frame. I checked the input file and all is good. an example of the input is: 37;2175168475;13;8.522729;47.19537;16366682 at N00;30;sculpture;bird;tourism;animal;statue;canon;eos;rebel;schweiz;switzerland;eagle;swiss;adler;skulptur;zug;1750;28;tamron;f28;canton;tourismus;vogel;baar;kanton;xti;tamron1750;1750mm;tamron1750mm;400d;rabbitriotnet; where the last values occurs on the next line in the data frame. It does not have to be the last value, as in the follwong example, the word "kempten" starts the next line: 39;167757703;12;10.309295;47.724545;21903142 at N00;36;white;building;tower;clock;clouds;germany;bayern;deutschland;bavaria;europa;europe;eagle;adler;eu;wolke;dome;townhall;rathaus;turm;weiss;allemagne;europeanunion;bundesrepublik;gebaeude;glocke;brd;allgau;kuppel;europ;kempten;niemcy;europo;federalrepublic;europaischeunion;europaeischeunion;germanio; What could be the reason? I ws thinking about solving the issue by using a different separator, that I would use for the first 7 fields and concatenating all of the remaining values into a single stirng value, but could not figure out how to do such a substitution in R. Unfortunately, on my system I cannot specify a range for sed... Thanks for any help/pointers Martin
strange behavior when reading csv - line wraps
9 messages · Martin Tomko, jim holtman, (Ted Harding)
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090529/014a349f/attachment-0001.pl>
Jim, the two lines I put in are the actual problematic input lines. In these examples, there are no quotes nor # signs, although I have no means to make sure they do not occur in the inputs (any hints how I could deal with that?). I am trying to avoid as much pre-processing outside R as possible, and I have to process about 500 files with up to 3000 records each, so I need a more or less automated/batch solution. - so any string substitution will have to occur in R. But for the moment, I do not see a reaason for substitution, and the wrapping still occurs. Cheers Martin
jim holtman wrote:
You need to supply the actual input line so we can see what is
happening. Are you sure you do not have unbalanced quotes in your
input (try quote='') or do you have comment characters ("#") in your
input?
On Fri, May 29, 2009 at 3:15 PM, Martin Tomko <martin.tomko at geo.uzh.ch
<mailto:martin.tomko at geo.uzh.ch>> wrote:
Dear All,
I am observing a strange behavior and searching the archives and
help pages didn't help much.
I have a csv with a variable number of fields in each line.
I use
dataPoints <- read.csv(inputFile, head=FALSE, sep=";",fill =TRUE);
to read it in, and it works. But - some lines are long and 'wrap',
or split and continue on the next line. So when I check the dim of
the frame, they are not correct and I can see when I do a printout
that the lines is split into two in the frame. I checked the input
file and all is good.
an example of the input is:
37;2175168475;13;8.522729;47.19537;16366682 at N00;30;sculpture;bird;tourism;animal;statue;canon;eos;rebel;schweiz;switzerland;eagle;swiss;adler;skulptur;zug;1750;28;tamron;f28;canton;tourismus;vogel;baar;kanton;xti;tamron1750;1750mm;tamron1750mm;400d;rabbitriotnet;
where the last values occurs on the next line in the data frame.
It does not have to be the last value, as in the follwong example,
the word "kempten" starts the next line:
39;167757703;12;10.309295;47.724545;21903142 at N00;36;white;building;tower;clock;clouds;germany;bayern;deutschland;bavaria;europa;europe;eagle;adler;eu;wolke;dome;townhall;rathaus;turm;weiss;allemagne;europeanunion;bundesrepublik;gebaeude;glocke;brd;allgau;kuppel;europ;kempten;niemcy;europo;federalrepublic;europaischeunion;europaeischeunion;germanio;
What could be the reason?
I ws thinking about solving the issue by using a different
separator, that I would use for the first 7 fields and
concatenating all of the remaining values into a single stirng
value, but could not figure out how to do such a substitution in
R. Unfortunately, on my system I cannot specify a range for sed...
Thanks for any help/pointers
Martin
______________________________________________
R-help at r-project.org <mailto:R-help at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
<http://www.r-project.org/posting-guide.html>
and provide commented, minimal, self-contained, reproducible code.
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090530/2db5723c/attachment-0001.pl>
In a private correspondence with Martin Tomko, I think the reason for the problem has been found. The numbers of ";"-separated fields in the 82 successive lines of his file are as follows: 01:26 02:26 03:33 04:33 05:12 06:12 07:12 08:12, 09:19 10:19 11:17 12:17 13:23 14:23 15:23 16:23, 17:23 18:23 19:23 20:23 21:23 22:23 23:23 24:23, 25:23 26:23 27:23 28:23 29:23 30:23 31:23 32:23, 33:23 34:23 35:23 36:23 37:23 38:23 39:23 40:23, 41:23 42:23 43:23 44:23 45:23 46:23 47:23 48:23, 49:23 50:23 51:23 52:23 53:23 54:23 55:23 56:23, 57:23 58:23 59:23 60:23 61:34 62:34 63:34 64:34, 65:13 66:13 67:38 68:38 69:20 70:20 71:44 72:20, 73:19 74:19 75:20 76:44 77:20 78:19 79:19 80:20, 81:25 82:25 So in the first 5 lines there is a maximum of 33 fields. Hence, since there is no header line, read.csv() decides to allocate 33 columns. (See ?read.csv). There are the following distinct numbers of fields in the lines: 12 13 17 19 20 23 25 26 33 34 38 44 so there are lines with 34, 38 and 44 fields. All lines in the CSV file end with ";", hence there is an implicit blank field at the end of each line. The lines with 34 fields have the 34th field blank, so after the break there is presumably a "quasi blank input line" where the 34th (blank) field has spilled over. Such input will be ignored with the default "blank.lines.skip = TRUE" option to read,csv(). The longer lines (2 with 38 fields, 2 with 44) will be split after the 33rd field, the remainder being taken as an additional input line. As a result, there are 82 (= 82+4) rows in the resulting dataframe. This explanation is compatible with what Martin has observed. The underlying forensic details were sniffed out with a couple of passes through 'awk' scripts. One solution is to call read.csv() with option "col.names=Xnn" where Xnn is a constructed character vector with elements such as "X01" "X02" ... "X44" (once one has determined, as above, that there is a maximum of 44 fields per line in the file). Ted.
On 30-May-09 19:43:47, jim holtman wrote:
It is still not clear to me exactly how you want to read the lines in. If the lines have a variable number of fields, and some of the lines might be wrapped, is there some way to determine where the start of each line is. If you are reading them in with read.csv, then the system is assuming that each line starts a new row. If this is not the case, then you will have to state the rules that determine where the lines start. You can always read the data in with 'scan' to separate each line and then do whatever processing is required to put together the rows in a data frame that you want. In one of your examples, you indicated that the line was split starting at the word "kempten"; if this is in the middle of the line, then you would have to create the break after reading the line in with 'scan' and then creating the rows in the dataframe. All of this can be done in R if you can state what the criteria is. On Sat, May 30, 2009 at 4:32 AM, Martin Tomko <martin.tomko at geo.uzh.ch>wrote:
Jim, the two lines I put in are the actual problematic input lines. In these examples, there are no quotes nor # signs, although I have no means to make sure they do not occur in the inputs (any hints how I could deal with that?). I am trying to avoid as much pre-processing outside R as possible, and I have to process about 500 files with up to 3000 records each, so I need a more or less automated/batch solution. - so any string substitution will have to occur in R. But for the moment, I do not see a reaason for substitution, and the wrapping still occurs. Cheers Martin jim holtman wrote:
You need to supply the actual input line so we can see what is
happening.
Are you sure you do not have unbalanced quotes in your input (try
quote='')
or do you have comment characters ("#") in your input?
On Fri, May 29, 2009 at 3:15 PM, Martin Tomko
<martin.tomko at geo.uzh.ch<mailto:
martin.tomko at geo.uzh.ch>> wrote:
Dear All,
I am observing a strange behavior and searching the archives and
help pages didn't help much.
I have a csv with a variable number of fields in each line.
I use
dataPoints <- read.csv(inputFile, head=FALSE, sep=";",fill =TRUE);
to read it in, and it works. But - some lines are long and 'wrap',
or split and continue on the next line. So when I check the dim of
the frame, they are not correct and I can see when I do a printout
that the lines is split into two in the frame. I checked the input
file and all is good.
an example of the input is:
37;2175168475;13;8.522729;47.19537;16366682 at N00
;30;sculpture;bird;tourism;animal;statue;canon;eos;rebel;schweiz;switz
erland;eagle;swiss;adler;skulptur;zug;1750;28;tamron;f28;canton;touris
mus;vogel;baar;kanton;xti;tamron1750;1750mm;tamron1750mm;400d;rabbitri
otnet;
where the last values occurs on the next line in the data frame.
It does not have to be the last value, as in the follwong example,
the word "kempten" starts the next line:
39;167757703;12;10.309295;47.724545;21903142 at N00
;36;white;building;tower;clock;clouds;germany;bayern;deutschland;bavar
ia;europa;europe;eagle;adler;eu;wolke;dome;townhall;rathaus;turm;weiss
;allemagne;europeanunion;bundesrepublik;gebaeude;glocke;brd;allgau;kup
pel;europ;kempten;niemcy;europo;federalrepublic;europaischeunion;europ
aeischeunion;germanio;
What could be the reason? I ws thinking about solving the issue by using a different separator, that I would use for the first 7 fields and concatenating all of the remaining values into a single stirng value, but could not figure out how to do such a substitution in R. Unfortunately, on my system I cannot specify a range for sed... Thanks for any help/pointers Martin
______________________________________________ R-help at r-project.org <mailto:R-help at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html<http://www.r-project.or g/posting-guide.html> <http://www.r-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 30-May-09 Time: 21:15:13 ------------------------------ XFMail ------------------------------
Dear Jim, with the help of Ted, we diagnosed that the cause is in the extreme variability in line length during reading in. As the table column number is apparently determined fro mthe first five lines, what exceeds this length gets automatically on the next line. I am now trying to find a way to read in the data despite this. I have no control over the table extent, the only thing that would make sense according to my data would be to read in a fixed number of columns and merge all remaining columns as a long string in the last one. No idea how to do this, though. Thanks Martin
jim holtman wrote:
It is still not clear to me exactly how you want to read the lines
in. If the lines have a variable number of fields, and some of the
lines might be wrapped, is there some way to determine where the start
of each line is.
If you are reading them in with read.csv, then the system is assuming
that each line starts a new row. If this is not the case, then you
will have to state the rules that determine where the lines start.
You can always read the data in with 'scan' to separate each line and
then do whatever processing is required to put together the rows in a
data frame that you want.
In one of your examples, you indicated that the line was split
starting at the word "kempten"; if this is in the middle of the line,
then you would have to create the break after reading the line in with
'scan' and then creating the rows in the dataframe. All of this can
be done in R if you can state what the criteria is.
On Sat, May 30, 2009 at 4:32 AM, Martin Tomko <martin.tomko at geo.uzh.ch
<mailto:martin.tomko at geo.uzh.ch>> wrote:
Jim,
the two lines I put in are the actual problematic input lines.
In these examples, there are no quotes nor # signs, although I
have no means to make sure they do not occur in the inputs (any
hints how I could deal with that?).
I am trying to avoid as much pre-processing outside R as possible,
and I have to process about 500 files with up to 3000 records
each, so I need a more or less automated/batch solution. - so any
string substitution will have to occur in R. But for the moment, I
do not see a reaason for substitution, and the wrapping still occurs.
Cheers
Martin
jim holtman wrote:
You need to supply the actual input line so we can see what is
happening. Are you sure you do not have unbalanced quotes in
your input (try quote='') or do you have comment characters
("#") in your input?
On Fri, May 29, 2009 at 3:15 PM, Martin Tomko
<martin.tomko at geo.uzh.ch <mailto:martin.tomko at geo.uzh.ch>
<mailto:martin.tomko at geo.uzh.ch
<mailto:martin.tomko at geo.uzh.ch>>> wrote:
Dear All,
I am observing a strange behavior and searching the
archives and
help pages didn't help much.
I have a csv with a variable number of fields in each line.
I use
dataPoints <- read.csv(inputFile, head=FALSE, sep=";",fill
=TRUE);
to read it in, and it works. But - some lines are long and
'wrap',
or split and continue on the next line. So when I check the
dim of
the frame, they are not correct and I can see when I do a
printout
that the lines is split into two in the frame. I checked
the input
file and all is good.
an example of the input is:
37;2175168475;13;8.522729;47.19537;16366682 at N00;30;sculpture;bird;tourism;animal;statue;canon;eos;rebel;schweiz;switzerland;eagle;swiss;adler;skulptur;zug;1750;28;tamron;f28;canton;tourismus;vogel;baar;kanton;xti;tamron1750;1750mm;tamron1750mm;400d;rabbitriotnet;
where the last values occurs on the next line in the data
frame.
It does not have to be the last value, as in the follwong
example,
the word "kempten" starts the next line:
39;167757703;12;10.309295;47.724545;21903142 at N00;36;white;building;tower;clock;clouds;germany;bayern;deutschland;bavaria;europa;europe;eagle;adler;eu;wolke;dome;townhall;rathaus;turm;weiss;allemagne;europeanunion;bundesrepublik;gebaeude;glocke;brd;allgau;kuppel;europ;kempten;niemcy;europo;federalrepublic;europaischeunion;europaeischeunion;germanio;
What could be the reason?
I ws thinking about solving the issue by using a different
separator, that I would use for the first 7 fields and
concatenating all of the remaining values into a single stirng
value, but could not figure out how to do such a
substitution in
R. Unfortunately, on my system I cannot specify a range for
sed...
Thanks for any help/pointers
Martin
______________________________________________
R-help at r-project.org <mailto:R-help at r-project.org>
<mailto:R-help at r-project.org <mailto:R-help at r-project.org>>
mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
<http://www.r-project.org/posting-guide.html>
<http://www.r-project.org/posting-guide.html>
and provide commented, minimal, self-contained,
reproducible code.
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
Martin Tomko
Postdoctoral Research Assistant
Geographic Information Systems Division
Department of Geography
University of Zurich - Irchel
Winterthurerstr. 190
CH-8057 Zurich, Switzerland
email: martin.tomko at geo.uzh.ch
site: http://www.geo.uzh.ch/~mtomko
mob: +41-788 629 558
tel: +41-44-6355256
fax: +41-44-6356848
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090531/dd2d97ed/attachment-0001.pl>
Ah!!! It was count.fields() which we had overlooked! We discoveered
a work-round which involved using
Data0 <- readLines(file)
to create a vector of strings, one for each line of the input file,
and then using
NF <- unlist(lapply(R0,function(x)
length(unlist(gregexpr(";",x,fixed=TRUE,useBytes=TRUE))))))
to count the number of occurrences of ";" (the separator) in each line.
(NF+1) produces the same result as count.fields(file,sep=";").
Thanks for pointing out the existence of count.fields()!
Ted.
On 31-May-09 15:04:23, jim holtman wrote:
You can do something like this: count the number of fields in each line
of
the file and use the max to determine the number of columns for
read.table:
file <- '/tempxx.txt'
maxFields <- max(count.fields(file)) # max
# now setup read.table for max number
input <- read.table(file, colClasses=rep(NA, maxFields), fill=TRUE,
col.names=paste("V", seq(maxFields), sep=''))
On Sun, May 31, 2009 at 6:06 AM, Martin Tomko
<martin.tomko at geo.uzh.ch>wrote:
Dear Jim, with the help of Ted, we diagnosed that the cause is in the extreme variability in line length during reading in. As the table column number is apparently determined fro mthe first five lines, what exceeds this length gets automatically on the next line. I am now trying to find a way to read in the data despite this. I have no control over the table extent, the only thing that would make sense according to my data would be to read in a fixed number of columns and merge all remaining columns as a long string in the last one. No idea how to do this, though. Thanks Martin jim holtman wrote:
It is still not clear to me exactly how you want to read the lines
in. If
the lines have a variable number of fields, and some of the lines
might be
wrapped, is there some way to determine where the start of each line
is.
If you are reading them in with read.csv, then the system is
assuming
that each line starts a new row. If this is not the case, then you
will
have to state the rules that determine where the lines start. You
can
always read the data in with 'scan' to separate each line and then do
whatever processing is required to put together the rows in a data
frame
that you want.
In one of your examples, you indicated that the line was split
starting
at the word "kempten"; if this is in the middle of the line, then you
would
have to create the break after reading the line in with 'scan' and
then
creating the rows in the dataframe. All of this can be done in R if
you can
state what the criteria is.
On Sat, May 30, 2009 at 4:32 AM, Martin Tomko
<martin.tomko at geo.uzh.ch<mailto:
martin.tomko at geo.uzh.ch>> wrote:
Jim,
the two lines I put in are the actual problematic input lines.
In these examples, there are no quotes nor # signs, although I
have no means to make sure they do not occur in the inputs (any
hints how I could deal with that?).
I am trying to avoid as much pre-processing outside R as possible,
and I have to process about 500 files with up to 3000 records
each, so I need a more or less automated/batch solution. - so any
string substitution will have to occur in R. But for the moment, I
do not see a reaason for substitution, and the wrapping still
occurs.
Cheers
Martin
jim holtman wrote:
You need to supply the actual input line so we can see what is
happening. Are you sure you do not have unbalanced quotes in
your input (try quote='') or do you have comment characters
("#") in your input?
On Fri, May 29, 2009 at 3:15 PM, Martin Tomko
<martin.tomko at geo.uzh.ch <mailto:martin.tomko at geo.uzh.ch>
<mailto:martin.tomko at geo.uzh.ch
<mailto:martin.tomko at geo.uzh.ch>>> wrote:
Dear All,
I am observing a strange behavior and searching the
archives and
help pages didn't help much.
I have a csv with a variable number of fields in each line.
I use
dataPoints <- read.csv(inputFile, head=FALSE, sep=";",fill
=TRUE);
to read it in, and it works. But - some lines are long and
'wrap',
or split and continue on the next line. So when I check the
dim of
the frame, they are not correct and I can see when I do a
printout
that the lines is split into two in the frame. I checked
the input
file and all is good.
an example of the input is:
37;2175168475;13;8.522729;47.19537;16366682 at N00
;30;sculpture;bird;tourism;animal;statue;canon;eos;rebel;schweiz;switz
erland;eagle;swiss;adler;skulptur;zug;1750;28;tamron;f28;canton;touris
mus;vogel;baar;kanton;xti;tamron1750;1750mm;tamron1750mm;400d;rabbitri
otnet;
where the last values occurs on the next line in the data
frame.
It does not have to be the last value, as in the follwong
example,
the word "kempten" starts the next line:
39;167757703;12;10.309295;47.724545;21903142 at N00
;36;white;building;tower;clock;clouds;germany;bayern;deutschland;bavar
ia;europa;europe;eagle;adler;eu;wolke;dome;townhall;rathaus;turm;weiss
;allemagne;europeanunion;bundesrepublik;gebaeude;glocke;brd;allgau;kup
pel;europ;kempten;niemcy;europo;federalrepublic;europaischeunion;europ
aeischeunion;germanio;
What could be the reason?
I ws thinking about solving the issue by using a different
separator, that I would use for the first 7 fields and
concatenating all of the remaining values into a single
stirng
value, but could not figure out how to do such a
substitution in
R. Unfortunately, on my system I cannot specify a range for
sed...
Thanks for any help/pointers
Martin
______________________________________________
R-help at r-project.org <mailto:R-help at r-project.org>
<mailto:R-help at r-project.org <mailto:R-help at r-project.org>>
mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html<http://www.r-pro
ject.org/posting-guide.html>
<http://www.r-project.org/posting-guide.html>
<http://www.r-project.org/posting-guide.html>
and provide commented, minimal, self-contained,
reproducible code.
-- Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
-- Martin Tomko Postdoctoral Research Assistant Geographic Information Systems Division Department of Geography University of Zurich - Irchel Winterthurerstr. 190 CH-8057 Zurich, Switzerland email: martin.tomko at geo.uzh.ch site: http://www.geo.uzh.ch/~mtomko mob: +41-788 629 558 tel: +41-44-6355256 fax: +41-44-6356848
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 31-May-09 Time: 16:24:27 ------------------------------ XFMail ------------------------------
Big thanks to Ted and Jim for all the help. Martin
(Ted Harding) wrote:
Ah!!! It was count.fields() which we had overlooked! We discoveered
a work-round which involved using
Data0 <- readLines(file)
to create a vector of strings, one for each line of the input file,
and then using
NF <- unlist(lapply(R0,function(x)
length(unlist(gregexpr(";",x,fixed=TRUE,useBytes=TRUE))))))
to count the number of occurrences of ";" (the separator) in each line.
(NF+1) produces the same result as count.fields(file,sep=";").
Thanks for pointing out the existence of count.fields()!
Ted.
On 31-May-09 15:04:23, jim holtman wrote:
You can do something like this: count the number of fields in each line
of
the file and use the max to determine the number of columns for
read.table:
file <- '/tempxx.txt'
maxFields <- max(count.fields(file)) # max
# now setup read.table for max number
input <- read.table(file, colClasses=rep(NA, maxFields), fill=TRUE,
col.names=paste("V", seq(maxFields), sep=''))
On Sun, May 31, 2009 at 6:06 AM, Martin Tomko
<martin.tomko at geo.uzh.ch>wrote:
Dear Jim,
with the help of Ted, we diagnosed that the cause is in the extreme
variability in line length during reading in. As the table column
number is
apparently determined fro mthe first five lines, what exceeds this
length
gets automatically on the next line.
I am now trying to find a way to read in the data despite this. I have
no
control over the table extent, the only thing that would make sense
according to my data would be to read in a fixed number of columns and
merge
all remaining columns as a long string in the last one. No idea how to
do
this, though.
Thanks
Martin
jim holtman wrote:
It is still not clear to me exactly how you want to read the lines
in. If
the lines have a variable number of fields, and some of the lines
might be
wrapped, is there some way to determine where the start of each line
is.
If you are reading them in with read.csv, then the system is
assuming
that each line starts a new row. If this is not the case, then you
will
have to state the rules that determine where the lines start. You
can
always read the data in with 'scan' to separate each line and then do
whatever processing is required to put together the rows in a data
frame
that you want.
In one of your examples, you indicated that the line was split
starting
at the word "kempten"; if this is in the middle of the line, then you
would
have to create the break after reading the line in with 'scan' and
then
creating the rows in the dataframe. All of this can be done in R if
you can
state what the criteria is.
On Sat, May 30, 2009 at 4:32 AM, Martin Tomko
<martin.tomko at geo.uzh.ch<mailto:
martin.tomko at geo.uzh.ch>> wrote:
Jim,
the two lines I put in are the actual problematic input lines.
In these examples, there are no quotes nor # signs, although I
have no means to make sure they do not occur in the inputs (any
hints how I could deal with that?).
I am trying to avoid as much pre-processing outside R as possible,
and I have to process about 500 files with up to 3000 records
each, so I need a more or less automated/batch solution. - so any
string substitution will have to occur in R. But for the moment, I
do not see a reaason for substitution, and the wrapping still
occurs.
Cheers
Martin
jim holtman wrote:
You need to supply the actual input line so we can see what is
happening. Are you sure you do not have unbalanced quotes in
your input (try quote='') or do you have comment characters
("#") in your input?
On Fri, May 29, 2009 at 3:15 PM, Martin Tomko
<martin.tomko at geo.uzh.ch <mailto:martin.tomko at geo.uzh.ch>
<mailto:martin.tomko at geo.uzh.ch
<mailto:martin.tomko at geo.uzh.ch>>> wrote:
Dear All,
I am observing a strange behavior and searching the
archives and
help pages didn't help much.
I have a csv with a variable number of fields in each line.
I use
dataPoints <- read.csv(inputFile, head=FALSE, sep=";",fill
=TRUE);
to read it in, and it works. But - some lines are long and
'wrap',
or split and continue on the next line. So when I check the
dim of
the frame, they are not correct and I can see when I do a
printout
that the lines is split into two in the frame. I checked
the input
file and all is good.
an example of the input is:
37;2175168475;13;8.522729;47.19537;16366682 at N00
;30;sculpture;bird;tourism;animal;statue;canon;eos;rebel;schweiz;switz
erland;eagle;swiss;adler;skulptur;zug;1750;28;tamron;f28;canton;touris
mus;vogel;baar;kanton;xti;tamron1750;1750mm;tamron1750mm;400d;rabbitri
otnet;
where the last values occurs on the next line in the data
frame.
It does not have to be the last value, as in the follwong
example,
the word "kempten" starts the next line:
39;167757703;12;10.309295;47.724545;21903142 at N00
;36;white;building;tower;clock;clouds;germany;bayern;deutschland;bavar
ia;europa;europe;eagle;adler;eu;wolke;dome;townhall;rathaus;turm;weiss
;allemagne;europeanunion;bundesrepublik;gebaeude;glocke;brd;allgau;kup
pel;europ;kempten;niemcy;europo;federalrepublic;europaischeunion;europ
aeischeunion;germanio;
What could be the reason?
I ws thinking about solving the issue by using a different
separator, that I would use for the first 7 fields and
concatenating all of the remaining values into a single
stirng
value, but could not figure out how to do such a
substitution in
R. Unfortunately, on my system I cannot specify a range for
sed...
Thanks for any help/pointers
Martin
______________________________________________
R-help at r-project.org <mailto:R-help at r-project.org>
<mailto:R-help at r-project.org <mailto:R-help at r-project.org>>
mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html<http://www.r-pro
ject.org/posting-guide.html>
<http://www.r-project.org/posting-guide.html>
<http://www.r-project.org/posting-guide.html>
and provide commented, minimal, self-contained,
reproducible code.
-- Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
-- Martin Tomko Postdoctoral Research Assistant Geographic Information Systems Division Department of Geography University of Zurich - Irchel Winterthurerstr. 190 CH-8057 Zurich, Switzerland email: martin.tomko at geo.uzh.ch site: http://www.geo.uzh.ch/~mtomko mob: +41-788 629 558 tel: +41-44-6355256 fax: +41-44-6356848
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 31-May-09 Time: 16:24:27 ------------------------------ XFMail ------------------------------
Martin Tomko
Postdoctoral Research Assistant
Geographic Information Systems Division
Department of Geography
University of Zurich - Irchel
Winterthurerstr. 190
CH-8057 Zurich, Switzerland
email: martin.tomko at geo.uzh.ch
site: http://www.geo.uzh.ch/~mtomko
mob: +41-788 629 558
tel: +41-44-6355256
fax: +41-44-6356848