Exceptional slowness with read.csv

Thanks, yeah, I think scan is more promising. I'll check it out.
No idea, but have you tried using ?scan to read those next 5 rows? It 
might give you a better idea of the pathologies that are causing 
problems. For example, an unmatched quote might result in some huge 
number of characters trying to be read into a single element of a 
character variable. As your previous respondent said, resolving such 
problems can be a challenge.

Cheers,
Bert

On Mon, Apr 8, 2024 at 8:06?AM Dave Dixon <ddixon at swcp.com> wrote:

    Greetings,

    I have a csv file of 76 fields and about 4 million records. I know
    that
    some of the records have errors - unmatched quotes, specifically.
    Reading the file with readLines and parsing the lines with
    read.csv(text
    = ...) is really slow. I know that the first 2459465 records are
    good.
    So I try this:

    ?> startTime <- Sys.time()
    ?> first_records <- read.csv(file_name, nrows = 2459465)
    ?> endTime <- Sys.time()
    ?> cat("elapsed time = ", endTime - startTime, "\n")

    elapsed time = ? 24.12598

    ?> startTime <- Sys.time()
    ?> second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
    ?> endTime <- Sys.time()
    ?> cat("elapsed time = ", endTime - startTime, "\n")

    This appears to never finish. I have been waiting over 20 minutes.

    So why would (skip = 2459465, nrows = 5) take orders of magnitude
    longer
    than (nrows = 2459465) ?

    Thanks!

    -dave

    PS: readLines(n=2459470) takes 10.42731 seconds.

    ______________________________________________
    R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide
    http://www.R-project.org/posting-guide.html
    <http://www.R-project.org/posting-guide.html>
    and provide commented, minimal, self-contained, reproducible code.

Exceptional slowness with read.csv

Thread (16 messages)