Skip to content

readLines() segfaults on large file & question on how to work around

9 messages · Jennifer Lyon, Ista Zahn, Iñaki Ucar +4 more

#
Hi:

I have a 2.1GB JSON file. Typically I use readLines() and
jsonlite:fromJSON() to extract data from a JSON file.

When I try and read in this file using readLines() R segfaults.

I believe the two salient issues with this file are
1). Its size
2). It is a single line (no line breaks)

I can reproduce this issue as follows
#Generate a big file with no line breaks
# In R
# in unix shell
cp alpha.txt file.txt
for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt
file.txt; done

This generates a 2.3GB file with no line breaks

in R:
*** caught segfault ***
address 0x7cffffff, cause 'memory not mapped'

Traceback:
 1: readLines("file.txt")

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 3

I conclude:
 I am potentially running up against a limit in R, which should give a
reasonable error, but currently just segfaults.

My question:
Most of the content of the JSON is an approximately 100K x 6K JSON
equivalent of a dataframe, and I know R can handle much bigger than this
size. I am expecting these JSON files to get even larger. My R code lives
in a bigger system, and the JSON comes in via stdin, so I have absolutely
no control over the data format. I can imagine trying to incrementally
parse the JSON so I don't bump up against the limit, but I am eager for
suggestions of simpler solutions.

Also, I apologize for the timing of this bug report, as I know folks are
working to get out the next release of R, but like so many things I have no
control over when bugs leap up.

Thanks.

Jen
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: R-3.4.1/lib/libRblas.so
LAPACK:R-3.4.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.1
#
As s work-around I  suggest readr::read_file.

--Ista
On Sep 2, 2017 2:58 PM, "Jennifer Lyon" <jennifer.s.lyon at gmail.com> wrote:

            

  
  
#
Thank you for your suggestion. Unfortunately, while R doesn't segfault
calling readr::read_file() on the test file I described, I get the error
message:

Error in read_file_(ds, locale) : negative length vectors are not allowed

Jen
On Sat, Sep 2, 2017 at 1:38 PM, Ista Zahn <istazahn at gmail.com> wrote:

            

  
  
#
2017-09-02 20:58 GMT+02:00 Jennifer Lyon <jennifer.s.lyon at gmail.com>:
As a workaround you can pipe something like "sed s/,/,\\n/g" before
your R script to insert line breaks.

I?aki
#
Jennifer, Why do you try Sparkr?

https://spark.apache.org/docs/1.6.1/api/R/read.json.html
On 2 September 2017 at 23:15, Jennifer Lyon <jennifer.s.lyon at gmail.com> wrote:
#
On Sat, Sep 2, 2017 at 8:58 PM, Jennifer Lyon <jennifer.s.lyon at gmail.com> wrote:
If your data consists of one json object per line, this is called
'ndjson'. There are several packages specialized to read ndjon files:

 - corpus::read_ndjson
 - ndjson::stream_in
 - jsonlite::stream_in

In particular the 'corpus' package handles large files really well
because it has an option to memory-map the file instead of reading all
of its data into memory.

If the data is too large to read, you can preprocess it using
https://stedolan.github.io/jq/ to extract the fields that you need.

You really don't need hadoop/spark/etc for this.
#
Jeroen:

Thank you for pointing me to ndjson, which I had not heard of and is
exactly my case.

My experience:
jsonlite::stream_in - segfaults
ndjson::stream_in - my fault, I am running Ubuntu 14.04 and it is too old
      so it won't compile the package
corpus::read_ndjson - works!!! Of course it does a different simplification
     than jsonlite::fromJSON, so I have to change some code, but it works
     beautifully at least in simple tests. The memory-map option may be of
     use in the future.

Another correspondent said that strings in R can only be 2^31-1 long, which
is why any "solution" that tries to load the whole file into R first as a
string, will fail.

Thanks for suggesting a path forward for me!

Jen
On Sun, Sep 3, 2017 at 2:15 AM, Jeroen Ooms <jeroenooms at gmail.com> wrote:

            

  
  
#
Although the problem can apparently be avoided in this case. readLines 
causing a segfault still seems unwanted behaviour to me. I can replicate 
this with the example below (sessionInfo is further down):


# Generate an example file
l <- paste0(sample(c(letters, LETTERS), 1E6, replace = TRUE),
   collapse="")
con <- file("test.txt", "wt")
for (i in seq_len(2500)) {
   writeLines(l, con, sep ="")
}
close(con)


# Causes segfault:
readLines("test.txt")

Also the error reported by readr is also reproduced (a more informative 
error message and checking for integer overflows would be nice). I will 
report this with readr.

library(readr)
read_file("test.txt")
# Error in read_file_(ds, locale) : negative length vectors are not
# allowed


--
Jan








 > sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.7.0
LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C 
LC_TIME=nl_NL.UTF-8
  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=nl_NL.UTF-8 
LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C                  LC_ADDRESS=C 

[10] LC_TELEPHONE=C             LC_MEASUREMENT=nl_NL.UTF-8 
LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] readr_1.1.1

loaded via a namespace (and not attached):
[1] compiler_3.4.1 R6_2.2.2       hms_0.3        tools_3.4.1 
tibble_1.3.3   Rcpp_0.12.12   rlang_0.1.2
On 03-09-17 20:50, Jennifer Lyon wrote:
#
As of R-devel 72925 one gets a proper error message instead of the crash.

Tomas
On 09/04/2017 08:46 AM, rhelp at eoos.dds.nl wrote: