Hi all, I used `gzfile` and `gzcon` to read a compressed file but I found that `gzcon` gave me a different result than `gzfile`. It seems like the `gzcon` does not handle the data correctly. I have posted an example below. In the example, a portion of a compressed file is downloaded from Google Cloud as a raw vector, and the data is saved into a temp file. If I use ` gzfile` to read the file, it can show the first 1000 lines successfully. However, if I wrap the raw vector as a connection, and use `gzcon` to read from that connection, it shows the first 884 lines along with a warning(see the output). code:
# installed.packages("BiocManager")
# BiocManager::install("GCSConnection", version = "devel")
library(GCSConnection)
## Download data from cloud
uri <-
"gs://gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.chr1.vcf.bgz"
con <- gcs_connection(uri)
data <- readBin(con, raw(), 4*1024*1024)
close(con)
## write data to a file
file_path <- tempfile() writeBin(data, file_path)
## Read the data using `gzfile`
con1 <- gzfile(file_path) str(readLines(con1, 1000))
## Read the data using `gzcon`
## We create a raw connection from the raw vector con2 <- gzcon(rawConnection(data)) str(readLines(con2, 1000))
output:
str(readLines(con1, 1000))
chr [1:1000] "##fileformat=VCFv4.2" "##hailversion=0.2.24-9cd88d97bedd" ...
str(readLines(con2, 1000))
chr [1:884] "##fileformat=VCFv4.2" "##hailversion=0.2.24-9cd88d97bedd" ... Warning message: In readLines(con2, 1000) : incomplete final line found on 'gzcon(data)'
I am not sure if this is caused by a bug in `gzcon` or the misuse of the function. The same result can be observed at R4.0 and R4.1 devel on Win. Here is my session info, I hope it can be helpful. Any suggestions and help would be appreciated. R Under development (unstable) (2020-06-27 r78747)
Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18363) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 system code page: 65001
Best, Jiefei