issue with unz()?
If you use check.names=FALSE in your call to read.csv you can see that the first column name starts with the 3 bytes ef bb bf, which is the UTF-8 "byte-order mark" that Microsoft applications like to put at the start of a text file stored in UTF-8.
v0514 <- read.csv(unz(temp, file0514[1]), stringsAsFactors=FALSE, check.names=FALSE) names(v0514)[1]
[1] "???Accident_Index"
charToRaw(names(v0514)[1])
[1] ef bb bf 41 63 63 69 64 65 6e 74 5f 49 6e 64 65 78 I thought that adding fileEncoding="UTF-8-BOM" or perhaps encoding="UTF-8-BOM" would take care of the issue, but it does not do it for me. You can remove them by hand with substring()
substring(names(v0514)[1],4)
[1] "Accident_Index" Bill Dunlap TIBCO Software wdunlap tibco.com
On Thu, Feb 9, 2017 at 4:13 PM, jing hua zhao <jinghuazhao at hotmail.com> wrote:
Dear R-devel, I appear to see differences in behavior of unz between Windows and Linux. url0514 <- "http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/Stats19_Data_2005-2014.zip" file0514 <- c("Vehicles0514.csv","Casualties0514.csv","Accidents0514.csv") temp <- tempfile() download.file(url0514,temp) a0514 <<- read.csv(unz(temp, file0514[3])) c0514 <<- read.csv(unz(temp, file0514[2])) v0514 <<- read.csv(unz(temp, file0514[1])) Under Windows, I noticed that there are variables i..Accident_Index in objects [a|c|v]0514, but this is not the case if zip file contains only one file, i.e., file2015 <- c("Vehicles_2015.csv","Casualties_2015.csv","Accidents_2015.csv") url2015 <- "http://data.dft.gov.uk/road-accidents-safety-data/RoadSafetyData_2015.zip" download.file(url2015,temp) v2015 <<- read.csv(unz(temp, file2015[1])) c2015 <<- read.csv(unz(temp, file2015[2])) a2015 <<- read.csv(unz(temp, file2015[3])) so to combine [a|c|v]0514 and [a|c|v]2015, I need to add something like names(a0514)[names(a0514)=="?..Accident_Index"] <- "Accident_Index" names(c0514)[names(c0514)=="?..Accident_Index"] <- "Accident_Index" names(v0514)[names(v0514)=="?..Accident_Index"] <- "Accident_Index" This is unnecessary under Linux (RHEL), since those i..Accident_Index have no i.. prefix. Do I miss anything here? Many thanks, Jing Hua Zhao [[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel