Skip to content

cannot read iso639 table

7 messages · Sam Steingold, Peter Dalgaard, William Dunlap

#
line 109 did not have 5 elements ... but it did!
empty beginning of file ... but it's not!

details:
--8<---------------cut here---------------start------------->8---
get.language.ISO.table <- function () {
  socket <- url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt",
                open="r",encoding="utf-8");
  data <- read.table(socket, as.is = TRUE, sep = "|", header = FALSE,
                     col.names = c("a3bibliographic","a3terminologic",
                       "a2","english","french"));
  close(socket);
  data
}
language.ISO.table <- get.language.ISO.table()

Error in read.table(socket, as.is = TRUE, sep = "|", header = FALSE,
  col.names = c("a3bibliographic", : 
  empty beginning of file
--8<---------------cut here---------------end--------------->8---
the first line is _not_ blank, as one can see by downloading the
file with wget
  
In addition:
--8<---------------cut here---------------start------------->8---
Warning messages:
1: In read.table(socket, as.is = TRUE, sep = "|", header = FALSE, col.names = c("a3bibliographic",  :
  invalid input found on input connection 'http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt'
--8<---------------cut here---------------end--------------->8---
what is invalid there? libreoffice calc opened the file just fine.

--8<---------------cut here---------------start------------->8---
2: In read.table(socket, as.is = TRUE, sep = "|", header = FALSE, col.names = c("a3bibliographic",  :
  incomplete final line found by readTableHeader on 'http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt'
--8<---------------cut here---------------end--------------->8---
indeed the final NL is missing. why is this a big deal?

when I download the file:

--8<---------------cut here---------------start------------->8---
read.table("ISO-639-2_utf-8.csv",encoding="utf-8", as.is = TRUE,
           sep = "|", header = FALSE,
            col.names = c("a3bibliographic","a3terminologic",
                       "a2","english","french"))
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 109 did not have 5 elements
--8<---------------cut here---------------end--------------->8---

however
--8<---------------cut here---------------start------------->8---
Warning message:
In readLines("ISO-639-2_utf-8.csv", encoding = "utf-8") :
  incomplete final line found on 'ISO-639-2_utf-8.csv'
[1] "dgr|||Dogrib|dogrib"                         
[2] "din|||Dinka|dinka"                           
[3] "div||dv|Divehi; Dhivehi; Maldivian|maldivien"
--8<---------------cut here---------------end--------------->8---
all lines look legit to me.

so, why can't I read the file?

thanks.

ps. ubuntu; R 2.15.1 (2012-06-22) installed from cran using aptitude.
#
On Sep 13, 2012, at 19:42 , Sam Steingold wrote:

            
quote="" would seem to be your friend (apostrophes in the file are doing you in). I can't reproduce the "empty beginning" error, though.

  
    
#
On Windows with R-2.15.1 in a 1252 locale, I had to read (and toss) out
the initial 3 bytes (the byte-order mark?) to make things work:

  > socket <- url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt",open="r",encoding="utf-8")
  > readChar(socket, nchars=3, useBytes=TRUE)
  [1] "???"
  > d <- read.table(socket, quote="", sep="|", stringsAsFactors=FALSE)
  > dim(d)
  [1] 485   5
  > head(d)
     V1 V2 V3             V4      V5
  1 aar    aa           Afar    afar
  2 abk    ab      Abkhazian abkhaze
  3 ace             Achinese    aceh
  4 ach                Acoli   acoli
  5 ada              Adangme adangme
  6 ady       Adyghe; Adygei  adygh?

If I deleted no initial bytes I got
  > socket <- url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt",open="r",encoding="utf-8")
  > d <- read.table(socket, quote="", sep="|", stringsAsFactors=FALSE)
  Warning messages:
  1: In read.table(socket, quote = "", sep = "|", stringsAsFactors = FALSE) :
    invalid input found on input connection 'http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt'
  2: In read.table(socket, quote = "", sep = "|", stringsAsFactors = FALSE) :
    incomplete final line found by readTableHeader on 'http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt'
  > dim(d)
  [1] 1 1
  > str(d)
  'data.frame':   1 obs. of  1 variable:
   $ V1: chr "?"
If I delete the initial 2 bytes I got an "empty beginning of file" error:
  > socket <- url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt",open="r",encoding="utf-8")
  > readChar(socket, nchars=2, useBytes=TRUE)
  [1] "??"
  > d <- read.table(socket, quote="", sep="|", stringsAsFactors=FALSE)
  Error in read.table(socket, quote = "", sep = "|", stringsAsFactors = FALSE) : 
    empty beginning of file
  In addition: Warning messages:
  1: In read.table(socket, quote = "", sep = "|", stringsAsFactors = FALSE) :
    invalid input found on input connection 'http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt'
  2: In read.table(socket, quote = "", sep = "|", stringsAsFactors = FALSE) :
    incomplete final line found by readTableHeader on 'http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt'

  > sessionInfo()
  R version 2.15.1 (2012-06-22)
  Platform: x86_64-pc-mingw32/x64 (64-bit)
  
  locale:
  [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
  [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
  [5] LC_TIME=English_United States.1252    
  
  attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
confirmed - first 3 bytes are "\357\273\277"
alas, this is all I get:

Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  invalid input found on input connection 'http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt'

  a3bibliographic a3terminologic a2        english  french
1             aar             NA aa           Afar    afar
2             abk             NA ab      Abkhazian abkhaze
3             ace             NA          Achinese    aceh
4             ach             NA             Acoli   acoli
5             ada             NA           Adangme adangme
6             ady             NA    Adyghe; Adygei   adygh

note that the first non-ASCII character terminates the input.

so, I still cannot read the data from the URL.

I can read the file though - with quote="" (thanks Peter!) -
except that the first record is "\357\273\277aar".
#
It would be helpful if you showed your commands and printed
outputs, copied directly from your R session, from the beginning
to the end.  I put the call to sessionInfo() in my message because
it is probably relevant.  It is nice to completely include the original
email when responding to it so others can see the whole story in
one place.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
Pragmatically, one can zap the BOM from the output with 

language.ISO.table[1,1] <- substring(language.ISO.table[1,1],2)

and be gone with it.

It would be nicer to zap the BOM before read.table, though. It does work for me with the below (notice that the BOM is a single character if you don't use useBytes=).
function () {
 socket <- url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt",
               open="r",encoding="utf-8");
 readChar(socket, nchar=1)
 data <- read.table(socket, as.is = TRUE, sep = "|", header = FALSE,
                    col.names = c("a3bibliographic","a3terminologic",
                      "a2","english","french"), quote="");
 close(socket);
 data
}
On Sep 13, 2012, at 22:26 , William Dunlap wrote:

            

  
    
#
On Windows with locale "Englist_United States.1252" my R-2.15.1 could not
get that far:
  >  socket <- url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt",
  +                open="r",encoding="utf-8");
  > read.table(socket, quote="", sep="|")
    V1
  1  ?
  Warning messages:
  1: In read.table(socket, quote = "", sep = "|") :
    invalid input found on input connection 'http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt'
  2: In read.table(socket, quote = "", sep = "|") :
    incomplete final line found by readTableHeader on 'http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt'
  > str(.Last.value)
  'data.frame':   1 obs. of  1 variable:
   $ V1: Factor w/ 1 level "?": 1
An initial readChar was the only way I could get it to work there.

Since Windows software seems to put a BOM at the top of a file to indicate that
it is using UTF-<something>, it would be nice if the connection code
at least had an option to deal with it.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com