Skip to content

foreign::read.dbf fails to parse dbf properly

1 message · Roger Bivand

#
On Sat, 30 Jul 2022, r-devel-request at r-project.org wrote:

            
As you may have seen in the code or the help page, the first port from the 
then version of shapelib: https://github.com/OSGeo/shapelib was made by 
Nicholas Lewin-Koh 20 years ago. The code is largely unchanged since then.

What is your OS and R version?

Reading F1_15.DBF (Fedora 36, locally built R 4.2.1), I see:
RESPONDENT      REPORT_YEA     SPPLMNT_NU     ROW_NUMBER       ROW_SEQ
  \xc2   :  350   \xe4\a:25774   NA's:25774   U      :  874   U      :  874
  \001   :  248                               C      :  864   C      :  864
  \002   :  208                               T      :  846   T      :  846
  x      :  206                               \004   :  845   \004   :  845
  \x85   :  197                               \002   :  840   \002   :  840
  (Other):24363                               (Other):20813   (Other):20813
  NA's   :  202                               NA's   :  692   NA's   :  692
...

which is why another problem may be encoding since R 4.2 on Windows 
(UCRT).

The help page does say:

"The DBF format is documented but not much adhered to.  There is is
no guarantee this will read all DBF files."

and:

      'read.dbf' is based on C code from <http://shapelib.maptools.org/>
      which implements the 'XBASE' specification.  It can convert fields
      of type '"L"' (logical), '"N"' and '"F"' (numeric and float) and
      '"D"' (dates): all other field types are read as-is as character
      vectors.  A numeric field is read as an R integer vector if it is
      encoded to have no decimals, otherwise as a numeric vector.
      However, if the numbers are too large to fit into an integer
      vector, it is changed to numeric.  Note that is possible to read
      integers that cannot be represented exactly even as doubles: this
      sometimes occurs if IDs are incorrectly coded as numeric.

So pre-converting seems easier than retro-fitting, given the time since 
the function was first published. Libre Office seems to see 40, and 
writes 40 in a more accessible way, which can be read by read.dbf().

Using a program from GDAL (locally built 3.5.1, with its bundled shapelib, 
on Fedora 36 UTF-8 locale ), https://gdal.org/programs/ogr2ogr.html,

ogr2ogr -f CSV F1_15.csv F1_15.DBF
Warning 1: One or several characters couldn't be converted correctly from 
CP1252 to UTF-8.  This warning will not be emitted anymore

and "(" not 40. So it doesn't seem that updating the shapelib files in 
foreign would help.


In addition, there is an error under options("warn"=2L) in:

F1_EMAIL.DBF
Error in read.dbf(i) :
   (converted from warning) value |0| found in logical field

and possibly others which do not seem to relate to the field definition 
problem you identified.

Roger