Skip to content
Prev 12099 / 12125 Next

[R-pkg-devel] Using the connections interface to decode text

Hello R package developers,

Now that R_GetConnection(), R_new_custom_connection(),
R_ReadConnection(), R_WriteConnection() are marked as experimental, I'm
curious: is it a good idea to use the interface to decode text from a
user-provided connection? For example, this could be useful to stream
the data into a parser without loading it all into memory first.

R_ReadConnection() is like readBin(), it won't decode any text. On the
other hand, since R_new_custom_connection() is also part of the
interface, this implies that the user must know about struct Rconn and
what its functions do, including the UTF8out flag and how readLines()
uses it. (Without the readLines() trick, R will only attempt to decode
the data into the native encoding. With the readLines() trick, R will
only accept unopened connections and close them afterwards.)

The following example seems to work:

// R_ExecWithCleanup(), R_CONNECTIONS_VERSION check omitted
SEXP readFromConn(SEXP sconn) {
 Rconnection conn = R_GetConnection(sconn);

 if (!conn->isopen) {
  conn->UTF8out = TRUE;
  strcpy(conn->mode, "rt");
  conn->open(conn);
 }

 for (;;) {
   int c = conn->fgetc(conn);
   if (c < 0 || c > 255) break; // R_EOF not declared
   Rprintf("%02x ", c);
 }

 Rprintf("\n");
 conn->close(conn);

 return R_NilValue;
}

LC_ALL=en_GB.iso885915 luit R # non-UTF-8 locale

'\u5b98\u8a71' |> iconv('UTF-8', 'GBK') |> writeLines('gbk.txt')
.Call('readFromConn', file('gbk.txt', encoding = 'GBK'))
# e5 ae 98 e8 a9 b1 0a
'\u5b98\u8a71' |> charToRaw() # same UTF-8 as above
# [1] e5 ae 98 e8 a9 b1

Is it a good idea to adopt such an approach in a package? Would it be
better to read data as binary and decode it using Riconv?