Skip to content

[R-pkg-devel] Using the connections interface to decode text

3 messages · Ivan Krylov, Simon Urbanek

#
Hello R package developers,

Now that R_GetConnection(), R_new_custom_connection(),
R_ReadConnection(), R_WriteConnection() are marked as experimental, I'm
curious: is it a good idea to use the interface to decode text from a
user-provided connection? For example, this could be useful to stream
the data into a parser without loading it all into memory first.

R_ReadConnection() is like readBin(), it won't decode any text. On the
other hand, since R_new_custom_connection() is also part of the
interface, this implies that the user must know about struct Rconn and
what its functions do, including the UTF8out flag and how readLines()
uses it. (Without the readLines() trick, R will only attempt to decode
the data into the native encoding. With the readLines() trick, R will
only accept unopened connections and close them afterwards.)

The following example seems to work:

// R_ExecWithCleanup(), R_CONNECTIONS_VERSION check omitted
SEXP readFromConn(SEXP sconn) {
 Rconnection conn = R_GetConnection(sconn);

 if (!conn->isopen) {
  conn->UTF8out = TRUE;
  strcpy(conn->mode, "rt");
  conn->open(conn);
 }

 for (;;) {
   int c = conn->fgetc(conn);
   if (c < 0 || c > 255) break; // R_EOF not declared
   Rprintf("%02x ", c);
 }

 Rprintf("\n");
 conn->close(conn);

 return R_NilValue;
}

LC_ALL=en_GB.iso885915 luit R # non-UTF-8 locale

'\u5b98\u8a71' |> iconv('UTF-8', 'GBK') |> writeLines('gbk.txt')
.Call('readFromConn', file('gbk.txt', encoding = 'GBK'))
# e5 ae 98 e8 a9 b1 0a
'\u5b98\u8a71' |> charToRaw() # same UTF-8 as above
# [1] e5 ae 98 e8 a9 b1

Is it a good idea to adopt such an approach in a package? Would it be
better to read data as binary and decode it using Riconv?
#
As the author of the custom connection API the answer is no, it was not the intention. The structure has to be exposed in order to implement the connection API for new connections types, there is no way around it (since implementing code need access to the internals), but it should not to be used outside of that context since it should be opaque to the *users* of the connections. So packages that do not implement new connections should not use the internal structures to access the internals of connections, because they are not intended to be part of the public API as they may change. That?s why this is strictly experimental - if we need to change it, only packages implementing new connections would have to adapt, but no one else should. Does that clarify?

Cheers,
Simon
#
On Fri, 20 Feb 2026 09:56:11 +1300
Simon Urbanek <simon.urbanek at R-project.org> wrote:

            
Duly noted, thank you for the explanation!