Skip to content

[R-pkg-devel] RFC: C backtraces for R CMD check via just-in-time debugging

9 messages · Rolf Turner, Kevin Ushey, Vladimir Dergachev +1 more

#
Hello,

This may be of interest to people who run lots of R CMD checks and have
to deal with resulting crashes in compiled code.

Every now and then, the CRAN checks surface a particularly nasty crash.
The R-level traceback stops in the compiled code. It's not obvious
where exactly the crash happens. Naturally, this never happened on the
maintainer's computer before and, in fact, is hard to reproduce.

Containers would help, but they cannot solve the problem completely.
Some problems only surface when there's more than 32 logical
processors, or during certain times of day. It may help to at least see
the location of the crash as it happens on the computer running the
check.

One way to provide that would be to run a special debugger that does
nothing most of the time, attaches to child threads and processes, and
produces backtraces when processes receive a crashing signal. There is
such a debugger for Windows [1], and there is now a proof of concept
for amd64 Linux [2]. 

I've just tried [2] on a 250-package reverse dependency check and saw a
lot of SIGSEGVs with rcx=00000000cafebabe or Java in the backtrace, but
other than that, it seems to work fine. Do you think it's worth
developing further?

The major downside of using a debugger like this is a noticeable change
in the environment: [v]fork(), clone() and exec() become slower,
attaching another tracer becomes impossible, SIGSEGVs may become much
slower (although I do hope that most software I rely upon doesn't care
about SIGSEGVs per second). On the other hand, these wrappers are as
transparent as they get and don't even need R -d to pass the arguments
to the child process.

The other way to provide C-level backtraces is a post-mortem debugger
(registered via the AeDebug registry key on Windows or
kernel.core_pattern sysctl on Linux). This avoids interference with the
process environment during normal execution, but requires more
integration work to collect the crash dumps, process them into usable
backtraces and associate with the R CMD check runs. There are also
injectable DLLs like libbacktrace, but these have to interfere with the
process from the inside, which may be worse than ptrace() in terms of
observable environment changes. On glibc systems (but not musl, macOS,
Windows), R's SIGSEGV handler could be enhanced to call
backtrace_symbols_fd(), which should be safe (no malloc()) as long as
libgcc is preloaded.

Is adding C-level backtraces to R CMD checks worth the effort? Could it
be a good idea to add this on CRAN? If yes, how can I help?
#
On Sun, 3 Mar 2024 11:14:44 +0300
Ivan Krylov via R-package-devel <r-package-devel at r-project.org> wrote:

            
<SNIP>
Sounds like an excellent idea to me, but I am not really qualified to
judge.  Most of this stuff is was over my head.

cheers,

Rolf Turner
#
Would libSegFault be useful here?
https://lemire.me/blog/2023/05/01/under-linux-libsegfault-and-addr2line-are-underrated/
On Sun, Mar 3, 2024, 5:15?PM Rolf Turner <rolfturner at posteo.net> wrote:

            

  
  
#
On Sun, 3 Mar 2024 19:19:43 -0800
Kevin Ushey <kevinushey at gmail.com> wrote:

            
Glad to know it has been moved to
<https://github.com/zatrazz/glibc-tools/tree/main/libSegFault> and not
just removed altogether after the upstream commit
<https://sourceware.org/git/?p=glibc.git;a=commit;h=65ccd641bacea33be23d51da737c2de7543d0f5e>.

libSegFault is safer than, say, libsegfault [*] because it both
supports SA_ONSTACK (for when a SIGSEGV is caused by stack overflow)
and avoids functions like snprintf() (which depend on the locale code,
which may have been the source of the crash). The only correctness
problem that may still be unaddressed is potential memory allocations
in backtrace() when it loads libgcc on first use. That should be easy
to fix by calling backtrace() once in segfault_init(). Unfortunately,
libSegFault is limited to glibc systems, so a different solution will
be needed on Windows, macOS and Linux systems with the musl libc.

Google-owned "backward" [**] tries to do most of this right, but (1) is
designed to be compiled together with C++ programs, not injected into
unrelated processes and (2) will exit the process if it survives
raise(signum), which will interfere with both rJava (judging by the
number of Java-related SIGSEGVs I saw while running R CMD check) and R's
own stack overflow survival attempts.
1 day later
#
I use libunwind in my programs, works quite well, and simple to use.

Happy to share the code if there is interest..

best

Vladimir Dergachev
On Mon, 4 Mar 2024, Ivan Krylov via R-package-devel wrote:

            
1 day later
#
On Tue, 5 Mar 2024 18:26:28 -0500 (EST)
Vladimir Dergachev <volodya at mindspring.com> wrote:

            
Do you mean that you use libunwind in signal handlers? An example on
how to produce a backtrace without calling any async-signal-unsafe
functions would indeed be greatly useful.

Speaking of shared objects injected using LD_PRELOAD, I've experimented
some more, and I think that none of them would work with R without
additional adjustments. They install their signal handler very soon
after the process starts up, and later, when R initialises, it
installs its own signal handler, overwriting the previous one. For this
scheme to work, either R would have to cooperate, remembering a pointer
to the previous signal handler and calling it at some point (which
sounds unsafe), or the injected shared object would have to override
sigaction() and call R's signal handler from its own (which sounds
extremely unsafe).

Without that, if we want C-level backtraces, we either need to patch R
to produce them (using backtrace() and limiting this to glibc systems
or using libunwind and paying the dependency cost) or to use a debugger.
#
Hi Ivan,

Here is the piece of code I currently use:

void backtrace_dump(void)
{
     unw_cursor_t    cursor;
     unw_context_t   context;

     unw_getcontext(&context);
     unw_init_local(&cursor, &context);

     while (unw_step(&cursor) > 0)
     {
         unw_word_t  offset, pc;
         char        fname[64];

         unw_get_reg(&cursor, UNW_REG_IP, &pc);

         fname[0] = '\0';
         (void) unw_get_proc_name(&cursor, fname, 64, &offset);

         fprintf(stderr, "0x%016lx : (%s+0x%lx)\n", pc-(long)backtrace_dump, fname, offset);
     }
}

To make it safe, one can simply replace fprintf() with a function that 
stores information into a buffer.

Several things to point out:

   * printing pc-(long)backtrace_dump works around address randomization, 
so that if you attach the debugger you can find the location again by 
using backtrace_dump+0xxxx (it does not have to be backtrace_dump, any 
symbol will do)

   * this works even if the symbols are stripped, in which case it finds an 
offset relative to the nearest available symbol - there are always some 
from the loader. Of course, in this case you should use the offsets and 
the debugger to find out whats wrong

   * you can call backtrace_dump() from anywhere, does not have to be a 
signal handler. I've taken to calling it when my programs detect some 
abnormal situation, so I can see the call chain.

   * this should work as a package, but I am not sure whether the offsets 
between package symbols and R symbols would be static or not. For R it 
might be a good idea to also print a table of offsets between some R 
symbol and all the loaded C packages R_init_RMVL(), at least initially.

   * R ought to know where packages are loaded, we might want to be clever 
and print out information on which package contains which function, or 
there might be identical R_init_RMVL() printouts.

best

Vladimir Dergachev
On Thu, 7 Mar 2024, Ivan Krylov wrote:

            
4 days later
#
Vladimir,

Thank you for the example and for sharing the ideas regarding
symbol-relative offsets!

On Thu, 7 Mar 2024 09:38:18 -0500 (EST)
Vladimir Dergachev <volodya at mindspring.com> wrote:

            
Is it ever possible for unw_get_reg() to fail (return non-zero) for
UNW_REG_IP? The documentation isn't being obvious about this. Then
again, if the process is so damaged it cannot even read the instruction
pointer from its own stack frame, any attempts at self-debugging must
be doomed.
Since package shared objects are mmap()ed into the address space and
(at least on Linux with ASLR enabled) mmap()s are supposed to be made
unpredictable, this offset ends up not being static. On Linux, R seems
to be normally built as a position-independent executable, so no matter
whether there is a libR.so, both the R base address and the package
shared object base address are randomised:

$ cat ex.c
#include <stdint.h>
#include <R.h>
void addr_diff(void) {
 ptrdiff_t diff = (char*)&addr_diff - (char*)&Rprintf;
 Rprintf("self - Rprintf = %td\n", diff);
}
$ R CMD SHLIB ex.c
$ R-dynamic -q -s -e 'dyn.load("ex.so"); .C("addr_diff");'
self - Rprintf = -9900928
$ R-dynamic -q -s -e 'dyn.load("ex.so"); .C("addr_diff");'
self - Rprintf = -15561600
$ R-static -q -s -e 'dyn.load("ex.so"); .C("addr_diff");'
self - Rprintf = 45537907472976
$ R-static -q -s -e 'dyn.load("ex.so"); .C("addr_diff");'
self - Rprintf = 46527711447632
That's true. Informaion on all registered symbols is available from
getLoadedDLLs().
#
On Tue, 12 Mar 2024, Ivan Krylov wrote:

            
Not sure. I think it just returns what is in it, you will get a false 
reading if the stack is corrupted. The way that I see it - some printout 
is better than none, and having signs that stack is badly corrupted is a 
useful debugging clue.
Ok, so this is reasonably straighforward.

best

Vladimir Dergachev