Skip to content

R on Solaris 10 x64

8 messages · Tai-Wei (David) Lin, Brian Ripley, Peter Dalgaard +3 more

#
Hi R Developers,

Greg is helping me with debugging R on Solaris 10 x64. Please let us
know if you have any thoughts or tips that can help us debug this.

Thanks,

David



************
Using default transfer plist
in vector_io: permuting
About to write

 *** caught segfault ***
address e8554000, cause 'memory not mapped'

Traceback:
 1: .External("do_hdf5save", call, sys.frame(sys.parent()), fileout,
 ..., PACKAGE = "hdf5")
 2: hdf5save(hdf5_Fstat, "Fstat", "geneNames", "genotype")
aborting ...
************

We've tried many things to debug it:

* dbx Runtime Checking (RTC) is not detecting any (meaningful) memory
access problems that I can see.

* The same on Solaris/SPARC.

* Neither does Valgrind on Linux.

* I've tried increasing the C stack size, assuming R could be running
out of stack size. Didn't help.

Running R under dbx (without RTC) until the crash shows this:

...
About to write
t at 1 (l at 1) signal SEGV (no mapping at the fault address) in _memcpy at
0xfe90444b
0xfe90444b: _memcpy+0x006b:     movaps   0x00000000(%esi),%xmm0
Current function is H5D_select_mgath
  379               HDmemcpy(tgath_buf,buf+off[curr_seq],curr_len);
(dbx) where
current thread: t at 1
  [1] _memcpy(0x0, 0xfdebc707, 0x9f5c4f0), at 0xfe90444b
=>[2] H5D_select_mgath(_buf = 0x9f79580, space = 0x8966770, iter =
0x8045980, nelmts = 3120U, dxpl_cache = 0xfe170078, _tgath_buf =
0x9f5c4f0), line 379 in "H5Dselect.c"
  [3] H5D_contig_write(io_info = 0x804620c, nelmts = 3120ULL, mem_type =
0x97b05c8, mem_space = 0x8966770, file_space = 0x8966770, tpath =
0x8ee7078, src_id = 201326906, dst_id = 201326904, buf = 0x9f79580),
line 1418 in "H5Dio.c"
  [4] H5D_write(dataset = 0x8f169c0, mem_type_id = 201326906, mem_space
= 0x8966770, file_space = 0x8966770, dxpl_id = 671088643, buf =
0x9f79580), line 952 in "H5Dio.c"
  [5] H5Dwrite(dset_id = 335544330, mem_type_id = 201326906,
mem_space_id = 0, file_space_id = 0, plist_id = 671088643, buf =
0x9f79580), line 586 in "H5Dio.c"
  [6] vector_io(call = 0x97234ec, writeflag = 1, dataset = 335544330,
space = 268435472, obj = 0x98386a0), line 535 in "hdf5.c"
  [7] hdf5_write_vector(call = 0x97234ec, id = 67108867, symname =
0x9cf35d0 "geneNames", val = 0x98386a0), line 693 in "hdf5.c"
  [8] hdf5_save_object(call = 0x97234ec, fid = 67108867, symname =
0x9cf35d0 "geneNames", val = 0x98386a0), line 957 in "hdf5.c"
  [9] do_hdf5save(args = 0x9723284), line 1104 in "hdf5.c"
  [10] do_External(call = 0x86d62bc, op = 0x8371cd8, args = 0x972340c,
env = 0x9723594), line 832 in "dotcode.c"
  [11] Rf_eval(e = 0x86d62bc, rho = 0x9723594), line 445 in "eval.c"
  [12] Rf_evalList(el = 0x86d6230, rho = 0x9723594, op = 0x837226c),
line 1463 in "eval.c"
  [13] Rf_eval(e = 0x86d6214, rho = 0x9723594), line 438 in "eval.c"
  [14] do_begin(call = 0x86d56bc, op = 0x836709c, args = 0x86d61dc, rho
= 0x9723594), line 1107 in "eval.c"
  [15] Rf_eval(e = 0x86d56bc, rho = 0x9723594), line 431 in "eval.c"
  [16] Rf_applyClosure(call = 0x9723738, op = 0x83c0328, arglist =
0x97236e4, rho = 0x8379b1c, suppliedenv = 0x8379b38), line 614 in "eval.c"
  [17] Rf_eval(e = 0x9723738, rho = 0x8379b1c), line 455 in "eval.c"
  [18] Rf_ReplIteration(rho = 0x8379b1c, savestack = 0, browselevel = 0,
state = 0x8047328), line 256 in "main.c"
  [19] R_ReplConsole(rho = 0x8379b1c, savestack = 0, browselevel = 0),
line 305 in "main.c"
  [20] run_Rmainloop(), line 944 in "main.c"
  [21] Rf_mainloop(), line 951 in "main.c"
  [22] main(ac = 4, av = 0x80477ac), line 33 in "Rmain.c"
(dbx) p curr_len
curr_len = 24960U
(dbx) p curr_seq
curr_seq = 0
(dbx) p of
dbx: "of" is not defined in the scope
`libhdf5.so.0.0.0`H5Dselect.c`H5D_select_mgath:347`
dbx: see `help scope' for details
(dbx) p off
off = 0x8042960
(dbx) p tgath_buf
tgath_buf = 0x9f5c4f0
"\xd87\x83^H\xa8\xf3\x82^H0^X\x82^H^X\xd4\x81^H^P\x90\x81^H\xb8m\x80^H^H'\x80^H\x88^?^?^H\x908^?^H\xb0\xf7~^H\xd8\xad~^H\xf8\xb2~^H\xb8\x8e~^H\xe8]~^H\xe8\xcb\xed^HP\xe3}^Hh\xdd\xbb^H\x98\xc4}^H\xf0\xa0}^H\xa8r}^HH}\xc3^HpO|^HH^V|^H^X\xd8|^H\xc0\xb1|^H8=}^H\x90\xcd{^H^Pm{^H\xb8#{^Hx'{^H\x90\xf8x^HpKx^H^POx^H\xa8~w^H^H>w^H\xf0\xb2w^H\xc8^Ew^HX'x^H\xf8\xdbv^H"
(dbx) p buf
buf = 0x9f79580
"\xd87\x83^H\xa8\xf3\x82^H0^X\x82^H^X\xd4\x81^H^P\x90\x81^H\xb8m\x80^H^H'\x80^H\x88^?^?^H\x908^?^H\xb0\xf7~^H\xd8\xad~^H\xf8\xb2~^H\xb8\x8e~^H\xe8]~^H\xe8\xcb\xed^HP\xe3}^Hh\xdd\xbb^H\x98\xc4}^H\xf0\xa0}^H\xa8r}^HH}\xc3^HpO|^HH^V|^H^X\xd8|^H\xc0\xb1|^H8=}^H\x90\xcd{^H^Pm{^H\xb8#{^Hx'{^H\x90\xf8x^HpKx^H^POx^H\xa8~w^H^H>w^H\xf0\xb2w^H\xc8^Ew^HX'x^H\xf8\xdbv^H"
(dbx) p nseq
nseq = 1U
(dbx) p len
len = 0x804195c
(dbx) p len[0..2]
len[0..2] =
[0] = 24960U
[1] = 140025512U
[2] = 140013048U
(dbx)


The R code in question is:

...
        /* Loop, while sequences left to process */
        for(curr_seq=0; curr_seq<nseq; curr_seq++) {
            /* Get the number of bytes in sequence */
            curr_len=len[curr_seq];

            HDmemcpy(tgath_buf,buf+off[curr_seq],curr_len);

            /* Advance offset in gather buffer */
            tgath_buf+=curr_len;
        } /* end for */
...

where

./src/hdf5-1.6.5/src/H5private.h:
#define HDmemcpy(X,Y,Z)  memcpy((char*)(X),(const char*)(Y),Z)

Maybe the "curr_len = 24960U" value is too high. I have no way of
knowing what it should be in this case.

The crash could be caused by a compiler bug, although it's not very
likely. These crashes have occurred both with and without optimization,
with and without -g.
#
What did the maintainer of this unmentioned contributed package (hdf5) say 
when you ask him?

[Hint: you *have* read the posting guide at

http://www.r-project.org/posting-guide.html

and done as it asks?]

There is no evidence here that this is anything to do with R itself.
On Fri, 13 Apr 2007, Tai-Wei (David) Lin wrote:

            

  
    
#
Prof Brian Ripley wrote:
Er, Brian, I think you are misfiring this time. As I understand it, they 
(David/Greg) are well aware that they are debugging an issue in a 
contributed package which occurs with an uncommon compiler on a maybe 
not so uncommon platform. They are just looking for hints on how to 
proceed to pinpoint the issue.

--------------

2 things caught my eye (except that their "R code" is clearly C): The 
dbx output doesn't show off[curr_seq], which could actually be the 
culprit, and the _memcpy call on the stack looks odd:

_memcpy(0x0, 0xfdebc707, 0x9f5c4f0)


(What happened to the length and how did NULL get in there.) If memcpy() 
is a  macro, I think I'd take a closer look at the include files and see 
if something is getting expanded  in an unintended way.
#
Peter,

Thanks for the hints.
(dbx) p off[curr_seq]
off[curr_seq] = 0
It shows the register values, which are not necessarily the 
arguments to memcpy(), apparently not in this case.

The "guilty" line of code

HDmemcpy(tgath_buf,buf+off[curr_seq],curr_len);

translates into (due to macro "#define HDmemcpy(X,Y,Z) 
memcpy((char*)(X),(const char*)(Y),Z)")

memcpy((char*)(tgath_buf),(const char*)(buf+off[curr_seq]),curr_len);

and

(dbx) p tgath_buf
tgath_buf = 0x9f5c4f0
(definitely not NULL)

We've also asked for help at help at hdfgroup.org regarding the HDF5 
package.

This is under Solaris 10 x86, using the latest Sun Studio 
compiler/tools.

-Greg
#
[Peter Dalgaard]
Firing is always misfiring, at least in that it lacks elegance.  There 
are ways to speak without the heat, or otherwise, to merely stay silent.

Yet this particular aspect of things improved a lot, lately, on the 
various R lists.  Congratulations to all.
#
Greg Nakhimovsky wrote:
Hmm, and can you see "both ends" of tgath_buf and buf (in particular 
tgath_but[curr_len - 1] and buf[curr_len-1])?

If only to "eliminate from our inquiries", I'd try running CPP over the 
code and see if the full macro expansion (including that from memcpy to 
_memcpy) is working as intended.

I take it that you've already tried compiling the module at maximal 
warning level?
#
(dbx) p tgath_buf[curr_len - 1]
tgath_buf[curr_len-1] = '\0'
(dbx) p buf[curr_len-1]
dbx: cannot access address 0x9f7f6ff
(dbx) p tgath_buf[curr_len - 2]
tgath_buf[curr_len-2] = '\0'
(dbx) p buf[curr_len-2]
dbx: cannot access address 0x9f7f6fe
(dbx)

Something is definitely wrong with this memcpy() operation. I 
suppose we'll need some help from the HDF5 folks to figure out 
what the buf memory buffer is supposed contain in this case.
After the macro expansion, this code looks like this:

         for(curr_seq=0; curr_seq<nseq; curr_seq++) {
             curr_len=len[curr_seq];
              memcpy ( ( char * ) ( tgath_buf ) , ( const char * 
) ( buf + off [curr_seq ] ) , curr_len );
             tgath_buf+=curr_len;
         }
No. We can try that, as well as lint.

Thanks.

-Greg
#
Hi David,
Tai-Wei (David) Lin wrote:
What's the initial size of tgath_buf? You need to make sure that you are not stepping
out of it i.e. that sum(len[i], for 0<=i<nseq) is not greater than its initial size.
That's for the writing side. Same on the reading side: you need to make sure that
buf+off[curr_seq]+len[i]-1 is a safe place to be for any 0<=i<nseq.

Otherwise, expect bad things to happen. And they are generally not reproducible in a
consistent way. So even if this code never crashes on other systems, it doesn't mean that
it is not broken.

Cheers,
H.