Skip to content
Prev 56747 / 63421 Next

R problems with lapack with gfortran

Dear Thomas,

thank you for your input. I've debugged one of the packages and I 
confirm that the breakage is related to passing of strings from C to 
Fortran. Indeed, BLAS and LAPACK define a large number of subroutines 
that take one or more explicit single-character strings as arguments. 
Other than that, BLAS has only one function (xerbla), which takes a 
string of unspecified length, LAPACK only has four (ilaenv, 
ilaenv2stage, lsamen, xerbla). The C interfaces to BLAS/LAPACK from 
Netlib depend on the historic behavior that explicit single-character 
strings are interoperable, concretely CBLAS and LAPACKE provide C 
interfaces/code that calls into Fortran BLAS/LAPACK without passing the 
1s as lengths of the character strings (functions taking a string of 
unspecified length are trivial and re-implemented in C). This has been 
working fine for very many years as the Fortran code never needed to 
access the length it knew was 1. R has been using the same practice, 
which long predates ISO_C_BINDING/BIND(C), and I've seen online 
discussions where people assumed interoperability of length 1 strings, 
once mentioning also a citation from Fortran 2003 Handbook that says "A 
Fortran character string with a length other than 1 is not 
interoperable" (which invites interpretation that length 1 strings were 
). I am not an expert to say whether the current Fortran standard 
requires that interoperability and I assume that it does not given this 
gfortran change.

This gfortran change breaks this interoperability: if a C function calls 
a Fortran function, passing it a single-character string for a parameter 
taking explicit single-character Fortran string, it may crash. I've 
debugged one case with R package BDgraph, this example 
"library(BDgraph); data.sim <- bdgraph.sim( n = 70, p = 5, size = 7, vis 
= TRUE )" crashes due to corruption of C stack by Fortran function 
DPOSV, when compiled with the new gfortran and with -O2. To see the 
problem, one can just look at the disassembly of DPOSV (LAPACK), neither 
the package nor R is not necessary:

SUBROUTINE DPOSV( UPLO, N, NRHS, A, LDA, B, LDB, INFO )
CHARACTER????????? UPLO

In one case, DPOSV calls DPOTRS before returning. The new gfortran with 
-O2 performs tail-call optimization, jumping to DPOTRS. In the annotated 
disassembly snippet, at 11747f1, DPOSV tries to ensure that there is 
constant 1 as string length of UPLO when tail-calling into DPOTRS, so it 
writes it to stack where there already should have been 1 as length of 
UPLO passed to DPOSV. But this argument has not been passed to DPOSV, so 
this causes stack corruption.

 ?1174ce:?????? 0f 85 62 ff ff ff?????? jne??? 117436 <dposv_+0xb6> <== jump if ERROR

 ???????? CALL DPOTRS( UPLO, N, NRHS, A, LDA, B, LDB, INFO )

 ? 1174d4:?????? 48 8b 04 24???????????? mov??? (%rsp),%rax <======= rax holds LDB

 ? 1174d8:?????? 4c 89 7c 24 68????????? mov??? %r15,0x68(%rsp) <=== save INFO to output param

 ? 1174dd:?????? 49 89 d8??????????????? mov??? %rbx,%r8 <========== pass LDA as LDA

 ? 1174e0:?????? 4c 89 e1??????????????? mov??? %r12,%rcx <========= pass A as A

 ? 1174e3:?????? 4c 8b 4c 24 08????????? mov??? 0x8(%rsp),%r9 <===== pass B as B

 ? 1174e8:?????? 4c 89 ea??????????????? mov??? %r13,%rdx <========= pass NRHS as NRHS

 ? 1174eb:?????? 48 89 ee??????????????? mov??? %rbp,%rsi <========= pass N as N

 ? 1174ee:?????? 4c 89 f7??????????????? mov??? %r14,%rdi <========= pass UPLO as UPLO

 ? 1174f1:?????? 48 c7 44 24 70 01 00??? movq?? $0x1,0x70(%rsp) <=== pass 1 hidden arg on stack CORRUPTS C STACK

 ? 1174f8:?????? 00 00

 ? 1174fa:?????? 48 89 44 24 60????????? mov??? %rax,0x60(%rsp) <=== pass LDB as LDB (stack)

 ????? END

 ? 1174ff:?????? 48 83 c4 28???????????? add??? $0x28,%rsp <== remove 5 vars from stack (sframe)

 ? 117503:?????? 5b????????????????????? pop??? %rbx

 ? 117504:?????? 5d????????????????????? pop??? %rbp

 ? 117505:?????? 41 5c?????????????????? pop??? %r12

 ? 117507:?????? 41 5d?????????????????? pop??? %r13

 ? 117509:?????? 41 5e?????????????????? pop??? %r14

 ? 11750b:?????? 41 5f?????????????????? pop??? %r15 <=== restore register to level before call

 ???????? CALL DPOTRS( UPLO, N, NRHS, A, LDA, B, LDB, INFO )

 ? 11750d:?????? e9 de 56 ef ff????????? jmpq?? cbf0 <dpotrs_ at plt> <=== tail call to dpotrs

Note that DPOSV never uses the length of the string (UPLO) from the 
hidden argument, the compiler clearly knows that its length is 1. In 
calls where the length is passed in registers, this does not cause 
trouble (like LSAME) and indeed is needed as the registers have 
different values

 ????? IF( .NOT.LSAME( UPLO, 'U' ) .AND. .NOT.LSAME( UPLO, 'L' ) ) THEN

 ? 117448:?????? b9 01 00 00 00????????? mov??? $0x1,%ecx

 ? 11744d:?????? ba 01 00 00 00????????? mov??? $0x1,%edx

 ? 117452:?????? 48 8d 35 bb 12 09 00??? lea??? 0x912bb(%rip),%rsi??????? # 1a8714 <ipivot.4261+0xd14>

 ? 117459:?????? 4c 89 f7??????????????? mov??? %r14,%rdi

 ? 11745c:?????? e8 1f 3d ef ff????????? callq? b180 <lsame_ at plt>

but it seems to me that the compiler could just refrain from setting the 
length to be 1 on the stack at 1174f1, since it knows it should have 
already been there. It would be a nice property if Fortran code that 
never accesses the hidden arguments with the lengths of the strings, 
because it knows what those lengths are, would also never write to those 
hidden arguments on the stack when it knows what they are (should be).

Before the gfortran change, DPOSV would call to DPOTRS normally (no 
tail-call optimization), so this problem would not occur (I tested with 
268974). By disabling tail call optimization via 
-fno-optimize-sibling-calls, the problem goes away also for other 
packages my colleagues have identified as crashing with the new 
gfortran. Did you know of any other optimization that could break this 
interoperability of 1-length strings? It would be really nice to users 
if this interoperability could be preserved, and if not by default than 
at least with some option.

Traditionally, BLAS/LAPACK implementations are interchangeable at 
dynamic linking time, using the Fortran interface that is however also 
used from C, without passing lengths for fixed 1-character strings. R 
supports this too, at least on some Linux distributions including 
Debian/Ubuntu it is packaged so that it runs with the BLAS/LAPACK 
implementation installed on the system. Even though this is probably not 
correct wrt to the todays Fortran standard (I don't know for sure), this 
is the common practice, and fixing this would not be easy - one would 
have to create a new interface to be used from C, separate from the 
Fortran one, and all software would have to start using that interface 
from C. In the current situation when the Fortran interface is used, 
confusion will arise with this gfortran change as different BLAS/LAPACK 
implementations are built by different Fortran compilers and use a 
different mix of Fortran/C for different computational subroutines. Note 
CBLAS could not be readily used as it itself breaks with the current 
gfortran change as well.

The same interoperability considerations apply to R packages, which 
include native code that calls from C or from Fortran into the (same) 
Fortran interface of BLAS/LAPACK. There would have to be a commonly 
accepted C interface instead by the BLAS/LAPACK implementations, and all 
of these packages would have to be modified to use that interface. If we 
created such a C interface just inside R and asked all package 
maintainers to update their packages, we would still have the problem 
with substitution of external BLAS(/LAPACK) implementations at dynamic 
linking time.

Indeed, it would be very hard to identify these problems by testing, 
because at least now the crashes are quite rate (for the tail-call 
optimization, a number of conditions have to be met to cause memory 
corruption, first the tail optimization has to happen, then the number 
of arguments has to be so large (on x86) that the lengths are passed on 
stack and not in registers, we have to be lucky for the memory 
corruption to map to a crash, etc).

So, any help we could get from you would be highly appreciated, be it 
just a compile option to keep the old behavior or an assurance that we 
are fine if we just disable the tail-call optimization. Appreciated by 
us but I believe also many others who use or develop BLAS/LAPACK, but 
may not have yet run into the problem, as they may not have been 
regularly testing bleeding-edge versions of compilers or may not have 
such a large code base to test as we have on CRAN.

Thanks
Tomas
On 4/24/19 11:32 PM, Thomas K?nig wrote: