Bug 237

Summary: NPTL bug configure check fails, but such usage of NPTL is actually incorrect.
Product: LinuxSampler Reporter: Hector Martin <marcan>
Component: otherAssignee: Christian Schoenebeck <cuse>
Status: CLOSED WONTFIX    
Severity: major    
Priority: P5    
Version: unspecified   
Hardware: PC   
OS: Linux   

Description Hector Martin 2015-02-23 17:47:07 CET
So I run Gentoo, and it seems I hit revival of this old bug; the NPTL bug check fails on my system.
https://bugs.gentoo.org/show_bug.cgi?id=194076

However, comment 15 in that thread seems to have been overlooked. As Evgeniy mentioned, the NPTL bug check is actually not legal code.

The manpage for pthread_setcanceltype states:

   Asynchronous cancelability
       Setting the cancelability type to PTHREAD_CANCEL_ASYNCHRONOUS is rarely
       useful. Since the thread could be canceled at any time, it cannot safely
       reserve resources (e.g. allocating memory with malloc(3)), acquire mutexes,
       [... snip ...]

       Functions  that  can  be safely asynchronously canceled are called
       async-cancel-safe functions.  POSIX.1-2001 requires only that
       pthread_cancel(3), pthread_setcancelstate(), and pthread_setcanceltype()
       be async-cancel-safe.  In general, other library functions can't be safely
       called from an asynchronously cancelable thread.

Therefore, calling pthread_mutex_lock from an asyncronously cancellable thread is not actually legal, so this isn't an NPTL bug, but actually a LinuxSampler bug.

Building with --enable-pthread-testcancel --disable-nptl-bug-check seems to make everything work correctly.
Comment 1 Christian Schoenebeck 2015-02-23 18:52:07 CET
That's not true. PTHREAD_CANCEL_ASYNCHRONOUS means that you are able to cancel ("kill") the created thread at any circumstances, especially it allows you to cancel the thread in question if it's i.e. currently suspended due to a mutex call. And the latter fact is the reason why some versions of the POSIX documentation state that PTHREAD_CANCEL_ASYNCHRONOUS leads to unsafe resource management. Which can be true after the point of time when the thread was canceled, but it does not result in unsafe resource management during all other times (before the thread was canceled).

In certain software applications this is indeed a severe issue, for example if you are implementing a DBMS software this might lead to undefined behavior regarding synchronized table resources (after the respective thread was cancelled). However in the use case of the sampler, usage of PTHREAD_CANCEL_ASYNCHRONOUS is neither a problem nor a bug, because the killed threads of the sampler don't leave any resources in unsafe state when they quit their execution. The resources those threads use are allocated and freed by a different thread.
Comment 2 Hector Martin 2015-02-23 19:10:21 CET
The documentation plainly states that an asynchronously cancellable thread cannot safely acquire mutexes, and that pthread_mutex_lock isn't among the functions that can be safely called from a thread using PTHREAD_CANCEL_ASYNCHRONOUS.

Here's a backtrace of the crash:

0x00007ffff6e0b3e8 in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.2/libgcc_s.so.1
(gdb) bt
#0  0x00007ffff6e0b3e8 in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.2/libgcc_s.so.1
#1  0x00007ffff6e0be4b in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.2/libgcc_s.so.1
#2  0x00007ffff6e0c180 in _Unwind_ForcedUnwind () from /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.2/libgcc_s.so.1
#3  0x00007ffff7bcc9c3 in __pthread_unwind () from /lib64/libpthread.so.0
#4  0x00007ffff7bc3c79 in sigcancel_handler () from /lib64/libpthread.so.0
#5  <signal handler called>
#6  0x00007ffff7bccbdc in __lll_lock_wait () from /lib64/libpthread.so.0
#7  0x00007ffff7bc776b in pthread_mutex_lock () from /lib64/libpthread.so.0
#8  0x7ab4c7200393146d in ?? ()
#9  0x0000000000000000 in ?? ()

What seems to be happening is that the sigcancel handler tries to unwind the stack, but pthread_mutex_lock isn't leaving the stack in an unwindable state while running. And it doesn't have to, since it's not async-cancel-safe.

The POSIX spec (2013) states:

3.24 Async-Cancel-Safe Function

A function that may be safely invoked by an application while the asynchronous form of cancellation is enabled. No function is async-cancel-safe unless explicitly described as such.

And pthread_mutex_lock is not such a function, therefore it is not safe to call while asynchronous cancellation is enabled. That's pretty clear cut.

If you think this is a bug, please point out where in the POSIX or glibc documentation it is stated that calling pthread_mutex_lock in an asynchronously cancellable thread is allowed under specific circumstances.
Comment 3 Christian Schoenebeck 2015-02-23 19:58:46 CET
If you are arguing based on the quoted POSIX specification, then yes, the behavior implied by LinuxSampler is not guaranteed by the POSIX specification. But personally I am rather interested about what is actually implemented on most systems. And as far as I can see it most NPTL implementations allow a safe combination of PTHREAD_CANCEL_ASYNCHRONOUS, pthread_mutex_lock() and pthread_cancel(). IRC Gentoo is the only Linux distribution where the discussed behavior was an issue so far and that's why this configure check actually exists. All other systems pass related test cases without any issue.

If you search for test cases against PTL implementations, your find the following one for example, which even goes further than our test case requirement:

http://fossies.org/linux/ltp/testcases/open_posix_testsuite/conformance/interfaces/pthread_setcanceltype/1-1.c

It not only implies that the locked mutex can be canceled in asynchronous cancellation mode, but it also expects the registered cancellation handler to be executed successfully.
Comment 4 Hector Martin 2016-02-03 13:07:06 CET
For what it's worth, the Gentoo developers agree that this is indeed broken code in LinuxSampler (that just happens to not be caught on most distros, but the additional hardening checks in Gentoo Hardened catch it).

https://bugs.gentoo.org/show_bug.cgi?id=537516#c4

I'm sure they'll be providing a distro patch to fix the issue, but it probably makes more sense to fix this for everyone instead of just for Gentoo Hardened. Even if things happen to work on other systems now, they are liable to break as compilers and libraries are updated. Code that contains undefined behavior tends to break further down the line even if it appears to work initially.

Of course, your code, your rules, so I'm not going to reopen the bug; feel free to reopen if you think it's worth looking at again.
Comment 5 Hector Martin 2016-02-03 13:09:21 CET
Seems something is broken about bugzilla email notifications (it complained about not being able to send e-mail). That being the case, I will reopen the bug just so it shows up on the open bugs list, since otherwise I assume nobody is going to notice my previous comment. Feel free to close again if you still think this is not an issue.
Comment 6 Christian Schoenebeck 2016-07-30 12:41:22 CEST
Email delivery of this Bugzilla installation was indeed broken for a while. There was a server upgrade in between and we struggled with several other issues related to that. But all services (including email delivery) work now again as expected.

Back to this issue: correct me if I am wrong, but as far as I can see it there is no easy alternative to resolve this overall issue once and for all. I mean an "easy" alternative would be a fix which would only require implementation changes to our most fundamental OS dependent C++ classes like "Thread", "Mutex", "Condition", etc. As far as I can see it however the current POSIX API does not provide any alternative to achieve that.

The only way to resolve this overall issue "the right way" would require not only changes to those fundamental OS dependent classes, but also to the entire LinuxSampler code base (i.e. engine implementations). And even worse: it would require more code all over the place (i.e. in the engine implementations, drivers,...) to get back to the same current behavior and would make the entire code base more error prone to common programming mistakes which could really end up in very serious misbehaviors like dead locks which then would affect all users.

Having said that, and since you said this NTPL issue had been addressed by Gentoo already, I fear I must be pragmatic on this one and decide that it would not be worth to change the entire LinuxSampler code base just to address a sole theoretical problem, and probably ending up with numerous new issues due to the overall code base changes all over the place. At least not until any other user of any system encounters this NTPL issue on his OS. And in all these years this issue was really just encountered on Gentoo, not on any other even exotic distro or commercial POSIX system.

However in case you know an "easy" alternative, just let me know!
Comment 7 Christian Schoenebeck 2019-02-27 15:37:36 CET
I am closing this report now due to the reasons described in detail in the previous comment.

If you still think this is an issue that should be resolved on our side, then feel free to reopen this report. Thanks!