TLS and shared library initialization

For my work on Rig, the ability for variables to be thread-local is of critical importance.
Both Pthreads in the POSIX world and Windows offer ways to allocate a thread-local key, from which to get/set a pointer(-sized) value.
Using this you can easily make any piece of data thread-local, you'd just allocate it on the heap and store the address in the thread-local key variable. This is usually called thread-local data or run-time TLS (thread-local-storage).
But several compilers and operating systems support extensions to the C language, so that one can declare a variable, with some restrictions, thread-local, and access it as one usually would, without the need to use any specific library functions or API. This is what people normally refer to as TLS (or load-time TLS if one wants to be more precise).
If TLS is available, it clearly is the preferred alternative, as it's much easier to use, doesn't need any special kind of initialization, and is usually faster (due to possible compiler optimizations) or at the very least as-fast as thread-local data.
Following is a table of which OSes support TLS, using which compilers (in their minimum version) and what keyword is exactly needed.

OS load-time TLS supported compilers run-time TLS
Windows __declspec(thread) MSVC 2005, ICC 9.0 Tls{Alloc,Free}
Linux __thread GCC 4.1.X, Clang 2.8, ICC 9.0, Solaris Studio 12 pthread_key_{create,delete}
FreeBSD __thread GCC 4.1.X, Clang 2.8 pthread_key_{create,delete}
MacOS X None None (unsupported) pthread_key_{create,delete}
Solaris __thread GCC 3.4.X, Solaris Studio 12 pthread_key_{create,delete}
OpenBSD None None (unsupported) pthread_key_{create,delete}
NetBSD None None (segfaults) pthread_key_{create,delete}
AIX __thread IBM XL C 11.X pthread_key_{create,delete}

GCC, ICC and Solaris Studio all support __thread, Clang does so as well.
IBM's XL C compiler on AIX supports __thread with the -qtls option.
On Windows, VC++ and ICC both support __declspec(thread).
Support is needed in both the compiler, the linker and the C library/thread library for this to correctly work.
Here a snippet of code, a few C defines that try to safely tell if TLS is available.

Another question that came up right alongside the availability of TLS was how to correctly have code executed when a shared library is loaded (at load-time before main() is entered, at run-time when dlopen() or LoadLibrary() are called) and unloaded (when returning from main(), or calling exit(), or at run-time when calling dlclose() or FreeLibrary()), so that initialization and destruction code could be safely run, both for miscellaneous purposes, such as more involved initialization of global data, and to correctly initialize thread-local keys using the OS/thread library functions, in case TLS wasn't available. At first I used custom synchronization code, based on atomic operations, to implicitly do this (by checking a shared variable that indicated if initialization was already done on each call to something that required it), but this is error-prone, hits performance, and leaves the question of cleanup at unload open. Another way to do it would be to explicitly require the user to call an initialization routine before he uses any library functionality, and a destruction one when he's finished, but that approach is tedious and error-prone too; so I figured there must be a better, standard way to do this, it seemed like such an useful and common functionality requirement, that it would've been strange that there was no useable solution out there...
The dlopen(3) man-page got me started, explaining that recent GCC's support the two function attributes "constructor" and "destructor" to define initialization and destruction functions, which substitute the old approach of having functions named __init and __fini. Using function attributes also enables you to define multiple initialization and destruction functions. Furthermore, with GCC, you can specify a priority to control the order of execution. Clang does not support this, and I couldn't find anything indicating Solaris Studio or ICC to support it either, so it's probably better anyway to not depend on the calling order of different constructor/destructor functions, and keep them independant from eachother.
Windows provides a similar mechanism using DllMain(), in which you can put code you want called at various relevant events.
I also checked if cleanup functions registered with atexit() would be called together with the destructors, while this is non-standard, it can be useful and is supported by a few, major C libraries.
The next table summarizes my findings on all this in an easily readable format.

OS initialization (load&run-time) destruction (load&run-time) atexit() on process exit atexit() on library unload
Windows DllMain
Yes Yes
Linux function __attribute__((constructor)) function __attribute__((destructor)) Yes Yes (since glibc 2.2.3)
FreeBSD function __attribute__((constructor)) function __attribute__((destructor)) Yes No
MacOS X function __attribute__((constructor)) function __attribute__((destructor)) Yes No
Solaris function __attribute__((constructor)) function __attribute__((destructor)) Yes Yes (since Solaris 8)
OpenBSD function __attribute__((constructor)) function __attribute__((destructor)) Yes No
NetBSD function __attribute__((constructor)) function __attribute__((destructor)) Yes No

GCC, Clang and ICC directly support the __attribute__ syntax, Solaris Studio 12 does too (and seems to translate it to the corresponding #pragma it supports), older versions or Sun Studio may only support #pragma init() / #pragma fini() though.
This necessitates support from both the compiler and the linker to work, on all tested platforms this was the case.

Posted by Luca Longinotti on 24 Feb 2011 at 18:00
Categories: C99, Programming Comments

Interview with Linus Torvalds

Great interview with Linus Torvalds by ITWire I wanted to make sure to share.
Especially the non-IT-related questions give some very interesting insights into the man behind Linux.

Posted by Luca Longinotti on 10 Feb 2011 at 01:16
Categories: CompSci Comments

Am I main?, a tale of TIDs

During my work on Rig, which will also include thread abstraction, I stumbled upon the problem of getting some kind of ID to identify a running thread, I wanted to be able to do something akin to getpid() (or GetCurrentProcessId() for Windows), but for threads, not processes. Solving this on Windows was easy, the Unix world is another story.
Now, the Pthreads API doesn't offer this functionality, the closest is pthread_self(), which returns an opaque type pthread_t, which can't (safely) be used directly to differentiate between threads. Which means that to solve this, I needed to enter the world of non-portable, OS-specific functionality: one of the reasons I wanted to use VMs in my previous post was in fact to try this out.
After reading a lot of documentation and trying out a few things, it became clear that each OS had a different way of getting this information.
Coincidentally, the next day someone on StackOverflow asked an interesting question that turned out to be related: "How to determine if the current thread is the main one?", which I set out to answer. My answer already contains a good explanation of how to approach that problem, so I won't reiterate it here, and simply offer a helpful reference of my overall findings.

OS Thread ID Is thread main?
Windows tid = GetCurrentThreadId(); ???
Linux tid = syscall(SYS_gettid); tid == getpid()
FreeBSD long lwpid;
tid = lwpid;
pthread_main_np() != 0
MacOS X tid = pthread_mach_thread_np(pthread_self()); pthread_main_np() != 0
Solaris tid = pthread_self(); tid == 1
OpenBSD Not available. pthread_main_np() != 0
NetBSD tid = _lwp_self(); tid == 1

Posted by Luca Longinotti on 09 Feb 2011 at 17:00
Categories: C99, Programming Comments

KVM, slow IO and strange options

In my quest for portability, I wanted to test a few things on several operating systems, mostly BSDs and Sun Oracle Solaris.
Seeing as virtualization is the current hype, I decided to give Linux KVM a try, as it promised to be the more open solution, while requiring less effort to setup, which in my case, for a few dev-VMs to try stuff on, is kinda important, I don't want to spend hours maintaining this setup, but I also don't expect stellar performance to run heavy workloads on it.
Gentoo makes the installation quite easy, all you need is to enable KVM in your kernel and emerge app-emulation/qemu-kvm.

  • clearly the kernel needs to have KVM support enabled for your CPU, but I have all the VirtIO stuff disabled, I don't need it and I tried VirtIO-blk to speed-up IO performance, but didn't notice any difference, it doesn't probably do much when you only have 1-2, max. 3 VMs running at any time, with not that much going on in them, for development.
  • qemu-kvm, careful of the USE flags and the QEMU_*_TARGETS!

package.use entries:

media-libs/libsdl X audio video opengl xv
app-emulation/qemu-kvm aio sdl
# remember "alsa" if you use it, for both packages!

make.conf entries:

QEMU_SOFTMMU_TARGETS="arm i386 ppc ppc64 sparc sparc64 x86_64"

'aio' is important for native AsyncIO support and 'sdl' to get a window with your VM in it (unless you always want to use VNC to connect). Most people can also probably reduce QEMU_SOFTMMU_TARGETS to "i386 x86_64", but I wanted to keep the option to emulate some alternative architectures.
Once that's all done, KVM worked perfectly, and I started installing a Xubuntu image just to test it, but noticed that IO was incredibly slow, and set out to find out how to better its performance, I ended up with the following two Bash functions to install VMs from ISOs and start them, to get a somewhat usable performance. The options are explained below.

# KVM support
kvm-start() {
    /usr/bin/kvm -net nic,macaddr=random -net user -cpu host -smp 4 -m 768 -usb
    -usbdevice tablet -vga cirrus -drive file=$1,cache=writeback,aio=native
kvm-install() {
    /usr/bin/qemu-img create -f raw $1 6G
    /usr/bin/kvm -net nic,macaddr=random -net user -cpu host -smp 4 -m 768 -usb
    -usbdevice tablet -vga cirrus -drive file=$1,cache=writeback,aio=native
    -cdrom $2 -boot d
  • -drive's cache=writeback,aio=native are crucial for storage performance, while aio helped just a little, changing the cache mode to writeback massively improved IO performance! Also, raw disk images do perform better than qcow2!
  • -cpu host -smp 4 -m 768 passes along all available CPU features, and raising memory from the default 128 helps too.
  • -usb -usbdevice tablet was needed to fix the broken mouse (it just didn't react at all in my case!), it also makes it possible to drag the mouse off the screen of the VM and back without having to always CTRL+ALT, but this also kinda depends on the OS you're emulating.
  • -vga cirrus enables support for resolutions up to 1024x768 and has very good compatibility all around. You could use -vga vmware for Linux guests to get very high resolutions, but it doesn't work that well with other (especially older) operating systems.
  • -net nic,macaddr=random -net user is for the standard, software routed networking, documented as "slow", but more than fast enough for development work (of course not for some kind of high-traffic thousands-of-connections server). Remember to set a valid, random MAC address!

Posted by Luca Longinotti on 08 Feb 2011 at 17:40
Categories: Gentoo, Software Comments

(Page 1 of 1)