A primer on threading, GIL and extensions in Python

Terminology:

python-runtime: Actual (single) runtime loaded in memory that would allow executing python language code. For this post, we would assume CPython implementation, interpreter would also refer to the same.
thread: refer to the flow of execution, more than one such flows may be executed concurrently or parallely.
GIL: Global interpreter lock, is actually composed of a state, condition variable, and a mutex. (we would treat GIL as a lock for following discussion)

Since python also allows running more than one interpreter/VM, we would assume only a single instance of that runtime for following discussion.

Python runtime allows creating threads to increase the utilization of the underlying system resources. One thread may execute some computation code while other is waiting on an I/O event. Without threading code would have to follow a single sequential flow of instructions, if an instruction happens to be blocking then all following instructions have to wait if even if some of the following instructions could be executed independently. Threading functionality allows to group a set of sequential instructions into its own independent execution flow.
Python allows to group such instructions into a function which could then be run concurrently. Actual performance effects of threading provided by python runtime under different constraints would be discussed throughout the post.

Python actually leverages OS/Kernel-level threads for running functions concurrently which are scheduled by an actual OS. Python itself neither implements any kind of user-space scheduling nor influence OS scheduling in any form. Using OS API to implement threading keeps the CPython implementation much simpler and easier to understand and extend.
But python-runtime cannot have much ask (without adding a fair amount of code/complexity) in how those threads would be scheduled by the underlying OS and which leads to some unintended behavior like worse performance of python code on multi-cores machines.

Each thread in python needs to acquire a lock (GIL) first before accessing any python data or to call any C API. At any time in course of computation at-most one thread would be executing any of the python code. Only benefit for using threading in current GIL based python implementations is to improve the performance of I/O based code. For computation bound threads we would always be better off by running code in a single thread. (atleast for pure python) We will see later how some compiled extensions can actually bypass GIL despite the common belief that every extension does it by default.

We will also try to understand how GIL has allowed the Cpython to manage a lot of complexity that would have come from managing concurrent access to python data and C API from multiple threads. It would be helpful to understand that multi-threading code is much harder to manage and debug in any language ecosystem and should be written with relatively higher focus. Multi-threading doesn't always have performance benefits and only few classes of algorithms are suited well for parallel execution.

Latest CPython implementations allow a time-sharing mechanism to switch b/w threads, like a switching interval of 50 milliseconds. Earlier python implementations would use the number of bytecode instructions to set the switching interval, (which was found to be biased and hence switch to time-based switching).

Any newly created thread has to atleast wait a fixed amount of interval sys.getswitchinterval() , before a currently running thread releases the GIL and signals the condition variable to actually schedule a waiting thread.
Even after this new thread gets scheduled by OS it still would need to acquire the GIL to actual do some work and in doing so it would compete with other threads including the one who released the GIL in first place. There are generally no guarantees that a non-running/waiting thread would acquire the GIL even after fixed interval.

Actual performance of your code would depend upon a lot of factors like how much I/O is involved, is some thread is CPU bound? etc.

Why:

With possible releases of no-gil CPython, i find it necessary to have a sufficient view of threading mechanism offered by python runtime. Only then we can be aware of problems that may occur with no-gil based python releases running older python code or compiled extensions.

We have to care about extra-details only if we are using threads or some async code. Simpler python scripts not using threads are supposed to work without any hiccups, but for now it remains difficult to estimate the performance impacts for such code in upcoming python releases.

Most of the python code is expected to work, as even with GIL enabled OS-level threads are being pre-emptied by OS, Instructions being executed are actually machine-instructions and don't translate to a single python function/routine. ,but Python runtime still guarantees of atomicity for operations like indexing, appending provided by in-built data-structures and these guarantees are expected to hold even with future releases.

How:

Take a look at code below:

import time
import threading

data_list = [] # accessible from both threads, i.e in shared memory aka heap.

def append_some_data():
    for i in range(10):
        data_list.append("thread_1")  # understand that append operation is guaranteed to be atomic by python.
        time.sleep(0.1)  # actually release the reacquire the GIL. (helping in switching threads..)

thread_1 = threading.Thread(target = append_some_data)  
thread_1.start()

for i in range(10):
    data_list.append("thread_0")
    time.sleep(0.1)

thread_1.join() # make sure thread_1 is finished too.

print(data_list)

# ['t_1', 't_0', 't_1', 't_0', 't_0', 't_1', 't_0', 't_1', 't_0', 't_1']  first run
# ['t_1', 't_0', 't_0', 't_1', 't_0', 't_1', 't_0', 't_1', 't_0', 't_1']  second run

Code above is supposed to reveal two points:

That some operations by default would be atomic irrespective of cpython flavour.
Nothing can be said about order of execution of different threads.

Every thread would have to acquire the GIL first to access any python data-structure or call C underlying API. This condition makes it quite easy to manage threads without worrying much about thread-safety. Note that each thread would still be scheduled by the OS but could only have some effect only after acquiring the GIL. Multi-core machines generally result in worse performance for I/O bound python code as would discuss later.

For first point, i said operations like indexing,setting, append are supposed to be atomic for both gil and no-gil releases, irrespective of the code actually used to implement such behaviour. For GIL based Cpython each thread can acquire the GIL once and do all such operations without further caution. Actual operations' implementations are much simpler and free of any syncing logic. But for free-threaded (no-gil) releases these operations would have to be made atomic to prevent violating fundamental assumptions. You could see python 3.13 devel code littered with Py_BEGIN_CRITICAL_SECTION and Py_END_CRITICAL_SECTION macros for various functions!

For the second point, user is always responsible for synchronizing the access to resources like file / sockets / Std I/O. Most of the I/O operations in python release the GIL and only try to reacquire it after returning from I/O code. So if multiple threads write using f.write() to the same file descriptor, actual order would be difficult to predict and may potential lead to race conditions. In this case code would actually be running in pure multi-threading mode.

# I/O operation,   UNDERSTAND THAT even with GIL  such I/O code may still create RACE-CONDITIONS.
release_gil()    #  Py_Allow_Threads, in the mean time other thread could run.
...
some_io_code()     # like a system call or extension code.
....
acquire_gil()     # Py_Restore_Threads, acquire the GIL and restore the thread state.

By default:

GIL is released generally for most of the I/O operations (including calls like time.sleep()), which results in faster thread switching for such I/O bound code. Runtime has to release the GIL to prevent any potential deadlock in case I/O never materializes. Other threads could keep running in the mean time. There is no way around it as runtime can't model THE I/O latencies and OS handles it anyway. I don't fully understand the situation with async code where i/o operations are supposed to be non-blocking by definition and effects of releasing GIL for such code.
GIL can also be released manually by using C API macros Py_Allow_Threads and Py_Restore_Threads for simple cases, some C extensions/modules like hashlib manually release GIL for compute intensive operation and acquire it back before running any pure python code.
Note that if gil is released manually by C extension, it is recommended to not access python runtime in any form before calling Py_Restore_Threads, extension generally run does operation on privately owned buffer/memory not visible to python runtime.
For Compute-bound threads GIL is released after sys.getinterval() interval which is around 5ms, other threads are supposed to acquire the GIL once released . Setting this value higher would improve the performance of Compute-bound thread but beats the main purpose of using threads . Setting this value to lower would theoretically improve responsiveness but may not be visible to user.

Since running thread would try to acquire the GIL just after releasing it, OS would treat acquiring GIL as any other instruction needs to executed. On multi-cores OS does its best to schedule the waiting thread on a separate CORE and has no incentive to stop thread that just released the gil. So most of the time this thread itself ends up acquiring the gil again. OS has no idea of semantics of GIL for new thread being scheduled, new thread could wake up and after not being able to acquire the GIL it would go back to waiting and so on..., so performance is generally worse on multi-cores machines.

We can use locks like threading.RLock() or other variants, to make sure we don't have any unexpected access (r/w) to a shared data-structure until that operation runs to completion. Note that definition of operation would depend on user, only after operation is concluded we would release the lock, if user fails to release the lock then other thread would never able to have access to that shared data-structure and possibly end up in a deadlock. It comes with performance penalties for sure, but sometimes very necessary for correctness guarantees of your code. So if a user use locks to control r/w access to the shared data-structures diligently, then pure python code is supposed to run correctly independent of GIL implementation.

A bit more about switching threads:

Execution of user level code goes through a couple of passes before it can be actually executed by a python VM. First pass converts user level code into a much smaller subset of instructions, which collectively is referred as bytecode, this bytecode is then transformed to the machine level code by the python VM installed on the system. This part can be thought of as front-end of python interpreter and is responsible for parsing, tokenizing, ASTing the pure python code. This is pure software level transformation and could be done independently and reproducibly on any platform.

Bytecode generated by front-end is fully portable, and eventually python VM would execute this byteCode instruction by instruction after doing all necessary book-keeping. Bytecode instructions represent a level deeper than pure python code and are much easier to execute, understand and transform by the python VM. We can note that each piece of our pure python code needs to go through a series of steps even before doing the actual operation. This is part of the cost we have to bear in exchange for flexibility offered by the language.

import dis # disassembly module.

def some_function():
    data = []
    for i in range(10):
        data.append(i)
    return data

dis.dis(some_function)

  2           0 BUILD_LIST               0
              2 STORE_FAST               0 (data)

  3           4 LOAD_GLOBAL              0 (range)
              6 LOAD_CONST               1 (10)
              8 CALL_FUNCTION            1
             10 GET_ITER
        >>   12 FOR_ITER                 7 (to 28)
             14 STORE_FAST               1 (i)

  4          16 LOAD_FAST                0 (data)
             18 LOAD_METHOD              1 (append)
             20 LOAD_CONST               2 (2)
             22 CALL_METHOD              1
             24 POP_TOP
             26 JUMP_ABSOLUTE            6 (to 12)

  5     >>   28 LOAD_FAST                0 (data)
             30 RETURN_VALUE

By disassembling the bytecode for some_function, we can observe a much simpler sequence of instructions which relates to the original pure python code. This can be naively thought of as a simple stack machine. For example BUILD_LIST is one instruction whose execution result (a pointer) is pushed on to the stack and STORE_FAST saves/stores this pointer in the data reference (for later use down the execution pipeline), and pops it off the stack. Similarly GET_ITER generates a python generator whose reference is stored in FOR_ITER, every call to this FOR_ITER instruction would produce a values which gets store in i variable and so on.. .

Actual number of machine instructions or C level code being used to implement a list is relatively larger. But using the power of abstractions, VM only needs to think about it at BUILD_LIST level. Underlying code would make sure that BUILD_LIST implementation code runs to completion before any such instruction from another thread could be executed. To actually do this underlying VM code has to use a mutex or a lock to guard the actual C code implementing/initializing the list, and we already have a lock in form of GIL, which would already be acquired by the thread executing the instruction.

Lets try to understand a bit more about thread switching by making a very very simple machine to execute some imaginary instructions.

import random
import time

# mapping thread id to a list of instructions to execute. 0 belongs to main thread.
CODE = {
    0: ["NOP" for  i in range(10000000)]  # for main thread ..
}

def register_thread(instructions = list[str]):
    """a function to register a new thread, 
    it provides an interface to run some instructions specific to a thread.
    """

    new_thread_id = max(Code.keys()) + 1
    CODE[new_thread_id] = instructions   # provide corresponding instructions to run.

    global SYS_COUNTER
    SYS_COUNTER = 2      # update this so this new thread could get a chance to run.

def switch_thread():
    # equivalent to releasing GIL from one thread and being acquired by other. 
    return random.choice(list(CODE.keys()))  # fair policy, uniformly random.

def resume_thread(id:int) -> int:
    """ runs one complete high level instruction to completion."""

    print("Resuming thread: {}".format(id))
    if len(CODE[id]) == 1:
        print("\tExecuting: {}".format(CODE[id].pop(0)))
        print("EXIT THREAD {}".format(id))
        _ = CODE.pop(id)
        return 0

    print("\tExecuting: {}".format(CODE[id].pop(0)))
    time.sleep(0.5)
    return 1

if __name__ == "__main__":
    # at beginning 
    SYS_COUNTER =  100000        # an initial large value indicating after how many instructions to switch to release a lock.

    # simulating initial work in the main thread... 
    for i in range(3):
        print("Main Thread, ID: 0")
        time.sleep(0.5)

    register_thread(["I", "AM", "BAT", "MAN"]) # simulating registering a new thread.
    register_thread(["I", "AM", "GROO", "OOT"]) # simulating registering a new thread.

    while True:
        thread_id = switch_thread()
        for i in range(SYS_COUNTER):
            flag = resume_thread(thread_id)
            if flag == 0:
                break

It may look like this...

Main Thread, ID: 0
Main Thread, ID: 0
Main Thread, ID: 0
Resuming thread: 1
        Executing: I
Resuming thread: 1
        Executing: AM
Resuming thread: 1
        Executing: BAT
Resuming thread: 1
        Executing: MAN
EXIT THREAD 1
Resuming thread: 0
        Executing: NOP
Resuming thread: 0
        Executing: NOP
Resuming thread: 2
        Executing: I
Resuming thread: 2
        Executing: AM
Resuming thread: 2
        Executing: GROO
Resuming thread: 2
        Executing: OOT
EXIT THREAD 2
Resuming thread: 0
        Executing: NOP
Resuming thread: 0
        Executing: NOP

......

SYS_COUNTER value is used as switching mechanism among concurrent threads, by default this value could be set to a large value when there is only one main thread. Above implementation employs a register_thread to register a thread with some default underlying runtime. This in our implementation sets the SYS_COUNTER to a much smaller value.

resume_thread is used to simulate the execution of a single atomic instruction. After executing instructions for SYS_COUNTER times, underlying runtime switches to a new thread. switch_thread can be assumed to be an equivalent to acquiring and releasing GIL, it uses a fair_policy in our code by (uniformly)randomly selecting one of the threads.

We can instead update this policy to favour the already running thread and a penality for everytime this happens to simulate the behavior of python on multi-core machines.

Extensions and GIL relationship:

Python extensions are pieces of code written with the aim to extend the python ecosystem by exposing an interface compatible with Python API. Such extensions are generally in form of compiled code and loaded dynamically by the Python runtime which leverage the functionality provided by an OS for loading Dlls. Like with other any DLL, compiled extension must be a valid DLL with an valid init function as required by OS loader.

Python interpreter knows nothing about the actual functionality/code provided by the extension even if it is loaded successfully.

Such extensions are loaded like any other pure python module and hence called from a thread with GIL held for any routine/function from extension. No other thread could acquire the GIL until that extension routine returns and release the GIL after that.

As i said python cannot inspect the extension, so if there is some I/O code inside extension code it may take some time to finish. Idea with extension is to actually run some compute intensive code which otherwise would have taken a lot of time in pure python.

Calling extension routines with GIL held implicitly guarantees thread safety as extension doesn't need to do much book-keeping or worry about potential race-conditions or segfault resulting from releasing some shared memory by GC, as gil is held by calling thread and all sequential or concurrent instructions would have to wait for return.

With no-gil or free-threading python extensions' code would have to specifically communicate with the underlying runtime using C API, if any other threads were to run in parallel too. This would put burden on developers to understand C API more thoroughly for writing extension for no-gil python versions. It would also be difficult to understand the impact of running older extensions on no-gil python runtime.

Following example should illustrate the blocking nature of python runtime when running some extension code by default.

# pseudo code
import some_extension

thread_1 = threading.Thread(target = some_extension.foo)  # asking the python to run a routine in a compiled extension.
thread_1.start()    # it blocks here, until foo returns.

# so this code in the main thread was supposed to process some events. This wouldn't run until extension is finished.
While True():
    event = get_event()
    process_event(event)

In the above code if extension foo routine takes a significant amount of time, then we would be wasting python potential to do some extra-work when extension is working. Above code would definitely exhibit a worse pattern:

if extension foo is only using a single core on a multi-core machine.
if following code is not sharing any memory with foo code.

As python threads are OS level threads, if possible we could expect (ideally) OS to schedule(and execute) our python code on a separate core if only extension could Communicate with the python runtime.

Some of the extensions like hashlib and zip use the pattern of releasing GIL for running some independent compute-intensive code to not to starve other threads. Actual access to Python C API can be gained through various ways like compiling extension with python headers and dynamically linking to a specific python version. This generally leads to either shipping compiled extension for a particular python version or building it on a new system during installation of a package. This building doesn't go as smooth on systems like windows, android and is a major source of frustration when using relatively less popular packages.

Let us look at a pattern that showcases the expected performance benefits due to threading.

// some extension function (pseudo code)
function foo(Py_Object * some_python_object){
    // NOTE: by default this routine would be called with GIL held
    // assert Gil_state() == 1

    // variable to store the thread state in a local variable.
    PThread_t * state;                  

    // increase the REF_COUNT for this python object. (Only if extension requires it. Otherwise GC can release it!)
    INC_REF(some_python_object)

    // save the thread state into state, and  release GIL to be acquired by concurrent threads (if any).    
    state = PyEval_SaveThread()     // could be called from any thread.

    some_compute_intensive_code()   //either on private memory (SAFE) or shared python memory/object (UNSAFE, since after releasing GIL its open season !)

    PyEval_RestoreThread(state);  //acquire the GIL and restore thread state from state local variable.

    // assert Gil_state() == 1

    // decrement the reference count.
    DEC_REF(some_python_object)

    return;
}

The above pattern highlights a protocol that could be followed by most of extensions.

If code needs to access some python object, it should increment the REF_COUNT to not have any unexpected side-effects while accessing shared memory.
Extension code should release the GIL if doing some compute intensive code (specifically if not using all cores during that computation). This would allow python to run code in another threads to attend other events. Note that releasing GIL is generally done when extension no longer needs python object/memory and instead update/modify some private memory.
Not releasing GIL would not allow other threads to run in the meantime, and may fail to attend events.
For free-threading (no-gil) i think it would be quite difficult to access some shared memory without using some kind of sync barrier, hence potentially limiting the performance gained from using extensions in the first place.

Above example should help in understanding that without reading code for all the modules imported by a python script it would be really difficult to predict the performance of that programme as a whole!

Shared libraries from any language and GIL:

A major advantage of using shared libraries is that such libaries could be written in languages other than C, like in rust, nim, zig etc. Such extensions use the implicit guarantee provided by GIL, without involving any C API and are able to speed up some compute intensive part.

Using your favorite language to develop extensions in does sound delightful even for personal reasons. User can spend a fair amount of time writing shared lib, but once done it would make it very tempting to just call it from python for scripting, integrating with other modules. But as a user you wouldn't want to mess with compiling some C headers every time for a different python version. A lot of effort would go in managing the infrastructure and idea may no longer seem delightful. One main reason to compile C headers during building of a extension is to have full access to C API for python runtime. Extensions compiled this way ends up inheriting a lot of information about underlying python runtime. Apart from containing the necessary C functions these extensions also include the necessary code like PyInit_<MODULE_NAME> to make it possible to import this as any normal package by python runtime.

Since we want to write extension code in desired language only once we will need to do some runtime tricks to get access to python runtime our extension would be loaded with. There exists a subset of python API called limited API for obvious reasons, which is stable enough to be available for all python releases after 3.2.0. We could target this API instead and write our extension code only once to be able to run on virtually all python releases.

Actual access to Python C API is gained through by dynamically accessing C API (symbols) at runtime. User can just create an extension (.so or .pyd/.dll) without caring about actual python runtime through trivial build steps for desired language and communicate with runtime actually loaded by OS. Take a look at code below to understand more.

# Note that python3.x.x.dll is itself loaded by the OS at runtime for python.exe process.
# i am writing code on windows. change accordingly

import win32api, win32process

# enumerate all loaded Dlls and their handles for `python.exe`.

for h in win32process.EnumProcessModules(win32process.GetCurrentProcess()):
    print(win32api.GetModuleFileName(h), h)


#C:\Users\random\miniconda3\python.exe 140695168221184
#C:\Windows\SYSTEM32\ntdll.dll 140710736232448
#C:\Windows\System32\KERNEL32.DLL 140710726598656
#C:\Windows\System32\KERNELBASE.dll 140710699925504
#C:\Windows\System32\ucrtbase.dll 140710693699584
#C:\Users\random\miniconda3\python310.dll 140709572050944     # this is the python runtime loaded.
#C:\Users\random\miniconda3\VCRUNTIME140.dll 140710583664640
# .......

Now call some Python C API dynamically.

# this is for demonstration, you do would do the same from your language !

# now wrap it in form of CDLL instance.
import ctypes
from ctypes import *
import sys

# PyDLL is like CDLL class, but GIL is not released during function call.
# since we are calling python C API, 
pylib = PyDLL(name = "pythonlib.dll", handle = 140709572050944)

# Py_GetVersion is part of limited API too.
pylib.Py_GetVersion.c_restype = c_char_p    # c character pointer aka null terminated string.
python_version = pylib.Py_GetVersion()

print(python_version)
# b'3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]'

print(sys.version)
# '3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]'

It would work but now we would be responsible to do all book-keeping by communicating with Python runtime, like incrementing/decrementing the Ref count for any data-structure/memory shared with python runtime. PyDLL class is used to make sure that GIL is not released while calling pylib routines. CDLL instance by default would release the GIL.

To be imported by python interpreter as a normal module, extension would also need to implement all necessary routines like PyInit_<ModuleName> too.

Such boilerplate code generation should be automated instead, I use an excellent package nimpy that handles most of this stuff allowing the generated extension to be easily imported with import <module_name>.

More ways to speed up python code ..

Python popularity has also inspired various tools with their own benefits and limitations to help speed up the python code. Python is a vast language and is continuously evolving which makes it almost impossible for a single tool to just accept any valid python code and generates a semantically similar but much faster code!
But understanding limitations of a tool can really help users to speed up bottleneck parts similar to a pre-compiled extension.

Cython: A python to C compiler.
Numba: An open-source JIT compiler that translates a subset of Python and NumPy into fast machine code.
Pypy: An alternative python runtime. and many more ...

Remarks:

Being able to integrate so much functionality from various third-party modules is one the main reasons for python popularity. It makes it much easier to prototype, test and make very cool programmes and still get a reasonable performance. As most of the scientific modules are written in C/C++ independently of any Python VM assumptions, availability of GIL has made it very easy to just use them as python extensions.

As python moves towards free-threading, i have my doubts about the performance of these extensions which has shaped the rise of language in its current form.
No-gil python is supposed to improve the performance of pure python code by trying to leverage all Cores on a system, but these threads then would then compete with extensions' code for their time on the CPU. This could end up lowering the performance in many unexpected ways, this is only after if somehow compiled extensions could be made to run correctly in a shared multithreading environment in the first place.
Some of most used python packages like pytorch, tensorflow, numpy already try to utilize system resources to the fullest anyway.

But kudos to the authors for trying this endeavor, they are really smart people and may know something i have missed completely.

I myself would have loved to work on standard tools for making writing extension much easier by doing rigorous checks during compilation time!

Running your code properly by synchronizing access to data-structures from multiple thread using locks should be main priority for users independent of flavor of underlying CPython. As long as users use standard python objects, classes and methods, their code would continue to run correctly although performance impacts remains to be seen with new runtime.