Millets: a practical memory-safety and thread-safety experiment .

Greatness is nothing but a meditative repetition of a sequence of primitive actions ...

Based on the observation that almost every data-structure is backed by some form of dynamically-allocated memory (Heap) and all algorithms allocate required (scratch) memory at "runtime", Framework being discussed (tries to) expose a borrow first model for any such memory needs. Memory as a resource should be the property of that object or data-structure, rather than the one of the type-system. Even though type-systems exposed by modern languages could really model the intention of a programmer by enforcing required constraints at the type-level, But not without a lot of effort on the compiler side in form of man hours and mental-grit necessary to continuously improve upon it especially after community has reached critical mass. Also there is little an compiler could do to sugggest optimal data-layouts or optimize reallocations with minimal narrow context at its disposal. On the other-hand such a framework or some kind of jit could collect stats to suggest more optimisations along this dimension !

This framework would be a personal piece of code, based on the personal needs and observations and hence don't claim to be a universal or complete solution in any form. But it will hopefully expose the common extensions in a pursuit of a much improved memory-safety, without requiring any modifications to the (language) compiler code. This would be a runtime library, hence much easier to experiment based on the personal needs, and one of such library could be backed by compiler as the default one, to build Data-structures upon. I have come to believe that this could even make the compiler/PL development much easier and faster to iterate upon which remains a very intensive effort in itself despite a lot of current tooling to help with. Besides promise of robust concurrency primitives could easily take decades even with a lot of money and man-power, and still could offer false-guarantees, and would include a huge amount of code in optimizations at compiler level .

Broader Ideas:

Since any such runtime would include (significant) extra overhead, and hence such libraries are bundled inside the compiler itself in hope to find optimization opportunities (in the future), and one of the reasons such frameworks are not developed independently of the compiler.

But as indicated, we would be trying to enforce borrow-first model for each memory allocation, if we succeed, we would end up reducing the memory allocation calls (and hence corresponding release calls), and may be will end up with a framework with much less overhead than our intuition may have initial estimated. Such model would end up similar to what Rust's share XOR mutable requirements, but at library level. Framework is not about a particular Programming language, but definitely easier to developed for PL with manually manged memory, or ideally providing information about an instances/variable scope to inject required code to automatically release memory.

⊕ This particular implementation has been developed in Nim, and for Nim.

Framework is supposed to double-check the user code for common memory-safety bugs even in multi-threaded environments, and yet could be disposed of, once a dry-run has not found any issues. User is supposed to and even encouraged direct memory access in pursuit of higher performance, but with more confidence! Main motive is to do away with redundant memory allocations/copy, generally involved with value-semantics languages (i.e each assignment produces a fresh copy of data, mostly to make compiler reasoning much easier).

It will however include minimal reference-counting, for release/production use-cases to correctly determine the point during execution to release the memory resources. But it itself would be modular, For now i Use Nim which injects the =destroy calls during runtime on the scope exit of an identifier, It could be easily modified to induce manual free calls. This would all be backed by an allocator which will keep track of all allocations, and any leaked memory will be detected by allocator before programme exits!

To summarize and for my sanity, following would be major ideas:

Borrow first: Defaults to no-copy whenever an access to memory is requested. Any new assignment (For same types), we would "share" the underlying physical (heap) memory. Only required stack-data is copied, Stack-data would be minimal and copying stack-data is generally much faster than Heap involvement, hence any assignment would incur very minimal Cost. Its much easier to anaylze when such "reference semantics" won't suffice, than to first do copy and then later analyze the code for copy-elisions for a runtime framework at least. Memory seems to exists on the independent dimension than the PL/Type-system, more closely related with data-structure/objects aka runtime part of your code. Trapping read and write calls seem to enough to enforce memory-safety checks to be performed by framework.
Detecting invalid/dangling access to the memory, throughout the code execution, i.e try to detect major and common memory-safety violations, particularly double-free, use-after-free which make up of the majority of the such bugs. We should be able to clearly indicate the reason for a failed check. (ideally where part of the literal code was responsible, which is a work in progress.)
Flexible: We don't have to satisfy all the constraints a type-system generally enforce to correctly determine the memory safety in languages like Rust, i.e type-system may require more guarantees to actually mark some code as memory safe, when it could be proven that existing semantics of some code wouldn't have violated memory-safety at runtime. It should match more closely with the user intentions, making it easier to share some memory/record among 2 or more data-structures, a DS is free to drop/release some memory at any time, change the reader status to writer/owner etc, without refactoring the whole codebase to satisfy compiler constraints. It is even very easy to come up with an immutable data-structure which may not allow any of that. Its seems much faster to experiment on user-space level than to fight compiler !

How:

Despite complex type-systems and compiler checks, main way to manipulate data is by loading and storing data from and to the RAM at runtime. These 2 instructions are exposed by almost all modern day CPUs. A simple and effective way would be to intercept all the reads and writes to the Memory, and mark any access (read or write) as invalid, if certain conditions are not met. We will not trap system calls, instead we will compare the memory addresses for accesses, with our own book-keeping. For that we will expose a memory abstraction, which each data-structure can choose to include it as a field to check all accesses by framework.

Terminology:

Allocator: Memory Allocator built upon the Glibc malloc, with idea of reducing the memory-requesting (malloc) calls by doing extra work in user-space. Actual implementation details could be ommited for now.
record: A continuous chunk of memory return by allocator for a new/fresh demand of memory. (Technically a pointer and size pair). A data-structure may keep multiple such records based on the needs. But for each fresh record, a separate call would be required.
reference/UserMemory: aka Memory abstraction, a Struct with record_size and pointer plus some-meta Data, exposed to the User, always kept on stack. A copy of each new reference is also maintained by Book-keeper which will refer to the already allocated physical memory. Any assignment will create a new reference (just more stack-data).
reference-id: Each reference is marked by an unique id. For now max 64 live references are possible.
reader: A reference that could only read the underlying memory. Based on their genesis during the code execution, each reference will have access to consistent view of memory during their lifetime, unless explicitly allowed or so. All references by default acquire the reader status. This is somewhat equivalent to immutable references. Any try to read a mutated memory will be detected and will result in panic! Any number of readers are possible.
writer: A reference that could both read and write to the underlying memory. Only one such writer is possible at any-time during code-execution. writer status can be assigned to any reference when requested explicitly. Previous writer if any , would be given a reader status. Sometimes owner term may be used in place of a writer!
Bookkeeper: An object that keeps track of all the active/live records and all the references that could access it . Any deallocation/allocation are required to inform the book-keeper.

Note that bookkeeper is modular and independent of the allocator, each could be improved and adapted based on the personal needs.

alt Hierarchy

Bookkeeper:

Despite complex type-systems and compiler checks, main way to manipulate data remains by loading and storing data from and to the RAM at runtime. These 2 instructions are exposed by almost all modern day CPUs. A simple and effective way would be to intercept all the reads and writes to the Memory, and mark any access (read or write) as invalid, if certain conditions are not met. We won't need to trap system calls, instead we can route all memory accesses through our Api to trigger memory-safety checks. To make it convenient we expose a memory abstraction aka userRecord type, to be composed by each data-structure, whom wish to leverage such checks, without ever having to care about any of internal implementation details!

Bookkeeper implementation has a Record type, which keeps track of the necessary meta-data like size, raw-memory-pointer returned by an allocator, and information about all the references to this record. Its quite easy to manage references as Nim expose =copy and =sink hooks, to be triggered on assignment statements.

⊕ We don't care much if Nim compiler triggers =copy or =sink hook, as we provide our own implementations which never does the deep-copy. This is somewhat similar to copy traits in Rust, primitive types in Nim are copied, which user-defined types, control this behaviour through =copy or =sink hooks.

Each time a =copy is requested, we find a slot and a corresponding reference-id, which is captured as reference_id field in UserRecord. Each reference on the book-keeper side further has its own meta-data, like MemoryRange for it is valid, reader/writer status. We could easily detect any out-of-bounds access. (independent of the data-structure level checks). In semantic sense, we would end up with one more reference to physical memory, such references can be easily passed around even across thread-boundaries, as we will see later in this post.

For any =copy trigger, Book-keeper can update its reference count for that Record. Since Nim offers ARC at the scope-level, corresponding =destroy hook is triggered whenever an instance of a type goes out of scope. We use this opportunity to decrement reference-count for a Record, and if that would be last reference to go out of scope, we can actually release the memory/record back to the allocator.

⊕ What allocator does with that record/memory after deallocation call is not purview of book-keeper, and hence fully independent of allocator implementation, as long as alloc and dealloc calls' signatures remains constant!

A lot of checks (even redundant) are there to check validity of such references at many occasions, generally any mismatch b/w (stack-level info) UserRecord and book-keeper is detected with a possible reason (and hopefully explanation).

Slices:

Aka temporary views, another very useful abstraction based upon the initial idea to reduce memory allocations. And combined with memory-safety, it could help in model many (performant) scenarios which occur regulary in codebases. We basically expose an API to further divide an allocated (contiguous) memory without (almost) no cost further into a desired subset/range . Any such slice would be a child of another (non-slice) parent reference and is guaranteed to go out of scope or explicitly dropped before its parent. Note slice implementation is only concerned with temporary guarantee, all reads/writers are done on parent behalf, this has made the implementation much cleaner, earlier implementation was needlessly complex to treat slices differently than they need to be. Since main reason for reference-counting seems to be automatic release of (memory) resources, and it only depend upon the number of active reference value, based on the above definition of slice we could prove that any number of slices cannot alter the release call of corresponding parent, hence we could do-away with all of reference-counting logic for such slices. (Incrementing the shared counter is the major overhead in multi-threaded programmes).

⊕ Cursor pragma can be manually triggered like var a{.cursor.} = b, to ignore copy/destroy cycle for a, or equivalently not include a in reference counting.
It ends up copying explicity stack-data without =copy or =destroy calls.

Slices or views result from practical needs to have a temporary access to some memory/chunk at some point during code execution, but generally for a very brief time. For example while reading a substring from already available string, such that this substring may be a key in a dictionary. Creating a new copy wouldn't make sense. ⊕ Slice in this document refers to a subset of memory for an already allocated record (which itself refers to a packed memory). Slice are assumed to have a stride of one, so an offset and a len field would be enough to point to the desired region inside a record unambiguously. Most of the programming languages introduce this slicing mechanism on some native-data types, like for a string/sequence/vector. Actual implementation may or may-not deep-copy the newly created sliced instance, for example an implementation may introduce (indirect) copy on write if that slice was ever written to.

Nim does copy for every slice operation by default, and this is not optimal to say the least for code-bases that use string type heavily like a search-engine. Nim have concept of open-array which is just a combination of pointer and len, but this open-array cannot be used as a type for object fields (without experimental flags). Another issue is that nim has null terminated strings to play well with C codebases. And hence kind of impossible to create a read only substring, because standard library functions do depend on this null byte along with the len field, which can be used seamlessly with existing library routines. I personally think having null terminated strings is a not a good decision, as most of the c routines also allow pointer and len like API , making it possible to do zero-cost Nim to C string passing anyway, but it may have been more convenient this way for language developers !

⊕ UserRecord almost double the size of a raw pointer, doubling the stack-usage than if we were to pass pointers to share memory. This may remind you of a fat-pointers like concept, used in Fil-c language or similar experiments. Idea is to hang/collect more meta-data to do more checks. Almost all "intelligent" software is about collecting or utilizing more (meta) data to make more fine-grained predictions, may be undeterministic but highly probable nonetheles_ (s or j) :)

type
    UserRecord*[T] = object
        # comes from allocator.
        record_pointer: ptr UncheckedArray[T]
        record_payload: uint32
        record_size: uint32     # in bytes!

        when defined(noMemorySafety):
            discard
        else:
            reference_id*: uint8

        isSlice*: bool            # for quick check, if some reference/stack-data represents a view/slice     


proc allocRecord*[T](n_elements:Natural):UserRecord[T] =
    # allocates enough (contiguous) memory to hold n_elements of type T.

    let n_bytes = n_elements * sizeof(T)
    let raw_mem_pointer =  a1_default.allocRecord(n_bytes, zeroin = zeroin_t)

    result = UserRecord[T](
        record_pointer = cast[ptr UncheckedArray[T]](raw_mem_pointer),
        record_payload = payload_counter,
        record_size = n_bytes,
        isSlice = false)

    inc payload_counter

# Enough memory to hold 32 int elements.
var x  = allocRecord[int](n_elements = 32)
echo x.reference_id                   # 0  (foremost reference )

var y = x                             # a new (reader) reference.
echo y.reference_id                   # something other than 0.

# warning, on reading, as we never wrote anything to it. Because why not :)
echo x

makeMeAWriter(x)   # ask to be a writer.
x[0] = 42          # write (trapped and memory-safety-checks are performed)
x[32] = 1739       # error, as outside of allocated range .

echo y             # error, while reading y ?

While reading y aka on echo y, we will get a fatal error, because at var y = x , y was tied to particular snapshot of underlying physical memory, and any "unexpected" mutation has to be soundly detected by the framework to make our model work. I used the word "unexpected" , because many times user is aware when some modifications/mutations are done through a function abstraction. We can do the following for framework to trust user's intentions.

# explicit acceptance of mutation.
expectMutation(y)       # y is tied to current-snapshot of the underlying memory.
echo y

...

makeMeAWriter(y)
#expectMutation(y)

echo y           # its still an error.

Note that, robust mutation detection is one of corde goals of framework and is rigidly enforced for each of write/read calls. In above case primitive like makeMeAWriter inform the book-keeper about only y being able to be write to underlying memory. Request succeeds, but y still be writing to unexpected, mutated memory and hence fails to write without addressing/expecting mutation. Rather than making makeMeAWriter responsible for checking mutation, hence adding complexity, some core checks are always done at read/write level only. This allows us to expose some more convenient APIs (in future), with more confidence. Focus is on composing one-off/orthogonal responsibilities to get what user intends.

...

makeMeAWriter(y)
expectMutation(y)
echo y

x[0] = 93      # prevented.. not a writer.

Minimal (current) API spec:

makeMeAWriter: immediately make some reference the owner of memory from that point on, It doesn't complaint , (may be except for slices as it would be redundant). Any previous owner/writer (if any) would be demoted to reader. Interestingly, first thread-id or a concurrent-unit id to access each/any byte would automatically become the writer/owner for that byte as soon as concurrent-environment is active. Thread-safety checks are supposed to (almost) detect race-conditions. During an active concurrent-environment reference-ids are replaced by thread ids, and memory access becomes much finegrained, allowing much easier to write concurrent algorithms !

⊕ For some scenarios like: var x:UserRecord[int] proc mutate_me(y:var UserRecord)= y[0] = 9 mutate_me(x) echo x # framework fails to indicate x is mutated.Even though mutate_me signature suggests that y/reference can/will be mutated, but i like framework to clearly indicate this, without knowing/remembering the detailed function signature. This could be solved another counter in the stack-data independent of book-keeper in future.. For now we don't recommend passing var UserRecord as parameter type!

expectMutation: Any attempt to read mutated memory by a supposed reader reference is detected. Purpose of this primitive is to assure framework that user is aware of reading desired mutated memory, and should be allowed to read current snapshot of memory at this point in time.
delActiveReference: During reallocation, framework would complaint if there are references to memory being reallocated. Either user would be aware , in that case can call this easily first before reallocation is triggered. Or user would not be aware, in that case framework would have detected a possible Bug. Earlier framework kind of did this in the background, and when user tried to use invalid reference sometime in future, would tell user about reallocation. It worked but was a bit complex, and also i feel each API should be explicitly called and have a semantically dedicated function. It is better to detect some bug/bad-assumption as soon as possible. It could also make it faster to deallocate memory, by explicitly dropping unrequired references.
makeMeASlice: This allows creation of a new reference from an existing non-slice Parent reference, and can be easily to create many such reference to be used across different threads. (With almost no cost). These are recommended to be used and can provide more fine-grained memory read/write checks.
clone: An explicit fresh copy of a record at some point in time.

Thread-Safety:

We could first try to define the thread-safety in the context of this framework. We will concern ourselves with one main condition that if and when ensured, we will call the user code thread-safe.

Terminology:

task: Could be understood as a sequence of instructions to be executed on/by any of concurrent-units.
concurrent-unit: any abstraction responsible for executing 1 or more instructions, which could be interrupted, and then resumed again. We assume once assigned a sequence of instructions, it will execute them in order. Os(threads), green-threads , virtual threads are some examples of such abstractions . We assume there exists unique Ids to differentiate each of concurrent-unit exposed by language/Os Runtime.
concurrent-environment: Assumed to be active, as long as more than 1 concurrent-units are active.
For this discussion, we just concern ourselves with read/write instructions capable of manipulating shared-memory.

Any byte (associated with a record/chunk) must only be read (during execution by a concurrent-unit) if written to at any point in time, from that same concurrent-unit, as long as concurrent-environment is active.

The above statement, just suggests one way of detecting race-conditions, but cannot account for the runtime undeterministic "scheduling" nature of concurrent-units. (Which makes thread-safety so tricky to resolve in reasonable times for all scenarios).
I think, we could code a machinery, that will be capable of detecting any race-conditions, only if we somehow can just assign/schedule each Task on an independent concurrent-unit. I think its possible to do so, without even modifying any of the compiler or language Runtime.

Say we have 2 tasks, TaskA and TaskB, being scheduled (by OS or user-space scheduler) on 2 concurrent-units. And assume there exist atleast one race condition, such that a memory-location L written during TaskB is being read during TaskA, without any barrier/sync protection, at some point during the concurrent-environment. We could visualize some of scenarios possible during scheduling of those two tasks, at runtime.

Schedule-scenarios

For scenario A and C, an implementation based on above definition would detect a race-condition. ⊕ Memory-location L is being read and written from different concurrent-units. (Actual order wouldn't matter) which violates the definition.

Since L has been chosen arbitrarily, its easy to prove that for scenarios A and C, race-condition would be detected for any of such location.
⊕ PMI (Principal of mathematical Induction)

We could also prove that definition will always be violated, unless both TaskA and TaskB are scheduled on a single/same concurrent unit. Following both observations, we could infer that, if each of the tasks could be scheduled on independent concurrent-units, we could always detect race-conditions in a reasonable O(n) time, without needing to run any (intensive) simulations.

⊕ I am yet to fully test out previous hypothesis. If above hypothesis is sound and i am not missing something obvious, It should be easier to schedule each task on an independent concurrent-units, atleast during safety-checks by using a bare-minimum scheduler. (Without any dependence on compiler or language Runtime code) But for now,our thread-safety implementation almost detects race-conditions :)

Being able to detect/indicate race-conditions with high probability, could make it possible to leverage data-structures for unexpected scenarios too. PL compiler treatment of data-structures as indivisible entity, something to be either fully moved/owned to one thread or not, may not be the best strategy.

import millets

type
    MyObject = object
        shard1: Memory
        shard2: Memory
        ...

proc updateShard1(a:MyObject)= 
    var b = a           # new reference.    
    b.shard1[..] = ..   # Mutates!
    ....

var x1: MyObject = (...)   # initialize a new object/ Memory.

# -------------------------
milSpawn updateShard1()      
x1.shard2[..]  = ....         # update shard2 concurrently with shard 1. 
milSpawnSync()

# -----------------------------------------

# Ok, since itself(owner) modified this.
echo x1.shard2

# Error, since modified (by other reference), we would be reading mutated memory.
echo x1.shard1

expectMutation(x1.shard1)      # We expect it.
echo x1.shard1                 

...

Spawn:

⊕ Our spawn would leverage a very naive threadpool, which make tasks available to its workers through a (naive) thread-safe queue.
There is huge literature experimenting with different forms of concurrency, some of them are nicely collected at: https://github.com/mratsim/weave-io-research/blob/d12c0b5/research/README.md

A minimal spawn like functionality is included, to make it easier to just schedule Nim procedure on one of the worker threads and possibly get speed up in multi-threaded environments.
⊕ Each PL has some defaults on how to handle ownership of involved variab;es/data , being accessed by the spawned procedure. Nim by default has value semantics, so deep-copies are generally made for involved variables. But Nim lacks (detailed) documentation for most of concurrency primitives exposed, so we will write our minimal code for better debugging and optimization opportunities ! Many Programming Languages, have this kind of functionality to let users easily schedule native functions/procedures this way. But to make it seemless to pass a native routine/procedure to be run on a new thread/green-thread, most of implementations do extra work in the background which generally involves analysis/capturing of the required Environment to prevent possible side-effects that could occur by naively accessing some shared variables/memory in a concurrent Environment.

Languages like Erlang/Go are quite popular for being able to run a large number of light (green) threads/co-routines making it very easy to write Network (I/O) applications. General way or only allowed way is to pass information among such concurrent units is by "message-passing", which is a safer (but less performant due to copying involved) and (relatively) easier to implement.
Rust bravely allows to share memory (preventing full copies) among threads and can prevent race-conditions (un-deterministic order of read/write to/from a physical memory location) by extending and enforcing Ownership rules to threads too . It almost impossible for any language/framework to prevent deadlocks, resulting from user specific code, as far as i understand Rust also cannot prevent that, and other complex bugs which may be very difficult to reproduce!

Building upon initial point of capturing variables/data when spawning a closure is generally first part of running that closure/procedure in a new thread. Following discussion should be valid of all types of concurrent-units!
We aim to carefully understand what should happen to a variable/data being moved/copied to other thread (concurrent-unit). concurrent-unit would be mainly concerned with a (user-space) function pointer, it could jump to and start executing in a rough sense. Framework/language have to handle all the stuff like "serialized" access to "data/environment" procedure would require, inject necessary code or ask the User to do, to prevent race-conditions and detecting any immutability violation etc. In simple terms, Language like Rust can robustly transfer required environment's ownership to a new Thread , while fulfilling many/all conditions mentioned above. But an interesting question is, what a language like Rust "sacrifice" in doing so ?

If assuming this kind of "ownership" works, we may have to recreate that "stack" for new thread to continue from. Technically we can communicate with this nascent thread through Shared/Heap Memory, since all threads use a Shared Heap in a (OS) process. Below is very minimal way on how to do this:

proc test(data: tuple[x:int, y:float])=
    echo int(data.x) + int(data.y)

milSpawn test((x:42, y:3.14))    # we wish to run `test` in a new thread or one of worker threads with the desired data tuple.

This leads to code generation like this at compile time.

⊕ Its interesting since temp_ptr is just a pointer and Nim generally offloads all raw-pointers sheninigans to user . And when temp_ptr goes out of scope, Nim would not call any =destroy or something, since its just would be a raw pointer and Nim wouldn't know what to do with it . It may be considered unsafe, and puts some cognitive load on the developer to remember such details.

let temp_ptr = cast[ptr `data_arg_type`](c_malloc(csize_t(sizeof(tuple[x:int, y:float]))))  # allocate some memory on shared Heap .   
temp_ptr[] = (x:42, y:3.14) # should trigger `=copy` or `=sink` . (Both should be OK.)

# actual call to create a posix thread or putting data on a thread safe queue !

# temp_ptr goes out of scope ...

We allocate some memory, to store the data-tuple or (any argument passed to a spawn procedure). In this exact case temp_ptr[] = (x...) triggers the copy of data/environment to new allocated memory. Note in Nim case, even though temp_ptr is a raw-pointer, but still temp_ptr[] type-checks hence it would trigger =copy aka deep-copy or =sink (move ownership), as it does for normal instances/type on assignment statement. We do this way rather than out-of-band copyMem because we need to make sure ref-count is incremented before actual spawning/creation-of-new-thread, to prevent any untimely (try to) release of resources!

So if =copy was triggered it would be equivalent of creating a new object, or increasing reference count, if =sink was triggered it basically would meant we tranferred ownership to temp_ptr, and since nothing happens when temp_ptr goes out of scope, We are owned one extra copy or reference to account for. There is other half which happens like this in nascent thread context:

# in nascent thread..

var data_z:tuple[x:int, y:float]
copyMem(addr data_z, temp_ptr, csize_t(sizeof(tuple[x:int, y:float])))

# this data_z is now used by `test` procedure.

⊕ In reality, we have to create a (uniform) wrapper for each procedure definition being spawned, so that could have a rigid type to for shared data-queue, and code could be type-checked. temp_ptr aka data pointer, would be one of arguments of wrapper, hence available in the spawned thread.

We do an out of band copying, to recreate the calling thread (necessary) stack for callee stack, effectively tranferring ownership to callee thread. Note that we generally need to just copy stack data, as pointers are stored on stack, and Heap is already shared. Even in case of deep-copying of some native data-structure, pointers info from the calling thread would become in-accessible as soon as temp_ptr goes out of scope, effectively making data_z being the single owner of that instance of data-structure. Now when data_z would go out of scope in nascent thread, corresponding =destroy is called, we would not owe that extra reference anymore. Other cool thing is that if second half of code never runs (aka spawned function is never scheduled), we could never release some particular record/memory, as reference count would never be zero, tripping allocator which would then complaint of missing memory in its records !!

Currently milSpawn doesn't analyze the Environment and just expects the User to pass all the required references/data through function parameters. So its more rigid (but more explicit) and Casual user may not like this behaviour. But Nim can analyze this and I think provides a way for user to inject a corresponding closure signature through anonymous procs, but for now framework prevent closure call convention.

A Rust Example :

use std::thread;

fn main() {
    let v = vec![1, 2, 3];

    let handle = thread::spawn(|| {
        println!("Here's a vector: {v:?}");
    });

    handle.join().unwrap();
}

Main error seems to be:

error[E0373]: closure may outlive the current function, but it borrows `v`, which is owned by the current function

As far i understand this example should have compiled from a casual user perspective like mine, But error indicates that closure may outlive v/vec because it borrows v for (println), which is owned by main thread (hence could go out of scope before prinln gets to use it). But we can note that v cannot go out of scope (atleast for this piece of code) as main thread would be waiting for spawned thread to finish.
I assume that compiler is doing analysis at only boundaries like spawn, function call etc, not on every line of code (which makes sense) or it doesn't/can't have a Global view of the Code, to infer some obvious false-negative scenarios!

But this code runs without any errors.

use std::thread;
use std::time::Duration;

fn main() {
    // let v = vec![1, 2, 3];

    thread::spawn(|| {
        let v = vec![1, 2, 3];
        thread::sleep(Duration::from_millis(2000));
        println!("Here's a vector:{v:?} ");
    });

    let millis = Duration::from_millis(10);
    thread::sleep(millis);
    // handle.join().unwrap();
}

Almost always new thread won't get to run, but there is not even a warning from Rust, about possible bad/redundant code . It just compiles without any warning or error in my experiments. I think the main reason is because its almost impossible to detect (more accurate) violations of borrowed promises at compile time in a reasonable time. handle.join() call is decoupled from any of the borrowing information, even if compiler could include the full function graph for analysis. At compile time it becomes too big of problem to solve for more optimal solutions and this clearly begs the question, could a (low-latency) runtime framework be more helpful in such scenarios.

import os
import millets   # milCreateThread, milJoinThread  # bookkeeper comes into scope

var thr:Thread[void]

proc simple_thread(){.thread.} = 
  sleep(5000)               
  echo "Exiting.."

# create a thread (we can detect concurrent environment is active)
milCreateThread[void](thr, simple_thread)

sleep(1000)

# milJoinThread(thr)  # We wait for `simple_thread` to finish. (concurrent environment is inactive after this.)

# At this point, BookKeeper will go out of scope and will do final Checks.
# If We ever forget to call `join` or `sync` , Book-keeper will flag this as an error as possible `zombie` threads presence !

In this case, since at runtime we can reliably detect concurrent environment, its much easier for framework to indicate such errors. This is just one example, where current implementaton seems to follow correct behaviour, but as framework would be stress-tested for more bigger codebases, it may fail to uphold guarantees, as i don't have a proper mathematical model like type-theory to back this up , but that would be interesting in itself !

This would be a much bigger discussion around possible best semantics for memory safety, multi-threading bugs and this blog-post may not be the best platform, as i, myself have to lot to learn. But kudos to Rust, for atleast making it easier to write (more) performant multi-threaded code by sharing memory, than just "suggesting" users to do message-passing, as languages like Go with millions in funding could do much better to detect and indicate race-conditions !

⊕ For now, thread-safety cannot into account the presence of a user-provided barrier/sync to do away with false-positive race-condition detection. Even though spawn -> sync -> spawn sequence could model many such scenarios, but in future idea is to have a dedicate primitive, like userDefinedBarrier to futher cement race-conditions detection.
We could only check memory-locations being accessed when safety-checks are ON, hence could prove thread-safety for such locations only. During production its possible to encounter a Location (not accessed during memory-safety checks), that could have triggered race-condition, hence such cases by pass thread-safety guarantees.

Data Structures.

Now, we seem to have all the necessary abstractions and theory to create a minimal Sequence/Vector (dynamically grow-able data-structure) which would leverage our memory abstraction (UserRecord) , that we could hope to use in a more practical example.

type
    MySeq[T] = object
        offset:Natural = 0  
        len:Natural    = 0 

        # memory-abstraction!
        memory:UserRecord[T]

proc `=destroy`[T](obj:MySeq[T])=
    when T is SomeNumber:
        `=destroy`(obj.memory)
    else:
        for i in 0..<obj.len:
            `=destroy`(obj.memory[i])
        `=destroy`(obj.memory)

proc newMySeq[T](init_size:Natural = 16):MySeq[T] =
    result = default(MySeq[T])

    result.memory = allocRecord[T](init_size)
    makeMeAWriter(result.memory) 

    result.len = init_size
    result.offset = 0
    return result

proc view[T](obj:MySeq[T], 
    offset:Natural, 
    len:Natural,
    writable:bool = false
    ):MySeq[T]=
    # A native View (almost) no-cost view overy underlying memory!

    result.len = len
    result.offset = result.offset + offset # slice of a slice also possible!

    result.memory = obj.memory.slice(
        offset = result.offset,
        len = result.len,
        writable = writable
    )
    return result

...

Below is an example code which sums up 10 Million int64 elements. Following code compiles nicely on Android too.

# Initialize Thread pool.
let n_workers = 8
milInitThreadPool(n_workers)

# Sample Data.
var data = newMySeq[int](init_size = 10_000_000)
for i in 0..<len(data):
    data[i] = i

proc sumSlice(input:MySeq[int], output:MySeq[int]) {.milSpawnArgs.} =
    # Sum data in `input`, and stores it into `output`

    var sum:int = 0
    # makeMeAWriter(output)     # for slice it could be discounted!
    for i in 0..<len(input):
        sum += input[i]
    output[0] = sum

proc sum(input:MySeq[int], n_threads:int):int =
    var output = newMySeq[int](n_threads)
    var work_acc = 0

    for i in 0..<n_threads:
        # get enough work for each task/thread.
        let 
            temp_range = getThreadWork(
                len(input),
                i,
                n_threads
            )

        let work_size = temp_range.b - temp_range.a

        # get corresponding slice/views for `input` and `output`.
        var inp_slice = input.view(
                offset = work_acc, 
                len = work_size
            )

        var out_slice = output.view(
            offset = i,
            len = 1,
            writable = true
        )
        work_acc += work_size  # update accumulator

        milSpawn sumSlice((input:inp_slice, output:out_slice))

    milSpawnSync(true)  # wait for all tasks to complete!

    for i in 0..<n_threads:
        result += output[i]

    return result

# Pytorch (2.5.0)
# Sum 10 million int64 elements.

import torch

# OpenMp backend is supposed to be used for parallel reduction Ops.
# Rather than MKL/Blas , based on my understanding of Aten/parallel.h ! 
torch.__config__.parallel_info()

ATen/Parallel:
        at::get_num_threads() : 4
        at::get_num_interop_threads() : 4
OpenMP 2019
        omp_get_max_threads() : 4
Intel(R) oneAPI Math Kernel Library Version 2024.2.2-Product Build 20240823 for Intel(R) 64 architecture applications
        mkl_get_max_threads() : 4
Intel(R) MKL-DNN v3.5.3 (Git Hash 66f0cb9eb66affd2da3bf5f8d897376f04aae6af)
std::thread::hardware_concurrency() : 8
Environment variables:
        OMP_NUM_THREADS : [not set]
        MKL_NUM_THREADS : [not set]
ATen parallel backend: OpenMP

# data
data = torch.randint(high = 10_00_000, size = (10_000_000,))
print(data.dtype)          # int64

torch.set_num_threads(1)
%timeit torch.sum(data)

torch.set_num_threads(2)
%timeit torch.sum(data)

torch.set_num_threads(4)
%timeit torch.sum(x)

⊕ Summing 10 million int64 elements , time in milliseconds.
Without spawn/threadPool, best performance i could get was with 4 threads @ 2.80 milliseconds for pure fork-join pattern (on Windows)
For Android, (One Plus tab) Best i could do was 5.92 milliseconds for 2 threads, compared to 6.24 milliseconds for a single thread, suggesting very high context switching costs compared to desktop Platforms. I currently don't fully understand how Android handles scheduling for Little.Big like chipsets !

	No. Threads 1	No. Threads 2	No. Threads 4
Pytorch	3.90	3.15	2.58
Ours	3.90	3.15	2.55

(Seems like we achieve parity with pytorch, hence passes the initial Litmus-test for all modules involved in the current implementation ..)

Allocator:

To make this all work, a minimal memory allocator is implemented which can keep track of all the records through a recorIdentifer (a combinator of pointer and size), which is supposed to be returned during each deallocation . Further a payload counter is also used, to un-ambiguously differentiate b/w stale record-identifiers. (preventing double-free scenarios) ⊕ Since on each allocation call, payload counter will be incremented (reliably even in multi-threaded calls), At some point during execution, its possible that same (physical) memory chunk just deallocated/released could be allocated during next allocation call. If some stale stack/meta-data tries to free we could detect it independently of bookkeeper ! Allocator is fully independent of all other modules, it should be quite easy to swap with more efficient/domain-specific memory allocators.

Technically, allocator is made up of arena blocks with pre-determined sizes, starting with 1mb, then 2Mb, 4Mb and so on, so its fast and easier to find the right starting block for a new memory allocation request. Once such an areana block is initialized, we do not return it to the OS (even fully unused), until a bigger such block is already initialized. This has some shades of a generation GC, as a lot of (recent) frequent allocations would mostly come from a single block, and if only required to exist for a very short time, (without a more permanent allocation in between), Arena allocator like semantics would be enough and fast enough to support such burst of allocations. In my limited experiments, i have found "fragmentation" to somewhat proportional to number of memory allocations, if we are reducing such allocations in first place, this really helps in reducing "fragmentation", making it easier for (Arena inspired) allocator to reclaim a good percentage of allocations. If user is careful already about allocations and prefers buffer reuse, such programmes generally end up faster with predictable memory bounds.

Minimal Ref counting:

Each record has its own atomic reference-counter, as we only need to know when a reference-counter becomes zero to deallocate corresponding resources. All book-keeping code is replaced by this minimal-ref-counting logic for release/production builds.
If book-keeper passes all checks, coupled with allocator independent checks, this (Atomic) ref-counting (should) just works seemlessly. Atomics could be bit-tricky to get right depending upon the implementation and API exposed, but for now seems to work alright, but can be easily replaced with a (independent) lock to find any bug related to atomics logic.

⊕ I haven't found a way to get scope information about an identifier to actually call/inject the destroy at only the required location. But i think it may be possible to implement (atomic) reference-count at compile Time just using macros, but this would be an exercise in future !

For scope-based ARC, with multiple instances/references, =destroy calls if not being inlined, could lead to a significant overhead based on what =destroy implementation. It could also be noted for our framework, total latency induced would be highly correlated with the number of non-slices references created for a record during its lifetime, If user can leverage slice-references judiciously, a lot of this overhead can be mitigated!

In the end allocator will correctly determine the leakage of memory if any of reference-counting logic is not correct. This feature has correctly determined such leakages during development and refactoring of framework itself, convincing me of being on the right path!

Naive Cost-Anaylsis:

Including some kind of runtime, would always result in some overhead, only thing matters is if that overhead would be worth it for a particular application.
For scope-based memory-management and actual structure of the code, user may end with significant finalizer/destroycalls. Each such call can be argued to have two costly elements, namely decrementing ref-count and if reference-count for a record becomes zero, then release of corresponding resources.
Assuming we managed to keep the allocations to optimum, then most of the (undesired) cost would be due to (atomic) ARC. Its also possible tp reduce the number of ARC calls, by leaning into lent/borrrow semantics aka slices.
For an equivalent written (correct) code in a manually managed-memory language, user would have made allocations/copy only where necessary, and would include corresponding free calls at the right time. But free calls timeline matters, an untimely avalance of destroy calls may end-up producing undesired-latency, but its still predictable with scope-based MM.
I think only way to measure the gains would be through a real-life big enough codebase porting, which i am in the process of, and only then i can be sure of exact practical benefits it could offer for more generic codebases. I intend to publish the more detailed analysis and practical issues encountered in the next post. But seeing it in action, working as intended for few scenarios i tested it in makes me hopeful :)
Despite the benefits, data-structures based on this memory-management, cannot interact directly with standard-library data-structures or code, without some kind of marshalling which would involve copying back and forth, which we have been trying to eleminate. This is one of reasons that it is more of a personal-project, as i would need to port the corresponding routine from Nim standard library or write a new one from scratch. Nim standard-library is very very readable, and i have found it quite easy to port some existing code in order to reduce allocations/reuse-memory in a hot-loop, having some standard/tested Data-structures would be very convenient for all future code i intend to write.

Final thoughts:

Core implementation was guided on the premises of reducing redundants allocations, by anaylzing code for mutated memory, if memory was indeed mutated with other active references present, then cloning would be suggested and indeed required for that specific code flow! Further, slices were introduced to share memory as much as possible without violating the basic conditions established earlier for memory-safety. I think slices coupled with thread-safety has made it much easier to leverage multi-cores with minimum effort even in a general purpose language, atleast for me. I have been using pointers to for sharing data among threads (it gets tiring/usafe as codebase evolves), but some limitations like creating substrings without copying was stil not possible.

⊕ To be fair, Nim allows to use -d:useMalloc to quickly check if default allocator is causing problems, but comes with a significant penalty for any codebase leveraging dynamic memory.
I also really like primitives like inlined by-value structs/objects,ensureMove , expandArc to easily understand compiler decisions/optimizations at pure language level.

Various nim competing primitives like isRefUnique , isolated[T] are still incomplete and sometimes even misleading with no warning. Even allocator concurrency issues are not mentioned in official documentation, and i have spent my fair time trying to debug that.

But i have come to love the language which just invites you to do whatever you want if you know the risks. I respect all the effort that has been put into a language with already at 2.0 version, without any big sponsors. I love the community which has been hugely helpful, I believe most of avid contributors in open-source community are trying to do their best!

I understand that i would encounter various edge cases and practicality issues as i start using data-structures based on this memory management in bigger codebases, but this is something i intend to improve upon. It would be part of grind to make it stable enough for me to use it everyday without thinking!

And as i mature as a programmer, i would like to try to improve upon practical and real-life problems, like detecting a tennis ball or badminton player with off-the-shell cameras, for which a computer-language will play a necessary but minimal part, it would just be one of the piece of much bigger puzzle. Its okay for me if i am not able to model my intentions purely using a programming language, as long as some hacking puts me past the finish line, as don't we all secretly aspires to be a hacker in the first place!

Resources:

I like to read about various Programming-languages' memory-management strategies, as they are moulded by practical requirements and trade-offs , rather than pure theoritical approaches.