Detecting Cpu Hardware features:

Detecting underlying CPU features is generally a pre-requisite for writing optimized code in try to squeeze out every possible bit of performance. It is generally the last step in an algorithm, after leveraging more generic capabilities like multi-threading. This involves understanding the computing architecture directly at chip instructions level, a level deeper than what OS has to offer. Being able to detect hardware features would allow us to dispatch an optimized routine rather than a generic one. Some libraries (like Onednn) even use a jit compiler to generate an optimized implementation for some operation at runtime rather than shipping with all possible implementations.

We would also look at some related topics like extended assembly syntax in pursuit of being able to detect such features.

Extended assembly syntax (EAS):

In order to get CHIP level information we may need to leverage assembly code directly bypassing compiler. EAS provides a way to make it a bit easier to include occasional assembly code. But it does come with some quirks as we would need to do extra work to let compiler understand our intentions with the assembly code wish to execute. Idea with EAS is generally to use C variables directly in assembly code to make code simpler.

Thing to note here is that compiler never parses the assembly code , it just substitutes the operands. Since it never parses the assembly instruction it cannot know its side-effects or is it even correct. EAS allows to let compiler know about which variables would be modified(write) or used(read) by the assembly code. So it is better to always include all the input and output operands clearly and exhaustively. But still we need to understand more about the constraints that we can associate with the operands.

Constraints allow to fine-tune already possible subset for an instruction. For example add instruction requires atleast one register as its operand, using constraints we may be able to select which variable should be in register and which one would come from memory. Since variables initially would reside in memory, extra instruction apart from assembly template may need to be generated to fulfill a constraint.

Taking a look at example below.

proc sum_2():int {.exportc.} = 
    var
        a:int = 9
        b:int = 11

    # b = a + b        # should be 12.
    # return b

    asm """
      "add %1, %0\n\t"   
      :"=r"(`b`)         
      :"r"(`a`)
    """
    return b

Using =r constraint for b doesn't behave as i expect, i specifically has to use =m to commit b into memory before returning it. Compiler may feel no need to store b onto the stack before returning !

According to me:

compiler generates the assembly code by substitution and some preliminary checking like we cannot set m constraint on both a and b, as add instruction requires atleast on register as operand.
Based on the constraints it may generate extra instructions in the final assembly, like using m on b will make sure to write result back to memory.
Compiler needs to know input and output operands being used, so that it doesn't optimize them away or something!

For now i use this asm extension inside a dedicated function, rather than mixing-in (in-lining) and test it to check if working as intended. I specifically commit the variables i wish to return to memory by using =m constraint.

I intend to update preceding paragraph as my understanding improves.

Best way to learn more about this is the official GNU GCC documentation.

CPUID:

cpuid is a instruction available on most of x86 systems, which allow to query Chip/processor information is a standard way even on much older x86 systems. We can use this to detect features like avx, fma capabilities. Wikipedia page on CPUid is quite thorough and contains enough information to understand and use this instruction for our usecase.

From Wikipedia page itself:

The CPUID instruction takes no parameters as CPUID implicitly uses the EAX register to determine the main category of information returned.

There could be a lot of information associated with a processor, so it is divided into categories referred using numerical Ids. First we need to ascertain the maximum allowed category (leaf). It can be obtained by setting eax register as 0 and later reading the eax register after calling cpuid instruction. as shown below.

proc cpuinfo(eaxi, ecxi: int32): tuple[eax, ebx, ecx, edx: int32]=
    # eaxi: contains the leaf level (node). 
    # exci: sometimes used, but is generally 0.

    var eaxr, eabr, eacr, eadr: int32    
    asm """
      "cpuid\n\t"
      "mov %%eax, %0\n\t"
      "mov %%ebx, %1\n\t"
      "mov %%ecx, %2\n\t"
      "mov %%edx, %3\n\t"

      :"=m"(`eaxr`), "=m"(`ebxr`), "=m"(`ecxr`) , "=m"(`edxr`)
      :"a"(`eaxi`), "c"(`ecxi`)
    """
    return (eaxr, ebxr, ecxr, edxr)

let (eax, ebx, ecx, edx) = cpuinfo(eaxi = 0'i32, ecxi = 0'i32)
echo eax        # 22 on my system

proc cpuNameX86Basic():string =
  # get basic name for CPU, 
  var (eax, ebx, ecx, edx) = info(eaxi = 0x00000000'i32, ecxi = 0)
  var temp = (ebx, edx, ecx, 0x00000000'i32) # add null character so that cstring can  be interpreted
  result = $cast[cstring](addr temp[0])
  echo "basic max leaf: ", int(eax)
  return result

echo cpuNameX86Basic()  # would print GenuineIntel on my system.

Mine processor is from skylake family on which the highest function implemented is returned as 0x16 aka 22.

Generally information about a feature availability is indicated by setting a bit on or off. For example according to Wikipedia, FMA3 instruction support could be checked using by setting leaf as 1 and reading the 12th bit in the ecx register.

let (eax, ebx, ecx, edx) = cpuinfo(eaxi = 1'i32, ecxi = 0'i32) 

proc check_bit(register:int32, bit:int32):bool = 
    result = register and (1'i32 shl bit) != 0

proc has_fma3():bool=
    let (eax, ebx, ecx, edx) = cpuinfo(eaxi = 1'i32, ecxi = 0'i32)
    result = check_bit(ecx, 12)

Visit Wikipedia page for more detailed information that could be extracted about underlying processor.

Availability of a standard instruction makes things easy from developer point of view too. It could be used to make sure that certain features are indeed present as required by an application or convey useful error messages to user if not.

GetAuxVal on ARM:

Arm system on the other hand requires developers to depend on the OS to expose the hardware functionalities. Standard special registers indeed are available on the ARM chipsets to expose the implemented features like ASimd, Fp16 support etc. Userspace applications wouldn't be able to access those registers directly as this would require EL_1 aka exception level 1. Only the code running at or above exception level 1 could access those registers directly. Using assembly code to directly read those registers would result in illegal instruction.

I also came across some commit messages during my search indicating the possibility of trapping exceptions occurring from trying to access those registers by the OS effectively emulating the expected behavior. But i haven't been able to see any evidence atleast on android linux , all such code resulted in illegal instruction interrupt by the hardware.

For now the standard approach is to collect the hardware features information from the OS itself, which is then included as a Vector during loading of user-space process or application. This trick can be used to get hardware specific information at the runtime. Commands like lscpu also be can be used for a quick glance at available hardware features.

Specifically GlibC provide a routine called getauxval which is supposed to take a key as its argument to return corresponding value.

From the Linux man page:

The getauxval() function retrieves values from the auxiliary vector, a mechanism that the kernel's ELF binary loader uses to pass certain information to user space when a program is executed.Each entry in the auxiliary vector consists of a pair of values: a type that identifies what this entry represents, and a value for that type. Given the argument type, getauxval() returns the corresponding value. The value returned for each type is given in the following list. Not all type values are present on all architectures.

For detecting cpu features we would need to use AT_HWCAP as the argument to getauxval, returned bit-mask values would be abi and architecture specific Since this is being provided by GlibC, this would work on compatible UNIX systems only. I haven't looked at ways to collect such information from OS different from UNIX running on ARM architectures.

Corresponding code to detect support for some hardware feature could look like this:

func getauxval(a:culong):culong {.importc:"getauxval", noDecl, header:"<sys/auxv.h>".}
var at_hwcap {.importc:"AT_HWCAP", noDecl, header:"<elf.h>".}: culong
let hwcaps = getauxval(at_hwcap) # get hardware capabilities.

{.pragma: aarch64features, noDecl, header:"<asm/hwcap.h>".}
var
    aes{.importc:"HWCAP_AES", aarch64features.} : culong
    crc32{.importc:"HWCAP_CRC32", aarch64features.} : culong       
    asimd{.importc:"HWCAP_ASIMD", aarch64features.} : culong      # neon/advanced simd instructions. (must on aarch64 )
    fphp{.importc:"HWCAP_FPHP", aarch64features.} : culong        # float16 support, half-precision.
    asimdhp{.importc:"HWCAP_ASIMDHP", aarch64features.} : culong  # neon half-precision, (generally true if half-precision (fphp) is true.)

proc supports_aes*():bool=
    result = bool(hwcaps and aes)

proc supports_asimd*():bool=
    result = bool(hwcaps and asimd)

Final Remarks:

We took a look at ways to collect CPU/processor/chip information for x86 and ARM systems at runtime with the idea of using such information programmatically. Such code shouldn't require any dependence on any third-party library and is supposed to be easier to compile or ship along with an application.

There is a whole lot to understand about how different aspects at the lowest level of digital systems fit together. Sometimes i struggle to understand the actual interpretation for some statement even in the official manual and resort to hypothesis or to finding some relevant code.

For ARM systems particularly it may be best to read the manual itself, internet may not be much of the help unless you already know quality resources. I used phind during my search for most of the time as both google and duckduckgo were nearly worthless, always pointing to generic articles. You should always be able to get a link to the original SOURCE and i like that atleast phind does that rather than laundering somebody's code shamelessly without any attribution.

Detecting Cpu Hardware features:

Extended assembly syntax (EAS):

CPUID:

GetAuxVal on ARM:

Final Remarks:

References: