Detecting Cpu Hardware features:
Detecting underlying CPU features is generally a pre-requisite for writing optimized code in try to squeeze out every possible bit of performance.
It is generally the last step in an algorithm, after leveraging more generic capabilities like multi-threading.
This involves understanding the computing architecture directly at chip instructions
level, a level deeper than what OS has to offer.
Being able to detect hardware
features would allow us to dispatch an optimized routine rather than a generic one.
Some libraries (like Onednn) even use a jit
compiler to generate an optimized implementation for some operation at runtime rather than shipping with all possible implementations.
We would also look at some related topics like extended assembly syntax
in pursuit of being able to detect such features.
Extended assembly syntax (EAS):
In order to get CHIP level information we may need to leverage assembly code directly bypassing compiler. EAS provides a way to make it a bit easier to include occasional assembly code. But it does come with some quirks as we would need to do extra work to let compiler understand our intentions with the assembly code wish to execute. Idea with EAS is generally to use C variables directly in assembly code to make code simpler.
Thing to note here is that compiler never parses the assembly code , it just substitutes the operands. Since it never parses the assembly instruction it cannot know its side-effects or is it even correct.
EAS allows to let compiler know about which variables would be modified(write) or used(read) by the assembly code.
So it is better to always include all the input
and output
operands clearly and exhaustively.
But still we need to understand more about the constraints
that we can associate with the operands.
Constraints
allow to fine-tune already possible subset for an instruction. For example add
instruction requires atleast one register
as its operand, using constraints we may be able to select which variable should be in register and which one would come from memory.
Since variables initially would reside in memory, extra instruction apart from assembly template may need to be generated to fulfill a constraint.
Taking a look at example below.
proc sum_2():int {.exportc.} =
var
a:int = 9
b:int = 11
# b = a + b # should be 12.
# return b
asm """
"add %1, %0\n\t"
:"=r"(`b`)
:"r"(`a`)
"""
return b
Using =r
constraint for b
doesn't behave as i expect, i specifically has to use =m
to commit b
into memory before returning it. Compiler may feel no need to store b
onto the stack
before returning
!
According to me:
- compiler generates the
assembly
code by substitution and some preliminary checking like we cannot setm
constraint on botha
andb
, asadd
instruction requires atleast on register as operand. - Based on the constraints it may generate extra instructions in the final assembly, like using
m
onb
will make sure to write result back to memory. - Compiler needs to know
input
andoutput
operands being used, so that it doesn't optimize them away or something!
For now i use this asm
extension inside a dedicated function, rather than mixing-in (in-lining) and test it to check if working as intended.
I specifically commit the variables i wish to return to memory by using =m
constraint.
I intend to update preceding paragraph as my understanding improves.
Best way to learn more about this is the official GNU GCC documentation.
CPUID:
cpuid
is a instruction available on most of x86
systems, which allow to query Chip/processor information is a standard way even on much older x86 systems.
We can use this to detect features like avx
, fma
capabilities.
Wikipedia page on CPUid is quite thorough and contains enough information to understand and use this instruction for our usecase.
From Wikipedia page itself:
The CPUID instruction takes no parameters as CPUID implicitly uses the
EAX
register to determine the main category of information returned.
There could be a lot of information associated with a processor, so it is divided into categories referred using numerical Ids.
First we need to ascertain the maximum
allowed category (leaf). It can be obtained by setting eax
register as 0
and later reading the eax
register after calling cpuid
instruction.
as shown below.
proc cpuinfo(eaxi, ecxi: int32): tuple[eax, ebx, ecx, edx: int32]=
# eaxi: contains the leaf level (node).
# exci: sometimes used, but is generally 0.
var eaxr, eabr, eacr, eadr: int32
asm """
"cpuid\n\t"
"mov %%eax, %0\n\t"
"mov %%ebx, %1\n\t"
"mov %%ecx, %2\n\t"
"mov %%edx, %3\n\t"
:"=m"(`eaxr`), "=m"(`ebxr`), "=m"(`ecxr`) , "=m"(`edxr`)
:"a"(`eaxi`), "c"(`ecxi`)
"""
return (eaxr, ebxr, ecxr, edxr)
let (eax, ebx, ecx, edx) = cpuinfo(eaxi = 0'i32, ecxi = 0'i32)
echo eax # 22 on my system
proc cpuNameX86Basic():string =
# get basic name for CPU,
var (eax, ebx, ecx, edx) = info(eaxi = 0x00000000'i32, ecxi = 0)
var temp = (ebx, edx, ecx, 0x00000000'i32) # add null character so that cstring can be interpreted
result = $cast[cstring](addr temp[0])
echo "basic max leaf: ", int(eax)
return result
echo cpuNameX86Basic() # would print GenuineIntel on my system.
Mine processor is from skylake
family on which the highest function implemented is returned as 0x16
aka 22
.
Generally information about a feature availability is indicated by setting a bit on or off.
For example according to Wikipedia, FMA3
instruction support could be checked using by setting leaf
as 1 and reading the 12th bit in the ecx
register.
let (eax, ebx, ecx, edx) = cpuinfo(eaxi = 1'i32, ecxi = 0'i32)
proc check_bit(register:int32, bit:int32):bool =
result = register and (1'i32 shl bit) != 0
proc has_fma3():bool=
let (eax, ebx, ecx, edx) = cpuinfo(eaxi = 1'i32, ecxi = 0'i32)
result = check_bit(ecx, 12)
Visit Wikipedia page for more detailed information that could be extracted about underlying processor.
Availability of a standard instruction makes things easy from developer point of view too. It could be used to make sure that certain features are indeed present as required by an application or convey useful error messages to user if not.
GetAuxVal on ARM:
Arm system on the other hand requires developers to depend on the OS to expose the hardware functionalities.
Standard special registers indeed are available on the ARM chipsets to expose the implemented features like ASimd, Fp16 support etc.
Userspace applications wouldn't be able to access those registers directly as this would require EL_1
aka exception level 1
. Only the code running at or above exception level 1 could access those registers directly.
Using assembly code to directly read those registers would result in illegal instruction
.
I also came across some commit messages during my search indicating the possibility of trapping
exceptions occurring from trying to access those registers by the OS effectively emulating the expected behavior.
But i haven't been able to see any evidence atleast on android linux
, all such code resulted in illegal instruction
interrupt by the hardware.
For now the standard approach is to collect the hardware features information from the OS itself, which is then included as a Vector during loading of user-space process or application.
This trick can be used to get hardware specific information at the runtime.
Commands like lscpu
also be can be used for a quick glance at available hardware features.
Specifically GlibC
provide a routine called getauxval
which is supposed to take a key
as its argument to return corresponding value
.
From the Linux man page:
The getauxval() function retrieves values from the auxiliary vector, a mechanism that the kernel's ELF binary loader uses to pass certain information to user space when a program is executed.Each entry in the auxiliary vector consists of a pair of values: a type that identifies what this entry represents, and a value for that type. Given the argument type, getauxval() returns the corresponding value. The value returned for each type is given in the following list. Not all type values are present on all architectures.
For detecting cpu features we would need to use AT_HWCAP
as the argument to getauxval
, returned bit-mask
values would be abi
and architecture
specific
Since this is being provided by GlibC
, this would work on compatible UNIX systems only. I haven't looked at ways to collect such information from OS different from UNIX running on ARM architectures.
Corresponding code to detect support for some hardware feature could look like this:
func getauxval(a:culong):culong {.importc:"getauxval", noDecl, header:"<sys/auxv.h>".}
var at_hwcap {.importc:"AT_HWCAP", noDecl, header:"<elf.h>".}: culong
let hwcaps = getauxval(at_hwcap) # get hardware capabilities.
{.pragma: aarch64features, noDecl, header:"<asm/hwcap.h>".}
var
aes{.importc:"HWCAP_AES", aarch64features.} : culong
crc32{.importc:"HWCAP_CRC32", aarch64features.} : culong
asimd{.importc:"HWCAP_ASIMD", aarch64features.} : culong # neon/advanced simd instructions. (must on aarch64 )
fphp{.importc:"HWCAP_FPHP", aarch64features.} : culong # float16 support, half-precision.
asimdhp{.importc:"HWCAP_ASIMDHP", aarch64features.} : culong # neon half-precision, (generally true if half-precision (fphp) is true.)
proc supports_aes*():bool=
result = bool(hwcaps and aes)
proc supports_asimd*():bool=
result = bool(hwcaps and asimd)
Final Remarks:
We took a look at ways to collect CPU/processor/chip information for x86
and ARM
systems at runtime with the idea of using such information programmatically.
Such code shouldn't require any dependence on any third-party library and is supposed to be easier to compile or ship along with an application.
There is a whole lot to understand about how different aspects at the lowest level of digital systems fit together. Sometimes i struggle to understand the actual interpretation for some statement even in the official manual and resort to hypothesis or to finding some relevant code.
For ARM systems particularly it may be best to read the manual itself, internet may not be much of the help unless you already know quality resources.
I used phind
during my search for most of the time as both google
and duckduckgo
were nearly worthless, always pointing to generic articles.
You should always be able to get a link to the original SOURCE and i like that atleast phind
does that rather than laundering somebody's code shamelessly without any attribution.