[RfC] RISC-V port and universal binary
Performance on Spacemit K3, thanks @edolnx for testing
master:
Total time (ms) : 65200
Nodes searched : 3493826
Nodes/second : 53586
riscv-scalable-port:
Total time (ms) : 15834
Nodes searched : 3493826
Nodes/second : 220653
Also thanks to @camel-cdr for guidance on RVV programming, and https://cloud-v.co for supplying an RVV instance to test with
passed STC:
LLR: 2.81 (-2.94,2.94) <0.00,2.00>
Total: 1152 W: 527 L: 108 D: 517
Ptnml(0-2): 0, 17, 167, 348, 44
https://tests.stockfishchess.org/tests/view/6a39895b3036e45021aeb368
## Summary
We've had a `riscv64` target for a while, but haven't really optimized for it, in particular the vector extension (RVV).
RVV, like SVE, is based on a scalable vector system where the vector length ranges from 128 to 65536. In practice implementations are between 128 and 2048, and 256 bits is quite common (e.g. the Spacemit K3 system above). Unfortunately this doesn't fit well into the rest of our code which assumes a fixed vector length, so what I've done is bypass the `VECTOR` ifdef (which now basically means "FIXED_LENGTH_VECTOR") and just have RVV-specific paths.
The ability to explicitly control `vl` makes the code quite readable, in my opinion. We use LMUL>1 in most places to take advantage of multi-vector instructions. Generally the LMULs were chosen to best support a 256-bit vlen, which is very common, but by virtue of how the vlen control works, the code works with any vlen. In a couple places, i.e., `get_changed_pieces` and `AffineTransformSparseInput::propagate`, we have separate implementations depending on the vlen, because the optimal LMUL varies a lot between implementations.
One little wrinkle is that `load_as` is compiled to a sequence of byte loads, because although unaligned loads are legal in RVA23, the spec says that they *may* be extremely slow (even though they usually aren't, in actual hw), so compilers are conservative. Thus I aligned the relevant buffers and made the semantics of `load_as` that the operand is aligned, by adding a runtime assertion.
### Universal binary
Adding a universal binary is pretty easy and we can just cross-compile. There are two targets: baseline rv64gc and riscv64-rva23, which is actually a smaller subset of RVA23 that also works on some older processors that don't support the full thing. We use clang because GCC, until recently, has a nasty bug with LTO and RVV.
Like the universal ARM and x86 builds, we check all the builds in CI. In this case we run bench with multiple vlens, 128 through 1024.
In the meantime I deleted the existing broken and unused riscv64 tests.
### Follow-ups
- Optimizations
- zvdot4a8i path
closes https://github.com/official-stockfish/Stockfish/pull/6920
No functional change