Neon dotprod speedup in sparse-input affine transform
Also apply VNNI accumulator splitting strategy to NEON dotprod.
Speedup measured locally with profile-build, apple-silicon M3 Pro:
```
Result of 20 runs
==================
base (...kfish-master) = 1582485 +/- 12985
test (...parse-affine) = 1605204 +/- 13801
diff = +22720 +/- 1212
speedup = +0.0144
P(speedup > 0) = 1.0000
CPU: 11 x arm
Hyperthreading: off
```
Passed STC:
https://tests.stockfishchess.org/tests/view/69c04c71f690a4b7f5fb0cde
LLR: 2.95 (-2.94,2.94) <0.00,2.00>
Total: 80576 W: 20748 L: 20391 D: 39437
Ptnml(0-2): 161, 8472, 22658, 8843, 154
closes https://github.com/official-stockfish/Stockfish/pull/6682
No functional change