Use bitset representation for nnz and move computation into feature transformer
Passed avx2/bmi2 STC:
https://tests.stockfishchess.org/tests/view/69ff99eb9392f0c317213eda
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 20992 W: 5469 L: 5192 D: 10331
Ptnml(0-2): 38, 2197, 5751, 2470, 40
Passed avxvnni STC:
https://tests.stockfishchess.org/tests/view/6a00022b9392f0c317213f7c
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 47328 W: 12328 L: 12009 D: 22991
Ptnml(0-2): 112, 5148, 12847, 5423, 134
Passed NEON STC:
https://tests.stockfishchess.org/tests/view/69ff99d69392f0c317213ed8
LLR: 2.96 (-2.94,2.94) <0.00,2.00>
Total: 29600 W: 7664 L: 7376 D: 14560
Ptnml(0-2): 48, 3074, 8277, 3344, 57
Passed avx512icl non-regression:
https://tests.stockfishchess.org/tests/view/6a002b699392f0c317213f91
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 90112 W: 23130 L: 22975 D: 44007
Ptnml(0-2): 192, 9939, 24633, 10106, 186
Measurements from vondele (Neoverse V2, armv8-dotprod):
==== master ====
==== Bench: 2344696 ====
1 Nodes/second : 280724660
2 Nodes/second : 280647282
3 Nodes/second : 282055192
Average (over 3): 281142378
==== 50a44640a3 ====
==== Bench: 2344696 ====
1 Nodes/second : 284271937
2 Nodes/second : 285638071
3 Nodes/second : 284349426
Average (over 3): 284753144
The patch's benefit is non-uniform and is ~0 for avx512icl unfortunately -- although I think we should be able to find something there....
## Background/explanation
The idea here is to move the non-zero block computation back to the previous layer, and overlap the work better. Then, to avoid trying to emulate compress instructions on targets not supporting them (i.e., everything except for AVX512), we use a `pop_lsb` loop on a bitset enumerating the non-zero blocks, rather than first writing them out as indices.
An early AVX2 implementation worked on fishtest (https://tests.stockfishchess.org/tests/view/69c3410534b6988b1e472db4) and linrock demonstrated that it also worked for NEON (https://tests.stockfishchess.org/tests/view/69ce0dc73ddc0eccd617188f). Thanks to him for encouraging me to push this idea over the finish line.
To abstract over the particular NNZ representation (bitset or index list), we use `NNZInfo` and `NNZCursor`. The cursor is used to write to one or the other perspective of the NNZ list. I'd appreciate ideas on making the code cleaner/more readable as imo it's still a bit ugly.
## Fixing GCC 15 regression
Separately, vondele noticed a regression in ARM performance from GCC 15 caused by suboptimal codegen. See https://discord.com/channels/435943710472011776/813919248455827515/1502837886381461605 for more info, but the easiest fix that I could come up with was inserting a couple `asm` optimization barriers.
Single threaded data:
```
average: stockfish.master.13.3 1041619
average: stockfish.master.15.2 999214
average: stockfish.patched.15.2 1031369
```
There's definitely regressions elsewhere as well, which we shld chase down, but at least this should unblock the arm64 universal binary work.
closes https://github.com/official-stockfish/Stockfish/pull/6814
No functional change