Share correction history between threads
We did quite a few tests because this is a pretty involved change with
unknown scaling behavior, but results are decent.
[STC 10+0.1 1th, non-regression](https://tests.stockfishchess.org/tests/live_elo/6941ce3b46f342e1ec210180)
```
LLR: 2.93 (-2.94,2.94) <-1.75,0.25>
Total: 83200 W: 21615 L: 21452 D: 40133
Ptnml(0-2): 247, 9064, 22844, 9169, 276
```
[STC 5+0.05 8th](https://tests.stockfishchess.org/tests/live_elo/693dc38346f342e1ec20f555)
```
LLR: 3.48 (-2.94,2.94) <0.00,2.00>
Total: 58536 W: 15067 L: 14688 D: 28781
Ptnml(0-2): 87, 6474, 15781, 6825, 101
```
[LTC 20+0.2 8th](https://tests.stockfishchess.org/tests/live_elo/693f2afb46f342e1ec20f847)
```
LLR: 2.94 (-2.94,2.94) <0.50,2.50>
Total: 27716 W: 7211 L: 6925 D: 13580
Ptnml(0-2): 8, 2674, 8207, 2962, 7
```
[LTC 10+0.1 64th](https://tests.stockfishchess.org/tests/live_elo/694003aa46f342e1ec20fac4):
```
LLR: 2.94 (-2.94,2.94) <0.50,2.50>
Total: 16918 W: 4439 L: 4182 D: 8297
Ptnml(0-2): 3, 1493, 5213, 1744, 6
```
[NUMA test, 5+0.05 256th](https://tests.stockfishchess.org/tests/view/6941ee4e46f342e1ec210203)
```
LLR: 2.95 (-2.94,2.94) <0.00,2.00>
Total: 7124 W: 1910 L: 1678 D: 3536
Ptnml(0-2): 0, 560, 2211, 790, 1
```
[LTC 60+0.6 64th](https://tests.stockfishchess.org/tests/live_elo/6940a85346f342e1ec20fcde):
```
LLR: 2.94 (-2.94,2.94) <0.50,2.50>
Total: 15504 W: 4045 L: 3826 D: 7633
Ptnml(0-2): 0, 1002, 5530, 1219, 1
```
Bonus (courtesy of Viz): The 1 double kill in this last test was master
blundering a cool mate in 3: https://lichess.org/jyNZuRl4
Basically the idea here is to share correction history between threads.
That way, T1 can use the correction values produced by T2, which already
searched positions with that pawn structure etc., so that T1 can search
more efficiently. The table size per thread is about the same, so we
shouldn't get a large increase in hash collisions; in fact, I'd expect a
lower collision rate overall.
Although I came up with and implemented the idea independently,
[Caissa](https://github.com/Witek902/Caissa) was the first engine to
implement corrhist sharing (and corrhist in the first place) – this idea
is not completely novel.
The table size is rounded to a power of two. In particular, it's `65536
* nextPowerOfTwo(threadCount)`. That way, the indexing operation becomes
an AND of the key bits with a mask, rather than something more expensive
(e.g., a `mul_hi64`-style approach or a modulo).
The updates are racy, like the TT, but because `entry` is hoisted into a
register, there's no risk of writing back a value that's out of the
designated range `[-D, D]`. Various attempts at rewriting using atomics
led to substantial slowdowns, so we begrudgingly ignored the functions
in thread sanitizer, but at some point we'd like to make this better.
We allocate one shared correction history per NUMA node, because the
penalty associated with crossing nodes is substantial – I get a 40% hit
with NPS=4 and 256 threads, which is intolerable. With separate tables
per NUMA node I get a 6% penalty for nodes per second, which isn't ideal
but apparently compensated for.
closes https://github.com/official-stockfish/Stockfish/pull/6478
Bench: 2690604
Co-authored-by: Disservin <disservin.social@gmail.com>