Make shared history allocation aware of non-uniform cache access
Although shared history has been successful overall, it led to some speed
issues with large numbers of threads. Originally we just split by NUMA node,
but on systems with non-unified L3 caches (most AMD workstation and server
CPUs, and some Intel E-core based server CPUs), this can still lead to a speed
penalty at the default config. Thus, we decided to further subdivide the shared
history based on the L3 cache structure.
Based on this test, the original SPRTs, and speed experiments, we decided that
grouping L3 domains to reach 32 threads per SharedHistories was a reasonable
balance for affected systems – but we may revisit this in the future. See the
PR for full details.
In an extreme case, a single-socket EPYC 9755 configured with 1 numa domain per socket,
the nps increases from:
Nodes/second : 182827480
to
Nodes/second : 229118365
In many cases, when L3 caches are shared between many threads, or when several
numa nodes are already configured per socket, this patch does not influence the
default. This default setting can adjusted with the existing NumaPolicy option.
closes https://github.com/official-stockfish/Stockfish/pull/6526
No functional change.