This is not about programming, this is about CPU and memory hardware. To be able to speed things up with multithreading, the multithreaded stuff must run simultaneously. Stuff runs simultaneously, if one thread runs on one CPU core and another thread on a different CPU core at the same time.eternaleye wrote: βFri Jul 26, 2024 6:42 pm Speaking of lateral thinking in optimizations, I'm curious as to both why electric network updates do have such a heavy memory bandwidth impact, and whether that impact could be avoided in the common case.
The in-memory structure that holds the electric network info is some array. It's an array, so items are adjacent to each other in memory, so multiple items fit into one cache line, which speeds things up, since they are loaded into the cache just once, then stay over the course of the tick update.
Some part of that array belongs to one electric network and some other part belongs to another electric network. One thread for one electric network iterates over one part of that array, and the other thread for the other electric network iterates over an other part of that array. This means the same array is accessed at the same time by 2 different threads, so it is accessed by 2 different CPU cores.
Memory access is fast, but to speed it up even more, there is a memory cache that can be accessed faster than the base memory. There are even multiple tiers of cache - there are 3 tiers. The L1 cache is extremely fast, but very small, about 32KB. It's not shared among CPU cores. It's the L1 cache that makes modern CPUs really fast. It can be so fast, because it isn't shared among CPU cores. The next tier is the L2 cache. It's a buffer between the L1 and the last tier of cache, the L3 cache. The L2 cache is also not shared between CPU cores. Only the L3 cache is cached between CPU cores, but due to that sharing, this cache is considerably slower than the L1 cache.
Items are cached in blocks of 64 bytes (called the cache line size). Now, if 2 threads are working in parallel on the same array, the 1st CPU core executing the 1st thread loads the data into its L1 cache through the L2 cache from the L3 cache. Now, if the other thread on our 2nd CPU core wants to access some part of that array less than 64 bytes away, the memory management hardware invalidates the L1 cache from the 1st core, then loads the data from L3 to L2 to the L1 cache of the 2nd core.
The real impact starts, if core 1 wants to access the next item. It is probably less than 64 bytes away, so it's usually already in the L1 cache, and this is the fastest possible access. However, it must be fetched from the l3 cache again due to the cache line was invalidated due to the access from the other core. This again invalidates the L1 cache from the other core. In the end, all memory access for this kind of data is fetched with the speed of the L3 cache, not with the really fast speed of the L1 cache.
Of course, many items in that array are more than 64 bytes away, so this kind of cache thrashing isn't always taking place. But it seems to take place often enough to destroy the performance gain of the parallelization.
You might think that you just need to make items bigger than the cache line size. However, this means more memory access in general, because the data is bigger, so this also slows everything down, because more stuff has to be purged from the cache to free space for the more data to load into the L3 cache. This is usually what defines the limiting speed for bigger data structures (the L3 cache of Intel consumer CPUs is 6-30 MB).
Also see: https://igoro.com/archive/gallery-of-pr ... e-effects/