As some of you may know, I’m have multithreaded pathfinding in the A* Pathfinding Project. This is a great way to keep the game at a steady FPS while other threads can carry out heavy pathfinding. Until now I have only had a single separate thread for pathfinding. That works good, but there is room for so much more now that computers with more than 2 cores start becoming common. So I decided that I should try to implement that.
It didn’t work out the way I wanted to in the beginning, the reason is that I have made no separation of connectivity data (which is usually static) and path data (temporary data needed for path calculation) everything was stored in the same Node object. So I started working on separating them. What I ended up with was a huge matrix where a large array of NodeRun object (in lack of a better name) is stored for each thread. The NodeRun object stores the temporary data (e.g G and H score) and every NodeRun object is linked to a specific Node, however there are for each thread a NodeRun object which links to the same Node. This works quite good, memory hasn’t increased that much, should be about 12 bytes per node for 1 thread and then 8 extra bytes per node per thread added. And I think the performance has barely been affected per thread.
The awesome stuff reveals itself when you test the performance with more threads then 1.
My computer has 4 physical cores, but uses hyperthreading for a total of 8. As you can see in the graph, I can manage to get it to run as much as 3.75 times faster than with a single thread! The speedup is linear at the start, but gets smaller when the number of threads increase above the number of physical cores. Memory usage increases linearly at about 2mb per added thread (for the 200*200 grid graph I’m testing on).
The only major drawback that I can see is that it will be slower to update the graphs. When a graph is updated, pathfinding must not be searching the graph at the same time, that could lead to weird errors. When not using multithreading, the function can simply update the graph (if needed) between calculating paths. Almost as easy when using a single thread, but then it must stop for a bit to let the Unity thread update the graphs since many things such as raycasting must be called from the main Unity thread. When using more than one thread, every thread must finish calculating the current path they are calculating, stop and then notify the unity thread that it can proceed with updating the graphs. This can lead to a decrease in performance because the threads spend much time idling. One way to solve this is to group graph updates together in chunks for example every 0.2 or 0.5 seconds to avoid stopping the threads too often when updating the graph a lot.
Anyway, I’m happy with the transition. I could also simplify many functions to use more advanced atomic operations instead of custom written lockless queues and stuff. When I get around to rewriting the GraphUpdateObject system for this I can also make sure it does not block the unity thread waiting for the pathfinding threads to stop, but it simply notifies the pathfinding threads that they should pause. And when they have, the unity thread, some frames later, can update the graphs. The previous approach could lead to lag when updating graphs if the pathfinding thread was calculating a really long path.
I’m a bit concerned for the mobile development though with the increased memory usage. It is not really much though. I built two tests, one with the original code (similar to latest release) and one with the new code. Both using a 200*200 grid graph. The new code used 6mb according to GC, and the old code used about 5 mb (I only had 1mb precision). But what’s interesting is that the old code had about 1.5 mb allocation per second, but the new code allocated practically nothing (0.0 to 0.1). I don’t know if that is something relevant, or if it just is GC acting odd. But it was repeatable at least. With 4 threads it peaked at 13mb, though it started at 8 or something. I don’t know much about C# internal memory management, so I usually just measure the peak allocated memory after 20 seconds or so when it has stabilized.