What methods, patterns, or tricks do you employ when coding to help your game run faster?

BadIdea4 · February 23, 2024, 5:45pm

Seems like you have a great handle on this thing already, and I struggle to come up with stuff you haven’t already thought of. I hope my comments have been helpful somehow. Anyway, another few ideas:

Sounds like the scoring function in the hot loop is already trivial, but double-check whether you can shave off something.
Maybe the map containers are adding overhead? Have you tried replacing them with pairs of arrays (key, value), or triples etc (key1, key2, value)?
Reduce the number of locations, people, etc to four or lower and trace through the execution of the code line by line, start to finish. Notice what the code is doing, especially when it bores you.
Take two or three days entirely away from the problem, and do lots of things that let your mind wander. Maybe some ideas will come to you then.

zdeerzzz · March 1, 2024, 12:47am

So in testing i recompiled my game using a different computer with different version of g++ and it ran twice as fast on the same computer after moving the finished executable back over.

One is an amd laptop running arch with a g++ 13, the other is an intel desktop runing ubuntu with g++ 12. Im running the execubles on the laptop but the version compiled on the desktop runs literally twice as fast.

Anyone have any idea what the optimal g++ version is short of guess and check?

Does processor brand have anything to do with this?

BadIdea4 · March 1, 2024, 5:51am

Clearly one of these executables is doing something much better than the other, so why not compare the assembly to find out what it is? If it’s because you can’t read assembly, I think you’d enjoy learning.

zdeerzzz · March 1, 2024, 6:22am

Well i started looking at compiler flags figuring one had to be auto set and saw the optimization ones, tried out some of the optimization flags brought the time to unbelievably fast. Literally cut the time into a lest than 1/6th of what it was. With this new trick i’m now computing 5000npc turns in 2686ms(2.686seconds).

Now i’m gonna need a new laptop with more ram to support all the npcs i can add. My original goal of 10,000 is totally possible if you have 30gb of ram. Only would take ~6secs to calculate all the npc turns, and the player is probably going to take longer than that to take theirs and it can be done in the background.

BadIdea4 · March 1, 2024, 8:50pm

Huh. I’d assumed that since you were trying to make the code go fast, optimization flags were the first thing you tried.

30GB of ram is still more than most people have. How are the NPCs taking about 3 megs per head?

zdeerzzz · March 1, 2024, 10:52pm

Well the base part of an npc does’t need 3mb it needs probably less than a kb. Also note that with each player in the game comes another job object and housing object and clothing items… The simulation as a wholes size is scaled directly by npc count.

With that said the memory hog comes from my plans and intentions for reactions to weight gained or lost, along with decision making based on weight trends(a lot of the decision stuff is there, already in the turn code). At which point any player in the game npc would need to know what another player in the games weight was at any point in time they saw the other player to have a basis for a reaction. How much history and how you store it is its own memory optimization problem. But obviously if every time 1 player interacts with another player there is a minimum of 2 new data points to be stored. Hence as the game plays memory requirements to store each npcs preceived version of history of the other npcs its met in the past x turns is and exponential requirement. At 1000 npcs its like .5g, at 5000 its 6g. Testing the memory requirements and turn time on my desktop(has 64gb of ram compared to my laptops 8gb) which is what sparked this whole rabbit hole. At 10,000 it took 28.8g of memory but I did not simulate more than a hand full of turns(keep in mind a lot of the history space in memory is pre reserved, so i don’t expect it to grow too much more). Also not accounting for all the other entities the npcs can create but not destroy during that time.

It is my intenetion of making a setting for choosing the count such that someone with only like 4gb of ram could theoretically try and play the game, just with a much reduced npc count. Default setting will probably target the game using about 6-8gb.

Well i was going to look into this at some point, however, as this is probably my single most compute heavy code ever written by a long shot, my thinking based on the past experience using them never really gained any noticeable performance. So i wasn’t expecting it to make to much of a difference. Also the old saying garbage in garbage out. The level of performance difference between my first pass at optimizing; and where the code is now, without compiler optimizations; makes a way bigger impact than the compiler itself, by a large margin.

BadIdea4 · March 3, 2024, 2:16am

Ah, memory for memory, of course! And since anyone could remember stuff about anyone else, it threatens to grow quadratically! Although, I assume an NPC can only observe so many things per day, so shouldn’t it be closer to linear in the end? Hm.

It might work to give each NPC a fixed sized weight observation memory, and whenever it fills up, forget the oldest observation of the person they’ve observed last. And a way to summarize old weight observations - basically, it only takes four numbers to store a line segment, which can stand for the history from one past date through another.

If there are more narrative or emotional memories as well, those could be remembered based on how charged they are, as well as recency.

zdeerzzz · March 4, 2024, 6:24am

so if all “things” size in memory is s, days is d, and players is x, and each player observers every other player each day the memory requirement for d days would be : s*d*(x*x) which is not linear. While if you assume that not every player sees every other player, the equation just becomes s*d*(x*ax) where a is the fraction like say 1/10,000 to say every player saw 1 of the population 10,000(assuming 10,000 players). But simplifying the equation your still left with a squared equation s*d*a(x*x). Point being as long as npcs observer greater than or equal to 1 person per day, to keep the second x non zero, an easy assumption to make, the equation is exponential. Also keep in mind s can be any number of observations, im assuming the npc makes all of them at once every time here.

Already doing this maximum memory length is 720days, also each npc is limited to one observation of another npc per day.

So i actually am cheating a bit here in a way, each npc already keeps the data point component for each day, for the length of history. So now, i only need to store the time point for other npcs perspective and go look up the data point based on the history time point.

Basically non self history gets to short storing the data component of a point and only store the time component. Also because were storing every data point for the length of history for each npcs self history, i don’t need the time component for those data points in the self history. Their in order in the array from current to history length old days.

So for a line segment of self history i only store 2 of 4 numbers to make 2 points for a line, and other npcs is the same only storing 2 of 4 numbers to generate 2 points to make a line.

I don’t currently have a way to keep older history or summarize history; to keep older history. Mainly this is due to me going in circles about how to determine what is important to summarize or how to go about summarizing. I’d welcome suggestions here.

grotlover2 · March 4, 2024, 2:40pm

If I am reading the profile correctly here, its highly likely this is due to the memory or IO bound operations with the data from them being cached in RAM or the CPU cache for subsequent calls.

As already mentioned setGoals looks to be the offender and this makes a ton of sense for first run of a memory intensive operation.

zdeerzzz · March 4, 2024, 4:22pm

Im assuming your referring to post 12, I did end up tracking down this weird behavior, it ended up being in my built in profiler, that was causing this.

grotlover2 · March 4, 2024, 4:46pm

Ok, this might turn into a long one but this stuff is always fun though really complex. What I will be writing below in generally applicable to all programing but there will be some C++ based stuff as well.

General Advice

1st - Remember Performance is Usually a Trade Off

The reason you usually hear dont optimize early is because generally performance is a trade off between usability and ease of maintenance. A great example of this is OOP itself. OOP adds a ton of overhead within programs through the use of VTables, more complex memory footprints, and heavier reliance on indirect memory access. This is all usually bad performance wise but is done to increase the ease of use and maintainability of the program.

2nd - Use a Good Profiling Tool

In general you can make some good guesses on where your bottlenecks are, but without a good profiling tool you can never know for sure. Even in my day job I have seen really smart, senior devs get their assumptions destroyed after running an actual profile on the software. This ts also much more common then you might think, especially for the reasons I will outline in section 3 here.

I am a big fan of Jetbrains tools and use those myself, but most engines will have some good tools already included for profiling. I do suggest to not roll your own if you can help it as they will usually only give you a high level, but not very detailed picture of your performance. I do know profilers can be a bit expensive though, but I do know there are a few good OSS ones out there.

3rd - Remember the Compiler is Smarter Than You

A massive benefit of compiled languages is the compiler can do a ton to optimize your code, and they do a TON of stuff to optimize your code. Running the compiler with the highest level optimization flags will just butcher your code to make it run more efficiently. If you look at the binary with no optimization and an optimized one they are nearly two different programs and if you profile them they can come out really different.

4th - 9/10 Times Its Memory

Memory access/creation is EXPENSIVE. And not just IO but even RAM to cpu cache and cpu cache to cpu cache can be orders of magnitude slower then cpu bound operations.

I cant count how many times I have been asked to look at a program to only see 30-50% of the run time was waiting on malloc/free operations. This is especially true if you are using std::string in C++ (its notorious for being really memory inefficient), and if you make a lot of allocations on the heap it can cause a lot of heap fragmentation that can slow down memory access.

Patterns like the Flyweight or Object Pool patterns can help with some of that. Also, bulk reserving and loading of data upfront can help increase cache coherency by reducing memory fragmentation.

ECS and other data oriented programing patterns make heavy use of this fact to make highly performant systems by reducing memory accesses and allocations through the use of cache coherency.

In reality, CPU bound operations do not contribute to performance issues as much as memory access tends to.

5th - Limit Your Problem Space

Sometimes you cant avoid an expensive operation. A good example of this is collision detection which is usually somewhere around O(n^2). You can improve the performance of these operations by trying to limit the amount of data they have to process.

A great example of this is how most systems handle collision. They do what is called a two pass check, where the first pass finds what entities could be colliding with each other and the second pass is the more expensive check for if they actually are colliding with each other.

This may sound wasteful on the surface as you are adding more overhead to an already really expensive operation but in reality the worse case scenario (all entities could be colliding with each other) is actually very rare and due to this that first pass really reduces the number of checks you have to do on the much more expensive second pass.

Another great example is changing your sim resolution. Maxis has a great talk on how they do this for Sims3, where when sims are not on screen the sim for them becomes very low resolution and its only when they are on screen the full sim is acting on them. I am fairly sure this is the GDC talk where they discuss it and is also a great talk on some of the befits of data driven design:

6th - Do NOT Mistake Structural Issues for Performance Issues

This goes a bit back to point 3 above. The compiler is much, much, much smarter then any of us. Stuff that looks like it may not be performant on the surface will usually be optimized out by the compiler. A great example is when people kept yelling about all the if statements in Yandere Sim. Structural and Architectural issues do not usually translate into performance issues and really can only be used to reliably spot issues with extending or maintaining the code base.

I highly recommend the video done by dyc3 on the issue as he goes into more detail and does a really good job explaining it.

7th - Multithreading: Great Rewards, Greater Risk

Multithreading can be a really great way to increase performance of your software in general, but you need to be very careful as its a land filled with landmines that can cause much more harm then good.

Two of the most common issues with Multithreading that can torpedo your performance is thrashing and atomic operations.

The easiest to explain first is atomic operations. These basically are parts of your code that can not be done in parrel or can not be interrupted. This can be either intentional or unintentional but depending on where that operation is it could make your multithreaded code run as if it is on that single thread since it has to wait for the other threads to complete first. Basically one wrong Mutex or code that requires it could make it run as it it was single threaded anyway.

Second, thrashing. This usually refers to when threads or processes fight each other for resources while they execute (usually in terms of cache). Thread context switching is expensive and if you dont have good cache coherency it can result in threads blowing away the cache for other threads causing them to spend a ton of time waiting on memory bound operations to reload their state (see point 4 above).

This is common with OOP patterns since they tend to be not very parallel friendly, but DOP patterns tend to be much easier to avoid these issues with since data and processing is more separated from each other.

8th - Smaller is Not Better

When it comes to memory access smaller is actually not always better. This is a more hardware specific thing but its more important that your memory plays nice with how the CPU expects to load it. For example most CPUs like to load data on 4byte bounds and any data that does not fit in these bounds can actually be quite a bit harder to the CPU to load and process. Luckily, as discussed on point 3, the compiler will usually try to make sure your memory is as optimized for the CPU as it can make it.

Also, cache coherency is much more important for fast data access then a small memory footprint.

C++ advice

Ok, here is where we get into some fun C++ specific advice, but this is getting long so I will keep try to keep it short and stick to some that are not commonly talked about or known.

1st - Stack vs Heap

Did you know you can allocate objects on the stack instead of the heap? A ton of people dont know about this trick with C++. Its commonly associated with a form of RAII called SBRM. Allocating objects on the stack has several advantages, from preventing memory leaks to faster and better access of that memory since stack access and allocation is much faster then doing anything on the heap.

The downside is its not very dynamic and you have a limited amount of space in the stack compared to the heap.

2nd - Meta Programing: How to Abuse the Compiler & Preprocessor

Meta Programing is one of those Black Magic concepts in C++ that is really powerful but very very complex. Usually you will see this mentioned though as a way to eliminate the overhead of VTables while maintaining some form of polymorphism .

This is because you can emulate a form of Duck Typing through the use of templates. When you template out a method, object, or function you can say you expect certain methods/properties on a type and the compiler will check for these at compile time. This allows you to enforce an interface through the compiler instead of through an abstract type like an interface.

This is even more powerful once you get into the CRTP or with the new type traits system in newer versions of C++.

3rd - Malloc or New to Slow, Make Your Own!

I do not suggest anyone really does this without really really REALLY good reason but you can override quite a few system defined operations with your own custom ones if it is needed. This includes things like new, malloc, free, and delete.

Its very common to see power users of the language do this to track memory allocations and deallocations for debugging but is used sometimes to implement more performant memory allocation methods.

skitterwave · March 4, 2024, 6:22pm

Another fun example is the source code for VVVVVV, which has (or had) a multi-thousand-line switch statement acting as a state machine for its cutscenes and rooms: GitHub - TerryCavanagh/VVVVVV: The source code to VVVVVV! http://thelettervsixtim.es/

zdeerzzz · March 5, 2024, 2:37am

Uhhh, im not sure the way you state this sounds correct, stack and heap both are stored in ram. There is not anything faster about the area of ram for stack vs heap in terms of raw io of a said same size block of each.

My understanding of the real difference is how each is allocated, each has its own allocator. Stack takes a last in first out approach which generally allocates faster and has the advantage of the os’ generally give a contiguous space of a os’ defined size to your program to use for stack. This makes accessing one item after another in stack also appear to be faster by proxy of how cpus pre-fetch and retrieve memory.

There is no reason you can’t allocate say on windows 1mb(i believe this is windows default stack space allocation for a given program) of heap space, and make your own stack like allocater, and use it the same as stack; and then see all the same benefits in heap, as you do in stack.

This really all ties back into your 3rd point on c++, and why its relevant and a thing people do.

Why does everyone hate on it, it as far as im concerned it behaves like every other standard library container, none of them could be called the fastest but they all have a known set of rules they operate by and std::string really is not a problem unless your ignoring its rules or not resevering the needed capacity in it before performing a known set of operations. Std:vector has all the same problems but gets none of the hate. In my opinion most of std::strings issues stem from lazy use of it, and not from it being a bad implementation or bad in general.

Std::string allocates the greeater of the number of bytes need to store the value your initializing it with or 15bytes(c++20) , then behaves much like std::vector after that(doubling its size every time it runs out of capacity with a max size increase of 240). If you know you need say 256 char’s worth of space just use the reserve method to get all the space up front to prevent all the allocations and coppys on the way from say 15bytes in size to 256bytes in size that could slow your operations down.

In summary std::string really is just to conservative out of the gate with it initial allocation, which is where i believe it gets its bad rep from.

cheddar · March 5, 2024, 3:30am

This is Very Late, but rather than give NPCs an ‘open book’ set specific daily patterns for each of them based on the day and their state. If they’re a [student], then you can cull any option that doesn’t involve going to classes during midday periods. If there’s other things being tracked like how gluttonous they are or what food they like, then you could pre-program a student that likes ice-cream to visit the local creamery after school. The game doesn’t have to process every outcome’s probability if you ‘cheat’.

I’ve seen several people fall into this trap with trying to make a Life Sim with Shenmue levels of unnecessary detail that prevents the game from having proper depth and bogging down hardware doing unimportant processes. We, the player, aren’t going to notice much of that behind-the-scenes activity, which means that for the most part, it is wasted effort.

It’s not a technical answer about how x process can be handled in the most efficient manner based on y hardware but it’s the irony of development that looking for coding solutions to a problem isn’t necessarily effective.

grotlover2 · March 5, 2024, 4:39am

Ah, I see where you are coming from but there are a few incorrect assumptions. The root of the difference between stack and heap performance is due to how they are allocated and accessed though.

From a allocation/deallocation front the stack is faster because it pushes all of the memory needed onto the stack when that stack frame is created. This bulk allocation is really efficient and since the stack frames are allocated in a stack structure the OS does not need to do much if any book keeping to keep track of the allocations. And when the frame gets popped the frame pointer just goes back to the beginning of that frame so it can be overwritten latter.

If you look at how a heap works, the OS needs to do a ton of book keeping on what memory has and has not been allocated. In rough terms (this is not fully true but works broadly for demo) when the OS is asked to allocate memory in the heap it has to scan the heap for the first block it can find that can fit that data, and when it gets destroyed the OS needs to update its meta data to reflect that memory is now free. This also leads to memory fragmentation which can affect access (explained more below).

As far as access goes this goes to my point on 4 and cache coherency. The stack frames (due to how they are allocated) are tightly packed and since they usually also contain other information such as instructions the stack has very high cache coherency and likely already has the data needed cached in one of the CPU caches.This overall leads to faster access times on average.

Heap has a few issues that influence access time. Since the heap can be fragmented and you are generally responsible for making your data tightly packed, its harder to make effective use of the cache with heap data in general. Also, some OS implementations may require heap access to be handled indirectly through something like a VTable but I dont remember that detail well.

When it comes to stack vs heap though the tldr; is its easier to make better use of the hardware caches with stack where with the heap you have to be more knowledgeable about how your memory is packed and being laid out. The heap also requires more book keeping by the OS which makes allocations and deallocations more expensive then on the stack.

This is actually a common complaint with the standard library in general.There is nothing wrong with it, but it goes to point 1. The standard lib chooses to prioritize being usable in a large variety or situations and backwards compatible, but this flexibility in general comes at the cost of performance. This is why EA has an entire reimplmentation of the standard lib that focuses on performance over general use for use in its games and engines.

Strings though are the worst offender because of a few reasons:

Unlike vectors its not clear to a standard user how memory for strings is being allocated. It can lead to some incorrect assumptions by those users.
Most languages handle strings in a much better and performant manner then std::string. This is just due to std::string being an older implementation when it comes to handling that kind of data.
Unlike objects, strings tend to get passed by value a lot. This is usually due to misconceptions caused by point 1 and can lead to a ton of memcopy operations.
Many of the operations preformed on strings tend to have a ton of memory allocations and memcopys. For example in a program I was profiling at work the line of code that does string x = y.trim() does no fewer then 2 allocations, 2 memcopys, and 1 deallocation. These really add up when dealing with a ton of data.

The tldr in std::string is it mainly does not run well compared to newer string implementations and alot of this is baggage from being such an old implementation and choosing flexibility and backwards compatibility over performance.

zdeerzzz · March 5, 2024, 4:42am

TLDW(to long didnt write) i agree with you, but cheating and shotcuts are boring and i’m a masochist.

BadIdea4 · March 5, 2024, 4:59am

@zdeerzzz More complete answer later, but number of times one person notices something about another per day, per capita, barely grows with local population once you have enough to create, say, a downtown with public transit. There are cities of 10k and cities of 10m, and no way people in the latter notice anything like 1000x as many people in the former, not even 10x. If you take it not per capita but according to whole population size, it’s basically a linear function by the time there are enough people for code efficiency to matter.

So if you’re seeing quadratic growth due to weight-noticing your NPCs are noticing too many weights and remembering them too long.

Stray thought: Almost every time I see someone, their weight hasn’t changed, and I immediately forget. When I do notice someone gain or lose, all I know is at this time they had that weight, and this new weight clearly isn’t that… how much weight change does it take for your NPCs to notice the change? What if you doubled that threshold?

cheddar · March 5, 2024, 11:29am

I mean, yeah, but maybe reframe it to yourself as allowing you to instead focus that masochism into other aspects of the coding, lol. Making NPC movements and actions more predictable also has the added benefit of rewarding players that figure out how how to take advantage of the system and can more easily see how their interactions with NPCs leads to changes in behaviour and so forth.

You ‘cheat’ where it doesn’t matter and nerd out in areas where it does, that’s how good game development happens and stalling out doesn’t, hee hoo.

zdeerzzz · March 6, 2024, 12:58am

Kinda whats already the plan.

Not currently set, the plan in my head was about 5-10lbs, so while the history length is about 2years at 720 days being the oldest kept point, and the system would allow storing 720 points of data. In reality you would probably would be adding 2-3 points max per month of game time per npc. So npcs would most likely have have less than 24-36 history points in their history for another npc. Think the average be will much closer to 2-10(my guesstimatoin). With very few exceptions that may be as high as 30-40(the player will probably be one of these cases). Currently what were storing is the day for each of these points which is 2 bytes. The game currently reserves space in a vector for 30 points when adding a new npc to track for any given npc.

BadIdea4 · March 6, 2024, 3:51am

What might happen as a result of one NPC’s noticing another’s weight change? How would the player notice this had happened?