What methods, patterns, or tricks do you employ when coding to help your game run faster?

What methods, patterns, or ricks do you employ when coding to help your game run faster?

My context for asking this is i’m currently building a game in c++ and i’m always looking for new ways of thinking to help improve how i code my game to make it run faster.

Measure, measure, measure the time. You don’t know what to speed up - let alone how - until you know where the time is being spent. Tools that do this are called profilers. Some are better than others; all are better than guessing.

To say anything more… what’s your game like, and what’s slow about it?

3 Likes

That is great advice and i/m already doing this. My game is a life management sim, with heavy npc and world simulations to bring the world to life. Whats slow, not sure anything is currently slow, can things be faster thought, always!

I guess if there was a long tent pole so to speak in my game it would be the npc turn calculation. Simple combination/permutation math dictates that there is 19,206,486,926,827,200 permutations of what the npc can decide to do in a given day, between locations, travel method, actions, and day periods. My big strategy that seems to be working in regard to this is culling options and possibilities asap, to reduce the number of permutations as early as possible in the decision making process.

1 Like

Cool! One thing in your favor is that your NPCs are simulated people, so they don’t have to find the very best choice. This means that instead of starting from 520C6 and culling down, you can start from 1 and build up!

One quick 'n dirty solution might be to generate… 24? random possible day plans per person and pick the best according to some measure. Or you could come up with a simple procedure to build a good enough plan. Something like a greedy first-pass, then a pass to consider simple possible changes/swaps. Or every day consider a few variations on yesterday’s (or same day last week’s) schedule, and pick the best.

I mean, you could also optimize on the level of the code, but with math like this code quality is probably not the main issue. Better to relax the constraints in a harmless way.

1 Like

This may not be relevant but recently, I’ve been wondering how to get efficient location of intractable elements ie finding a table to use.
I realised the reason a lot of games I know use rooms is probably to break down that task. If you know a table will be in X room, you only need to figure out the closest room, then search within the room.
saves you having to search the nav graph completely.

Well this really depends on data structures for your game, and exact implementation details of that depend on language your game is written in and your choices of how to architect it.

Is this a 3d game and your trying to find all the intractable objects in 3d space for a npc to interact with?

I guess what im getting at is as the game developer unlike the npc we already know where what when why and how of everything, so how you architect your data determines how you go about accessing it when the time comes.

1 Like

Yeah breaking large problems into smaller more discreet problems is a common tactic, not just for optimization, but for coding in general. The fastest thing to do is nothing at all, so the more things you can find ways to ‘ignore’ without doing any processing on them, the better. This applies to @zdeerzzz as well:

Simple combination/permutation math dictates that there is 19,206,486,926,827,200 permutations of what the npc can decide to do in a given day,

If you can find branch points in what your NPC can decide to do then you don’t have to consider many of those options. i.e. if there’s a bunch of smithing actions, but they’re not a smith, or there’s not anvil, don’t even consider the smithing actions as possible tasks. Another thing you could do is look for dependencies. If, in your case, the npc can choose not to travel, only consider travel methods if they’ve already decided to travel somewhere. Finite State Machines are very simple way to start organizing this kind of logic. There’s more complicated stuff for decision making, but idk what your level of familiarity is with programming and game AI is.

One thing I find myself doing, when I’m not getting into the weeds of micro optimizations, is thinking of things in terms of overarching systems. Many modern programming languages are object oriented, and that’s fine and dandy, but it can obscure the fact that these aren’t just individual objects acting individually they’re all part of a greater system.

A concrete example in a game I’ve worked on was fish pathing. We let users place fish in arbitrarily sized aquarium volumes, so there’s not much we can assume about where the fish are swimming. For the most part the idea is simple, try to pick a valid spot to swim to, get there and pause/do fish things, repeat.
The problem was that because of the arbitrary volumes, we had to do a bunch of ray casts (which are expensive) to figure out what were valid directions the fish could swim in. This meant users with many fish would lag like crazy. The solution was to make a fish AI system that kept a pointer (reference, whatever) to each fish and instead of the fishing ticking/updating each frame on it’s own, the system would work it’s way through the list of fish, only updating so many each frame. Now there’s an upper bound on how many fish can be raycasting meaning it can only get so laggy. And fish just kinda hang out a lot so it doesn’t look that weird if some of them don’t get updated for a while becuase there 100s of fish or what have you.

1 Like

Okay and now some micro optimization stuff because I just want to nerd out.

IF you do want to get into more micro optimization learning more about how CPUs handled instruction, memory, branching, and so on is helpful. Mind you, most of the time this isn’t necessary at the game logic level unless you’re doing very particular things, but if you have many objects that do the same/similar things, there are some things you can do:

1 - group by logic
Each instruction has to be sent to the CPU so CPUs cache instructions and basically try to ‘work ahead’ as much as possible. If your logic is changing all the time (e.g. you run A.Update() then B.Update() then A.Update() again) then it can’t cache all the instructions because it needs to run different ones all the time. The solution is to run all A’s together then all the B’s (again this really only makes sense to do if you have many A’s and or many B’s).

2 - know your data
In a similar manner to instructions, CPUs cache memory as well. It turns out the physical distance from your ram to CPU matters because CPUs are so fast that the physical time for the signal to travel is actually relatively slow. Thus, CPUs have a small RAM cache on the chip itself so when they need to get something you have in memory, they pull all the surrounding memory in hopes that it gets whatever else it needs, and puts it in the cache. You can take advantage of this by putting all your similar data in one spot, instead of allocated wherever. e.g. if you have 500 class Dingbot, and you compute something with each of their data every frame, it’s faster to keep them all in one packed array then have them scattered about memory. This is because when you compute Dingbot[0], the CPU has likely pulled Dingbot[1, 2,…etc] into cache since they’re all next to each other in memory, so when you compute over Dingbot[1] it doesn’t need to go out to main memory again.

3 - reduce your memory footprint
This goes with #2 and for much the same reason. Because the CPU cache is small since it has to fit on the chip, the smaller the size of your data structures, the more of them you fit on the cache, and the CPU will have to fetch from main memory less frequently.

4 - reduce conditional branches
This follows from #1, that if the cpu is trying to cache instructions ahead of time, then if you branch (i.e. if(true/false) it breaks this because if it’s true it has to load a certain set of instructions, and if it’s false it has to load a different set. You’d be surprised how often I’ve been able to factor code out of branches, or branch less frequently by organizing my code and my data better. An example, if I have several class A and class B: A as well as class C: A which inherit from A, and I know that the code that branches is based on class, then branching once, then running the code for all of B, then branching again and running all of the code for C, rather than branching on each individual object, can dramatically reduce how often you branch.
A psuedocode example:

// many branches
foreach (obj in children_of_A)
{
  if (obj.is_shiny)
     obj.ShinyCode();
  else
     obj.DullCode();
}

// versus less branched code

foreach (obj in b_class_objs)
{
  // we know all B's are shiny so no branch
  obj.ShinyCode();
}

foreach(obj in c_class_objs)
{
  // we know that all C's are dull so no branch needed
  obj.DullCode();
}

In both cases we iterate over the same number of objects, but by knowing our data and separating it properly, we can reduce the number of branches.

In conclusion:

  • You probably don’t need to do most of this unless you’re doing something ambitious.
  • Know your data
  • Just like the real world “Cache is king baby”
4 Likes

check, check, and check.

so the interesting thing here is the prefetcher in a cpu actually handles random data in the heap fairly well to a point, its more important to check/know how quickly your jumping from memory object to memory object. I have tested with both arrays of pointers and objects themselves, and at a certain complexity of object or function operation on the object putting the data next to each other in memory really doesn’t matter when the prefecher is doing a good job and keeping up with where to code is at during execution. And i’m not talking about an array of int’s vs int*'s. This leads me back to @BadIdea4 and measuring. The take away here is not to sacrifice code simplicity for speed until your sure it will actually produce a faster result.

I would add there is a clear exception to this, and that is conditional branching that gives an early out a long loop, or loop containing nested loops.

1 Like

I have tested with both arrays of pointers and objects themselves, and at a certain complexity of object or function operation on the object putting the data next to each other in memory really doesn’t matter

Yeah, things break down when you’re dealing with larger objects doing various operations all over their data. You’d really have to find separable portions of the code that you can break off, along with some of the data, and tight loop over it to really get anything out of it, and even then… but I guess that’s why it’s a micro optimization. Not really worth while unless you’re doing it hundreds of times a frame lol.

I would add there is a clear exception to this, and that is conditional branching that gives an early out a long loop, or loop containing nested loops.

Oh of course, the fastest thing is to do nothing at all. Now if it’s something trivial like if (do_math) { x += 2 } it’s probably faster to just remove the branch and either always do it or do the branch outside of a loop, etc. but I figured that was maybe more given.

Another thing I didn’t mention was branch prediction because that just makes things more complicated. If do_math is always (or nearly always) true and you’re tightlooping over it, the CPU will remember that it was true last time and assume it’s going to need instructions for x += 2 and preload them. At that point you’re really only taking the hit of evaluating do_math. Then again, if it’s really always true it’s trivial to just remove it.


So i’m investigating an interesting behavior. In the trace you can see the first human( npc ) to have its turn planed took longer than the rest, which is interesting. I did not expect the first to be the slowest to process by a large margin, 10ms vs average of the rest 3.4ms. Also the rest slowly get slower in order.

1 Like

Can you get statistics about memory prefetching and time spent waiting for memory? I suspect the cpu was waiting for the memory for the first npc and less so for the rest, but that’s just a guess.

Alternately, if you’re more cpu bound than memory bound, what’s with setGoals taking so long? Are your npcs changing their life goals every day, or does that function do something else?

A couple things to keep in mind;

First chrome tracing has a json file intake limit of 400ish mb so im limited in the number of npcs i can spawn when profiling(more npcs == more data points == bigger file). With the profiler disabled at runtime i run with 3000 npcs normally, seems to be a manageable number, that is currently in the bounds of my performance target.

Second with the npc count being so low it shrinks the rest of the world proportionately meaning the time spent in #genpossible part is about 1/20 of what it normally is. SetGoals time is inflated do to lack of selection. need to go work on things related to this next.

Third Setgoals is another function called in planday, and itself calls picktargetjob, picktargethouse, and updategoalscore, which the first two will not run every time set goals is called it just happens that it has to run the first time you plan which is what im currently capturing. During game play you can expect the time to be much shorter.

Fourth keep in mind due to the profiler only being able to write to fille from one thread effectivly forcing me to run the game single threaded which is not how it plays normaly.

I have been mostly focused on #genpossible #possible and #possiblescore as they started as the longest tent polls.

Here is a look when running the profiler with 1000npcs although i have no visualization(chrome tracer):


This generated a 1.5gb file of data points. which is why for seeing incremental changes i run a lower npc count. much faster to process and open. Also refer to point 4 normaly the 83 second to process all 1000 turns would have taken 8seconds when runing multithreaded.

What does “selection” mean here?

Hm. What happens if you let the players start with some default goals, and call setGoals in rotation instead of all at the same time? So, npc #1 sets goals on the first day, npc #2 the second… Or maybe this isn’t a helpful idea.

I notice many of these functions are called the same 100 times, and I assume it’s always in the same order. What if you merge them all into one function, and consider that as the thing to be optimized? Or at least the subtree of longest-taking functions, genPossible, possible, and possibleScore.

Also, could we see the code for those functions?

Have you tried a line profiler, or a sampling profiler?

Selection of target hosuing jobs or anything eles the npc may need to pick from. The problem when the npc count is small has to do with how the world generation code works and having to change my npc count around for profiling has exposed that it dosent scale well with npc count, this is a later problem.

Ok so i have profiled the overhead of a function call in c++ before and were talking >.001ms of overhead(and in that measurement you have the overhead of the multiple calls the profiler itself made internally to make it), you can see that the most of the time the scope is in one big function plan day with exactly 4 cases of the function call overhead happening so were talking the overall impact is >.0004ms of time per call to planday. Consider that panday itself took 83ms total compared to the >.004ms of time possibly saved not the best thing to go after.

Yes to each npc will have run planday and all functions it calls, no an npc does not have the options to skip doing any of these(except as mentioned earlier with goalscore). Each npc must have a valid plan to execute on its turn so to speak. Having a npc not participate in a day is not acceptable to my design goals.

Ok when reading the break down of different things called note that all the #_______, are sections of a function im specifically profiling, they’re not functions themselves rather scopes, or sub scope everything inbetween { }. #GenPossible is a sub scope surounding the 6 for loops churning out the possible plans, the 6 actions taken on each day period. #Possible is the scope of the inner most for loop at which point were looking at a specific plan. #Possiblescore is the subscope in #possible of the code that is scoring that specific plan. Keep in mind we are looking at ~.01ms of time at the scope of #Possiblescore.

So when looking at the profile of the run with 1000 npcs we generated ~3.6million possible plans and scored them across our 1000 npcs. That’s ~3,600 plans per npc looked at. Remember there was 19,206,486,926,827,200 possible permutations of day plans to look at for each npc, so you can see the culling really coming into play here.

Im gonna pass on this i don’t fell like having to explain how the data is structured and what it all does. For you to makes any sense of it you would probably just need to look at the whole project which im not going to post. Thers a lot of optimizations still to be made not in the plan day function with how data is structured. working my way out so to speak.

have done this with my custom built in profiler just put the line in question in its own subscope to get how long it took to run.

No because it takes random samples of which i’m not interested in. Rather just look a the whole picture. Also its results come with way more inaccuracies(has to do with how sampling profilers average data points since they don’t have every data point, this is an inherent weakness of only sampling periodically)than what i’m doing.

I’m aware that C++ has very low function call overhead - one of the great upsides of C and C++!. I was suggesting manual inlining as a way to see the sequence of instructions more clearly, to help you see potential optimizations. But given the next quote I don’t think that’s necessary.

Ah, I was not aware of that. I was assuming they were static methods as opposed to freestanding functions or something.

That’s totally fair and up to you. It’s your code! But explaining your code to someone else is one of the better ways to get unstuck on a programming problem, so I think you would probably find it worthwhile.

So, given what I have to go on, I can suggest two possible transformations:

(1)

What if instead of 6 loops in sequence, you had one loop of the six things in sequence? Not so much as an optimization in itself, but as a transformation that could lead to one.

(2)
It seems like #possible, #possibleScore, and #possibleUpdate are all pretty fast already, and being called many many times. Can you think of ways to have to call them less? Use some more heuristics about what would likely make a better plan to avoid even considering some plans/changes?

If the problem is that the npcs want and are trying to maximize different things, and have different stats and locations, maybe find a way to categorize them and use heuristics for each category? For example, non-drivers should probably have a smaller distance cutoff than drivers, so if you process them separately you only have to consider the further destinations for the drivers.

I guess what I’m getting at is do some work based on the activities in themselves to not have to consider every activity for every npc, or do so more aggressively.

its 6 loop nested for in for in for in for in for in for. were generating a list of 6 numbers a,b,c,d,e,f with each for loop being responsible for one of the 6 numbers. the result is the 6 actions taken to make up a day.

already doing this:

#travel builds a multi level dictionary scoring traveling from one locaton to another by method of travel for all known locations by the npc and valid methods

#travelmax picks the the max scores for each travel posibility(ie from x location to y) and store it in a map

#actions pre coomputes the scores of a given action during a given day period for the npc based on goals based on what actions the npc can even do(not bothering with an action a npc cant do)

#a_loc finds all valid locations of an action we just scored above

#a_locMax finds the best action location by by travel score from travel max and stores it in a map

#dp_possible filters actions by dayperiod for which ones we should even consider for a period of the day. Baicly if it has a negative score during that period it’s not kept, storing this in a map

This all allows us to only have to do 6 for loops instead of way more for #Genpossible. our 6 loops iterate over the picked actions for each day period we found in dp_possible.

Also since we did all the pre work on scoring outside the loops we skipp right to adding up the score in a_locMax and travelMax for each action. if the score is higher than the previous iteration through the loops we also store the location and travel method in those maps, along with the actions picked as our new best day plan.

All this removes all new memory allocations from inside genpossible, and remove a lot of logic and math, and allow us to shortcut almost all of the permutations, we basically have pre determined locations and travel methods. we also dont have to acess any non stack memory inside genposssible becuse of this, even npc stats.

What numeric type are you using for everything? Have you tried switching?

for any action, location, or travel method uint8_t (none of these need to go over 255) for any score float, there are some uint’s sprinkled throughout for transversing large arrys vectors and maps. not really it mostly be switching from float to a double or something of that nature, and a lot of the types are determined by having to store the data somewhere and trying to keep the memory footprint down.