Ok, this might turn into a long one but this stuff is always fun though really complex. What I will be writing below in generally applicable to all programing but there will be some C++ based stuff as well.
General Advice
1st - Remember Performance is Usually a Trade Off
The reason you usually hear dont optimize early is because generally performance is a trade off between usability and ease of maintenance. A great example of this is OOP itself. OOP adds a ton of overhead within programs through the use of VTables, more complex memory footprints, and heavier reliance on indirect memory access. This is all usually bad performance wise but is done to increase the ease of use and maintainability of the program.
2nd - Use a Good Profiling Tool
In general you can make some good guesses on where your bottlenecks are, but without a good profiling tool you can never know for sure. Even in my day job I have seen really smart, senior devs get their assumptions destroyed after running an actual profile on the software. This ts also much more common then you might think, especially for the reasons I will outline in section 3 here.
I am a big fan of Jetbrains tools and use those myself, but most engines will have some good tools already included for profiling. I do suggest to not roll your own if you can help it as they will usually only give you a high level, but not very detailed picture of your performance. I do know profilers can be a bit expensive though, but I do know there are a few good OSS ones out there.
3rd - Remember the Compiler is Smarter Than You
A massive benefit of compiled languages is the compiler can do a ton to optimize your code, and they do a TON of stuff to optimize your code. Running the compiler with the highest level optimization flags will just butcher your code to make it run more efficiently. If you look at the binary with no optimization and an optimized one they are nearly two different programs and if you profile them they can come out really different.
4th - 9/10 Times Its Memory
Memory access/creation is EXPENSIVE. And not just IO but even RAM to cpu cache and cpu cache to cpu cache can be orders of magnitude slower then cpu bound operations.
I cant count how many times I have been asked to look at a program to only see 30-50% of the run time was waiting on malloc/free operations. This is especially true if you are using std::string in C++ (its notorious for being really memory inefficient), and if you make a lot of allocations on the heap it can cause a lot of heap fragmentation that can slow down memory access.
Patterns like the Flyweight or Object Pool patterns can help with some of that. Also, bulk reserving and loading of data upfront can help increase cache coherency by reducing memory fragmentation.
ECS and other data oriented programing patterns make heavy use of this fact to make highly performant systems by reducing memory accesses and allocations through the use of cache coherency.
In reality, CPU bound operations do not contribute to performance issues as much as memory access tends to.
5th - Limit Your Problem Space
Sometimes you cant avoid an expensive operation. A good example of this is collision detection which is usually somewhere around O(n^2). You can improve the performance of these operations by trying to limit the amount of data they have to process.
A great example of this is how most systems handle collision. They do what is called a two pass check, where the first pass finds what entities could be colliding with each other and the second pass is the more expensive check for if they actually are colliding with each other.
This may sound wasteful on the surface as you are adding more overhead to an already really expensive operation but in reality the worse case scenario (all entities could be colliding with each other) is actually very rare and due to this that first pass really reduces the number of checks you have to do on the much more expensive second pass.
Another great example is changing your sim resolution. Maxis has a great talk on how they do this for Sims3, where when sims are not on screen the sim for them becomes very low resolution and its only when they are on screen the full sim is acting on them. I am fairly sure this is the GDC talk where they discuss it and is also a great talk on some of the befits of data driven design:
6th - Do NOT Mistake Structural Issues for Performance Issues
This goes a bit back to point 3 above. The compiler is much, much, much smarter then any of us. Stuff that looks like it may not be performant on the surface will usually be optimized out by the compiler. A great example is when people kept yelling about all the if statements in Yandere Sim. Structural and Architectural issues do not usually translate into performance issues and really can only be used to reliably spot issues with extending or maintaining the code base.
I highly recommend the video done by dyc3 on the issue as he goes into more detail and does a really good job explaining it.
7th - Multithreading: Great Rewards, Greater Risk
Multithreading can be a really great way to increase performance of your software in general, but you need to be very careful as its a land filled with landmines that can cause much more harm then good.
Two of the most common issues with Multithreading that can torpedo your performance is thrashing and atomic operations.
The easiest to explain first is atomic operations. These basically are parts of your code that can not be done in parrel or can not be interrupted. This can be either intentional or unintentional but depending on where that operation is it could make your multithreaded code run as if it is on that single thread since it has to wait for the other threads to complete first. Basically one wrong Mutex or code that requires it could make it run as it it was single threaded anyway.
Second, thrashing. This usually refers to when threads or processes fight each other for resources while they execute (usually in terms of cache). Thread context switching is expensive and if you dont have good cache coherency it can result in threads blowing away the cache for other threads causing them to spend a ton of time waiting on memory bound operations to reload their state (see point 4 above).
This is common with OOP patterns since they tend to be not very parallel friendly, but DOP patterns tend to be much easier to avoid these issues with since data and processing is more separated from each other.
8th - Smaller is Not Better
When it comes to memory access smaller is actually not always better. This is a more hardware specific thing but its more important that your memory plays nice with how the CPU expects to load it. For example most CPUs like to load data on 4byte bounds and any data that does not fit in these bounds can actually be quite a bit harder to the CPU to load and process. Luckily, as discussed on point 3, the compiler will usually try to make sure your memory is as optimized for the CPU as it can make it.
Also, cache coherency is much more important for fast data access then a small memory footprint.
C++ advice
Ok, here is where we get into some fun C++ specific advice, but this is getting long so I will keep try to keep it short and stick to some that are not commonly talked about or known.
1st - Stack vs Heap
Did you know you can allocate objects on the stack instead of the heap? A ton of people dont know about this trick with C++. Its commonly associated with a form of RAII called SBRM. Allocating objects on the stack has several advantages, from preventing memory leaks to faster and better access of that memory since stack access and allocation is much faster then doing anything on the heap.
The downside is its not very dynamic and you have a limited amount of space in the stack compared to the heap.
2nd - Meta Programing: How to Abuse the Compiler & Preprocessor
Meta Programing is one of those Black Magic concepts in C++ that is really powerful but very very complex. Usually you will see this mentioned though as a way to eliminate the overhead of VTables while maintaining some form of polymorphism .
This is because you can emulate a form of Duck Typing through the use of templates. When you template out a method, object, or function you can say you expect certain methods/properties on a type and the compiler will check for these at compile time. This allows you to enforce an interface through the compiler instead of through an abstract type like an interface.
This is even more powerful once you get into the CRTP or with the new type traits system in newer versions of C++.
3rd - Malloc or New to Slow, Make Your Own!
I do not suggest anyone really does this without really really REALLY good reason but you can override quite a few system defined operations with your own custom ones if it is needed. This includes things like new, malloc, free, and delete.
Its very common to see power users of the language do this to track memory allocations and deallocations for debugging but is used sometimes to implement more performant memory allocation methods.