cpu_optimization.rst 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278
  1. .. _doc_cpu_optimization:
  2. CPU optimization
  3. ================
  4. Measuring performance
  5. =====================
  6. We have to know where the "bottlenecks" are to know how to speed up our program.
  7. Bottlenecks are the slowest parts of the program that limit the rate that
  8. everything can progress. Focussing on bottlenecks allows us to concentrate our
  9. efforts on optimizing the areas which will give us the greatest speed
  10. improvement, instead of spending a lot of time optimizing functions that will
  11. lead to small performance improvements.
  12. For the CPU, the easiest way to identify bottlenecks is to use a profiler.
  13. CPU profilers
  14. =============
  15. Profilers run alongside your program and take timing measurements to work out
  16. what proportion of time is spent in each function.
  17. The Godot IDE conveniently has a built-in profiler. It does not run every time
  18. you start your project: it must be manually started and stopped. This is
  19. because, like most profilers, recording these timing measurements can
  20. slow down your project significantly.
  21. After profiling, you can look back at the results for a frame.
  22. .. figure:: img/godot_profiler.png
  23. .. figure:: img/godot_profiler.png
  24. :alt: Screenshot of the Godot profiler
  25. Results of a profile of one of the demo projects.
  26. .. note:: We can see the cost of built-in processes such as physics and audio,
  27. as well as seeing the cost of our own scripting functions at the
  28. bottom.
  29. Time spent waiting for various built-in servers may not be counted in
  30. the profilers. This is a known bug.
  31. When a project is running slowly, you will often see an obvious function or
  32. process taking a lot more time than others. This is your primary bottleneck, and
  33. you can usually increase speed by optimizing this area.
  34. For more info about using Godot's built-in profiler, see
  35. :ref:`doc_debugger_panel`.
  36. External profilers
  37. ~~~~~~~~~~~~~~~~~~
  38. Although the Godot IDE profiler is very convenient and useful, sometimes you
  39. need more power, and the ability to profile the Godot engine source code itself.
  40. You can use a number of third party profilers to do this including
  41. `Valgrind <https://www.valgrind.org/>`__,
  42. `VerySleepy <http://www.codersnotes.com/sleepy/>`__,
  43. `HotSpot <https://github.com/KDAB/hotspot>`__,
  44. `Visual Studio <https://visualstudio.microsoft.com/>`__ and
  45. `Intel VTune <https://software.intel.com/content/www/us/en/develop/tools/vtune-profiler.html>`__.
  46. .. note:: You will need to compile Godot from source to use a third-party profiler.
  47. This is required to obtain debugging symbols. You can also use a debug
  48. build, however, note that the results of profiling a debug build will
  49. be different to a release build, because debug builds are less
  50. optimized. Bottlenecks are often in a different place in debug builds,
  51. so you should profile release builds whenever possible.
  52. .. figure:: img/valgrind.png
  53. :alt: Screenshot of Callgrind
  54. Example results from Callgrind, which is part of Valgrind.
  55. From the left, Callgrind is listing the percentage of time within a function and
  56. its children (Inclusive), the percentage of time spent within the function
  57. itself, excluding child functions (Self), the number of times the function is
  58. called, the function name, and the file or module.
  59. In this example, we can see nearly all time is spent under the
  60. `Main::iteration()` function. This is the master function in the Godot source
  61. code that is called repeatedly. It causes frames to be drawn, physics ticks to
  62. be simulated, and nodes and scripts to be updated. A large proportion of the
  63. time is spent in the functions to render a canvas (66%), because this example
  64. uses a 2D benchmark. Below this, we see that almost 50% of the time is spent
  65. outside Godot code in ``libglapi`` and ``i965_dri`` (the graphics driver).
  66. This tells us the a large proportion of CPU time is being spent in the
  67. graphics driver.
  68. This is actually an excellent example because, in an ideal world, only a very
  69. small proportion of time would be spent in the graphics driver. This is an
  70. indication that there is a problem with too much communication and work being
  71. done in the graphics API. This specific profiling led to the development of 2D
  72. batching, which greatly speeds up 2D rendering by reducing bottlenecks in this
  73. area.
  74. Manually timing functions
  75. =========================
  76. Another handy technique, especially once you have identified the bottleneck
  77. using a profiler, is to manually time the function or area under test.
  78. The specifics vary depending on the language, but in GDScript, you would do
  79. the following:
  80. ::
  81. var time_start = OS.get_ticks_usec()
  82. # Your function you want to time
  83. update_enemies()
  84. var time_end = OS.get_ticks_usec()
  85. print("update_enemies() took %d microseconds" % time_end - time_start)
  86. When manually timing functions, it is usually a good idea to run the function
  87. many times (1,000 or more times), instead of just once (unless it is a very slow
  88. function). The reason for doing this is that timers often have limited accuracy.
  89. Moreover, CPUs will schedule processes in a haphazard manner. Therefore, an
  90. average over a series of runs is more accurate than a single measurement.
  91. As you attempt to optimize functions, be sure to either repeatedly profile or
  92. time them as you go. This will give you crucial feedback as to whether the
  93. optimization is working (or not).
  94. Caches
  95. ======
  96. CPU caches are something else to be particularly aware of, especially when
  97. comparing timing results of two different versions of a function. The results
  98. can be highly dependent on whether the data is in the CPU cache or not. CPUs
  99. don't load data directly from the system RAM, even though it's huge in
  100. comparison to the CPU cache (several gigabytes instead of a few megabytes). This
  101. is because system RAM is very slow to access. Instead, CPUs load data from a
  102. smaller, faster bank of memory called cache. Loading data from cache is very
  103. fast, but every time you try and load a memory address that is not stored in
  104. cache, the cache must make a trip to main memory and slowly load in some data.
  105. This delay can result in the CPU sitting around idle for a long time, and is
  106. referred to as a "cache miss".
  107. This means that the first time you run a function, it may run slowly because the
  108. data is not in the CPU cache. The second and later times, it may run much faster
  109. because the data is in the cache. Due to this, always use averages when timing,
  110. and be aware of the effects of cache.
  111. Understanding caching is also crucial to CPU optimization. If you have an
  112. algorithm (routine) that loads small bits of data from randomly spread out areas
  113. of main memory, this can result in a lot of cache misses, a lot of the time, the
  114. CPU will be waiting around for data instead of doing any work. Instead, if you
  115. can make your data accesses localised, or even better, access memory in a linear
  116. fashion (like a continuous list), then the cache will work optimally and the CPU
  117. will be able to work as fast as possible.
  118. Godot usually takes care of such low-level details for you. For example, the
  119. Server APIs make sure data is optimized for caching already for things like
  120. rendering and physics. Still, you should be especially aware of caching when
  121. using :ref:`GDNative <toc-tutorials-gdnative>`.
  122. Languages
  123. =========
  124. Godot supports a number of different languages, and it is worth bearing in mind
  125. that there are trade-offs involved. Some languages are designed for ease of use
  126. at the cost of speed, and others are faster but more difficult to work with.
  127. Built-in engine functions run at the same speed regardless of the scripting
  128. language you choose. If your project is making a lot of calculations in its own
  129. code, consider moving those calculations to a faster language.
  130. GDScript
  131. ~~~~~~~~
  132. :ref:`GDScript <toc-learn-scripting-gdscript>` is designed to be easy to use and iterate,
  133. and is ideal for making many types of games. However, in this language, ease of
  134. use is considered more important than performance. If you need to make heavy
  135. calculations, consider moving some of your project to one of the other
  136. languages.
  137. C#
  138. ~~
  139. :ref:`C# <toc-learn-scripting-C#>` is popular and has first-class support in Godot.It
  140. offers a good compromise between speed and ease of use. Beware of possible
  141. garbage collection pauses and leaks that can occur during gameplay, though. A
  142. common approach to workaround issues with garbage collection is to use *object
  143. pooling*, which is outside the scope of this guide.
  144. Other languages
  145. ~~~~~~~~~~~~~~~
  146. Third parties provide support for several other languages, including `Rust
  147. <https://github.com/godot-rust/godot-rust>`_ and `Javascript
  148. <https://github.com/GodotExplorer/ECMAScript>`_.
  149. C++
  150. ~~~
  151. Godot is written in C++. Using C++ will usually result in the fastest code.
  152. However, on a practical level, it is the most difficult to deploy to end users'
  153. machines on different platforms. Options for using C++ include
  154. :ref:`GDNative <toc-tutorials-gdnative>` and
  155. :ref:`custom modules <doc_custom_modules_in_c++>`.
  156. Threads
  157. =======
  158. Consider using threads when making a lot of calculations that can run in
  159. parallel to each other. Modern CPUs have multiple cores, each one capable of
  160. doing a limited amount of work. By spreading work over multiple threads, you can
  161. move further towards peak CPU efficiency.
  162. The disadvantage of threads is that you have to be incredibly careful. As each
  163. CPU core operates independently, they can end up trying to access the same
  164. memory at the same time. One thread can be reading to a variable while another
  165. is writing: this is called a *race condition*. Before you use threads, make sure
  166. you understand the dangers and how to try and prevent these race conditions.
  167. Threads can also make debugging considerably more difficult. The GDScript
  168. debugger doesn't support setting up breakpoints in threads yet.
  169. For more information on threads, see :ref:`doc_using_multiple_threads`.
  170. SceneTree
  171. =========
  172. Although Nodes are an incredibly powerful and versatile concept, be aware that
  173. every node has a cost. Built-in functions such as `_process()` and
  174. `_physics_process()` propagate through the tree. This housekeeping can reduce
  175. performance when you have very large numbers of nodes (usually in the thousands).
  176. Each node is handled individually in the Godot renderer. Therefore, a smaller
  177. number of nodes with more in each can lead to better performance.
  178. One quirk of the :ref:`SceneTree <class_SceneTree>` is that you can sometimes
  179. get much better performance by removing nodes from the SceneTree, rather than by
  180. pausing or hiding them. You don't have to delete a detached node. You can for
  181. example, keep a reference to a node, detach it from the scene tree using
  182. :ref:`Node.remove_child(node) <class_Node_method_remove_child>`, then reattach
  183. it later using :ref:`Node.add_child(node) <class_Node_method_add_child>`.
  184. This can be very useful for adding and removing areas from a game, for example.
  185. You can avoid the SceneTree altogether by using Server APIs. For more
  186. information, see :ref:`doc_using_servers`.
  187. Physics
  188. =======
  189. In some situations, physics can end up becoming a bottleneck. This is
  190. particularly the case with complex worlds and large numbers of physics objects.
  191. Here are some techniques to speed up physics:
  192. - Try using simplified versions of your rendered geometry for collision shapes.
  193. Often, this won't be noticeable for end users, but can greatly increase
  194. performance.
  195. - Try removing objects from physics when they are out of view / outside the
  196. current area, or reusing physics objects (maybe you allow 8 monsters per area,
  197. for example, and reuse these).
  198. Another crucial aspect to physics is the physics tick rate. In some games, you
  199. can greatly reduce the tick rate, and instead of for example, updating physics
  200. 60 times per second, you may update them only 30 or even 20 times per second.
  201. This can greatly reduce the CPU load.
  202. The downside of changing physics tick rate is you can get jerky movement or
  203. jitter when the physics update rate does not match the frames per second
  204. rendered. Also, decreasing the physics tick rate will increase input lag.
  205. It's recommended to stick to the default physics tick rate (60 Hz) in most games
  206. that feature real-time player movement.
  207. The solution to jitter is to use *fixed timestep interpolation*, which involves
  208. smoothing the rendered positions and rotations over multiple frames to match the
  209. physics. You can either implement this yourself or use a
  210. `third-party addon <https://github.com/lawnjelly/smoothing-addon>`__.
  211. Performance-wise, interpolation is a very cheap operation compared to running a
  212. physics tick. It's orders of magnitude faster, so this can be a significant
  213. performance win while also reducing jitter.