gpu_optimization.rst 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296
  1. .. _doc_gpu_optimization:
  2. GPU optimization
  3. ================
  4. Introduction
  5. ------------
  6. The demand for new graphics features and progress almost guarantees that you
  7. will encounter graphics bottlenecks. Some of these can be on the CPU side, for
  8. instance in calculations inside the Godot engine to prepare objects for
  9. rendering. Bottlenecks can also occur on the CPU in the graphics driver, which
  10. sorts instructions to pass to the GPU, and in the transfer of these
  11. instructions. And finally, bottlenecks also occur on the GPU itself.
  12. Where bottlenecks occur in rendering is highly hardware-specific.
  13. Mobile GPUs in particular may struggle with scenes that run easily on desktop.
  14. Understanding and investigating GPU bottlenecks is slightly different to the
  15. situation on the CPU. This is because, often, you can only change performance
  16. indirectly by changing the instructions you give to the GPU. Also, it may be
  17. more difficult to take measurements. In many cases, the only way of measuring
  18. performance is by examining changes in the time spent rendering each frame.
  19. Draw calls, state changes, and APIs
  20. -----------------------------------
  21. .. note:: The following section is not relevant to end-users, but is useful to
  22. provide background information that is relevant in later sections.
  23. Godot sends instructions to the GPU via a graphics API (Vulkan, OpenGL, OpenGL
  24. ES or WebGL). The communication and driver activity involved can be quite
  25. costly, especially in OpenGL, OpenGL ES and WebGL. If we can provide these
  26. instructions in a way that is preferred by the driver and GPU, we can greatly
  27. increase performance.
  28. Nearly every API command in OpenGL requires a certain amount of validation to
  29. make sure the GPU is in the correct state. Even seemingly simple commands can
  30. lead to a flurry of behind-the-scenes housekeeping. Therefore, the goal is to
  31. reduce these instructions to a bare minimum and group together similar objects
  32. as much as possible so they can be rendered together, or with the minimum number
  33. of these expensive state changes.
  34. 2D batching
  35. ^^^^^^^^^^^
  36. In 2D, the costs of treating each item individually can be prohibitively high -
  37. there can easily be thousands of them on the screen. This is why 2D *batching*
  38. is used with OpenGL-based rendering methods. Multiple similar items are grouped
  39. together and rendered in a batch, via a single draw call, rather than making a
  40. separate draw call for each item. In addition, this means state changes,
  41. material and texture changes can be kept to a minimum.
  42. Vulkan-based rendering methods do not use 2D batching yet. Since draw calls are
  43. much cheaper with Vulkan compared to OpenGL, there is less of a need to have 2D
  44. batching (although it can still be beneficial in some cases).
  45. 3D batching
  46. ^^^^^^^^^^^
  47. In 3D, we still aim to minimize draw calls and state changes. However, it can be
  48. more difficult to batch together several objects into a single draw call. 3D
  49. meshes tend to comprise hundreds or thousands of triangles, and combining large
  50. meshes in real-time is prohibitively expensive. The costs of joining them quickly
  51. exceeds any benefits as the number of triangles grows per mesh. A much better
  52. alternative is to **join meshes ahead of time** (static meshes in relation to each
  53. other). This can be done by artists, or programmatically within Godot using an add-on.
  54. There is also a cost to batching together objects in 3D. Several objects
  55. rendered as one cannot be individually culled. An entire city that is off-screen
  56. will still be rendered if it is joined to a single blade of grass that is on
  57. screen. Thus, you should always take objects' location and culling into account
  58. when attempting to batch 3D objects together. Despite this, the benefits of
  59. joining static objects often outweigh other considerations, especially for large
  60. numbers of distant or low-poly objects.
  61. For more information on 3D specific optimizations, see
  62. :ref:`doc_optimizing_3d_performance`.
  63. Reuse shaders and materials
  64. ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  65. The Godot renderer is a little different to what is out there. It's designed to
  66. minimize GPU state changes as much as possible. :ref:`StandardMaterial3D
  67. <class_StandardMaterial3D>` does a good job at reusing materials that need similar
  68. shaders. If custom shaders are used, make sure to reuse them as much as
  69. possible. Godot's priorities are:
  70. - **Reusing Materials:** The fewer different materials in the
  71. scene, the faster the rendering will be. If a scene has a huge amount
  72. of objects (in the hundreds or thousands), try reusing the materials.
  73. In the worst case, use atlases to decrease the amount of texture changes.
  74. - **Reusing Shaders:** If materials can't be reused, at least try to reuse
  75. shaders. Note: shaders are automatically reused between
  76. StandardMaterial3Ds that share the same configuration (features
  77. that are enabled or disabled with a check box) even if they have different
  78. parameters.
  79. If a scene has, for example, 20,000 objects with 20,000 different
  80. materials each, rendering will be slow. If the same scene has 20,000
  81. objects, but only uses 100 materials, rendering will be much faster.
  82. Pixel cost versus vertex cost
  83. -----------------------------
  84. You may have heard that the lower the number of polygons in a model, the faster
  85. it will be rendered. This is *really* relative and depends on many factors.
  86. On a modern PC and console, vertex cost is low. GPUs originally only rendered
  87. triangles. This meant that every frame:
  88. 1. All vertices had to be transformed by the CPU (including clipping).
  89. 2. All vertices had to be sent to the GPU memory from the main RAM.
  90. Nowadays, all this is handled inside the GPU, greatly increasing performance. 3D
  91. artists usually have the wrong feeling about polycount performance because 3D
  92. modeling software (such as Blender, 3ds Max, etc.) need to keep geometry in CPU
  93. memory for it to be edited, reducing actual performance. Game engines rely on
  94. the GPU more, so they can render many triangles much more efficiently.
  95. On mobile devices, the story is different. PC and console GPUs are
  96. brute-force monsters that can pull as much electricity as they need from
  97. the power grid. Mobile GPUs are limited to a tiny battery, so they need
  98. to be a lot more power efficient.
  99. To be more efficient, mobile GPUs attempt to avoid *overdraw*. Overdraw occurs
  100. when the same pixel on the screen is being rendered more than once. Imagine a
  101. town with several buildings. GPUs don't know what is visible and what is hidden
  102. until they draw it. For example, a house might be drawn and then another house
  103. in front of it (which means rendering happened twice for the same pixel). PC
  104. GPUs normally don't care much about this and just throw more pixel processors to
  105. the hardware to increase performance (which also increases power consumption).
  106. Using more power is not an option on mobile so mobile devices use a technique
  107. called *tile-based rendering* which divides the screen into a grid. Each cell
  108. keeps the list of triangles drawn to it and sorts them by depth to minimize
  109. *overdraw*. This technique improves performance and reduces power consumption,
  110. but takes a toll on vertex performance. As a result, fewer vertices and
  111. triangles can be processed for drawing.
  112. Additionally, tile-based rendering struggles when there are small objects with a
  113. lot of geometry within a small portion of the screen. This forces mobile GPUs to
  114. put a lot of strain on a single screen tile, which considerably decreases
  115. performance as all the other cells must wait for it to complete before
  116. displaying the frame.
  117. To summarize, don't worry about vertex count on mobile, but
  118. **avoid concentration of vertices in small parts of the screen**.
  119. If a character, NPC, vehicle, etc. is far away (which means it looks tiny), use
  120. a smaller level of detail (LOD) model. Even on desktop GPUs, it's preferable to
  121. avoid having triangles smaller than the size of a pixel on screen.
  122. Pay attention to the additional vertex processing required when using:
  123. - Skinning (skeletal animation)
  124. - Morphs (shape keys)
  125. .. Not implemented in Godot 4.x yet. Uncomment when this is implemented.
  126. - Vertex-lit objects (common on mobile)
  127. Pixel/fragment shaders and fill rate
  128. ------------------------------------
  129. In contrast to vertex processing, the costs of fragment (per-pixel) shading have
  130. increased dramatically over the years. Screen resolutions have increased: the
  131. area of a 4K screen is 8,294,400 pixels, versus 307,200 for an old 640×480 VGA
  132. screen. That is 27 times the area! Also, the complexity of fragment shaders has
  133. exploded. Physically-based rendering requires complex calculations for each
  134. fragment.
  135. You can test whether a project is fill rate-limited quite easily. Turn off
  136. V-Sync to prevent capping the frames per second, then compare the frames per
  137. second when running with a large window, to running with a very small window.
  138. You may also benefit from similarly reducing your shadow map size if using
  139. shadows. Usually, you will find the FPS increases quite a bit using a small
  140. window, which indicates you are to some extent fill rate-limited. On the other
  141. hand, if there is little to no increase in FPS, then your bottleneck lies
  142. elsewhere.
  143. You can increase performance in a fill rate-limited project by reducing the
  144. amount of work the GPU has to do. You can do this by simplifying the shader
  145. (perhaps turn off expensive options if you are using a :ref:`StandardMaterial3D
  146. <class_StandardMaterial3D>`), or reducing the number and size of textures used.
  147. Also, when using non-unshaded particles, consider forcing vertex shading in
  148. their material to decrease the shading cost.
  149. .. seealso::
  150. On supported hardware, :ref:`doc_variable_rate_shading` can be used to
  151. reduce shading processing costs without impacting the sharpness of edges on
  152. the final image.
  153. **When targeting mobile devices, consider using the simplest possible shaders
  154. you can reasonably afford to use.**
  155. Reading textures
  156. ^^^^^^^^^^^^^^^^
  157. The other factor in fragment shaders is the cost of reading textures. Reading
  158. textures is an expensive operation, especially when reading from several
  159. textures in a single fragment shader. Also, consider that filtering may slow it
  160. down further (trilinear filtering between mipmaps, and averaging). Reading
  161. textures is also expensive in terms of power usage, which is a big issue on
  162. mobiles.
  163. **If you use third-party shaders or write your own shaders, try to use
  164. algorithms that require as few texture reads as possible.**
  165. Texture compression
  166. ^^^^^^^^^^^^^^^^^^^
  167. By default, Godot compresses textures of 3D models when imported using video RAM
  168. (VRAM) compression. Video RAM compression isn't as efficient in size as PNG or
  169. JPG when stored, but increases performance enormously when drawing large enough
  170. textures.
  171. This is because the main goal of texture compression is bandwidth reduction
  172. between memory and the GPU.
  173. In 3D, the shapes of objects depend more on the geometry than the texture, so
  174. compression is generally not noticeable. In 2D, compression depends more on
  175. shapes inside the textures, so the artifacts resulting from 2D compression are
  176. more noticeable.
  177. As a warning, most Android devices do not support texture compression of
  178. textures with transparency (only opaque), so keep this in mind.
  179. .. note::
  180. Even in 3D, "pixel art" textures should have VRAM compression disabled as it
  181. will negatively affect their appearance, without improving performance
  182. significantly due to their low resolution.
  183. Post-processing and shadows
  184. ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  185. Post-processing effects and shadows can also be expensive in terms of fragment
  186. shading activity. Always test the impact of these on different hardware.
  187. **Reducing the size of shadowmaps can increase performance**, both in terms of
  188. writing and reading the shadowmaps. On top of that, the best way to improve
  189. performance of shadows is to turn shadows off for as many lights and objects as
  190. possible. Smaller or distant OmniLights/SpotLights can often have their shadows
  191. disabled with only a small visual impact.
  192. Transparency and blending
  193. -------------------------
  194. Transparent objects present particular problems for rendering efficiency. Opaque
  195. objects (especially in 3D) can be essentially rendered in any order and the
  196. Z-buffer will ensure that only the front most objects get shaded. Transparent or
  197. blended objects are different. In most cases, they cannot rely on the Z-buffer
  198. and must be rendered in "painter's order" (i.e. from back to front) to look
  199. correct.
  200. Transparent objects are also particularly bad for fill rate, because every item
  201. has to be drawn even if other transparent objects will be drawn on top
  202. later on.
  203. Opaque objects don't have to do this. They can usually take advantage of the
  204. Z-buffer by writing to the Z-buffer only first, then only performing the
  205. fragment shader on the "winning" fragment, the object that is at the front at a
  206. particular pixel.
  207. Transparency is particularly expensive where multiple transparent objects
  208. overlap. It is usually better to use transparent areas as small as possible to
  209. minimize these fill rate requirements, especially on mobile, where fill rate is
  210. very expensive. Indeed, in many situations, rendering more complex opaque
  211. geometry can end up being faster than using transparency to "cheat".
  212. Multi-platform advice
  213. ---------------------
  214. If you are aiming to release on multiple platforms, test *early* and test
  215. *often* on all your platforms, especially mobile. Developing a game on desktop
  216. but attempting to port it to mobile at the last minute is a recipe for disaster.
  217. In general, you should design your game for the lowest common denominator, then
  218. add optional enhancements for more powerful platforms. For example, you may want
  219. to use the Compatibility rendering method for both desktop and mobile platforms
  220. where you target both.
  221. Mobile/tiled renderers
  222. ----------------------
  223. As described above, GPUs on mobile devices work in dramatically different ways
  224. from GPUs on desktop. Most mobile devices use tile renderers. Tile renderers
  225. split up the screen into regular-sized tiles that fit into super fast cache
  226. memory, which reduces the number of read/write operations to the main memory.
  227. There are some downsides though. Tiled rendering can make certain techniques
  228. much more complicated and expensive to perform. Tiles that rely on the results
  229. of rendering in different tiles or on the results of earlier operations being
  230. preserved can be very slow. Be very careful to test the performance of shaders,
  231. viewport textures and post processing.