vdp-timing.html 57 KB

  1. <?xml version="1.0" encoding="iso-8859-1"?>
  2. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  3. "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  4. <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
  5. <head>
  6. <link title="Purple" rel="stylesheet" href="manual-purple.css" type="text/css" />
  7. <title>V9938 VRAM timings</title>
  8. </head>
  9. <body>
  10. <h1>V9938 VRAM timings</h1>
  11. Measurements done by: Joost Yervante Damad, Alex Wulms, Wouter Vermaelen<br/>
  12. Analysis done by: Wouter Vermaelen<br/>
  13. Text written by: Wouter Vermaelen<br/>
  14. with help from the rest of the openMSX team.
  15. <h2>Introduction</h2>
  16. <p>This text describes in detail how, when and why the V9938 reads from and
  17. writes to VRAM in bitmap screen modes (screen 5, 6, 7 and 8). VRAM is accessed
  18. for bitmap and sprite rendering but also for VDP command execution or by
  19. CPU VRAM read/write requests.</p>
  20. <h5 id="motivation">motivation</h5>
  21. <p>Modern MSX emulators like blueMSX and openMSX are already fairly accurate.
  22. And for most practical applications like games or even demos they are already
  23. <i>good enough</i>. Though there are cases where you can still clearly see the
  24. difference between a real and an emulated MSX machine.</p>
  25. <p>For example the following pictures show the speed of the LINE command for
  26. different slopes of the line. The first two pictures are generated on different
  27. MSX emulators, the last picture is from a real MSX. Without going into all the
  28. details: lines are drawn from the center of the image to each point at the
  29. border. While the LINE commands are executing, the command color register is
  30. rapidly changed (at a fixed rate). So faster varying colors indicate a slower
  31. executing command.</p>
  32. <img src="line-speed-old-8.png" width="300">
  33. <img src="line-speed-emu-8.png" width="300">
  34. <img src="line-speed-real-8.png" width="300">
  35. <p>From left to right these pictures show:</p>
  36. <ul>
  37. <li>(left) The output of MSX emulators that use Alex Wulms' original command
  38. engine emulation core. All(?) modern MSX emulators use this core, including
  39. blueMSX, OCM and (older versions of) openMSX. The output are squares, this
  40. indicates that the speed of a LINE command doesn't depend on the slope of the
  41. line.</li>
  42. <li>(center) The output of openMSX version 0.9.1. Here the command engine was
  43. tweaked to take the slope of the line into account, so the test now generates
  44. clean octagonals.</li>
  45. <li>(right) The output of a real MSX. The overall shape is also an octagonal.
  46. But there are also a lot of irregularities. These irregularities can be
  47. reproduced when running the test multiple times. So it must be a <i>real</i>
  48. effect, and not some kind of measurement noise.</li>
  49. </ul>
  50. <p>This test is derived from NYRIKKI's test program described in this (long) <a
  51. href="http://www.msx.org/forum/msx-talk/software-and-gaming/line">MRC forum
  52. thread</a>. This particular test is not that important. But because it
  53. generates a nice graphical output it allows to show the problem without going
  54. into too much technical details (yet).</p>
  55. <p>In most MSX applications these LINE speed differences, or small command
  56. speed differences in general, likely won't cause any problems. (Except of
  57. course in programs like this that specifically test for it.) But it would still
  58. be nice to improve the emulators.</p>
  59. <h5>measurements</h5>
  60. <p>To be able to improve openMSX further we need to have a good understanding
  61. of what it is exactly that causes these irregularities. It would be very hard
  62. to figure out this stuff only by using MSX test programs. It might be easier to
  63. look at the deeper hardware level. More specifically at the communication
  64. between the VDP (V9938) and the VRAM chips. This should allow us to see when
  65. exactly the VDP reads or writes which VRAM addresses.</p>
  66. <p>So at the 2013 MSX fair in Nijmegen we (some member of the openMSX team and
  67. I) connected a logic analyzer to the VDP-VRAM bus in a Philips NMS8250 machine.
  68. The following picture gives an impression of our measurement setup.</p>
  69. <img src="v9938-probes.jpg">
  70. <p>Next we ran some MSX software that puts the VDP in a certain display mode.
  71. It enables/disables screen and/or sprite rendering. And it optionally executes
  72. VDP commands and/or accesses VRAM via the CPU. And while this test was running
  73. we could capture (small chunks of) the communication between the VDP and the
  74. VRAM. This gives us output (waveforms) like in the following image.</p>
  75. <img src="gtkwave.png">
  76. <p>It's not so easy to go from this waveform data to meaningful results about
  77. how the VDP operates. This text also won't talk about this analysis process. If
  78. you're interested in the analysis or in the raw measurement data, you can find
  79. some more details in the <a
  80. href="https://sourceforge.net/mailarchive/message.php?msg_id=30375119">
  81. openmsx-devel mailinglist archive</a>. The rest of this text will only discuss
  82. the final results of the analysis.</p>
  83. <p>Because one of the primary goals was to improve the command engine emulation
  84. in openMSX, the measurements mostly focused on the bitmap screen modes (a V9938
  85. doesn't allow commands in non-bitmap modes). So the following sections will
  86. only occasionally mention text or character modes. Because we used a V9938 we
  87. also couldn't test the YJK modes (screen 11 and 12). But it's highly likely
  88. that, from a VRAM access point of view, these modes behave the same as screen 8
  89. (or as we'll see later, the same as all the bitmap screen modes).</p>
  90. <h2>VRAM accesses</h2>
  91. <p>Before presenting the actual results of (the analysis of) the measurements,
  92. this section first explains the general workings of the VDP-VRAM communication.
  93. This is mostly a description of the functional interface of DRAM chips, but
  94. then specifically applied to the VDP case. Feel free to skim (or even skip)
  95. this section.</p>
  96. <p>Like most RAM chips in MSX machines, the VDP uses DRAM chips for the video
  97. RAM. There exist many variations in DRAM chips. You can find a whole lot of
  98. information on <a
  99. href="http://en.wikipedia.org/wiki/Dynamic_random-access_memory">
  100. wikipedia</a>. Most of the info in this section can also be found in the 'V9938
  101. Technical Data Book'. Often that book goes into a lot more detail than this
  102. text. Here I highlight (and simplify) the aspects that are relevant to
  103. understand the later sections in this text.</p>
  104. <h3>Connection between VDP and VRAM</h3>
  105. <p>Between the VDP and the VRAM chips there is an 8-bit data bus. This means
  106. that a single read or write access will transfer 1 byte of data.</p>
  107. <p>There is also an 8-bit address bus. Obviously 8 bits are not enough to
  108. address the full 128kB or even 192kB VRAM address space. Instead the address is
  109. transferred in two steps. First the row-address is transferred followed by the
  110. column address. (Usually) the row address corresponds to bits 15-8 of the full
  111. address, while the column address corresponds to bits 7-0.</p>
  112. <p>Though this still only allows to address up-to 64kB. To get to 128kB, there
  113. are 2 separate column-address-select signals (named CAS0 and CAS1). These two
  114. signals allow to select one of the two available 64kB banks. So combined this
  115. gives 128kB. (Usually) you can interpret CAS0/CAS1 as bit 16 of the
  116. address.</p>
  117. <p>In case of a MSX machine with 192kB VRAM there is still a third signal:
  118. CASX. To simplify the rest of this text, this possibility is ignored. It anyway
  119. doesn't fundamentally change anything.</p>
  120. <p>Next to the data and address bus there are still some control signals. I've
  121. already mentioned the CAS signals (used to select the column address). There's
  122. a similar RAS (row address select) signal. And finally there's a R/W
  123. (read-write) signal that indicates whether the access is a read or a write.</p>
  124. <h3>Timing of the VDP-VRAM signals</h3>
  125. <p>When the VDP wants to read or write a byte from/to VRAM it has to
  126. <i>wiggle</i> the signals that connect the VDP to the VRAM in a certain way.
  127. This section describes the timing of those <i>wiggles</i>.</p>
  128. <p>The timing description in this section is different from the description in
  129. the 'VDP Technical Data Book'. The Data Book has the <i>real</i> timings,
  130. including all the subtle details for how to build an actual working system.
  131. This text has all the timings rounded to integer multiples of VDP clock cycles.
  132. IMHO these simplified timings make the VDP-VRAM connection easier to understand
  133. from a <i>functional</i> point of view.</p>
  134. <h4>A single write</h4>
  135. <p>To write a single byte to VRAM, follow this schema:</p>
  136. <img src="dram-write.png">
  137. <ul>
  138. <li>Put the row address on the address bus and activate the RAS signal. Most
  139. signals are active-low, so activating means make the signal low.</li>
  140. <li>After one cycle (remember these are <i>functional</i> timings, especially
  141. in this step the <i>real</i> timing rules are more complex):</li>
  142. <ul>
  143. <li>Activate (one of) the CAS signals.</li>
  144. <li>Put the column address on the address bus.</li>
  145. <li>Set the R/W signal. A low signal means write.</li>
  146. <li>Put the to-be-written data on the data bus.</li>
  147. </ul>
  148. </li>
  149. <li>After two cycles the CAS signal can be deactivated. At this point the
  150. value of the R/W signal doesn't matter anymore (it may have any value). But
  151. measurements show that the VDP restores the R/W signal to a high value at this
  152. point.</li>
  153. <li>Again one cycle later, the RAS signal can be deactivated.</li>
  154. <li>The RAS signal has to remain de-active for at least two cycles.</li>
  155. </ul>
  156. <p>So a full write cycle takes 6 VDP clock cycles.</p>
  157. <h4>A single read</h4>
  158. <p>Reads are very similar to writes, they follow this schema:</p>
  159. <img src="dram-read.png">
  160. <ul>
  161. <li>Put the row address on the address bus and activate the RAS signal.</li>
  162. <li>After one cycle:</li>
  163. <ul>
  164. <li>Activate (one of) the CAS signals.</li>
  165. <li>Put the column address on the address bus.</li>
  166. <li>Set the R/W signal: a high value indicates a read. The VDP keeps this
  167. signal high between VRAM transactions. So in measurements you don't
  168. actually see this signal changing for reads.</li>
  169. </ul>
  170. </li>
  171. <li>After two cycles the read data is available on the data bus. The CAS signal
  172. can be deactivated now.</li>
  173. <li>After one cycle the RAS signal can be deactivated.</li>
  174. <li>Wait at least two cycles before starting the next VRAM transaction.</li>
  175. </ul>
  176. <p>So this is very similar to a write: address selection is identical.
  177. Obviously the R/W signal and the direction (and timing) of the information on
  178. the data bus is different. And just like a write, a full read cycle also takes
  179. 6 VDP cycles.</p>
  180. <h4>Page mode reads (burst read)</h4>
  181. <p>Often the VDP needs to read data from successive VRAM addresses. If those
  182. addresses all have the same row address, then there's a faster way to perform
  183. this compared to doing multiple reads like in the schema above.</p>
  184. <img src="dram-read-burst.png">
  185. <ul>
  186. <li>Put the (common) row address on the address bus and activate the RAS
  187. signal.</li>
  188. <li>After one cycle:</li>
  189. <ul>
  190. <li>Put the first column address on the address bus.</li>
  191. <li>Activate (one of) the CAS signals.</li>
  192. <li>Set the R/W signal (though the VDP already has this signal in the
  193. correct state).</li>
  194. </ul>
  195. <li>After two cycles read the data from the data bus, and deactivate CAS.</li>
  196. <li>Two cycles later, put the 2nd column address on the address bus and
  197. re-activate (one of) the CAS signals.</li>
  198. <li>Again two cycles later read the data and deactivate CAS.</li>
  199. <li>It's possible to repeat this process for a 3rd, 4th, &hellip; byte.</li>
  200. <li>After one cycle deactivate the RAS signal.</li>
  201. <li>Wait at least two cycles before starting the next VRAM transaction.</li>
  202. </ul>
  203. <p>The above diagram shows a burst-length of only two bytes. It's also possible
  204. to have longer lengths. The VDP uses lengths up-to 4 bytes (or 8, see next
  205. section).</p>
  206. <p>In this example reading two bytes takes 10 VDP cycles. Doing two single
  207. reads would take 2&times;6=12 cycles. When doing longer bursts, the savings
  208. become bigger. Doing a burst of N reads takes 2+4&times;N cycles compared to
  209. 6&times;N cycles for a sequence of single reads.</p>
  210. <p>In principle it's also possible to do burst-writes. Though the VDP doesn't
  211. use them (it never needs to write more than 1 byte in a sequence).</p>
  212. <h4>Multi-bank page mode reads</h4>
  213. <p>Burst reads are already faster than single-reads. But to be able to render
  214. screen 7 and 8 images, burst reads are still not fast enough. In these two
  215. screen modes, to be able to read the required data from VRAM fast enough, the
  216. VDP reads from two banks in parallel.</p>
  217. <img src="dram-read-burst-2banks.png">
  218. <p>There are 2 banks of 64kB. These two banks share the RAS control signal, but
  219. they each have their own CAS signal. The address and data signals are also
  220. shared. This allows to read from both banks <i>almost</i> in parallel:</p>
  221. <ul>
  222. <li>In burst mode it was possible to read one byte every 4 VDP cycles. For
  223. this the CAS signal had two be alternatingly two cycles high and two cycles
  224. low. The address and data buses are only used during 1 of these 4 cycles.</li>
  225. <li>Multi-bank mode uses both the CAS0 and the CAS1 signals. CAS0 is high when
  226. CAS1 is low and vice-versa. When looking at a single bank (which only sees one
  227. of the two CAS signals) this looks like a normal burst read. The only
  228. difference is that the RAS signal is at the start or at the end 2 cycles
  229. longer active than strictly needed. But that's perfectly fine.</li>
  230. </ul>
  231. <p>So this schema gives (almost) double the VRAM-bandwidth. The only
  232. requirement is that you alternatingly read from bank0 and bank1. At first sight
  233. this requirement seems so strict that it is almost never possible to make use of
  234. this banked reading mode: to render screen 7 or 8 you indeed need to read many
  235. successive VRAM locations, not locations that alternatingly come from the
  236. 1st and 2nd 64kB bank.</p>
  237. <p>To make it possible to use banked reading mode, the VDP interleaves the two
  238. banks. This introduces the concept of <i>logical</i> and <i>physical</i>
  239. addresses:</p>
  240. <ul>
  241. <li><i>Logical</i> addresses are the addresses that a programmer of the VDP
  242. normally uses. For example the bitmap data for screen 8 (possibly) starts at
  243. address 0x00000 and goes till address 0x0D400.</li>
  244. <li><i>Physical</i> addresses are the addresses that actually appear on the
  245. signals between the VDP and the VRAM. So the combination of the row and column
  246. address and the CAS0 or CAS1 bank-selection.</li>
  247. </ul>
  248. <p>In most screen modes the logical and the physical addresses are the same.
  249. But in screen 7 and 8 there's a transformation between the two:</p>
  250. <p align="center">physical = (logical &gt;&gt; 1) | (logical &lt;&lt; 16)</p>
  251. <p>So the 17-bit logical address is rotated one bit to the right to get the
  252. physical address. The effect of this transformation is that all even logical
  253. addresses end up in physical bank0 while all odd logical addresses end up in
  254. physical bank1. So now when you read from successive logical addresses you read
  255. from alternating physical banks and thus it is possible to use banked read
  256. mode.</p>
  257. <p>Usually a VDP programmer doesn't need to be aware of this interleaving. But
  258. because interleaving is only enabled in screen 7 and 8, this effect can become
  259. visible when switching between screen modes. <i>An alternative design decision
  260. could have been to always interleave the addresses. I guess the V9938 designers
  261. didn't make this choice to allow for single chip configurations in case only
  262. 64kB VRAM is connected.</i></p>
  263. <p>The diagram above shows a read of 2&times;2 bytes, in reality the VDP only
  264. uses this schema to read 2&times;4 bytes. In principle it's also possible to
  265. write to two banks in parallel, but the VDP never needs this ability.</p>
  266. <h4>Refresh</h4>
  267. <p>DRAM chips need to be refreshed regularly. The VDP is responsible for doing
  268. this (there are DRAM chips that handle refresh internally, but the VDP doesn't
  269. use such chips). Many DRAM chips allow a refresh by only activating and
  270. deactivating the RAS signal, so without actually performing a read or write in
  271. between. When extrapolating from the above timing diagrams, this would only
  272. cost 4 cycles. Though the VDP doesn't actually use this RAS-without-CAS refresh
  273. mode. Instead it performs a regular read access which takes 6 cycles.</p>
  274. <p>Each time a read (or write) is performed on a certain row of a DRAM chip,
  275. that whole row is refreshed. So to refresh the whole RAM, the VDP has to
  276. periodically read (any column address of) each of the 256 possible rows.</p>
  277. <h2>Distribution of VRAM accesses</h2>
  278. <p>The previous section described the details of isolated (single or burst)
  279. VRAM accesses. This section will look at such accesses as indivisible units and
  280. examine how these units are grouped together and spread in time to perform all
  281. the VRAM related stuff the VDP has to do.</p>
  282. <p>The VDP can perform VRAM reads/writes for the following reasons:</p>
  283. <ul>
  284. <li>Refresh</li>
  285. <li>Bitmap rendering</li>
  286. <li>Sprite rendering</li>
  287. <li>CPU read/write</li>
  288. <li>Command read/write</li>
  289. </ul>
  290. <p>Note that next to bitmap modes, the VDP also has character and text modes. I
  291. didn't investigate those modes yet, so this text mostly ignores them.</p>
  292. <p>The rest of this text explains when in time (at which specific VDP
  293. cycles) accesses of each type are executed.</p>
  294. <p>We'll first focus on refresh and bitmap/sprite rendering. Later we'll add
  295. CPU and command engine. The reason for this split is that the first group has a
  296. fairly simple pattern: refreshes always occur at fixed moments in time.
  297. Enabling bitmap rendering only adds additional VRAM reads but has no influence
  298. on the timing of the refreshes. Similarly enabling sprite rendering adds even
  299. more reads without influencing the bitmap or refresh reads. CPU and command
  300. accesses on the other hand cannot simply be added to this schema without
  301. influencing each other. So those are postponed till a later section.</p>
  302. <h3>Horizontal line timing</h3>
  303. <p>The VDP renders a full frame line-by-line. For each line the VDP (possibly)
  304. has to read some bitmap and sprite data from VRAM. It's logical to assume (and
  305. the measurements confirm this) that the data fetches within one line occur at
  306. the same relative positions as the corresponding data fetches of another line.
  307. So if we can figure out the details for one line, we can extrapolate this to a
  308. whole frame. Similarly we can assume that different frames will have similar
  309. relative timings. So really all we need to know is the timing of one line.</p>
  310. <p><i>TODO: odd and even frames in interlace mode probably do have timing
  311. differences. Still need to investigate this.</i>
  312. </p>
  313. <p>Let's thus first look at what we already know about an horizontal display
  314. line. The 'V9938 Technical Data Book' contains the following timing info about
  315. (non-text mode) display lines.</p>
  316. <table>
  317. <tr><th>Description </th><th>Cycles </th><th>Length</th></tr>
  318. <tr><td>Synchronize signal</td><td>[0 - 100)</td><td> 100</td></tr>
  319. <tr><td>Left erase time </td><td>[100 - 202)</td><td> 102</td></tr>
  320. <tr><td>Left border </td><td>[202 - 258)</td><td> 56</td></tr>
  321. <tr><td>Display cycle </td><td>[258 - 1282)</td><td>1024</td></tr>
  322. <tr><td>Right border </td><td>[1282 - 1341)</td><td> 59</td></tr>
  323. <tr><td>Right erase time </td><td>[1341 - 1368)</td><td> 27</td></tr>
  324. <tr><td>Total </td><td>[0 - 1368)</td><td>1368</td></tr>
  325. </table>
  326. <p>So one display line is divided in 6 periods. The total length of one line is
  327. 1368 cycles. The previous section showed how long individual VRAM accesses
  328. take. The next sections will figure out how all the required accesses fit in
  329. this per-line budget of 1368 cycles.</p>
  330. <p>A note about the timing notation: in this text all the timing numbers are
  331. VDP cycles relative within one line. For example in the table above the display
  332. period starts at cycle 258. The display period of the next line will start at
  333. cycle 258+1368=1626, the next at cycle 2994 and so on. To make the values
  334. smaller, all cycle numbers will be folded to the interval [0, 1368). The
  335. staring point (cycle=0) has no special meaning. We could have taken any other
  336. point and called that the starting point. (For the current choice, the external
  337. VDP HSYNC pin gets activated at cycle=0, so it was a convenient point to
  338. synchronize the measurements on).</p>
  339. <p><i>TODO horizontal set-adjust: The numbers in the above table are valid for
  340. horizontal set-adjust=0. Similarly all our measurements were done with
  341. set-adjust=0. Using different set-adjust values will make the left/right border
  342. bigger/smaller. I still need to figure out which timing values of the next
  343. sections are changed by this. E.g. are all the VRAM accesses in a line shifted
  344. as a whole, or are just the bitmap data fetches shifted and remain (some) other
  345. accesses fixed?</i></p>
  346. <p><i>TODO bits S1,S0 in VDP register R#9: The above table is valid for
  347. S1,S0=0,0. In other cases the length of a display line is only 1365 cycles
  348. instead of 1368. The rest of this text assumes a line length of 1368 cycles. I
  349. still need to figure out where exactly in the line this difference of 3 cycles
  350. is located.</i></p>
  351. <!-- numbers for 1365 cycles
  352. [0 - 100) (len= 100)
  353. [100 - 202) (len= 102)
  354. [202 - 258) (len= 56)
  355. [258 -1282) (len=1024)
  356. [1282-1339) (len= 57)
  357. [1339-1365) (len= 26)-->
  358. <h3>Sneak preview</h3>
  359. <p>The following image graphically summarizes the results of the rest of this
  360. section. This is a very wide image, it is much larger than what can be shown
  361. inline in this text (click to see the full image). It's highly recommended to
  362. open this image in an external image viewer that allows to easily zoom in and
  363. out and scroll the image.</p>
  364. <a href="vdp-timing.png">
  365. <img src="vdp-timing.png" width="1200">
  366. </a>
  367. <p>Here's an overview of the most important items in this image:</p>
  368. <ul>
  369. <li>Horizontally there are 6 regions in the image (each has a slightly
  370. different background color). These regions correspond to the 'synchronize',
  371. 'left/right erase', 'left/right border' and 'display' regions in the table from
  372. the previous section.</li>
  373. <li>Horizontally you also see a timeline going from 0 to 1368 cycles. This
  374. corresponds to one full display line.</li>
  375. <li>Vertically there are 3 big groups: 'screen off', 'no sprites' and
  376. 'sprites', see next section for why these groups are important.</li>
  377. <li>Within one vertical group there is one color-coded band and a set of
  378. RAS/CAS signals. Usually there's one RAS and 2 CAS signals, but the 'sprites
  379. off' group has 2 pairs of CAS signals. For the 'sprites off' and 'sprites on'
  380. groups there are subtle differences in the CAS0/1 signals between screen modes
  381. 5/6 and 7/8. But to save space these differences are only shown once.</li>
  382. <li>The colors in the color-coded band have the following meaning:</li>
  383. <ul>
  384. <li>red: refresh read</li>
  385. <li>green: bitmap data read (dark-green is dummy bitmap read)</li>
  386. <li>yellow: sprite data read (brown is dummy sprite read)</li>
  387. <li>blue: potential CPU or command engine read or write</li>
  388. <li>dark-grey: dummy read</li>
  389. <li>light-gray: idle (no read or write)</li>
  390. </ul>
  391. <li>The CAS signals are drawn in either a full or a stippled line. Full means
  392. the signal is definitely high/low at this point. Stippled means, it can be high
  393. or low depending on whether there was a CPU request or VDP command executing at
  394. that point. Note that the RAS signal always toggles, even if there is no CPU or
  395. command access required.</li>
  396. </ul>
  397. <p>The next sections will go into a lot more detail. It's probably a good idea
  398. to have this (zoomed in) image open while reading those later sections.</p>
  399. <h3>3 operating modes</h3>
  400. <p>When looking from a VDP-VRAM interaction point of view, the VDP can operate
  401. in 3 modes:</p>
  402. <ul>
  403. <li>Screen disabled (sprite status doesn't matter). This is the same as
  404. vertical border.</li>
  405. <li>Screen enabled, sprites disabled.</li>
  406. <li>Screen enabled, sprites enabled.</li>
  407. </ul>
  408. <p>Note that the (bitmap) screen mode (screen 5, 6, 7, or 8) largely doesn't
  409. matter for the VRAM access pattern.</p>
  410. <p><i>TODO sprite fetching happens 1 line earlier than displaying those sprites
  411. (see below for details). This means that the last line of the vertical border
  412. before the display area likely uses a 'mixed mode' where it doesn't yet fetch
  413. bitmap data but it does already fetch sprite data. I didn't specifically
  414. measure this condition, so I can't really tell anything about this mixed mode.
  415. (One possibility is that it's just like a normal display line, but the fetched
  416. bitmap data is ignored.) Similarly the last line of the display area doesn't
  417. strictly need to fetch new sprite data.</i></p>
  418. <p>We'll now look at these 3 modes in more detail.</p>
  419. <h4>Screen disabled</h4>
  420. <h5>refresh</h5>
  421. <p>Screen rendering can be disabled via bit 6 in VDP register R#1. There's also
  422. no screen rendering when the VDP is showing a vertical border line. From a
  423. VRAM-access point of view both cases are identical.</p>
  424. <p>In this mode the VDP doesn't need to fetch any data from VRAM for
  425. rendering. It only needs to refresh the VRAM. As already mentioned earlier,
  426. the VDP uses a regular read to refresh the RAM, so this takes 6 cycles.</p>
  427. <p>The VDP executes 8 refresh actions per display line. They start at the
  428. following moments in time (the red blocks in the big timing diagram):</p>
  429. <table>
  430. <tr><td>284</td><td>412</td><td>540</td><td>668</td>
  431. <td>796</td><td>924</td><td>1052</td><td>1180</td></tr>
  432. </table>
  433. <h5>refresh-addresses</h5>
  434. <p><i>I didn't investigate this refresh-address-stuff in detail because it
  435. doesn't matter for emulation accuracy</i>.</p>
  436. <p>The logical addresses used for refresh reads seems to be of the form:</p>
  437. <p align="center">N&times;0x10101 | 0x3F</p>
  438. <p>Where N increases on each refresh action. So each refresh the row address
  439. increases by one and every other refresh either the CAS0 or the CAS1 signal
  440. gets used (the columns address doesn't matter for refresh). Note that this
  441. formula is for the logical address, in screen 7/8 this still gets transformed
  442. to a physical address. So in screen 7/8 a refresh action always uses the CAS1
  443. signal. That means that in screen 7/8 the DRAM chip(s) of bank0 actually do get
  444. refreshed using the RAS-without-CAS refresh mode.</p>
  445. <p>The refresh timings are the same for all non-text screen modes. But in text
  446. modes there are only 7 refreshes per line and they are also located at
  447. different relative positions than in the table above. I didn't investigate
  448. this further.</p>
  449. <h5>dummy reads</h5>
  450. <p>Next to the refresh reads, in 'screen disabled' mode, the VDP still performs
  451. 4 reads of address 0x1FFFF. At the following moments (marked with dark-grey
  452. blocks on the timeline):</p>
  453. <table><tr><td>1236</td><td>1244</td><td>1252</td><td>1260</td></tr></table>
  454. <p>I can't image any use for these reads, so let's call them dummy reads. In all
  455. our measurements these dummy reads always re-occur in these same positions, so
  456. it's not a fluke in (only one of) the measurements.</p>
  457. <p>The refresh actions remain exactly the same in the other two modes. But
  458. these dummy reads are different in the mode 'sprites off' or disappear
  459. completely in the mode 'sprites on'. (This confirms that nothing 'useful' is
  460. done by these dummy reads).</p>
  461. <p>Anyway for emulation we can mostly ignore these dummy reads. It only matters
  462. that at these moments in time there cannot be CPU or command VRAM reads or
  463. writes.</p>
  464. <h4>screen enabled, sprites disabled</h4>
  465. <h5>refresh and dummy reads</h5>
  466. <p>Refresh works exactly the same as in the previous mode. The dummy reads
  467. are a bit different. Now there are only 3 dummy reads at slightly different
  468. moments (also shown in dark-grey):</p>
  469. <table><tr><td>1242</td><td>1250</td><td>1258</td></tr></table>
  470. <p>The first of these 3 reads is always from address 0x1FFFF. The second and
  471. third dummy read have a pattern in their address. For example:</p>
  472. <table>
  473. <tr><th>1st</th><th>2nd</th><th>3rd</th></tr>
  474. <tr><td>0x1FFFF</td><td>0x03B80</td><td>0x03B82</td></tr>
  475. <tr><td>0x1FFFF</td><td>0x03C00</td><td>0x03C02</td></tr>
  476. <tr><td>0x1FFFF</td><td>0x03C80</td><td>0x03C82</td></tr>
  477. <tr><td>0x1FFFF</td><td>0x03D00</td><td>0x03D02</td></tr>
  478. </table>
  479. <p>This table shows the addresses of the 3 dummy reads for 4 successive display
  480. lines (this is data from an actual measurement, unfortunately our equipment
  481. could only buffer up to 4 lines). The lower 7 bits of the address of the 2nd
  482. read always seem to be zero. The address of the 3rd read is the same as for the
  483. 2nd read except that bit 1 is set to 1. When going from one line to the next,
  484. the address increases by 0x80. Our measurements captured 10 independent sets of
  485. 4 successive lines. Each time bits 16-15 were zero (bits 14-7 do take different
  486. values). This could be a coincidence, or it could be that these bits really
  487. aren't included in the counter. Note that again these are logical addresses (so
  488. still transformed for screen 7/8). I didn't investigate these dummy reads in
  489. more detail because they mostly don't matter for emulation.</p>
  490. <h5>bitmap reads</h5>
  491. <p>The major change compared to the previous mode is that now the VDP needs to
  492. fetch extra data for the bitmap rendering. These fetches happen in 32 blocks of
  493. 4 bytes (screen 5/6) or 8 bytes (screen 7/8). The fetches within one block
  494. happen in burst mode. This means that one block takes 18 cycles (screen 5/6) or
  495. 20 cycles (screen 7/8). Though later we'll see that the two spare cycles for
  496. screen 5/6 are not used for anything else, so for simplicity we can say that in
  497. all bitmap modes a bitmap-fetch-block takes 20 cycles. This is even clearer if
  498. you look at the RAS signal: this signal follows the exact same pattern in all
  499. (bitmap) screen modes, so in screen 5/6 it remains active for two cycles longer
  500. than strictly necessary.</p>
  501. <p>Actually before these 32 blocks there's one extra dummy block. This block
  502. has the same timing as the other blocks, but it always reads address 0x1FFFF.
  503. From an emulator point of view, these dummy reads don't matter, it only matters
  504. that at those moments no other VRAM accesses can occur.</p>
  505. <p>The start of these 1+32 blocks are located at these moments in time (these
  506. are the green blocks in the big timing diagram):</p>
  507. <table>
  508. <tr><td>(195)</td><td> 227</td><td> 259</td><td> 291</td><td> 323</td>
  509. <td> 355</td><td> 387</td><td> 419</td><td> 451</td></tr>
  510. <tr><td> </td><td> 483</td><td> 515</td><td> 547</td><td> 579</td>
  511. <td> 611</td><td> 643</td><td> 675</td><td> 707</td></tr>
  512. <tr><td> </td><td> 739</td><td> 771</td><td> 803</td><td> 835</td>
  513. <td> 867</td><td> 899</td><td> 931</td><td> 963</td></tr>
  514. <tr><td> </td><td> 995</td><td>1027</td><td>1059</td><td>1091</td>
  515. <td>1123</td><td>1155</td><td>1187</td><td>1219</td></tr>
  516. </table>
  517. <p><i>The following is only speculation: I wonder why there is such a dummy
  518. preamble block. Theoretically this <b>could</b> have been used (or reserved) to
  519. implement V9958-like horizontal scrolling without having to mask 8 border
  520. pixels. Unfortunately horizontal scrolling on a V9958 doesn't work like that
  521. :(</i></p>
  522. <h4>screen enabled, sprites enabled</h4>
  523. <h5>refresh, dummy reads, bitmap reads</h5>
  524. <p>Refresh and bitmap reads are exactly the same as in the previous mode. But
  525. the 3 or 4 dummy reads from the previous 2 modes are not present in this
  526. mode.</p>
  527. <h5>sprite reads</h5>
  528. <p><i>I've only investigated bitmap modes, that means the stuff below applies
  529. only to sprite mode 2.</i></p>
  530. <p>For sprite rendering you need to:
  531. <ul>
  532. <li>Figure out which sprites are visible: There are 32 positions in the
  533. sprite attribute table, and of those maximum 8 sprites can be visible
  534. (per line).</li>
  535. <li>For the visible sprites, fetch the required data so that it can actually
  536. be drawn. This data is: the x- and y-coordinates, the sprite pattern number,
  537. the pattern data and the color data.</li>
  538. </ul>
  539. <p>Figuring out which sprites are visible is done by reading the y-coordinates
  540. of each of the 32 possible sprites. These reads happen interleaved between the
  541. 32 block-reads of the bitmap data, so read one byte between each bitmap-block.
  542. Because of this interleaving it's not possible to use burst mode, so each read
  543. takes 6 cycles. There's also 1 dummy read of address 0x1FFFF at the end. The
  544. reads happen at these moments in time (yellow blocks between the green blocks in
  545. the diagram):</p>
  546. <table>
  547. <tr><td> 182</td><td> 214</td><td> 246</td><td> 278</td>
  548. <td> 310</td><td> 342</td><td> 374</td><td> 406</td></tr>
  549. <tr><td> 438</td><td> 470</td><td> 502</td><td> 534</td>
  550. <td> 566</td><td> 598</td><td> 630</td><td> 662</td></tr>
  551. <tr><td> 694</td><td> 726</td><td> 758</td><td> 790</td>
  552. <td> 822</td><td> 854</td><td> 886</td><td> 918</td></tr>
  553. <tr><td> 950</td><td> 982</td><td>1014</td><td>1046</td>
  554. <td>1078</td><td>1110</td><td>1142</td><td>1174</td><td>(1206)</td></tr>
  555. </table>
  556. <p>In the worst case, the 8 last sprites of the attribute table are visible. In
  557. that case all 32 reads are really required. Though even if the limit of 8
  558. visible sprites is reached earlier, the VDP continues fetching all 32+1 bytes.
  559. Also if one y-coordinate is equal to 216 (meaning that all later sprites are
  560. invisible), still all 32+1 fetches are executed.</p>
  561. <p>Once the VDP has figured out which sprites are visible it needs to fetch the
  562. data to actually draw the sprites. This VRAM access pattern is relatively
  563. complex:</p>
  564. <ul>
  565. <li>In the worst case there are 8 visible sprites. This requires reading
  566. 8&times;6 bytes. Some of these reads can be done in burst mode, others are
  567. single byte reads.</li>
  568. <li>Even if there are less than 8 sprites to display, all read accesses do
  569. still occurs. It <i>seems</i> to be that the useless reads are duplicates of
  570. sprite 0. (Or is it the first visible sprite? I didn't look in detail because
  571. it's not important for our purpose. It only matters that the VRAM bus remains
  572. occupied).</li>
  573. <li>The data fetches happens in 4 chunks of each 2 sprites. Each chunk
  574. reads:</li>
  575. <ul>
  576. <li>Y-coordinate, x-coordinate and pattern-number of 1st sprite. Burst of 3
  577. reads, takes 13(!)cycles.</li>
  578. <li>Y-coordinate, x-coordinate and pattern-number of 2nd sprite. Burst of 3
  579. reads, takes 13(!)cycles.</li>
  580. <li>Pause of 6 or 10(!) cycles</li>
  581. <li>2 pattern bytes of 1st sprite. Burst of 2 reads, takes 10 cycles.</li>
  582. <li>Color attribute of 1st sprite. Single read, takes 6 cycles.</li>
  583. <li>2 pattern bytes of 2nd sprite. Burst of 2 reads, takes 10 cycles.</li>
  584. <li>Color attribute of 2nd sprite. Single read, takes 6 cycles.</li>
  585. </ul>
  586. <li>Note that the burst of 3 reads only takes 13 instead of the expected 14
  587. cycles. If you look at the RAS/CAS signals you see that this uses an illegal(?)
  588. RAM access pattern: RAS is released together with CAS (even slightly before if
  589. you look at the raw measured data). But obviously this seems to work fine
  590. <i>&hellip; makes me wonder why the VDP doesn't always use this faster
  591. access pattern.</i></li>
  592. <li>Even for 8x8 sprites, the VDP always fetches 2 bytes of pattern-data per
  593. sprite line (and the 2nd byte is ignored).</li>
  594. <li>Note that the y-coordinate is fetched again. It was already fetched to
  595. figure out which sprites are visible.</li>
  596. <li>The positions in time of these reads (single or burst) are like this
  597. (yellow blocks (mostly) in the border period in the big timing diagram):
  598. <table>
  599. <tr><td>1238</td><td>1251</td><td>1270</td><td>1280</td><td>1286</td><td>1296</td></tr>
  600. <tr><td>1302</td><td>1315</td><td>1338</td><td>1348</td><td>1354</td><td>1364</td></tr>
  601. <tr><td> 2</td><td> 15</td><td> 34</td><td> 44</td><td> 50</td><td> 60</td></tr>
  602. <tr><td> 66</td><td> 79</td><td> 98</td><td> 108</td><td> 114</td><td> 124</td></tr>
  603. </table>
  604. Note that some of these fetches occur in the previous and some in the current
  605. display line. Though the start of the display line was chosen arbitrary (we
  606. could have picked the staring point so that these numbers don't wrap). It only
  607. matters that all sprite data is fetched before the display rendering
  608. starts.</li>
  609. <li>Also note that the timing is slightly irregular: in the 1st, 3rd and 4th
  610. group there's a pause of 6 cycles, there fits exactly one other access in this
  611. gap. But in the 2nd group there's a pause of 10 cycles. There also only fits
  612. one other access in this gap, and the timing is 2+6+2, so 2 'wasted' cycles
  613. before and after that other access. <i>I suspect that these 2+2 cycles are
  614. related to the R#9 S1,S0 bits. TODO measure this</i>.</li>
  615. </ul>
  616. <p>It's worth repeating that whenever sprites are enabled, the VDP
  617. <b>always</b> performs the same fetch-pattern. So even if no sprites are
  618. actually visible, or if sprites are partially disabled (with y=216), and even
  619. with 8x8 vs 16x16 sprites, magnified or not. This confirms the fact that the
  620. VDP command engine is slowed down by the exact same amount in all these
  621. situation. Also all (bitmap) screen modes behave exactly the same with respect
  622. to sprite data fetches.</p>
  623. <h3>CPU and command reads/writes</h3>
  624. <h5>position of access slots</h5>
  625. <p>The previous sections explained when the VDP reads from VRAM for refresh and
  626. bitmap/sprite rendering (and even some dummy reads). Depending on the mode
  627. (screen/sprites enabled/disabled), this takes more or less of the available
  628. VRAM-bandwidth. The portion of the VRAM bandwidth that is not used for
  629. rendering can be used for CPU or command engine VRAM reads or writes.</p>
  630. <p>All CPU and command engine accesses are single (non-burst) accesses, so they
  631. take 6 cycles each. However it is <b>not</b> the case that whenever the VRAM
  632. bus is idle for 6 cycles, it can be used for CPU or command engine
  633. accesses.</p>
  634. <p>Instead there are fixed moments in time where there could <i>possibly</i>
  635. start a cpu or command access, let's call these moments access slots. Each slot
  636. can be used for either CPU or command accesses (there are no slots that are
  637. uniquely reserved for either CPU or for commands). The position and the amount
  638. of access slots <i>only</i> depends on the VDP mode (screen off, sprites off,
  639. sprites on), not for example on the amount of actually visible sprites or on
  640. the (bitmap) screen mode.</p>
  641. <p>The 3 tables below show the amount and the positions of the possible access
  642. slots for the 3 different modes (in the timing diagram these are the blue
  643. blocks):</p>
  644. <p><table>
  645. <caption>screen off, 154 possible slots</caption>
  646. <tr><td> 0</td><td> 8</td><td> 16</td><td> 24</td><td> 32</td>
  647. <td> 40</td><td> 48</td><td> 56</td><td> 64</td><td> 72</td></tr>
  648. <tr><td> 80</td><td> 88</td><td> 96</td><td> 104</td><td> 112</td>
  649. <td> 120</td><td> 164</td><td> 172</td><td> 180</td><td> 188</td></tr>
  650. <tr><td> 196</td><td> 204</td><td> 212</td><td> 220</td><td> 228</td>
  651. <td> 236</td><td> 244</td><td> 252</td><td> 260</td><td> 268</td></tr>
  652. <tr><td> 276</td><td> 292</td><td> 300</td><td> 308</td><td> 316</td>
  653. <td> 324</td><td> 332</td><td> 340</td><td> 348</td><td> 356</td></tr>
  654. <tr><td> 364</td><td> 372</td><td> 380</td><td> 388</td><td> 396</td>
  655. <td> 404</td><td> 420</td><td> 428</td><td> 436</td><td> 444</td></tr>
  656. <tr><td> 452</td><td> 460</td><td> 468</td><td> 476</td><td> 484</td>
  657. <td> 492</td><td> 500</td><td> 508</td><td> 516</td><td> 524</td></tr>
  658. <tr><td> 532</td><td> 548</td><td> 556</td><td> 564</td><td> 572</td>
  659. <td> 580</td><td> 588</td><td> 596</td><td> 604</td><td> 612</td></tr>
  660. <tr><td> 620</td><td> 628</td><td> 636</td><td> 644</td><td> 652</td>
  661. <td> 660</td><td> 676</td><td> 684</td><td> 692</td><td> 700</td></tr>
  662. <tr><td> 708</td><td> 716</td><td> 724</td><td> 732</td><td> 740</td>
  663. <td> 748</td><td> 756</td><td> 764</td><td> 772</td><td> 780</td></tr>
  664. <tr><td> 788</td><td> 804</td><td> 812</td><td> 820</td><td> 828</td>
  665. <td> 836</td><td> 844</td><td> 852</td><td> 860</td><td> 868</td></tr>
  666. <tr><td> 876</td><td> 884</td><td> 892</td><td> 900</td><td> 908</td>
  667. <td> 916</td><td> 932</td><td> 940</td><td> 948</td><td> 956</td></tr>
  668. <tr><td> 964</td><td> 972</td><td> 980</td><td> 988</td><td> 996</td>
  669. <td>1004</td><td>1012</td><td>1020</td><td>1028</td><td>1036</td></tr>
  670. <tr><td>1044</td><td>1060</td><td>1068</td><td>1076</td><td>1084</td>
  671. <td>1092</td><td>1100</td><td>1108</td><td>1116</td><td>1124</td></tr>
  672. <tr><td>1132</td><td>1140</td><td>1148</td><td>1156</td><td>1164</td>
  673. <td>1172</td><td>1188</td><td>1196</td><td>1204</td><td>1212</td></tr>
  674. <tr><td>1220</td><td>1228</td><td>1268</td><td>1276</td><td>1284</td>
  675. <td>1292</td><td>1300</td><td>1308</td><td>1316</td><td>1324</td></tr>
  676. <tr><td>1334</td><td>1344</td><td>1352</td><td>1360</td></tr>
  677. </table></p>
  678. <p><table>
  679. <caption>sprites off, 88 possible slots</caption>
  680. <tr><td> 6</td><td> 14</td><td> 22</td><td> 30</td><td> 38</td>
  681. <td> 46</td><td> 54</td><td> 62</td><td> 70</td><td> 78</td></tr>
  682. <tr><td> 86</td><td> 94</td><td> 102</td><td> 110</td><td> 118</td>
  683. <td> 162</td><td> 170</td><td> 182</td><td> 188</td><td> 214</td></tr>
  684. <tr><td> 220</td><td> 246</td><td> 252</td><td> 278</td><td> 310</td>
  685. <td> 316</td><td> 342</td><td> 348</td><td> 374</td><td> 380</td></tr>
  686. <tr><td> 406</td><td> 438</td><td> 444</td><td> 470</td><td> 476</td>
  687. <td> 502</td><td> 508</td><td> 534</td><td> 566</td><td> 572</td></tr>
  688. <tr><td> 598</td><td> 604</td><td> 630</td><td> 636</td><td> 662</td>
  689. <td> 694</td><td> 700</td><td> 726</td><td> 732</td><td> 758</td></tr>
  690. <tr><td> 764</td><td> 790</td><td> 822</td><td> 828</td><td> 854</td>
  691. <td> 860</td><td> 886</td><td> 892</td><td> 918</td><td> 950</td></tr>
  692. <tr><td> 956</td><td> 982</td><td> 988</td><td>1014</td><td>1020</td>
  693. <td>1046</td><td>1078</td><td>1084</td><td>1110</td><td>1116</td></tr>
  694. <tr><td>1142</td><td>1148</td><td>1174</td><td>1206</td><td>1212</td>
  695. <td>1266</td><td>1274</td><td>1282</td><td>1290</td><td>1298</td></tr>
  696. <tr><td>1306</td><td>1314</td><td>1322</td><td>1332</td><td>1342</td>
  697. <td>1350</td><td>1358</td><td>1366</td></tr>
  698. </table></p>
  699. <p><table>
  700. <caption>sprites on, 31 possible slots</caption>
  701. <tr><td> 28</td><td> 92</td><td> 162</td><td> 170</td><td> 188</td>
  702. <td> 220</td><td> 252</td><td> 316</td><td> 348</td><td> 380</td></tr>
  703. <tr><td> 444</td><td> 476</td><td> 508</td><td> 572</td><td> 604</td>
  704. <td> 636</td><td> 700</td><td> 732</td><td> 764</td><td> 828</td></tr>
  705. <tr><td> 860</td><td> 892</td><td> 956</td><td> 988</td><td>1020</td>
  706. <td>1084</td><td>1116</td><td>1148</td><td>1212</td><td>1264</td></tr>
  707. <tr><td>1330</td></tr>
  708. </table></p>
  709. <p>Note that even in the mode 'screen off', when the VRAM bus is otherwise
  710. mostly idle, the access slots are still at least 8 cycles apart. A single
  711. access takes only 6 cycles, so 2 cycles are wasted.</p>
  712. <p>Very roughly speaking in mode 'screen off' there are about twice as many
  713. access slots as in the mode 'sprites off' and about 5 times as many as in the
  714. mode 'sprites on'. This does however <b>not</b> mean that in these modes the
  715. command engine will execute respectively 2&times; and 5&times; as fast. Instead
  716. in the mode 'sprites on' the speed of command execution is mostly limited by
  717. the amount of available access slots, while in the mode 'screen off', the
  718. bottleneck is mostly the speed of the command engine itself.</p>
  719. <p>Also note that the access slots are not evenly spread in time. For
  720. example:</p>
  721. <ul>
  722. <li>In mode 'screen off', the slots are often only 8 cycles apart (measured
  723. from the start of the 1st to the start of the 2nd slot). Though starting
  724. at cycle=120 there's a gap of 44 cycles.</li>
  725. <li>In mode 'sprites off', during the horizontal border, the access slots are
  726. roughly 8 cycles apart like in the previous mode, but during the display
  727. period, the spacing is more like 26 or 32 cycles. The largest gap is now 54
  728. cycles starting at cycle=1212.</li>
  729. <li>In mode 'sprites on', the pattern is again completely different. Here the
  730. slots are roughly 32 or 64 cycles apart. (The border even has slightly larger
  731. gaps than the display area. So contrary to some speculations, the commands do
  732. not execute faster in the horizontal border in this mode). The largest gap is
  733. now 70 cycles, starting at cycle=92. There's even one location where the
  734. smallest gap is also only 8 cycles. (Though if you look at the measurements
  735. you'll see that the slot right after this smallest gap (at cycle=170) is rarely
  736. actually used, even though the command engine is starved for VRAM
  737. bandwidth).</li>
  738. </ul>
  739. <p>These large gaps between the access slots are important. For example if the
  740. CPU is sending data to the VDP at a too fast rate, and this happens right at a
  741. moment where there are no access slots available, then some of the data send by
  742. the CPU is lost. We'll see later in this text that this can even happen
  743. when the time between the incoming CPU requests is (slightly) larger than the
  744. size of the largest gap.</p>
  745. <h5>allocation of access slots</h5>
  746. <p>The access slots can be used for either CPU or VDP command reads or writes.
  747. This section explains how the slots are allocated to these two subsystems.</p>
  748. <p>The basic principle is very simple: the CPU or the command engine take the
  749. first available access slot. And when the CPU and command engine both require
  750. an access slot at the same time, then the CPU gets priority. Though if you look
  751. at the details it is a bit more complicated:</p>
  752. <ul>
  753. <li>When the CPU sends a read or write VRAM request to the VDP, this request is
  754. put in a buffer until it can be handled.</li>
  755. <li>When the CPU sends a new request when there's still a previous request
  756. pending then the old request is lost. More on this below. <i>TODO most logical
  757. is that the old (not the new) request is lost, but actually check this. Though
  758. the Z80 might be too slow to be able to test this.</i></li>
  759. <li>Similarly when the VDP command engine needs to perform a VRAM read or
  760. write, this request is also put in a buffer. This is a different buffer than
  761. the one for CPU requests.</li>
  762. <li>In contrast to the CPU, the command engine is stalled when the command
  763. engine buffer holds a request. So command engine requests can never get
  764. lost.</li>
  765. <li>16 cycles in advance of an access slot the VDP checks whether there is
  766. either a pending CPU or command request. If there's a pending CPU request, that
  767. request will be executed (16 cycles later). If there's no cpu request but there
  768. is a command request then that one will be executed (16 cycles later). So the
  769. CPU takes priority over the command engine. And very important, if there's no
  770. request pending yet, then 16 cycles later nothing will be executed, not even if
  771. a request does arrive within 16 cycles.</li>
  772. </ul>
  773. <h5>cpu access slows down command execution</h5>
  774. <p>A surprising result (at least to me) of these measurements is that the
  775. speed of VDP command execution is reduced while simultaneously doing CPU VRAM
  776. accesses. Looking back this makes sense because the same VRAM access slots are
  777. shared between CPU and command engine and the CPU gets priority.</p>
  778. <p>This effect is clearly noticeable in the mode 'sprites on' but much less in
  779. the other two modes. This is easily explained by looking at the amount of
  780. available access slots in these modes.</p>
  781. <p>The most extreme situation occurs in this test. Execute a HMMV VDP command
  782. (this is the fastest command, see below) while simultaneously executing a long
  783. series of <code>OUT (#98),A</code> instructions (the fastest way to send CPU
  784. write requests). In our measurements, in the mode 'sprites on' the command
  785. execution speed was approximately cut in half! But in the other two modes, the
  786. execution speed was barely influenced. (Actually our test program wasn't
  787. accurate enough to measure any significant speed difference, but theoretically
  788. also in the latter two modes the execution speed should be reduced by a small
  789. amount).</p>
  790. <h5>too fast CPU access</h5>
  791. <p>The fastest way for the Z80 to send read or write VRAM request to the VDP is
  792. by using a sequence of <code>IN A,(#98)</code> or <code>OUT (#98),A</code>
  793. instructions (of course such a sequence always writes the same value or ignores
  794. all but the last read value). This takes 12 Z80 clock cycles per request.
  795. (Instructions like <code>OUT (C),r</code> or <code>OUTI</code> are all slower).
  796. The VDP is clocked at 6&times; the Z80 speed. So when the Z80 sends multiple
  797. requests to the VDP, the minimal distance between these requests, translated to
  798. VDP cycles, is at least 72 VDP cycles. Earlier we saw that the maximal gap
  799. between access slots was 70 VDP cycles, so at first sight there's no problem.
  800. However consider this scenario:</p>
  801. <ul>
  802. <li>Suppose we're in 'sprites on' mode. At time=236, we're 16 cycles before an
  803. access slot. Suppose there's no pending CPU nor command request at this
  804. time. So nothing will get executed at time=252.</li>
  805. <li>A bit later at time=240 there arrives a CPU write request. This request
  806. gets buffered.</li>
  807. <li>At time=252 there is an access slot, but nothing will get executed in this
  808. slot (because this slot wasn't allocated at time=236).</li>
  809. <li>At time=300 we're again 16 cycles before an access slot. Now there is a
  810. pending CPU request, so we'll execute that at time=316.</li>
  811. <li>At time=312 we receive a new CPU write request. This is 312-240=72 VDP
  812. cycles (or 12 Z80 cycles, the duration of a <code>OUT (#98),A</code>
  813. instruction) after the previous request. But the buffer still contains the
  814. previous unhandled request. The new request overwrites the old request!</li>
  815. <li>At time=316 there's an access slot and we've allocated this slot to the CPU
  816. (at time=300). So the pending CPU request gets executed. Though this writes the
  817. data from the new request, the data from the old request is never written!</li>
  818. </ul>
  819. <p>Note that this scenario used a gap of only 64 VDP cycles between access
  820. slots, while there were 72 cycles between the CPU requests. (And the largest
  821. gap between access slots is 70 cycles).</p>
  822. <!--TODO tests on real machine:
  823. only lost in 'sprites on' mode ??
  824. OUT (#99),A -> easy lost
  825. OUT (C),A -> only very occasionally
  826. other OUT patterns always OK
  827. -->
  828. <h2>Command engine timing</h2>
  829. <p>The command engine needs access to VRAM. In the previous section we saw when
  830. the VDP will grand access to this subsystem: when there's an access slot
  831. available and when that slot is not already allocated to CPU access. In this
  832. section we'll see when exactly the command engine will generate VRAM access
  833. requests. Obviously the type (read or write) and the rate of these requests
  834. depends on the type of the VDP command that is executing.</p>
  835. <p>Some commands (like HMMV) only need to write to VRAM. Other commands (like
  836. LMMM) need 2 reads and 1 write per pixel. Many commands execute on a block (a
  837. rectangle) of pixels. Such a block is executed line per line (all pixels within
  838. one horizontal line are processed before moving to the next line). Moving from
  839. one line to the next takes some amount of time (but YMMM is an exception, see
  840. below). This means that e.g. a HMMM command on a 20x4 rectangle executes faster
  841. than on a 4x20 rectangle (same amount of pixels in both cases, but a different
  842. rectangle shape).</p>
  843. <p>The following table summarizes the timing for all measured commands:</p>
  844. <table>
  845. <tr><th>Command</th><th>Per pixel</th><th>Per line</th></tr>
  846. <tr><td>HMMV</td><td>48 W </td><td>56</td></tr>
  847. <tr><td>YMMM</td><td>40 R 24 W </td><td>0 </td></tr>
  848. <tr><td>HMMM</td><td>64 R 24 W </td><td>64</td></tr>
  849. <tr><td>LMMV</td><td>72 R 24 W </td><td>64</td></tr>
  850. <tr><td>LMMM</td><td>64 R 32 R 24 W</td><td>64</td></tr>
  851. <tr><td>LINE</td><td>88 R 24 W </td><td>32</td></tr>
  852. </table>
  853. <p><i>TODO timing for PSET, POINT, SRCH</i></p>
  854. <p>I'll explain the notation in this table with an example. Take the LMMM
  855. command:</p>
  856. <ul>
  857. <li>Per pixel the LMMM command needs to:
  858. <ul><li>Read a byte from the source.</li>
  859. <li>Read a byte from the destination</li>
  860. <li>Calculate the result: extract the pixel value from source and
  861. destination, combine the two (possibly with a logical operation), insert
  862. the result in the destination byte. And write the result back to the
  863. destination.</li>
  864. </ul></li>
  865. <li>So per pixel, the LMMM command will generate 3 VRAM accesses: 2 read
  866. followed by one write. Between these accesses there will be some amount of
  867. time.</li>
  868. <li>For LMMM the table lists '64 R 32 R 24 W'. Let's start at the 1st 'R'
  869. character, this represents the 1st read. Next there's the value 32 and a 2nd
  870. 'R', this means that the 2nd read comes <i>at least</i> 32 cycles after the 1st
  871. read. Then there's '24 W', meaning there are <i>at least</i> 24 cycles between
  872. the 2nd read and the write. And the initial value '64' means that there are
  873. <i>at least</i> 64 cycles between the write and the 1st read for the next
  874. pixel.</li>
  875. <li>When moving from one horizontal line to the next in a block command, there
  876. is some extra delay. For the LMMM command this takes 64 extra cycles. So
  877. 64+64=128 cycles from the last write of a line till the first read of the next
  878. line.</li>
  879. <li>Note that all these values are the <i>optimal</i> timing values. The actual
  880. delay can be longer because there is no access slot available or the slot is
  881. already allocated for CPU access.</li>
  882. </ul>
  883. <p>All the commands in the table above are block commands except for 'LINE'.
  884. For the LINE command the meaning of the columns 'Per pixel' and 'Per line' may
  885. not be immediately clear:</p>
  886. <ul>
  887. <li>The VDP uses the <a href="http://en.wikipedia.org/wiki/Bresenham%27s_line_algorithm">
  888. Bresenham algorithm</a> the calculate which pixels are part of the line.</li>
  889. <li>This algorithm takes at each iteration one step in the <i>major</i>
  890. direction. The timings for such an iteration are written in the 'Per pixel'
  891. column for the LINE command.</li>
  892. <li>Depending on the slope of the line, in some iterations the Bresenham
  893. algorithm also takes a step in the <i>minor</i> direction. For the VDP such a
  894. minor step takes some extra time (32 cycles). This is written in the 'Per line'
  895. column of the LINE command. (If you look back at the very beginning of this
  896. text, these major and minor steps explain the general octagonal shapes in the
  897. images. The uneven distribution of the access slots explain the
  898. irregularities.)</li>
  899. </ul>
  900. <p>Note that for the YMMM command there's no extra overhead when going from one
  901. horizontal line to the next. This might be related to the fact that a line of
  902. a YMMV command always starts at the left or right border of the screen.</p>
  903. <p><i>TODO What we didn't measure (also couldn't measure with our test setup)
  904. was the delay between the start of the command (when the CPU sends the command
  905. byte to the VDP) and the moment the command actually starts executing (e.g.
  906. when the first read or write command access is send to VRAM). It's logical to
  907. assume that the 'per line' overhead also occurs at the start of the command.
  908. But it's possible there is also some additional 'per command' overhead.</i></p>
  909. <h5>speculation on the slowness of the command engine</h5>
  910. <p>When looking at the above table, we see that the command engine is very
  911. slow. For example in a HMMM command there are 24 cycles between reading a byte
  912. and writing that byte to the new location. Or in a LINE command it takes 32
  913. cycles to take a step in the minor direction. I <i>believe</i> there are two
  914. main reasons for this slowness:</p>
  915. <ul>
  916. <li>I believe that internally the VDP command engine subsystem runs at 1/8 of
  917. the master VDP clock frequency. This matches the observation that all values in
  918. the above table are multiples of 8. It also explains why the access slots are
  919. always at least 8 cycles apart (while a VRAM access only requires 6
  920. cycles).</li>
  921. <li>The command engine gets stalled whenever there's a pending command engine
  922. VRAM request. A VRAM request (CPU or command) only gets handled after it's been
  923. pending for at least 16 cycles. So combined this means the command engine gets
  924. stalled for 16 cycles on every VRAM request it makes. (Note that especially
  925. this point is just speculation).</li>
  926. </ul>
  927. <p>Taking these two points into account, the above table can be rewritten
  928. as:</p>
  929. <table>
  930. <tr><th>Command</th><th>Per pixel</th><th>Per line</th></tr>
  931. <tr><td>HMMV</td><td>(4&times;8+16) W </td><td>7&times;8</td></tr>
  932. <tr><td>YMMM</td><td>(3&times;8+16) R (1&times;8+16) W </td><td>0&times;8</td></tr>
  933. <tr><td>HMMM</td><td>(6&times;8+16) R (1&times;8+16) W </td><td>8&times;8</td></tr>
  934. <tr><td>LMMV</td><td>(7&times;8+16) R (1&times;8+16) W </td><td>8&times;8</td></tr>
  935. <tr><td>LMMM</td><td>(6&times;8+16) R (2&times;8+16) R (1&times;8+16) W</td><td>8&times;8</td></tr>
  936. <tr><td>LINE</td><td>(9&times;8+16) R (1&times;8+16) W </td><td>4&times;8</td></tr>
  937. </table>
  938. <p>When you look at the data in this way, the numbers already look more
  939. reasonable.</p>
  940. <h2>Next steps</h2>
  941. <p>All the information above <i>should</i> already be enough to significantly
  942. improve the accuracy of MSX emulators. The following months I plan to work on
  943. improving openMSX.</p>
  944. <ul>
  945. <li>First I'd like to improve the CPU-VRAM access stuff, so that e.g. too fast
  946. CPU accesses actually result in dropped requests.</li>
  947. <li>Next step is the timing of the VDP commands. This depends on the previous
  948. step because e.g. CPU access slows down command execution.</li>
  949. <li>Still a later step could be to more accurately in time fetch the data
  950. required for display rendering (bitmap, sprites). This is lower priority
  951. because:
  952. <ul>
  953. <li>These effects are limited to the visual output. Errors can't influence
  954. the 'state' of the MSX machine. So it's impossible to write a MSX program
  955. that checks (= makes a decision based on) the rendering accuracy. (OTOH it is
  956. possible to check for dropped CPU requests or the speed of the
  957. commands).</li>
  958. <li>I don't know of any <i>existing</i> MSX software where this will make a
  959. noticeable difference. Maybe an idea for a <i>new</i> test is to vary the
  960. y-coordinates of the sprite(s) within one display line. Thus causing the
  961. sprite engine to use two different values in the two phases of sprite
  962. rendering.</li>
  963. <li>Hmm &hellip; or maybe there is an existing program: the <a
  964. href="http://users.skynet.be/bk263586/verti.zip">verti</a> demo. On current
  965. emulators the vertical bars are all equally wide. But on a real MSX there
  966. are wider and smaller bars, but all are multiples of 8 pixels.</a>
  967. </ul>
  968. </ul>
  969. <p>I'm afraid this will all still take quite a bit of work.</p>
  970. <p>Anyway, I hope the information in this document is useful. For (other) MSX
  971. emulator developers or for MSX developers in general.</p>
  972. <hr/>
  973. <p align="right" style="font-size:smaller;">
  974. 2013/03/30, Wouter Vermaelen
  975. </p>
  976. </body>
  977. </html>