1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144 |
- <?xml version="1.0" encoding="iso-8859-1"?>
- <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
- "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
- <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
- <head>
- <link title="Purple" rel="stylesheet" href="manual-purple.css" type="text/css" />
- <title>V9938 VRAM timings</title>
- </head>
- <body>
- <h1>V9938 VRAM timings</h1>
- Measurements done by: Joost Yervante Damad, Alex Wulms, Wouter Vermaelen<br/>
- Analysis done by: Wouter Vermaelen<br/>
- Text written by: Wouter Vermaelen<br/>
- with help from the rest of the openMSX team.
- <h2>Introduction</h2>
- <p>This text describes in detail how, when and why the V9938 reads from and
- writes to VRAM in bitmap screen modes (screen 5, 6, 7 and 8). VRAM is accessed
- for bitmap and sprite rendering but also for VDP command execution or by
- CPU VRAM read/write requests.</p>
- <h5 id="motivation">motivation</h5>
- <p>Modern MSX emulators like blueMSX and openMSX are already fairly accurate.
- And for most practical applications like games or even demos they are already
- <i>good enough</i>. Though there are cases where you can still clearly see the
- difference between a real and an emulated MSX machine.</p>
- <p>For example the following pictures show the speed of the LINE command for
- different slopes of the line. The first two pictures are generated on different
- MSX emulators, the last picture is from a real MSX. Without going into all the
- details: lines are drawn from the center of the image to each point at the
- border. While the LINE commands are executing, the command color register is
- rapidly changed (at a fixed rate). So faster varying colors indicate a slower
- executing command.</p>
- <img src="line-speed-old-8.png" width="300">
- <img src="line-speed-emu-8.png" width="300">
- <img src="line-speed-real-8.png" width="300">
- <p>From left to right these pictures show:</p>
- <ul>
- <li>(left) The output of MSX emulators that use Alex Wulms' original command
- engine emulation core. All(?) modern MSX emulators use this core, including
- blueMSX, OCM and (older versions of) openMSX. The output are squares, this
- indicates that the speed of a LINE command doesn't depend on the slope of the
- line.</li>
- <li>(center) The output of openMSX version 0.9.1. Here the command engine was
- tweaked to take the slope of the line into account, so the test now generates
- clean octagonals.</li>
- <li>(right) The output of a real MSX. The overall shape is also an octagonal.
- But there are also a lot of irregularities. These irregularities can be
- reproduced when running the test multiple times. So it must be a <i>real</i>
- effect, and not some kind of measurement noise.</li>
- </ul>
- <p>This test is derived from NYRIKKI's test program described in this (long) <a
- href="http://www.msx.org/forum/msx-talk/software-and-gaming/line">MRC forum
- thread</a>. This particular test is not that important. But because it
- generates a nice graphical output it allows to show the problem without going
- into too much technical details (yet).</p>
- <p>In most MSX applications these LINE speed differences, or small command
- speed differences in general, likely won't cause any problems. (Except of
- course in programs like this that specifically test for it.) But it would still
- be nice to improve the emulators.</p>
- <h5>measurements</h5>
- <p>To be able to improve openMSX further we need to have a good understanding
- of what it is exactly that causes these irregularities. It would be very hard
- to figure out this stuff only by using MSX test programs. It might be easier to
- look at the deeper hardware level. More specifically at the communication
- between the VDP (V9938) and the VRAM chips. This should allow us to see when
- exactly the VDP reads or writes which VRAM addresses.</p>
- <p>So at the 2013 MSX fair in Nijmegen we (some member of the openMSX team and
- I) connected a logic analyzer to the VDP-VRAM bus in a Philips NMS8250 machine.
- The following picture gives an impression of our measurement setup.</p>
- <img src="v9938-probes.jpg">
- <p>Next we ran some MSX software that puts the VDP in a certain display mode.
- It enables/disables screen and/or sprite rendering. And it optionally executes
- VDP commands and/or accesses VRAM via the CPU. And while this test was running
- we could capture (small chunks of) the communication between the VDP and the
- VRAM. This gives us output (waveforms) like in the following image.</p>
- <img src="gtkwave.png">
- <p>It's not so easy to go from this waveform data to meaningful results about
- how the VDP operates. This text also won't talk about this analysis process. If
- you're interested in the analysis or in the raw measurement data, you can find
- some more details in the <a
- href="https://sourceforge.net/mailarchive/message.php?msg_id=30375119">
- openmsx-devel mailinglist archive</a>. The rest of this text will only discuss
- the final results of the analysis.</p>
- <p>Because one of the primary goals was to improve the command engine emulation
- in openMSX, the measurements mostly focused on the bitmap screen modes (a V9938
- doesn't allow commands in non-bitmap modes). So the following sections will
- only occasionally mention text or character modes. Because we used a V9938 we
- also couldn't test the YJK modes (screen 11 and 12). But it's highly likely
- that, from a VRAM access point of view, these modes behave the same as screen 8
- (or as we'll see later, the same as all the bitmap screen modes).</p>
- <h2>VRAM accesses</h2>
- <p>Before presenting the actual results of (the analysis of) the measurements,
- this section first explains the general workings of the VDP-VRAM communication.
- This is mostly a description of the functional interface of DRAM chips, but
- then specifically applied to the VDP case. Feel free to skim (or even skip)
- this section.</p>
- <p>Like most RAM chips in MSX machines, the VDP uses DRAM chips for the video
- RAM. There exist many variations in DRAM chips. You can find a whole lot of
- information on <a
- href="http://en.wikipedia.org/wiki/Dynamic_random-access_memory">
- wikipedia</a>. Most of the info in this section can also be found in the 'V9938
- Technical Data Book'. Often that book goes into a lot more detail than this
- text. Here I highlight (and simplify) the aspects that are relevant to
- understand the later sections in this text.</p>
- <h3>Connection between VDP and VRAM</h3>
- <p>Between the VDP and the VRAM chips there is an 8-bit data bus. This means
- that a single read or write access will transfer 1 byte of data.</p>
- <p>There is also an 8-bit address bus. Obviously 8 bits are not enough to
- address the full 128kB or even 192kB VRAM address space. Instead the address is
- transferred in two steps. First the row-address is transferred followed by the
- column address. (Usually) the row address corresponds to bits 15-8 of the full
- address, while the column address corresponds to bits 7-0.</p>
- <p>Though this still only allows to address up-to 64kB. To get to 128kB, there
- are 2 separate column-address-select signals (named CAS0 and CAS1). These two
- signals allow to select one of the two available 64kB banks. So combined this
- gives 128kB. (Usually) you can interpret CAS0/CAS1 as bit 16 of the
- address.</p>
- <p>In case of a MSX machine with 192kB VRAM there is still a third signal:
- CASX. To simplify the rest of this text, this possibility is ignored. It anyway
- doesn't fundamentally change anything.</p>
- <p>Next to the data and address bus there are still some control signals. I've
- already mentioned the CAS signals (used to select the column address). There's
- a similar RAS (row address select) signal. And finally there's a R/W
- (read-write) signal that indicates whether the access is a read or a write.</p>
- <h3>Timing of the VDP-VRAM signals</h3>
- <p>When the VDP wants to read or write a byte from/to VRAM it has to
- <i>wiggle</i> the signals that connect the VDP to the VRAM in a certain way.
- This section describes the timing of those <i>wiggles</i>.</p>
- <p>The timing description in this section is different from the description in
- the 'VDP Technical Data Book'. The Data Book has the <i>real</i> timings,
- including all the subtle details for how to build an actual working system.
- This text has all the timings rounded to integer multiples of VDP clock cycles.
- IMHO these simplified timings make the VDP-VRAM connection easier to understand
- from a <i>functional</i> point of view.</p>
- <h4>A single write</h4>
- <p>To write a single byte to VRAM, follow this schema:</p>
- <img src="dram-write.png">
- <ul>
- <li>Put the row address on the address bus and activate the RAS signal. Most
- signals are active-low, so activating means make the signal low.</li>
- <li>After one cycle (remember these are <i>functional</i> timings, especially
- in this step the <i>real</i> timing rules are more complex):</li>
- <ul>
- <li>Activate (one of) the CAS signals.</li>
- <li>Put the column address on the address bus.</li>
- <li>Set the R/W signal. A low signal means write.</li>
- <li>Put the to-be-written data on the data bus.</li>
- </ul>
- </li>
- <li>After two cycles the CAS signal can be deactivated. At this point the
- value of the R/W signal doesn't matter anymore (it may have any value). But
- measurements show that the VDP restores the R/W signal to a high value at this
- point.</li>
- <li>Again one cycle later, the RAS signal can be deactivated.</li>
- <li>The RAS signal has to remain de-active for at least two cycles.</li>
- </ul>
- <p>So a full write cycle takes 6 VDP clock cycles.</p>
- <h4>A single read</h4>
- <p>Reads are very similar to writes, they follow this schema:</p>
- <img src="dram-read.png">
- <ul>
- <li>Put the row address on the address bus and activate the RAS signal.</li>
- <li>After one cycle:</li>
- <ul>
- <li>Activate (one of) the CAS signals.</li>
- <li>Put the column address on the address bus.</li>
- <li>Set the R/W signal: a high value indicates a read. The VDP keeps this
- signal high between VRAM transactions. So in measurements you don't
- actually see this signal changing for reads.</li>
- </ul>
- </li>
- <li>After two cycles the read data is available on the data bus. The CAS signal
- can be deactivated now.</li>
- <li>After one cycle the RAS signal can be deactivated.</li>
- <li>Wait at least two cycles before starting the next VRAM transaction.</li>
- </ul>
- <p>So this is very similar to a write: address selection is identical.
- Obviously the R/W signal and the direction (and timing) of the information on
- the data bus is different. And just like a write, a full read cycle also takes
- 6 VDP cycles.</p>
- <h4>Page mode reads (burst read)</h4>
- <p>Often the VDP needs to read data from successive VRAM addresses. If those
- addresses all have the same row address, then there's a faster way to perform
- this compared to doing multiple reads like in the schema above.</p>
- <img src="dram-read-burst.png">
- <ul>
- <li>Put the (common) row address on the address bus and activate the RAS
- signal.</li>
- <li>After one cycle:</li>
- <ul>
- <li>Put the first column address on the address bus.</li>
- <li>Activate (one of) the CAS signals.</li>
- <li>Set the R/W signal (though the VDP already has this signal in the
- correct state).</li>
- </ul>
- <li>After two cycles read the data from the data bus, and deactivate CAS.</li>
- <li>Two cycles later, put the 2nd column address on the address bus and
- re-activate (one of) the CAS signals.</li>
- <li>Again two cycles later read the data and deactivate CAS.</li>
- <li>It's possible to repeat this process for a 3rd, 4th, … byte.</li>
- <li>After one cycle deactivate the RAS signal.</li>
- <li>Wait at least two cycles before starting the next VRAM transaction.</li>
- </ul>
- <p>The above diagram shows a burst-length of only two bytes. It's also possible
- to have longer lengths. The VDP uses lengths up-to 4 bytes (or 8, see next
- section).</p>
- <p>In this example reading two bytes takes 10 VDP cycles. Doing two single
- reads would take 2×6=12 cycles. When doing longer bursts, the savings
- become bigger. Doing a burst of N reads takes 2+4×N cycles compared to
- 6×N cycles for a sequence of single reads.</p>
- <p>In principle it's also possible to do burst-writes. Though the VDP doesn't
- use them (it never needs to write more than 1 byte in a sequence).</p>
- <h4>Multi-bank page mode reads</h4>
- <p>Burst reads are already faster than single-reads. But to be able to render
- screen 7 and 8 images, burst reads are still not fast enough. In these two
- screen modes, to be able to read the required data from VRAM fast enough, the
- VDP reads from two banks in parallel.</p>
- <img src="dram-read-burst-2banks.png">
- <p>There are 2 banks of 64kB. These two banks share the RAS control signal, but
- they each have their own CAS signal. The address and data signals are also
- shared. This allows to read from both banks <i>almost</i> in parallel:</p>
- <ul>
- <li>In burst mode it was possible to read one byte every 4 VDP cycles. For
- this the CAS signal had two be alternatingly two cycles high and two cycles
- low. The address and data buses are only used during 1 of these 4 cycles.</li>
- <li>Multi-bank mode uses both the CAS0 and the CAS1 signals. CAS0 is high when
- CAS1 is low and vice-versa. When looking at a single bank (which only sees one
- of the two CAS signals) this looks like a normal burst read. The only
- difference is that the RAS signal is at the start or at the end 2 cycles
- longer active than strictly needed. But that's perfectly fine.</li>
- </ul>
- <p>So this schema gives (almost) double the VRAM-bandwidth. The only
- requirement is that you alternatingly read from bank0 and bank1. At first sight
- this requirement seems so strict that it is almost never possible to make use of
- this banked reading mode: to render screen 7 or 8 you indeed need to read many
- successive VRAM locations, not locations that alternatingly come from the
- 1st and 2nd 64kB bank.</p>
- <p>To make it possible to use banked reading mode, the VDP interleaves the two
- banks. This introduces the concept of <i>logical</i> and <i>physical</i>
- addresses:</p>
- <ul>
- <li><i>Logical</i> addresses are the addresses that a programmer of the VDP
- normally uses. For example the bitmap data for screen 8 (possibly) starts at
- address 0x00000 and goes till address 0x0D400.</li>
- <li><i>Physical</i> addresses are the addresses that actually appear on the
- signals between the VDP and the VRAM. So the combination of the row and column
- address and the CAS0 or CAS1 bank-selection.</li>
- </ul>
- <p>In most screen modes the logical and the physical addresses are the same.
- But in screen 7 and 8 there's a transformation between the two:</p>
- <p align="center">physical = (logical >> 1) | (logical << 16)</p>
- <p>So the 17-bit logical address is rotated one bit to the right to get the
- physical address. The effect of this transformation is that all even logical
- addresses end up in physical bank0 while all odd logical addresses end up in
- physical bank1. So now when you read from successive logical addresses you read
- from alternating physical banks and thus it is possible to use banked read
- mode.</p>
- <p>Usually a VDP programmer doesn't need to be aware of this interleaving. But
- because interleaving is only enabled in screen 7 and 8, this effect can become
- visible when switching between screen modes. <i>An alternative design decision
- could have been to always interleave the addresses. I guess the V9938 designers
- didn't make this choice to allow for single chip configurations in case only
- 64kB VRAM is connected.</i></p>
- <p>The diagram above shows a read of 2×2 bytes, in reality the VDP only
- uses this schema to read 2×4 bytes. In principle it's also possible to
- write to two banks in parallel, but the VDP never needs this ability.</p>
- <h4>Refresh</h4>
- <p>DRAM chips need to be refreshed regularly. The VDP is responsible for doing
- this (there are DRAM chips that handle refresh internally, but the VDP doesn't
- use such chips). Many DRAM chips allow a refresh by only activating and
- deactivating the RAS signal, so without actually performing a read or write in
- between. When extrapolating from the above timing diagrams, this would only
- cost 4 cycles. Though the VDP doesn't actually use this RAS-without-CAS refresh
- mode. Instead it performs a regular read access which takes 6 cycles.</p>
- <p>Each time a read (or write) is performed on a certain row of a DRAM chip,
- that whole row is refreshed. So to refresh the whole RAM, the VDP has to
- periodically read (any column address of) each of the 256 possible rows.</p>
- <h2>Distribution of VRAM accesses</h2>
- <p>The previous section described the details of isolated (single or burst)
- VRAM accesses. This section will look at such accesses as indivisible units and
- examine how these units are grouped together and spread in time to perform all
- the VRAM related stuff the VDP has to do.</p>
- <p>The VDP can perform VRAM reads/writes for the following reasons:</p>
- <ul>
- <li>Refresh</li>
- <li>Bitmap rendering</li>
- <li>Sprite rendering</li>
- <li>CPU read/write</li>
- <li>Command read/write</li>
- </ul>
- <p>Note that next to bitmap modes, the VDP also has character and text modes. I
- didn't investigate those modes yet, so this text mostly ignores them.</p>
- <p>The rest of this text explains when in time (at which specific VDP
- cycles) accesses of each type are executed.</p>
- <p>We'll first focus on refresh and bitmap/sprite rendering. Later we'll add
- CPU and command engine. The reason for this split is that the first group has a
- fairly simple pattern: refreshes always occur at fixed moments in time.
- Enabling bitmap rendering only adds additional VRAM reads but has no influence
- on the timing of the refreshes. Similarly enabling sprite rendering adds even
- more reads without influencing the bitmap or refresh reads. CPU and command
- accesses on the other hand cannot simply be added to this schema without
- influencing each other. So those are postponed till a later section.</p>
- <h3>Horizontal line timing</h3>
- <p>The VDP renders a full frame line-by-line. For each line the VDP (possibly)
- has to read some bitmap and sprite data from VRAM. It's logical to assume (and
- the measurements confirm this) that the data fetches within one line occur at
- the same relative positions as the corresponding data fetches of another line.
- So if we can figure out the details for one line, we can extrapolate this to a
- whole frame. Similarly we can assume that different frames will have similar
- relative timings. So really all we need to know is the timing of one line.</p>
- <p><i>TODO: odd and even frames in interlace mode probably do have timing
- differences. Still need to investigate this.</i>
- </p>
- <p>Let's thus first look at what we already know about an horizontal display
- line. The 'V9938 Technical Data Book' contains the following timing info about
- (non-text mode) display lines.</p>
- <table>
- <tr><th>Description </th><th>Cycles </th><th>Length</th></tr>
- <tr><td>Synchronize signal</td><td>[0 - 100)</td><td> 100</td></tr>
- <tr><td>Left erase time </td><td>[100 - 202)</td><td> 102</td></tr>
- <tr><td>Left border </td><td>[202 - 258)</td><td> 56</td></tr>
- <tr><td>Display cycle </td><td>[258 - 1282)</td><td>1024</td></tr>
- <tr><td>Right border </td><td>[1282 - 1341)</td><td> 59</td></tr>
- <tr><td>Right erase time </td><td>[1341 - 1368)</td><td> 27</td></tr>
- <tr><td>Total </td><td>[0 - 1368)</td><td>1368</td></tr>
- </table>
- <p>So one display line is divided in 6 periods. The total length of one line is
- 1368 cycles. The previous section showed how long individual VRAM accesses
- take. The next sections will figure out how all the required accesses fit in
- this per-line budget of 1368 cycles.</p>
- <p>A note about the timing notation: in this text all the timing numbers are
- VDP cycles relative within one line. For example in the table above the display
- period starts at cycle 258. The display period of the next line will start at
- cycle 258+1368=1626, the next at cycle 2994 and so on. To make the values
- smaller, all cycle numbers will be folded to the interval [0, 1368). The
- staring point (cycle=0) has no special meaning. We could have taken any other
- point and called that the starting point. (For the current choice, the external
- VDP HSYNC pin gets activated at cycle=0, so it was a convenient point to
- synchronize the measurements on).</p>
- <p><i>TODO horizontal set-adjust: The numbers in the above table are valid for
- horizontal set-adjust=0. Similarly all our measurements were done with
- set-adjust=0. Using different set-adjust values will make the left/right border
- bigger/smaller. I still need to figure out which timing values of the next
- sections are changed by this. E.g. are all the VRAM accesses in a line shifted
- as a whole, or are just the bitmap data fetches shifted and remain (some) other
- accesses fixed?</i></p>
- <p><i>TODO bits S1,S0 in VDP register R#9: The above table is valid for
- S1,S0=0,0. In other cases the length of a display line is only 1365 cycles
- instead of 1368. The rest of this text assumes a line length of 1368 cycles. I
- still need to figure out where exactly in the line this difference of 3 cycles
- is located.</i></p>
- <!-- numbers for 1365 cycles
- [0 - 100) (len= 100)
- [100 - 202) (len= 102)
- [202 - 258) (len= 56)
- [258 -1282) (len=1024)
- [1282-1339) (len= 57)
- [1339-1365) (len= 26)-->
- <h3>Sneak preview</h3>
- <p>The following image graphically summarizes the results of the rest of this
- section. This is a very wide image, it is much larger than what can be shown
- inline in this text (click to see the full image). It's highly recommended to
- open this image in an external image viewer that allows to easily zoom in and
- out and scroll the image.</p>
- <a href="vdp-timing.png">
- <img src="vdp-timing.png" width="1200">
- </a>
- <p>Here's an overview of the most important items in this image:</p>
- <ul>
- <li>Horizontally there are 6 regions in the image (each has a slightly
- different background color). These regions correspond to the 'synchronize',
- 'left/right erase', 'left/right border' and 'display' regions in the table from
- the previous section.</li>
- <li>Horizontally you also see a timeline going from 0 to 1368 cycles. This
- corresponds to one full display line.</li>
- <li>Vertically there are 3 big groups: 'screen off', 'no sprites' and
- 'sprites', see next section for why these groups are important.</li>
- <li>Within one vertical group there is one color-coded band and a set of
- RAS/CAS signals. Usually there's one RAS and 2 CAS signals, but the 'sprites
- off' group has 2 pairs of CAS signals. For the 'sprites off' and 'sprites on'
- groups there are subtle differences in the CAS0/1 signals between screen modes
- 5/6 and 7/8. But to save space these differences are only shown once.</li>
- <li>The colors in the color-coded band have the following meaning:</li>
- <ul>
- <li>red: refresh read</li>
- <li>green: bitmap data read (dark-green is dummy bitmap read)</li>
- <li>yellow: sprite data read (brown is dummy sprite read)</li>
- <li>blue: potential CPU or command engine read or write</li>
- <li>dark-grey: dummy read</li>
- <li>light-gray: idle (no read or write)</li>
- </ul>
- <li>The CAS signals are drawn in either a full or a stippled line. Full means
- the signal is definitely high/low at this point. Stippled means, it can be high
- or low depending on whether there was a CPU request or VDP command executing at
- that point. Note that the RAS signal always toggles, even if there is no CPU or
- command access required.</li>
- </ul>
- <p>The next sections will go into a lot more detail. It's probably a good idea
- to have this (zoomed in) image open while reading those later sections.</p>
- <h3>3 operating modes</h3>
- <p>When looking from a VDP-VRAM interaction point of view, the VDP can operate
- in 3 modes:</p>
- <ul>
- <li>Screen disabled (sprite status doesn't matter). This is the same as
- vertical border.</li>
- <li>Screen enabled, sprites disabled.</li>
- <li>Screen enabled, sprites enabled.</li>
- </ul>
- <p>Note that the (bitmap) screen mode (screen 5, 6, 7, or 8) largely doesn't
- matter for the VRAM access pattern.</p>
- <p><i>TODO sprite fetching happens 1 line earlier than displaying those sprites
- (see below for details). This means that the last line of the vertical border
- before the display area likely uses a 'mixed mode' where it doesn't yet fetch
- bitmap data but it does already fetch sprite data. I didn't specifically
- measure this condition, so I can't really tell anything about this mixed mode.
- (One possibility is that it's just like a normal display line, but the fetched
- bitmap data is ignored.) Similarly the last line of the display area doesn't
- strictly need to fetch new sprite data.</i></p>
- <p>We'll now look at these 3 modes in more detail.</p>
- <h4>Screen disabled</h4>
- <h5>refresh</h5>
- <p>Screen rendering can be disabled via bit 6 in VDP register R#1. There's also
- no screen rendering when the VDP is showing a vertical border line. From a
- VRAM-access point of view both cases are identical.</p>
- <p>In this mode the VDP doesn't need to fetch any data from VRAM for
- rendering. It only needs to refresh the VRAM. As already mentioned earlier,
- the VDP uses a regular read to refresh the RAM, so this takes 6 cycles.</p>
- <p>The VDP executes 8 refresh actions per display line. They start at the
- following moments in time (the red blocks in the big timing diagram):</p>
- <table>
- <tr><td>284</td><td>412</td><td>540</td><td>668</td>
- <td>796</td><td>924</td><td>1052</td><td>1180</td></tr>
- </table>
- <h5>refresh-addresses</h5>
- <p><i>I didn't investigate this refresh-address-stuff in detail because it
- doesn't matter for emulation accuracy</i>.</p>
- <p>The logical addresses used for refresh reads seems to be of the form:</p>
- <p align="center">N×0x10101 | 0x3F</p>
- <p>Where N increases on each refresh action. So each refresh the row address
- increases by one and every other refresh either the CAS0 or the CAS1 signal
- gets used (the columns address doesn't matter for refresh). Note that this
- formula is for the logical address, in screen 7/8 this still gets transformed
- to a physical address. So in screen 7/8 a refresh action always uses the CAS1
- signal. That means that in screen 7/8 the DRAM chip(s) of bank0 actually do get
- refreshed using the RAS-without-CAS refresh mode.</p>
- <p>The refresh timings are the same for all non-text screen modes. But in text
- modes there are only 7 refreshes per line and they are also located at
- different relative positions than in the table above. I didn't investigate
- this further.</p>
- <h5>dummy reads</h5>
- <p>Next to the refresh reads, in 'screen disabled' mode, the VDP still performs
- 4 reads of address 0x1FFFF. At the following moments (marked with dark-grey
- blocks on the timeline):</p>
- <table><tr><td>1236</td><td>1244</td><td>1252</td><td>1260</td></tr></table>
- <p>I can't image any use for these reads, so let's call them dummy reads. In all
- our measurements these dummy reads always re-occur in these same positions, so
- it's not a fluke in (only one of) the measurements.</p>
- <p>The refresh actions remain exactly the same in the other two modes. But
- these dummy reads are different in the mode 'sprites off' or disappear
- completely in the mode 'sprites on'. (This confirms that nothing 'useful' is
- done by these dummy reads).</p>
- <p>Anyway for emulation we can mostly ignore these dummy reads. It only matters
- that at these moments in time there cannot be CPU or command VRAM reads or
- writes.</p>
- <h4>screen enabled, sprites disabled</h4>
- <h5>refresh and dummy reads</h5>
- <p>Refresh works exactly the same as in the previous mode. The dummy reads
- are a bit different. Now there are only 3 dummy reads at slightly different
- moments (also shown in dark-grey):</p>
- <table><tr><td>1242</td><td>1250</td><td>1258</td></tr></table>
- <p>The first of these 3 reads is always from address 0x1FFFF. The second and
- third dummy read have a pattern in their address. For example:</p>
- <table>
- <tr><th>1st</th><th>2nd</th><th>3rd</th></tr>
- <tr><td>0x1FFFF</td><td>0x03B80</td><td>0x03B82</td></tr>
- <tr><td>0x1FFFF</td><td>0x03C00</td><td>0x03C02</td></tr>
- <tr><td>0x1FFFF</td><td>0x03C80</td><td>0x03C82</td></tr>
- <tr><td>0x1FFFF</td><td>0x03D00</td><td>0x03D02</td></tr>
- </table>
- <p>This table shows the addresses of the 3 dummy reads for 4 successive display
- lines (this is data from an actual measurement, unfortunately our equipment
- could only buffer up to 4 lines). The lower 7 bits of the address of the 2nd
- read always seem to be zero. The address of the 3rd read is the same as for the
- 2nd read except that bit 1 is set to 1. When going from one line to the next,
- the address increases by 0x80. Our measurements captured 10 independent sets of
- 4 successive lines. Each time bits 16-15 were zero (bits 14-7 do take different
- values). This could be a coincidence, or it could be that these bits really
- aren't included in the counter. Note that again these are logical addresses (so
- still transformed for screen 7/8). I didn't investigate these dummy reads in
- more detail because they mostly don't matter for emulation.</p>
- <h5>bitmap reads</h5>
- <p>The major change compared to the previous mode is that now the VDP needs to
- fetch extra data for the bitmap rendering. These fetches happen in 32 blocks of
- 4 bytes (screen 5/6) or 8 bytes (screen 7/8). The fetches within one block
- happen in burst mode. This means that one block takes 18 cycles (screen 5/6) or
- 20 cycles (screen 7/8). Though later we'll see that the two spare cycles for
- screen 5/6 are not used for anything else, so for simplicity we can say that in
- all bitmap modes a bitmap-fetch-block takes 20 cycles. This is even clearer if
- you look at the RAS signal: this signal follows the exact same pattern in all
- (bitmap) screen modes, so in screen 5/6 it remains active for two cycles longer
- than strictly necessary.</p>
- <p>Actually before these 32 blocks there's one extra dummy block. This block
- has the same timing as the other blocks, but it always reads address 0x1FFFF.
- From an emulator point of view, these dummy reads don't matter, it only matters
- that at those moments no other VRAM accesses can occur.</p>
- <p>The start of these 1+32 blocks are located at these moments in time (these
- are the green blocks in the big timing diagram):</p>
- <table>
- <tr><td>(195)</td><td> 227</td><td> 259</td><td> 291</td><td> 323</td>
- <td> 355</td><td> 387</td><td> 419</td><td> 451</td></tr>
- <tr><td> </td><td> 483</td><td> 515</td><td> 547</td><td> 579</td>
- <td> 611</td><td> 643</td><td> 675</td><td> 707</td></tr>
- <tr><td> </td><td> 739</td><td> 771</td><td> 803</td><td> 835</td>
- <td> 867</td><td> 899</td><td> 931</td><td> 963</td></tr>
- <tr><td> </td><td> 995</td><td>1027</td><td>1059</td><td>1091</td>
- <td>1123</td><td>1155</td><td>1187</td><td>1219</td></tr>
- </table>
- <p><i>The following is only speculation: I wonder why there is such a dummy
- preamble block. Theoretically this <b>could</b> have been used (or reserved) to
- implement V9958-like horizontal scrolling without having to mask 8 border
- pixels. Unfortunately horizontal scrolling on a V9958 doesn't work like that
- :(</i></p>
- <h4>screen enabled, sprites enabled</h4>
- <h5>refresh, dummy reads, bitmap reads</h5>
- <p>Refresh and bitmap reads are exactly the same as in the previous mode. But
- the 3 or 4 dummy reads from the previous 2 modes are not present in this
- mode.</p>
- <h5>sprite reads</h5>
- <p><i>I've only investigated bitmap modes, that means the stuff below applies
- only to sprite mode 2.</i></p>
- <p>For sprite rendering you need to:
- <ul>
- <li>Figure out which sprites are visible: There are 32 positions in the
- sprite attribute table, and of those maximum 8 sprites can be visible
- (per line).</li>
- <li>For the visible sprites, fetch the required data so that it can actually
- be drawn. This data is: the x- and y-coordinates, the sprite pattern number,
- the pattern data and the color data.</li>
- </ul>
- <p>Figuring out which sprites are visible is done by reading the y-coordinates
- of each of the 32 possible sprites. These reads happen interleaved between the
- 32 block-reads of the bitmap data, so read one byte between each bitmap-block.
- Because of this interleaving it's not possible to use burst mode, so each read
- takes 6 cycles. There's also 1 dummy read of address 0x1FFFF at the end. The
- reads happen at these moments in time (yellow blocks between the green blocks in
- the diagram):</p>
- <table>
- <tr><td> 182</td><td> 214</td><td> 246</td><td> 278</td>
- <td> 310</td><td> 342</td><td> 374</td><td> 406</td></tr>
- <tr><td> 438</td><td> 470</td><td> 502</td><td> 534</td>
- <td> 566</td><td> 598</td><td> 630</td><td> 662</td></tr>
- <tr><td> 694</td><td> 726</td><td> 758</td><td> 790</td>
- <td> 822</td><td> 854</td><td> 886</td><td> 918</td></tr>
- <tr><td> 950</td><td> 982</td><td>1014</td><td>1046</td>
- <td>1078</td><td>1110</td><td>1142</td><td>1174</td><td>(1206)</td></tr>
- </table>
- <p>In the worst case, the 8 last sprites of the attribute table are visible. In
- that case all 32 reads are really required. Though even if the limit of 8
- visible sprites is reached earlier, the VDP continues fetching all 32+1 bytes.
- Also if one y-coordinate is equal to 216 (meaning that all later sprites are
- invisible), still all 32+1 fetches are executed.</p>
- <p>Once the VDP has figured out which sprites are visible it needs to fetch the
- data to actually draw the sprites. This VRAM access pattern is relatively
- complex:</p>
- <ul>
- <li>In the worst case there are 8 visible sprites. This requires reading
- 8×6 bytes. Some of these reads can be done in burst mode, others are
- single byte reads.</li>
- <li>Even if there are less than 8 sprites to display, all read accesses do
- still occurs. It <i>seems</i> to be that the useless reads are duplicates of
- sprite 0. (Or is it the first visible sprite? I didn't look in detail because
- it's not important for our purpose. It only matters that the VRAM bus remains
- occupied).</li>
- <li>The data fetches happens in 4 chunks of each 2 sprites. Each chunk
- reads:</li>
- <ul>
- <li>Y-coordinate, x-coordinate and pattern-number of 1st sprite. Burst of 3
- reads, takes 13(!)cycles.</li>
- <li>Y-coordinate, x-coordinate and pattern-number of 2nd sprite. Burst of 3
- reads, takes 13(!)cycles.</li>
- <li>Pause of 6 or 10(!) cycles</li>
- <li>2 pattern bytes of 1st sprite. Burst of 2 reads, takes 10 cycles.</li>
- <li>Color attribute of 1st sprite. Single read, takes 6 cycles.</li>
- <li>2 pattern bytes of 2nd sprite. Burst of 2 reads, takes 10 cycles.</li>
- <li>Color attribute of 2nd sprite. Single read, takes 6 cycles.</li>
- </ul>
- <li>Note that the burst of 3 reads only takes 13 instead of the expected 14
- cycles. If you look at the RAS/CAS signals you see that this uses an illegal(?)
- RAM access pattern: RAS is released together with CAS (even slightly before if
- you look at the raw measured data). But obviously this seems to work fine
- <i>… makes me wonder why the VDP doesn't always use this faster
- access pattern.</i></li>
- <li>Even for 8x8 sprites, the VDP always fetches 2 bytes of pattern-data per
- sprite line (and the 2nd byte is ignored).</li>
- <li>Note that the y-coordinate is fetched again. It was already fetched to
- figure out which sprites are visible.</li>
- <li>The positions in time of these reads (single or burst) are like this
- (yellow blocks (mostly) in the border period in the big timing diagram):
- <table>
- <tr><td>1238</td><td>1251</td><td>1270</td><td>1280</td><td>1286</td><td>1296</td></tr>
- <tr><td>1302</td><td>1315</td><td>1338</td><td>1348</td><td>1354</td><td>1364</td></tr>
- <tr><td> 2</td><td> 15</td><td> 34</td><td> 44</td><td> 50</td><td> 60</td></tr>
- <tr><td> 66</td><td> 79</td><td> 98</td><td> 108</td><td> 114</td><td> 124</td></tr>
- </table>
- Note that some of these fetches occur in the previous and some in the current
- display line. Though the start of the display line was chosen arbitrary (we
- could have picked the staring point so that these numbers don't wrap). It only
- matters that all sprite data is fetched before the display rendering
- starts.</li>
- <li>Also note that the timing is slightly irregular: in the 1st, 3rd and 4th
- group there's a pause of 6 cycles, there fits exactly one other access in this
- gap. But in the 2nd group there's a pause of 10 cycles. There also only fits
- one other access in this gap, and the timing is 2+6+2, so 2 'wasted' cycles
- before and after that other access. <i>I suspect that these 2+2 cycles are
- related to the R#9 S1,S0 bits. TODO measure this</i>.</li>
- </ul>
- <p>It's worth repeating that whenever sprites are enabled, the VDP
- <b>always</b> performs the same fetch-pattern. So even if no sprites are
- actually visible, or if sprites are partially disabled (with y=216), and even
- with 8x8 vs 16x16 sprites, magnified or not. This confirms the fact that the
- VDP command engine is slowed down by the exact same amount in all these
- situation. Also all (bitmap) screen modes behave exactly the same with respect
- to sprite data fetches.</p>
- <h3>CPU and command reads/writes</h3>
- <h5>position of access slots</h5>
- <p>The previous sections explained when the VDP reads from VRAM for refresh and
- bitmap/sprite rendering (and even some dummy reads). Depending on the mode
- (screen/sprites enabled/disabled), this takes more or less of the available
- VRAM-bandwidth. The portion of the VRAM bandwidth that is not used for
- rendering can be used for CPU or command engine VRAM reads or writes.</p>
- <p>All CPU and command engine accesses are single (non-burst) accesses, so they
- take 6 cycles each. However it is <b>not</b> the case that whenever the VRAM
- bus is idle for 6 cycles, it can be used for CPU or command engine
- accesses.</p>
- <p>Instead there are fixed moments in time where there could <i>possibly</i>
- start a cpu or command access, let's call these moments access slots. Each slot
- can be used for either CPU or command accesses (there are no slots that are
- uniquely reserved for either CPU or for commands). The position and the amount
- of access slots <i>only</i> depends on the VDP mode (screen off, sprites off,
- sprites on), not for example on the amount of actually visible sprites or on
- the (bitmap) screen mode.</p>
- <p>The 3 tables below show the amount and the positions of the possible access
- slots for the 3 different modes (in the timing diagram these are the blue
- blocks):</p>
- <p><table>
- <caption>screen off, 154 possible slots</caption>
- <tr><td> 0</td><td> 8</td><td> 16</td><td> 24</td><td> 32</td>
- <td> 40</td><td> 48</td><td> 56</td><td> 64</td><td> 72</td></tr>
- <tr><td> 80</td><td> 88</td><td> 96</td><td> 104</td><td> 112</td>
- <td> 120</td><td> 164</td><td> 172</td><td> 180</td><td> 188</td></tr>
- <tr><td> 196</td><td> 204</td><td> 212</td><td> 220</td><td> 228</td>
- <td> 236</td><td> 244</td><td> 252</td><td> 260</td><td> 268</td></tr>
- <tr><td> 276</td><td> 292</td><td> 300</td><td> 308</td><td> 316</td>
- <td> 324</td><td> 332</td><td> 340</td><td> 348</td><td> 356</td></tr>
- <tr><td> 364</td><td> 372</td><td> 380</td><td> 388</td><td> 396</td>
- <td> 404</td><td> 420</td><td> 428</td><td> 436</td><td> 444</td></tr>
- <tr><td> 452</td><td> 460</td><td> 468</td><td> 476</td><td> 484</td>
- <td> 492</td><td> 500</td><td> 508</td><td> 516</td><td> 524</td></tr>
- <tr><td> 532</td><td> 548</td><td> 556</td><td> 564</td><td> 572</td>
- <td> 580</td><td> 588</td><td> 596</td><td> 604</td><td> 612</td></tr>
- <tr><td> 620</td><td> 628</td><td> 636</td><td> 644</td><td> 652</td>
- <td> 660</td><td> 676</td><td> 684</td><td> 692</td><td> 700</td></tr>
- <tr><td> 708</td><td> 716</td><td> 724</td><td> 732</td><td> 740</td>
- <td> 748</td><td> 756</td><td> 764</td><td> 772</td><td> 780</td></tr>
- <tr><td> 788</td><td> 804</td><td> 812</td><td> 820</td><td> 828</td>
- <td> 836</td><td> 844</td><td> 852</td><td> 860</td><td> 868</td></tr>
- <tr><td> 876</td><td> 884</td><td> 892</td><td> 900</td><td> 908</td>
- <td> 916</td><td> 932</td><td> 940</td><td> 948</td><td> 956</td></tr>
- <tr><td> 964</td><td> 972</td><td> 980</td><td> 988</td><td> 996</td>
- <td>1004</td><td>1012</td><td>1020</td><td>1028</td><td>1036</td></tr>
- <tr><td>1044</td><td>1060</td><td>1068</td><td>1076</td><td>1084</td>
- <td>1092</td><td>1100</td><td>1108</td><td>1116</td><td>1124</td></tr>
- <tr><td>1132</td><td>1140</td><td>1148</td><td>1156</td><td>1164</td>
- <td>1172</td><td>1188</td><td>1196</td><td>1204</td><td>1212</td></tr>
- <tr><td>1220</td><td>1228</td><td>1268</td><td>1276</td><td>1284</td>
- <td>1292</td><td>1300</td><td>1308</td><td>1316</td><td>1324</td></tr>
- <tr><td>1334</td><td>1344</td><td>1352</td><td>1360</td></tr>
- </table></p>
- <p><table>
- <caption>sprites off, 88 possible slots</caption>
- <tr><td> 6</td><td> 14</td><td> 22</td><td> 30</td><td> 38</td>
- <td> 46</td><td> 54</td><td> 62</td><td> 70</td><td> 78</td></tr>
- <tr><td> 86</td><td> 94</td><td> 102</td><td> 110</td><td> 118</td>
- <td> 162</td><td> 170</td><td> 182</td><td> 188</td><td> 214</td></tr>
- <tr><td> 220</td><td> 246</td><td> 252</td><td> 278</td><td> 310</td>
- <td> 316</td><td> 342</td><td> 348</td><td> 374</td><td> 380</td></tr>
- <tr><td> 406</td><td> 438</td><td> 444</td><td> 470</td><td> 476</td>
- <td> 502</td><td> 508</td><td> 534</td><td> 566</td><td> 572</td></tr>
- <tr><td> 598</td><td> 604</td><td> 630</td><td> 636</td><td> 662</td>
- <td> 694</td><td> 700</td><td> 726</td><td> 732</td><td> 758</td></tr>
- <tr><td> 764</td><td> 790</td><td> 822</td><td> 828</td><td> 854</td>
- <td> 860</td><td> 886</td><td> 892</td><td> 918</td><td> 950</td></tr>
- <tr><td> 956</td><td> 982</td><td> 988</td><td>1014</td><td>1020</td>
- <td>1046</td><td>1078</td><td>1084</td><td>1110</td><td>1116</td></tr>
- <tr><td>1142</td><td>1148</td><td>1174</td><td>1206</td><td>1212</td>
- <td>1266</td><td>1274</td><td>1282</td><td>1290</td><td>1298</td></tr>
- <tr><td>1306</td><td>1314</td><td>1322</td><td>1332</td><td>1342</td>
- <td>1350</td><td>1358</td><td>1366</td></tr>
- </table></p>
- <p><table>
- <caption>sprites on, 31 possible slots</caption>
- <tr><td> 28</td><td> 92</td><td> 162</td><td> 170</td><td> 188</td>
- <td> 220</td><td> 252</td><td> 316</td><td> 348</td><td> 380</td></tr>
- <tr><td> 444</td><td> 476</td><td> 508</td><td> 572</td><td> 604</td>
- <td> 636</td><td> 700</td><td> 732</td><td> 764</td><td> 828</td></tr>
- <tr><td> 860</td><td> 892</td><td> 956</td><td> 988</td><td>1020</td>
- <td>1084</td><td>1116</td><td>1148</td><td>1212</td><td>1264</td></tr>
- <tr><td>1330</td></tr>
- </table></p>
- <p>Note that even in the mode 'screen off', when the VRAM bus is otherwise
- mostly idle, the access slots are still at least 8 cycles apart. A single
- access takes only 6 cycles, so 2 cycles are wasted.</p>
- <p>Very roughly speaking in mode 'screen off' there are about twice as many
- access slots as in the mode 'sprites off' and about 5 times as many as in the
- mode 'sprites on'. This does however <b>not</b> mean that in these modes the
- command engine will execute respectively 2× and 5× as fast. Instead
- in the mode 'sprites on' the speed of command execution is mostly limited by
- the amount of available access slots, while in the mode 'screen off', the
- bottleneck is mostly the speed of the command engine itself.</p>
- <p>Also note that the access slots are not evenly spread in time. For
- example:</p>
- <ul>
- <li>In mode 'screen off', the slots are often only 8 cycles apart (measured
- from the start of the 1st to the start of the 2nd slot). Though starting
- at cycle=120 there's a gap of 44 cycles.</li>
- <li>In mode 'sprites off', during the horizontal border, the access slots are
- roughly 8 cycles apart like in the previous mode, but during the display
- period, the spacing is more like 26 or 32 cycles. The largest gap is now 54
- cycles starting at cycle=1212.</li>
- <li>In mode 'sprites on', the pattern is again completely different. Here the
- slots are roughly 32 or 64 cycles apart. (The border even has slightly larger
- gaps than the display area. So contrary to some speculations, the commands do
- not execute faster in the horizontal border in this mode). The largest gap is
- now 70 cycles, starting at cycle=92. There's even one location where the
- smallest gap is also only 8 cycles. (Though if you look at the measurements
- you'll see that the slot right after this smallest gap (at cycle=170) is rarely
- actually used, even though the command engine is starved for VRAM
- bandwidth).</li>
- </ul>
- <p>These large gaps between the access slots are important. For example if the
- CPU is sending data to the VDP at a too fast rate, and this happens right at a
- moment where there are no access slots available, then some of the data send by
- the CPU is lost. We'll see later in this text that this can even happen
- when the time between the incoming CPU requests is (slightly) larger than the
- size of the largest gap.</p>
- <h5>allocation of access slots</h5>
- <p>The access slots can be used for either CPU or VDP command reads or writes.
- This section explains how the slots are allocated to these two subsystems.</p>
- <p>The basic principle is very simple: the CPU or the command engine take the
- first available access slot. And when the CPU and command engine both require
- an access slot at the same time, then the CPU gets priority. Though if you look
- at the details it is a bit more complicated:</p>
- <ul>
- <li>When the CPU sends a read or write VRAM request to the VDP, this request is
- put in a buffer until it can be handled.</li>
- <li>When the CPU sends a new request when there's still a previous request
- pending then the old request is lost. More on this below. <i>TODO most logical
- is that the old (not the new) request is lost, but actually check this. Though
- the Z80 might be too slow to be able to test this.</i></li>
- <li>Similarly when the VDP command engine needs to perform a VRAM read or
- write, this request is also put in a buffer. This is a different buffer than
- the one for CPU requests.</li>
- <li>In contrast to the CPU, the command engine is stalled when the command
- engine buffer holds a request. So command engine requests can never get
- lost.</li>
- <li>16 cycles in advance of an access slot the VDP checks whether there is
- either a pending CPU or command request. If there's a pending CPU request, that
- request will be executed (16 cycles later). If there's no cpu request but there
- is a command request then that one will be executed (16 cycles later). So the
- CPU takes priority over the command engine. And very important, if there's no
- request pending yet, then 16 cycles later nothing will be executed, not even if
- a request does arrive within 16 cycles.</li>
- </ul>
- <h5>cpu access slows down command execution</h5>
- <p>A surprising result (at least to me) of these measurements is that the
- speed of VDP command execution is reduced while simultaneously doing CPU VRAM
- accesses. Looking back this makes sense because the same VRAM access slots are
- shared between CPU and command engine and the CPU gets priority.</p>
- <p>This effect is clearly noticeable in the mode 'sprites on' but much less in
- the other two modes. This is easily explained by looking at the amount of
- available access slots in these modes.</p>
- <p>The most extreme situation occurs in this test. Execute a HMMV VDP command
- (this is the fastest command, see below) while simultaneously executing a long
- series of <code>OUT (#98),A</code> instructions (the fastest way to send CPU
- write requests). In our measurements, in the mode 'sprites on' the command
- execution speed was approximately cut in half! But in the other two modes, the
- execution speed was barely influenced. (Actually our test program wasn't
- accurate enough to measure any significant speed difference, but theoretically
- also in the latter two modes the execution speed should be reduced by a small
- amount).</p>
- <h5>too fast CPU access</h5>
- <p>The fastest way for the Z80 to send read or write VRAM request to the VDP is
- by using a sequence of <code>IN A,(#98)</code> or <code>OUT (#98),A</code>
- instructions (of course such a sequence always writes the same value or ignores
- all but the last read value). This takes 12 Z80 clock cycles per request.
- (Instructions like <code>OUT (C),r</code> or <code>OUTI</code> are all slower).
- The VDP is clocked at 6× the Z80 speed. So when the Z80 sends multiple
- requests to the VDP, the minimal distance between these requests, translated to
- VDP cycles, is at least 72 VDP cycles. Earlier we saw that the maximal gap
- between access slots was 70 VDP cycles, so at first sight there's no problem.
- However consider this scenario:</p>
- <ul>
- <li>Suppose we're in 'sprites on' mode. At time=236, we're 16 cycles before an
- access slot. Suppose there's no pending CPU nor command request at this
- time. So nothing will get executed at time=252.</li>
- <li>A bit later at time=240 there arrives a CPU write request. This request
- gets buffered.</li>
- <li>At time=252 there is an access slot, but nothing will get executed in this
- slot (because this slot wasn't allocated at time=236).</li>
- <li>At time=300 we're again 16 cycles before an access slot. Now there is a
- pending CPU request, so we'll execute that at time=316.</li>
- <li>At time=312 we receive a new CPU write request. This is 312-240=72 VDP
- cycles (or 12 Z80 cycles, the duration of a <code>OUT (#98),A</code>
- instruction) after the previous request. But the buffer still contains the
- previous unhandled request. The new request overwrites the old request!</li>
- <li>At time=316 there's an access slot and we've allocated this slot to the CPU
- (at time=300). So the pending CPU request gets executed. Though this writes the
- data from the new request, the data from the old request is never written!</li>
- </ul>
- <p>Note that this scenario used a gap of only 64 VDP cycles between access
- slots, while there were 72 cycles between the CPU requests. (And the largest
- gap between access slots is 70 cycles).</p>
- <!--TODO tests on real machine:
- only lost in 'sprites on' mode ??
- OUT (#99),A -> easy lost
- OUT (C),A -> only very occasionally
- other OUT patterns always OK
- -->
- <h2>Command engine timing</h2>
- <p>The command engine needs access to VRAM. In the previous section we saw when
- the VDP will grand access to this subsystem: when there's an access slot
- available and when that slot is not already allocated to CPU access. In this
- section we'll see when exactly the command engine will generate VRAM access
- requests. Obviously the type (read or write) and the rate of these requests
- depends on the type of the VDP command that is executing.</p>
- <p>Some commands (like HMMV) only need to write to VRAM. Other commands (like
- LMMM) need 2 reads and 1 write per pixel. Many commands execute on a block (a
- rectangle) of pixels. Such a block is executed line per line (all pixels within
- one horizontal line are processed before moving to the next line). Moving from
- one line to the next takes some amount of time (but YMMM is an exception, see
- below). This means that e.g. a HMMM command on a 20x4 rectangle executes faster
- than on a 4x20 rectangle (same amount of pixels in both cases, but a different
- rectangle shape).</p>
- <p>The following table summarizes the timing for all measured commands:</p>
- <table>
- <tr><th>Command</th><th>Per pixel</th><th>Per line</th></tr>
- <tr><td>HMMV</td><td>48 W </td><td>56</td></tr>
- <tr><td>YMMM</td><td>40 R 24 W </td><td>0 </td></tr>
- <tr><td>HMMM</td><td>64 R 24 W </td><td>64</td></tr>
- <tr><td>LMMV</td><td>72 R 24 W </td><td>64</td></tr>
- <tr><td>LMMM</td><td>64 R 32 R 24 W</td><td>64</td></tr>
- <tr><td>LINE</td><td>88 R 24 W </td><td>32</td></tr>
- </table>
- <p><i>TODO timing for PSET, POINT, SRCH</i></p>
- <p>I'll explain the notation in this table with an example. Take the LMMM
- command:</p>
- <ul>
- <li>Per pixel the LMMM command needs to:
- <ul><li>Read a byte from the source.</li>
- <li>Read a byte from the destination</li>
- <li>Calculate the result: extract the pixel value from source and
- destination, combine the two (possibly with a logical operation), insert
- the result in the destination byte. And write the result back to the
- destination.</li>
- </ul></li>
- <li>So per pixel, the LMMM command will generate 3 VRAM accesses: 2 read
- followed by one write. Between these accesses there will be some amount of
- time.</li>
- <li>For LMMM the table lists '64 R 32 R 24 W'. Let's start at the 1st 'R'
- character, this represents the 1st read. Next there's the value 32 and a 2nd
- 'R', this means that the 2nd read comes <i>at least</i> 32 cycles after the 1st
- read. Then there's '24 W', meaning there are <i>at least</i> 24 cycles between
- the 2nd read and the write. And the initial value '64' means that there are
- <i>at least</i> 64 cycles between the write and the 1st read for the next
- pixel.</li>
- <li>When moving from one horizontal line to the next in a block command, there
- is some extra delay. For the LMMM command this takes 64 extra cycles. So
- 64+64=128 cycles from the last write of a line till the first read of the next
- line.</li>
- <li>Note that all these values are the <i>optimal</i> timing values. The actual
- delay can be longer because there is no access slot available or the slot is
- already allocated for CPU access.</li>
- </ul>
- <p>All the commands in the table above are block commands except for 'LINE'.
- For the LINE command the meaning of the columns 'Per pixel' and 'Per line' may
- not be immediately clear:</p>
- <ul>
- <li>The VDP uses the <a href="http://en.wikipedia.org/wiki/Bresenham%27s_line_algorithm">
- Bresenham algorithm</a> the calculate which pixels are part of the line.</li>
- <li>This algorithm takes at each iteration one step in the <i>major</i>
- direction. The timings for such an iteration are written in the 'Per pixel'
- column for the LINE command.</li>
- <li>Depending on the slope of the line, in some iterations the Bresenham
- algorithm also takes a step in the <i>minor</i> direction. For the VDP such a
- minor step takes some extra time (32 cycles). This is written in the 'Per line'
- column of the LINE command. (If you look back at the very beginning of this
- text, these major and minor steps explain the general octagonal shapes in the
- images. The uneven distribution of the access slots explain the
- irregularities.)</li>
- </ul>
- <p>Note that for the YMMM command there's no extra overhead when going from one
- horizontal line to the next. This might be related to the fact that a line of
- a YMMV command always starts at the left or right border of the screen.</p>
- <p><i>TODO What we didn't measure (also couldn't measure with our test setup)
- was the delay between the start of the command (when the CPU sends the command
- byte to the VDP) and the moment the command actually starts executing (e.g.
- when the first read or write command access is send to VRAM). It's logical to
- assume that the 'per line' overhead also occurs at the start of the command.
- But it's possible there is also some additional 'per command' overhead.</i></p>
- <h5>speculation on the slowness of the command engine</h5>
- <p>When looking at the above table, we see that the command engine is very
- slow. For example in a HMMM command there are 24 cycles between reading a byte
- and writing that byte to the new location. Or in a LINE command it takes 32
- cycles to take a step in the minor direction. I <i>believe</i> there are two
- main reasons for this slowness:</p>
- <ul>
- <li>I believe that internally the VDP command engine subsystem runs at 1/8 of
- the master VDP clock frequency. This matches the observation that all values in
- the above table are multiples of 8. It also explains why the access slots are
- always at least 8 cycles apart (while a VRAM access only requires 6
- cycles).</li>
- <li>The command engine gets stalled whenever there's a pending command engine
- VRAM request. A VRAM request (CPU or command) only gets handled after it's been
- pending for at least 16 cycles. So combined this means the command engine gets
- stalled for 16 cycles on every VRAM request it makes. (Note that especially
- this point is just speculation).</li>
- </ul>
- <p>Taking these two points into account, the above table can be rewritten
- as:</p>
- <table>
- <tr><th>Command</th><th>Per pixel</th><th>Per line</th></tr>
- <tr><td>HMMV</td><td>(4×8+16) W </td><td>7×8</td></tr>
- <tr><td>YMMM</td><td>(3×8+16) R (1×8+16) W </td><td>0×8</td></tr>
- <tr><td>HMMM</td><td>(6×8+16) R (1×8+16) W </td><td>8×8</td></tr>
- <tr><td>LMMV</td><td>(7×8+16) R (1×8+16) W </td><td>8×8</td></tr>
- <tr><td>LMMM</td><td>(6×8+16) R (2×8+16) R (1×8+16) W</td><td>8×8</td></tr>
- <tr><td>LINE</td><td>(9×8+16) R (1×8+16) W </td><td>4×8</td></tr>
- </table>
- <p>When you look at the data in this way, the numbers already look more
- reasonable.</p>
- <h2>Next steps</h2>
- <p>All the information above <i>should</i> already be enough to significantly
- improve the accuracy of MSX emulators. The following months I plan to work on
- improving openMSX.</p>
- <ul>
- <li>First I'd like to improve the CPU-VRAM access stuff, so that e.g. too fast
- CPU accesses actually result in dropped requests.</li>
- <li>Next step is the timing of the VDP commands. This depends on the previous
- step because e.g. CPU access slows down command execution.</li>
- <li>Still a later step could be to more accurately in time fetch the data
- required for display rendering (bitmap, sprites). This is lower priority
- because:
- <ul>
- <li>These effects are limited to the visual output. Errors can't influence
- the 'state' of the MSX machine. So it's impossible to write a MSX program
- that checks (= makes a decision based on) the rendering accuracy. (OTOH it is
- possible to check for dropped CPU requests or the speed of the
- commands).</li>
- <li>I don't know of any <i>existing</i> MSX software where this will make a
- noticeable difference. Maybe an idea for a <i>new</i> test is to vary the
- y-coordinates of the sprite(s) within one display line. Thus causing the
- sprite engine to use two different values in the two phases of sprite
- rendering.</li>
- <li>Hmm … or maybe there is an existing program: the <a
- href="http://users.skynet.be/bk263586/verti.zip">verti</a> demo. On current
- emulators the vertical bars are all equally wide. But on a real MSX there
- are wider and smaller bars, but all are multiples of 8 pixels.</a>
- </ul>
- </ul>
- <p>I'm afraid this will all still take quite a bit of work.</p>
- <p>Anyway, I hope the information in this document is useful. For (other) MSX
- emulator developers or for MSX developers in general.</p>
- <hr/>
- <p align="right" style="font-size:smaller;">
- 2013/03/30, Wouter Vermaelen
- </p>
- </body>
- </html>
|