123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200 |
- Some time ago (March 2015) Laurens Holst (aka Grauw) did some measurements
- on the duration of various R800 instructions. For his results see
- http://map.grauw.nl/resources/z80instr.php
- For the most part he was able to confirm the results we obtained earlier
- (described in the "r800test.txt" document). Though for the CALL
- instruction he 'sometimes' got a 1 cycle difference. Now, half a year
- later, I finally got around to investigate this in more detail.
- I used the following measuring method. Alex Wulms once created an MSX
- cartridge with a 3.57MHz counter on it. Writing to IO port 0x20 resets the
- counter and reading from IO ports 0x20-0x23 (atomically) reads it. So this
- is much like the MSX-turboR E6-timer, except that it ticks 14x faster. The
- R800 runs at 2 x 3.57MHz, so that means we can still only measure up-to 2
- cycles accurate. To work around this I always repeat the to-be-measured
- sequence twice. Here's an example program.
- org #c000
- di
- out (#20),a ; reset timer
- call func ; repeat to be measured
- call func ; sequence twice
- in a,(#20) ; read timer
- ld l,a
- in a,(#20)
- ld h,a
- ret
- func: ret
- Initially I run this program without the 2 'call func' instructions. After
- the program returns, register HL contains the value '5'. This is the
- time between reseting and reading the counter. This value should be
- subtracted from all future measurements.
- Note: usually these test programs give stable results. But once in a while
- you get a too high value. This can be explained by the R800 refresh stuff
- (see "r800-refresh.txt" for more details). By repeating the same test a
- few times it's often possible to avoid this refresh stuff.
- When re-inserting the two 'call func' instructions, I obtain the value
- '17', subtracting 5 gives 12. So executing 2 'call + ret' sequences takes
- 2x12 cycles. So a single 'call + ret' sequence takes 12 cycles. Later in
- this text I won't repeat this calculation, I'll directly give the length
- of the (single) sequence.
- So this confirms the result of my earlier measurements. See "r800test.txt"
- for details. There I also show that these 12 cycles decompose into 7
- cycles for the "call" instruction and 5 cycles for the "ret" instruction.
- The interesting thing is that Laurens 'sometimes' measured 8 cycles for
- the "call" instruction. Let's now make the following change in the program:
- ...
- func: nop ; <-- added this nop instruction
- ret
- So the executed sequence is now "call + nop + ret". Naively we'd expect
- this to take only 1 cycle more. Instead we measure 14 cycles (2 more).
- I also tested this program:
- ...
- call func
- nop
- call func
- nop
- ...
- func: ret
- The sequence is now "call + ret + nop". This does take 13 cycles (only 1
- more). So it's really "call" immediately followed by "ret" that is
- special.
- I measured a lot of other sequences as well. I'll summarize them in this
- table:
- sequence measured decomposed remark
- a) call, ret 12 7+5 NO penalty, call-ret is special
- b) call, ret, nop 13 7+5+1 NO penalty
- c) call, nop, ret 14 7+3+4 penalty on nop
- d) call, nop, nop, ret 15 7+3+1+4 penalty on 1st nop
- e) call, reti 15 7+8 penalty on reti, reti is NOT special
- f) call, nop, reti 16 7+3+6 penalty on nop
- g) call, pop hl 12 7+5 NO penalty, call-pop is special
- h) call, nop, pop hl 14 7+3+4 penalty on nop
- i) call, nop, nop, pop hl 15 7+3+1+4 penalty on 1st nop
- j) call, pop ix 14 7+7 penalty on "pop ix", different from "pop hl"!
- k) call, nop, pop ix 15 7+3+5 penalty on nop
- l) call, nop, nop, pop ix 16 7+3+1+5 penalty on 1st nop
- m) push hl, pop hl 11 6+5 no penalty,
- n) push hl, nop, pop hl 12 6+2+4 push-pop is not special
- o) push hl, ret 11 6+5 no penalty,
- p) push hl, nop, ret 12 6+2+4 push-ret is not special
- q) rst, ret 11 6+5 NO penalty, rst-ret is special
- r) rst, nop, ret 13 6+3+4 penalty on nop
- s) rst, nop, nop, ret 14 6+3+1+4 penalty on 1st nop
- t) rst, reti 14 6+8 penalty on reti, reti is NOT special
- u) rst, nop, reti 15 6+3+6 penalty on nop
- v) rst, nop, nop, reti 16 6+3+1+6 penalty on 1st nop
- w) rst, jp(hl) 9 6+3 penalty on jp(hl)
- x) rst, jp(ix) 10 6+4 penalty on jp(ix)
- Details:
- a) For simplicity I've shown the decomposition 12 -> 7+5. In reality the
- situation is more complex. The full sequence in the test-program is
- out ; call ; ret ; call ; ret ; in
- When we take page-breaks into account, we get the following for the 4
- middle instructions (I'm using the notation from the "r800test.txt"
- document):
- fffWw FRr FffWw FRr + page-break
- So the 1st "call" instruction takes 6 cycles, the 1st "ret" takes 5
- cycles, 2nd "call" takes 7, 2nd "ret" takes 5 and the next "in"
- instruction has a page-break-penalty on fetch. If we simplify this
- (artificially attribute the page-break for "in" to the 1st "call") we can
- say "call" takes 7 and "ret" takes 5 cycles.
- Notice that there's an even number of cycles between the "out" and "in"
- operation. So this inserts an IO-penalty on the IN instruction (to align
- the R800 bus at 7MHZ to the external cartridge bus at 3.5MHz, see
- "r800test.txt" for more details). In case of an empty sequence ("out"
- directly followed by "in"), we have zero cycles which is also even. So our
- calibration method (subtract 5 from the measured result) remains valid.
- d) I'll give another example of the simplified decomposition. The full
- sequence is:
- (out) ; call ; nop ; nop ; ret ; call ; nop ; nop ; ret ; (in)
- The middle part (without out-in) decomposes to
- fffWw Fx f fRr FffWw Fx f fWw + page-break
- 6 3 1 4 7 3 1 4 (+1)
- Here 'x' is an extra "call-penalty", see below for more details. So if we
- artificially move the page-break-penalty for "in" to the front we get
- 7+3+1+4 for call+nop+nop+ret. This same simplification is made for all
- decompositions in the table.
- c) Compared to 'a' you'd expect this sequence to take only 1 extra cycle
- (just like sequence b). Instead there's yet 1 more extra cycle. In
- general, when looking at more sequences and/or replacing 'nop' with other
- instructions with known duration (not shown in the table), we see that we
- always get an extra cycle whenever a call instruction is not _immediately_
- followed by a "ret" or "pop" instruction. We'll call this a "call-penalty"
- cycle.
- We could attribute this call-penalty cycle to the call instruction itself,
- thus say that the call instruction takes 7 or 8 cycles, depending on what
- instruction follows. For most practical purposes this explanation is good
- enough. Though if you look in detail, the fetch of the next instruction
- has to start 7 (not 8) cycles after the start of the call instruction
- (otherwise we cannot know whether there should be a penalty or not). So
- this call-penalty really happens during the _next_ instruction.
- It's likely that this call-penalty stuff can be explained in a simpler
- way if you look at the full-pipelined implementation of the R800.
- _Maybe_ a call somehow/sometimes introduces a late pipeline stall.
- Though because no details are known about the R800 pipeline, for now,
- I'll stick to this weird sequential explanation that a "call"
- instruction somehow induces a stall in the next instruction.
- e) In sequence 'a' and 'b' we saw that "call" immediately followed by
- "ret" has no call-penalty. This is _not_ the case for "reti" or "retn".
- g) h) i) A "call" immediately followed by "pop" also has no penalty.
- j) k) l) Though "pop" has to be a single-byte instruction ("pop af", "pop
- bc", "pop de" or "pop hl"). "pop ix" or "pop iy" do have the call-penalty.
- This is similar to how a multi-byte return instruction (reti) also gets
- the penalty.
- m) n) o) p) From a stack-usage point of view a "call" immediately followed
- by a "ret" or "pop" is similar to a "push" immediately followed by "ret"
- or "pop". Though the tests show there's no penalty for push-nop-pop
- compared to push-pop or for push-nop-ret compared to push-ret.
- q) r) s) t) u) v) The "rst" instruction behaves similar to "call", so
- there is a penalty on the next instruction except if that next instruction
- is a (single-byte) "ret" or "pop". The only difference is that the "rst"
- instruction itself takes 1 (only 1!) cycles less than a "call"
- instruction.
- TODO implement this stuff in openMSX.
|