r800-call.txt 8.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200
  1. Some time ago (March 2015) Laurens Holst (aka Grauw) did some measurements
  2. on the duration of various R800 instructions. For his results see
  3. http://map.grauw.nl/resources/z80instr.php
  4. For the most part he was able to confirm the results we obtained earlier
  5. (described in the "r800test.txt" document). Though for the CALL
  6. instruction he 'sometimes' got a 1 cycle difference. Now, half a year
  7. later, I finally got around to investigate this in more detail.
  8. I used the following measuring method. Alex Wulms once created an MSX
  9. cartridge with a 3.57MHz counter on it. Writing to IO port 0x20 resets the
  10. counter and reading from IO ports 0x20-0x23 (atomically) reads it. So this
  11. is much like the MSX-turboR E6-timer, except that it ticks 14x faster. The
  12. R800 runs at 2 x 3.57MHz, so that means we can still only measure up-to 2
  13. cycles accurate. To work around this I always repeat the to-be-measured
  14. sequence twice. Here's an example program.
  15. org #c000
  16. di
  17. out (#20),a ; reset timer
  18. call func ; repeat to be measured
  19. call func ; sequence twice
  20. in a,(#20) ; read timer
  21. ld l,a
  22. in a,(#20)
  23. ld h,a
  24. ret
  25. func: ret
  26. Initially I run this program without the 2 'call func' instructions. After
  27. the program returns, register HL contains the value '5'. This is the
  28. time between reseting and reading the counter. This value should be
  29. subtracted from all future measurements.
  30. Note: usually these test programs give stable results. But once in a while
  31. you get a too high value. This can be explained by the R800 refresh stuff
  32. (see "r800-refresh.txt" for more details). By repeating the same test a
  33. few times it's often possible to avoid this refresh stuff.
  34. When re-inserting the two 'call func' instructions, I obtain the value
  35. '17', subtracting 5 gives 12. So executing 2 'call + ret' sequences takes
  36. 2x12 cycles. So a single 'call + ret' sequence takes 12 cycles. Later in
  37. this text I won't repeat this calculation, I'll directly give the length
  38. of the (single) sequence.
  39. So this confirms the result of my earlier measurements. See "r800test.txt"
  40. for details. There I also show that these 12 cycles decompose into 7
  41. cycles for the "call" instruction and 5 cycles for the "ret" instruction.
  42. The interesting thing is that Laurens 'sometimes' measured 8 cycles for
  43. the "call" instruction. Let's now make the following change in the program:
  44. ...
  45. func: nop ; <-- added this nop instruction
  46. ret
  47. So the executed sequence is now "call + nop + ret". Naively we'd expect
  48. this to take only 1 cycle more. Instead we measure 14 cycles (2 more).
  49. I also tested this program:
  50. ...
  51. call func
  52. nop
  53. call func
  54. nop
  55. ...
  56. func: ret
  57. The sequence is now "call + ret + nop". This does take 13 cycles (only 1
  58. more). So it's really "call" immediately followed by "ret" that is
  59. special.
  60. I measured a lot of other sequences as well. I'll summarize them in this
  61. table:
  62. sequence measured decomposed remark
  63. a) call, ret 12 7+5 NO penalty, call-ret is special
  64. b) call, ret, nop 13 7+5+1 NO penalty
  65. c) call, nop, ret 14 7+3+4 penalty on nop
  66. d) call, nop, nop, ret 15 7+3+1+4 penalty on 1st nop
  67. e) call, reti 15 7+8 penalty on reti, reti is NOT special
  68. f) call, nop, reti 16 7+3+6 penalty on nop
  69. g) call, pop hl 12 7+5 NO penalty, call-pop is special
  70. h) call, nop, pop hl 14 7+3+4 penalty on nop
  71. i) call, nop, nop, pop hl 15 7+3+1+4 penalty on 1st nop
  72. j) call, pop ix 14 7+7 penalty on "pop ix", different from "pop hl"!
  73. k) call, nop, pop ix 15 7+3+5 penalty on nop
  74. l) call, nop, nop, pop ix 16 7+3+1+5 penalty on 1st nop
  75. m) push hl, pop hl 11 6+5 no penalty,
  76. n) push hl, nop, pop hl 12 6+2+4 push-pop is not special
  77. o) push hl, ret 11 6+5 no penalty,
  78. p) push hl, nop, ret 12 6+2+4 push-ret is not special
  79. q) rst, ret 11 6+5 NO penalty, rst-ret is special
  80. r) rst, nop, ret 13 6+3+4 penalty on nop
  81. s) rst, nop, nop, ret 14 6+3+1+4 penalty on 1st nop
  82. t) rst, reti 14 6+8 penalty on reti, reti is NOT special
  83. u) rst, nop, reti 15 6+3+6 penalty on nop
  84. v) rst, nop, nop, reti 16 6+3+1+6 penalty on 1st nop
  85. w) rst, jp(hl) 9 6+3 penalty on jp(hl)
  86. x) rst, jp(ix) 10 6+4 penalty on jp(ix)
  87. Details:
  88. a) For simplicity I've shown the decomposition 12 -> 7+5. In reality the
  89. situation is more complex. The full sequence in the test-program is
  90. out ; call ; ret ; call ; ret ; in
  91. When we take page-breaks into account, we get the following for the 4
  92. middle instructions (I'm using the notation from the "r800test.txt"
  93. document):
  94. fffWw FRr FffWw FRr + page-break
  95. So the 1st "call" instruction takes 6 cycles, the 1st "ret" takes 5
  96. cycles, 2nd "call" takes 7, 2nd "ret" takes 5 and the next "in"
  97. instruction has a page-break-penalty on fetch. If we simplify this
  98. (artificially attribute the page-break for "in" to the 1st "call") we can
  99. say "call" takes 7 and "ret" takes 5 cycles.
  100. Notice that there's an even number of cycles between the "out" and "in"
  101. operation. So this inserts an IO-penalty on the IN instruction (to align
  102. the R800 bus at 7MHZ to the external cartridge bus at 3.5MHz, see
  103. "r800test.txt" for more details). In case of an empty sequence ("out"
  104. directly followed by "in"), we have zero cycles which is also even. So our
  105. calibration method (subtract 5 from the measured result) remains valid.
  106. d) I'll give another example of the simplified decomposition. The full
  107. sequence is:
  108. (out) ; call ; nop ; nop ; ret ; call ; nop ; nop ; ret ; (in)
  109. The middle part (without out-in) decomposes to
  110. fffWw Fx f fRr FffWw Fx f fWw + page-break
  111. 6 3 1 4 7 3 1 4 (+1)
  112. Here 'x' is an extra "call-penalty", see below for more details. So if we
  113. artificially move the page-break-penalty for "in" to the front we get
  114. 7+3+1+4 for call+nop+nop+ret. This same simplification is made for all
  115. decompositions in the table.
  116. c) Compared to 'a' you'd expect this sequence to take only 1 extra cycle
  117. (just like sequence b). Instead there's yet 1 more extra cycle. In
  118. general, when looking at more sequences and/or replacing 'nop' with other
  119. instructions with known duration (not shown in the table), we see that we
  120. always get an extra cycle whenever a call instruction is not _immediately_
  121. followed by a "ret" or "pop" instruction. We'll call this a "call-penalty"
  122. cycle.
  123. We could attribute this call-penalty cycle to the call instruction itself,
  124. thus say that the call instruction takes 7 or 8 cycles, depending on what
  125. instruction follows. For most practical purposes this explanation is good
  126. enough. Though if you look in detail, the fetch of the next instruction
  127. has to start 7 (not 8) cycles after the start of the call instruction
  128. (otherwise we cannot know whether there should be a penalty or not). So
  129. this call-penalty really happens during the _next_ instruction.
  130. It's likely that this call-penalty stuff can be explained in a simpler
  131. way if you look at the full-pipelined implementation of the R800.
  132. _Maybe_ a call somehow/sometimes introduces a late pipeline stall.
  133. Though because no details are known about the R800 pipeline, for now,
  134. I'll stick to this weird sequential explanation that a "call"
  135. instruction somehow induces a stall in the next instruction.
  136. e) In sequence 'a' and 'b' we saw that "call" immediately followed by
  137. "ret" has no call-penalty. This is _not_ the case for "reti" or "retn".
  138. g) h) i) A "call" immediately followed by "pop" also has no penalty.
  139. j) k) l) Though "pop" has to be a single-byte instruction ("pop af", "pop
  140. bc", "pop de" or "pop hl"). "pop ix" or "pop iy" do have the call-penalty.
  141. This is similar to how a multi-byte return instruction (reti) also gets
  142. the penalty.
  143. m) n) o) p) From a stack-usage point of view a "call" immediately followed
  144. by a "ret" or "pop" is similar to a "push" immediately followed by "ret"
  145. or "pop". Though the tests show there's no penalty for push-nop-pop
  146. compared to push-pop or for push-nop-ret compared to push-ret.
  147. q) r) s) t) u) v) The "rst" instruction behaves similar to "call", so
  148. there is a penalty on the next instruction except if that next instruction
  149. is a (single-byte) "ret" or "pop". The only difference is that the "rst"
  150. instruction itself takes 1 (only 1!) cycles less than a "call"
  151. instruction.
  152. TODO implement this stuff in openMSX.