r800-refresh.txt 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406
  1. I was investigating whether the 'R' register behaves the same on Z80 and R800.
  2. I initially wrote this program:
  3. di
  4. ld hl,#0100
  5. xor a
  6. ld b,a
  7. ld r,a
  8. loop: ld a,r ; (2)
  9. ld (hl),a ; (4)
  10. inc hl ; (1)
  11. djnz loop ; (3)
  12. ei
  13. ret
  14. On Z80, for every iteration of this loop, R is increased by 5 (+2 for 'ld a,r'
  15. and +1 for the other 3 instructions). On R800 R is also increased by 5 each
  16. iteration. Though sometimes it's increased by 6!
  17. The increase by 6 instead of 5 happened every 18th or 19th (alternating)
  18. Iteration. One iteration takes 10 cycles (taking R800-page-breaks into account),
  19. so that's on average every 185th clock cycle.
  20. This number 185 is very close to the number of useful cycles in a full R800
  21. refresh cycle. Currently (2010-08-16) R800-refresh in openmsx stalls the R800
  22. for 26 cycles every 210 cycles (so that leaves 210-26=184 useful cycles).
  23. So all this lead me to the hypotheses that a R800-refresh cycle also increases
  24. the R register by one.
  25. This would be very useful because it would allow to measure in more detail how
  26. refresh works exactly on R800 (the current model in openMSX (210/26) is not
  27. 100% correct). So that's what I'll do in the rest of this document.
  28. New test program. This records sequences longer than 256, it's important that
  29. each iteration takes the same amount of cycles. So I had to replace the djnz
  30. instruction.
  31. di
  32. ld hl,#0100
  33. xor a
  34. ld r,a
  35. loop: ld a,r ; (2)
  36. ld (hl),a ; (4)
  37. inc hl ; (1)
  38. ld a,h ; (1)
  39. cp #c0 ; (2)
  40. jr nz,loop ; (3)
  41. [ some code to transform memory block #0100-#C000 into
  42. the differential of this block, so it's easy to see by
  43. how much R increased each iteration ]
  44. Each iteration takes 13 cycles. Per iteration R is increased by either 7 or 8.
  45. The difference table (a small portion) looked like this:
  46. 07 07 07 07 07 07 07 07 07 07 07 07 07 08 07 07
  47. 07 07 07 07 07 07 07 07 07 07 07 08 07 07 07 07
  48. 07 07 07 07 07 07 07 07 07 08 07 07 07 07 07 07
  49. 07 07 07 07 07 07 07 08 07 07 07 07 07 07 07 07
  50. 07 07 07 07 07 07 08 07 07 07 07 07 07 07 07 07
  51. 07 07 07 07 08 07 07 07 07 07 07 07 07 07 07 07
  52. 07 07 08 07 07 07 07 07 07 07 07 07 07 07 07 07
  53. 08 07 07 07 07 07 07 07 07 07 07 07 07 07 07 08
  54. 07 07 07 07 07 07 07 07 07 07 07 07 07 07 08 07
  55. 07 07 07 07 07 07 07 07 07 07 07 07 08 07 07 07
  56. ...
  57. These tables always have only two different numbers, in this case a series of
  58. 7's followed by one 8, then again a series of 7's followed by one 8 and so one.
  59. The actual number '7' or '8' is not important for timing, only the pattern of
  60. these numbers in the table is important. So I'll create a more compact notation
  61. for the table above:
  62. 14 14 14 14 15 14 14 14 14 15 14 ... (these are decimal numbers)
  63. or even
  64. 4*14 15 ...
  65. This means there are 4 repetitions of '13*0x07 + 1* 0x08' (length 14) followed by
  66. one time '14*0x07 + 1*0x08' (length 15).
  67. So this notation indicates how many iterations there are before there is a
  68. refresh.
  69. For this test on average there are '(4 * 14 + 1 * 15) / 5 = 14.2' iterations
  70. between two refresh cycles. Each iteration takes 13 cycles. So that's 184.6
  71. useful clock cycles between refresh cycles.
  72. Next I realized this test program could be speed up one cycle:
  73. ld c,#c0
  74. loop: ld a,r ; (2)
  75. ld (hl),a ; (4)
  76. inc hl ; (1)
  77. ld a,h ; (1)
  78. cp c ; (1)
  79. jr nz,loop ; (3)
  80. An iteration now takes 12 cycles (R is still increased by 7 or 8).
  81. Difference table looks like this
  82. 16 3*15 16 2*15 16 2*15 (this sequence repeats all the time)
  83. That's (3*16+7*15)/10 = 15.3 iteration
  84. or 15.3*12 = 183.6 clock cycles between refresh.
  85. Then I started inserting extra instructions in this test program:
  86. ld c,#c0
  87. loop: ld a,r ; (2)
  88. ld (hl),a ; (4)
  89. [***]
  90. inc hl ; (1)
  91. ld a,h ; (1)
  92. cp c ; (1)
  93. djnz loop ; (3)
  94. * NOP
  95. 13 cycles per iteration (R increases 8 or 9)
  96. pattern: 8*14 15
  97. cycles between refresh: 183.44 (8*14+15)/9*13
  98. * 2 x NOP
  99. 14 cycles/iteration (R incr 9 or 10)
  100. pattern: 5*13 14
  101. cycles/refresh = 184.33
  102. I noticed the start of the difference table was a bit different, it went like
  103. this:
  104. 9*13 14 7*13 14 5*13 14 5*13 14 5*13 14
  105. So it took some time before it stabilized on the '5*13 14' pattern. For the
  106. rest of the tests I didn't look at the start of the table anymore. I only
  107. searched for the 'stable' pattern.
  108. I ran this same test again, but now the (stable) pattern was
  109. 9*13 14
  110. that gives
  111. cycles/refresh = 183.4
  112. This is very interesting behaviour, depending on some (yet unknown) initial
  113. conditions, _the_whole_test_ runs slightly faster or slower. This is
  114. interesting because it may give a clue to why there is considerable variation
  115. to *some* speed measurements on a real R800 (see doc/r800-test.txt) while for
  116. other tests the results are much more stable. (Also in the current R800 openMSX
  117. emulation, the speed measurements are always stable).
  118. I didn't repeat the previous tests. Maybe they also show different patterns
  119. on different runs. I did repeat all future tests (there always seems to be
  120. either only 1 or 2 different stable patterns)
  121. * NEG (like 2xNOP also takes 2 cycles)
  122. 14 cycles/iteration (R incr 9 or 10)
  123. pattern: 7*13 14
  124. cycles/refresh = 183.75
  125. * 3 x NOP
  126. 15 cycles/iteration (R incr 10 or 11)
  127. pattern: 2*12 13
  128. 3*12 13
  129. cycles/refresh = 185
  130. 183.75
  131. * IM 1 (3 cycles, like 3xNOP)
  132. 15 cycles/iteration (R incr 9 or 10)
  133. pattern: same as 3 x NOP
  134. * 4 x NOP
  135. 16 cycles/iteration (R incr 11 or 12)
  136. pattern: 11 12
  137. cycles/refresh = 184
  138. * LD (HL),A (4 cycles (with page-breaks))
  139. 16 cycles/iteration (R incr 8 or 9)
  140. pattern: same as 4 x NOP
  141. * 5 x NOP
  142. 17 cycles/iteration (R incr 12 or 13)
  143. pattern: 4*11 10
  144. 7*11 10 6*11 10
  145. cycles/refresh = 183.6
  146. 184.73
  147. * BIT 0,(HL) (5 cycles)
  148. 17 cycles/iteration (R incr 9 or 10)
  149. pattern: 5*11 10 4*11 10
  150. 6*11 10
  151. cycles/refresh = 183.90
  152. 184.57
  153. I still need to do more experiments, but some *guesses* so far: useful number
  154. of cycles always seems to be within (183, 185]. That's a variation of more than
  155. one clock cycle. Maybe refresh also waits for an even clock cycle (just like IO
  156. does). This could explain why there are two stable patterns for some tests (in
  157. one case you have to insert an extra cycle to align to an even number of
  158. cycles, in the other case you're already aligned). This could explain why there
  159. can be variation in speed between different runs of the same test. And finally
  160. it explains why the documentation (e.g. atoc or even the turbor datapack) talks
  161. about half clock cycles for the duration of the refresh (in 50% of the cases it
  162. needs to add one cycle). Those docs talk about 21.5 cycles refresh every 222
  163. cycles, those number seem wrong to me (don't match measurements on real HW),
  164. but at least I now have an idea about that half clock cycle.
  165. ---------
  166. I implemented the above guess (refresh waits for even clock cycle) in openMSX
  167. revision 11643. I can now more or less reproduce the results above: The
  168. number of useful clock cycles per refresh is also in range (183, 185], but the
  169. number per test is not the same as above (the 'stable patterns' from the tests
  170. above are different). Also in openMSX there's always only one stable pattern,
  171. while on a real R800, for some test, there clearly were two different possible
  172. patterns.
  173. New test: above we measured the number of useful clock cycles per 'refresh
  174. cycle'. Now we're going to measure how many cycles the refresh itself takes.
  175. The idea goes like this: by observing the R register we can detect that there
  176. was a refresh, so by doing a longer test we can count how many refresh cycles
  177. there were. We can combine this with measuring the total time of the test
  178. (using the E6-timer). We can also calculate how many 'useful' cycles there
  179. were in the whole test. So the difference between actual and useful cycles
  180. must be cycles spend on refresh. And finally if we divide that by the counted
  181. number of refreshes, we should know how many cycles a single refresh takes.
  182. Full test program:
  183. org #c000
  184. di
  185. ld hl,#0100
  186. ld c,#bc
  187. ld e,3
  188. l1 ld a,r ; This loop waits till there was a refresh.
  189. ld d,a ; It's an attempt to get more stable measurements.
  190. ld a,r ; Note that this loop will not terminate in Z80
  191. sub d ; mode.
  192. sub e
  193. jr z,l1
  194. out (#e6),a
  195. loop ld a,r ; (2) Actual test loop, same as in tests above
  196. ld (hl),a ; (4) One iteration takes 12 cycles
  197. inc hl ; (1)
  198. ;[***] ; extra instructions here (see below)
  199. ld a,h ; (1)
  200. cp c ; (1)
  201. jr nz,loop ; (3/2) 3 cycles when jump is taken, 2 otherwise
  202. in a,(#e6)
  203. ld l,a
  204. in a,(#e7)
  205. ld h,a
  206. ld (#be00),hl ; store total duration of the test
  207. ld hl,#0100 ; Next we do some post-processing on the data:
  208. l2 inc hl ; First calculate the difference table, same
  209. ld a,(hl) ; routine as in tests above (though not shown
  210. dec hl ; there).
  211. sub (hl)
  212. ld (hl),a
  213. inc hl
  214. ld a,h
  215. cp c
  216. jr nz,l2
  217. ld hl,#bc00
  218. ld de,#bc00+1
  219. ld bc,#200-1
  220. ld (hl),0
  221. ldir
  222. ld de,#100 ; Next we calculate a histogram.
  223. ld hl,#bc00 ; We expect this histogram to be all zero, except for
  224. l3 ld a,(de) ; two entries: the iterations with and without a
  225. ld l,a ; refresh cycle.
  226. inc (hl) ; There is a single other point in this histogram,
  227. jr nz,nc ; that's because we don't calculate the difference of
  228. inc h ; the very last iteration correctly (we'd have to
  229. inc (hl) ; sample the R register one more time outside the
  230. dec h ; loop). But for now we ignore this outlier.
  231. nc inc de
  232. ld a,d
  233. cp h
  234. jr nz, l3
  235. ei
  236. ret
  237. Test results:
  238. * no extra instructions:
  239. real machine:
  240. histogram: 0x07: 0xAECE
  241. 0x08: 0x0C31
  242. 0x66: 0x0001 -> belongs to 0x07, can be seen by the pattern
  243. in the difference table, I won't show this
  244. entry anymore in the next results
  245. E6-ticks: 0x5B75
  246. total cycles: E6-timer * 28 cycles/E6-tick = 655564 cycles
  247. useful cycles: 0xBB00 iterations * 12 cycles/iteration = 574464 cycles
  248. overhead cycles: 655564 - 574464 = 81100 cycles
  249. cycles per refresh = 81100 cycles / 0xc31 refreshes
  250. = 25.985 cycles/refresh
  251. openMSX revision 11648:
  252. histogram: 0x07: 0xAECD+1
  253. 0x08: 0x0C32
  254. E6-ticks: 0x5B78
  255. --> 26.004 cycles/refresh
  256. * extra instruction: NOP (1 cycles, increases R by 1)
  257. real machine:
  258. histogram: 0x8: 0xADCA
  259. 0x9: 0x0D36
  260. E6-ticks: 0x6318
  261. --> 26.011 cycles/refresh
  262. openMSX revision 11648:
  263. histogram: 0x8: 0xADC8
  264. 0x9: 0x0D38
  265. E6-ticks: 0x632A
  266. --> 26.144 cycles/refresh
  267. * extra instruction: IM 1 (3 cycles, incr R 2)
  268. real machine:
  269. histogram: 0x9: 0xABCA
  270. 0xA: 0x0F36
  271. E6-ticks: 0x721B
  272. --> 25.636 cycles/refresh
  273. openMSX revision 11648:
  274. histogram: 0x9: 0xABBC
  275. 0xA: 0x0F44
  276. E6-ticks: 0x727E
  277. --> 26.25 cycles/refresh
  278. !! Big difference between openmsx and real MSX !!
  279. * extra instructions: EXX ; MULUW HL,BC ; EXX (1+36+1 cycles, incr R 1+2+1)
  280. real machine:
  281. histogram: 0xB: 0x882D
  282. 0xC: 0x32D3
  283. E6-ticks: 0x17D33 (of course I couldn't measure how many times the
  284. timer overflowed, but it must be one time)
  285. --> 26.042 cycles/refresh
  286. openMSX revision 11648:
  287. histogram: 0xB: 0x8801
  288. 0xC: 0x32FF
  289. E6-ticks: 0x17E80
  290. --> 26.669 cycles/refresh
  291. !! Big difference between openmsx and real MSX !!
  292. Preliminary conclusion:
  293. Refresh seems to take about 26 cycles, this is also the value we currently
  294. use in openMSX. Though especially for the 'IM 1' case there's still a
  295. relatively big difference between openMSX and the real hardware. Needs
  296. more experiments.
  297. One other difference between Z80 and R800
  298. di
  299. xor a
  300. ld r,a
  301. ld a,r
  302. ld c,a ; Z80: c=2 R800: c=1
  303. ld a,r
  304. ld b,a ; Z80: b=5 R800: b=4
  305. ld a,r ; Z80: a=8 R800: a=7
  306. And of course once in a while the numbers for R800 are (partially) increased by
  307. one.
  308. This difference can possibly be explained by a difference in the order of
  309. storing the result to the 'R' register and increasing the 'R' register on each
  310. M1 cycle.