perf-c2c.txt 7.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291
  1. perf-c2c(1)
  2. ===========
  3. NAME
  4. ----
  5. perf-c2c - Shared Data C2C/HITM Analyzer.
  6. SYNOPSIS
  7. --------
  8. [verse]
  9. 'perf c2c record' [<options>] <command>
  10. 'perf c2c record' [<options>] -- [<record command options>] <command>
  11. 'perf c2c report' [<options>]
  12. DESCRIPTION
  13. -----------
  14. C2C stands for Cache To Cache.
  15. The perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows
  16. you to track down the cacheline contentions.
  17. The tool is based on x86's load latency and precise store facility events
  18. provided by Intel CPUs. These events provide:
  19. - memory address of the access
  20. - type of the access (load and store details)
  21. - latency (in cycles) of the load access
  22. The c2c tool provide means to record this data and report back access details
  23. for cachelines with highest contention - highest number of HITM accesses.
  24. The basic workflow with this tool follows the standard record/report phase.
  25. User uses the record command to record events data and report command to
  26. display it.
  27. RECORD OPTIONS
  28. --------------
  29. -e::
  30. --event=::
  31. Select the PMU event. Use 'perf mem record -e list'
  32. to list available events.
  33. -v::
  34. --verbose::
  35. Be more verbose (show counter open errors, etc).
  36. -l::
  37. --ldlat::
  38. Configure mem-loads latency.
  39. -k::
  40. --all-kernel::
  41. Configure all used events to run in kernel space.
  42. -u::
  43. --all-user::
  44. Configure all used events to run in user space.
  45. REPORT OPTIONS
  46. --------------
  47. -k::
  48. --vmlinux=<file>::
  49. vmlinux pathname
  50. -v::
  51. --verbose::
  52. Be more verbose (show counter open errors, etc).
  53. -i::
  54. --input::
  55. Specify the input file to process.
  56. -N::
  57. --node-info::
  58. Show extra node info in report (see NODE INFO section)
  59. -c::
  60. --coalesce::
  61. Specify sorting fields for single cacheline display.
  62. Following fields are available: tid,pid,iaddr,dso
  63. (see COALESCE)
  64. -g::
  65. --call-graph::
  66. Setup callchains parameters.
  67. Please refer to perf-report man page for details.
  68. --stdio::
  69. Force the stdio output (see STDIO OUTPUT)
  70. --stats::
  71. Display only statistic tables and force stdio mode.
  72. --full-symbols::
  73. Display full length of symbols.
  74. --no-source::
  75. Do not display Source:Line column.
  76. --show-all::
  77. Show all captured HITM lines, with no regard to HITM % 0.0005 limit.
  78. -f::
  79. --force::
  80. Don't do ownership validation.
  81. -d::
  82. --display::
  83. Switch to HITM type (rmt, lcl) to display and sort on. Total HITMs as default.
  84. C2C RECORD
  85. ----------
  86. The perf c2c record command setup options related to HITM cacheline analysis
  87. and calls standard perf record command.
  88. Following perf record options are configured by default:
  89. (check perf record man page for details)
  90. -W,-d,--sample-cpu
  91. Unless specified otherwise with '-e' option, following events are monitored by
  92. default:
  93. cpu/mem-loads,ldlat=30/P
  94. cpu/mem-stores/P
  95. User can pass any 'perf record' option behind '--' mark, like (to enable
  96. callchains and system wide monitoring):
  97. $ perf c2c record -- -g -a
  98. Please check RECORD OPTIONS section for specific c2c record options.
  99. C2C REPORT
  100. ----------
  101. The perf c2c report command displays shared data analysis. It comes in two
  102. display modes: stdio and tui (default).
  103. The report command workflow is following:
  104. - sort all the data based on the cacheline address
  105. - store access details for each cacheline
  106. - sort all cachelines based on user settings
  107. - display data
  108. In general perf report output consist of 2 basic views:
  109. 1) most expensive cachelines list
  110. 2) offsets details for each cacheline
  111. For each cacheline in the 1) list we display following data:
  112. (Both stdio and TUI modes follow the same fields output)
  113. Index
  114. - zero based index to identify the cacheline
  115. Cacheline
  116. - cacheline address (hex number)
  117. Total records
  118. - sum of all cachelines accesses
  119. Rmt/Lcl Hitm
  120. - cacheline percentage of all Remote/Local HITM accesses
  121. LLC Load Hitm - Total, Lcl, Rmt
  122. - count of Total/Local/Remote load HITMs
  123. Store Reference - Total, L1Hit, L1Miss
  124. Total - all store accesses
  125. L1Hit - store accesses that hit L1
  126. L1Hit - store accesses that missed L1
  127. Load Dram
  128. - count of local and remote DRAM accesses
  129. LLC Ld Miss
  130. - count of all accesses that missed LLC
  131. Total Loads
  132. - sum of all load accesses
  133. Core Load Hit - FB, L1, L2
  134. - count of load hits in FB (Fill Buffer), L1 and L2 cache
  135. LLC Load Hit - Llc, Rmt
  136. - count of LLC and Remote load hits
  137. For each offset in the 2) list we display following data:
  138. HITM - Rmt, Lcl
  139. - % of Remote/Local HITM accesses for given offset within cacheline
  140. Store Refs - L1 Hit, L1 Miss
  141. - % of store accesses that hit/missed L1 for given offset within cacheline
  142. Data address - Offset
  143. - offset address
  144. Pid
  145. - pid of the process responsible for the accesses
  146. Tid
  147. - tid of the process responsible for the accesses
  148. Code address
  149. - code address responsible for the accesses
  150. cycles - rmt hitm, lcl hitm, load
  151. - sum of cycles for given accesses - Remote/Local HITM and generic load
  152. cpu cnt
  153. - number of cpus that participated on the access
  154. Symbol
  155. - code symbol related to the 'Code address' value
  156. Shared Object
  157. - shared object name related to the 'Code address' value
  158. Source:Line
  159. - source information related to the 'Code address' value
  160. Node
  161. - nodes participating on the access (see NODE INFO section)
  162. NODE INFO
  163. ---------
  164. The 'Node' field displays nodes that accesses given cacheline
  165. offset. Its output comes in 3 flavors:
  166. - node IDs separated by ','
  167. - node IDs with stats for each ID, in following format:
  168. Node{cpus %hitms %stores}
  169. - node IDs with list of affected CPUs in following format:
  170. Node{cpu list}
  171. User can switch between above flavors with -N option or
  172. use 'n' key to interactively switch in TUI mode.
  173. COALESCE
  174. --------
  175. User can specify how to sort offsets for cacheline.
  176. Following fields are available and governs the final
  177. output fields set for caheline offsets output:
  178. tid - coalesced by process TIDs
  179. pid - coalesced by process PIDs
  180. iaddr - coalesced by code address, following fields are displayed:
  181. Code address, Code symbol, Shared Object, Source line
  182. dso - coalesced by shared object
  183. By default the coalescing is setup with 'pid,iaddr'.
  184. STDIO OUTPUT
  185. ------------
  186. The stdio output displays data on standard output.
  187. Following tables are displayed:
  188. Trace Event Information
  189. - overall statistics of memory accesses
  190. Global Shared Cache Line Event Information
  191. - overall statistics on shared cachelines
  192. Shared Data Cache Line Table
  193. - list of most expensive cachelines
  194. Shared Cache Line Distribution Pareto
  195. - list of all accessed offsets for each cacheline
  196. TUI OUTPUT
  197. ----------
  198. The TUI output provides interactive interface to navigate
  199. through cachelines list and to display offset details.
  200. For details please refer to the help window by pressing '?' key.
  201. CREDITS
  202. -------
  203. Although Don Zickus, Dick Fowles and Joe Mario worked together
  204. to get this implemented, we got lots of early help from Arnaldo
  205. Carvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
  206. C2C BLOG
  207. --------
  208. Check Joe's blog on c2c tool for detailed use case explanation:
  209. https://joemario.github.io/blog/2016/09/01/c2c-blog/
  210. SEE ALSO
  211. --------
  212. linkperf:perf-record[1], linkperf:perf-mem[1]