'\" t .\" Title: perf-amd-ibs .\" Author: [FIXME: author] [see http://www.docbook.org/tdg5/en/html/author] .\" Generator: DocBook XSL Stylesheets vsnapshot .\" Date: 2024-06-20 .\" Manual: perf Manual .\" Source: perf .\" Language: English .\" .TH "PERF\-AMD\-IBS" "1" "2024\-06\-20" "perf" "perf Manual" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .\" http://bugs.debian.org/507673 .\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" ----------------------------------------------------------------- .\" * set default formatting .\" ----------------------------------------------------------------- .\" disable hyphenation .nh .\" disable justification (adjust text to left margin only) .ad l .\" ----------------------------------------------------------------- .\" * MAIN CONTENT STARTS HERE * .\" ----------------------------------------------------------------- .SH "NAME" perf-amd-ibs \- Support for AMD Instruction\-Based Sampling (IBS) with perf tool .SH "SYNOPSIS" .sp .nf \fIperf record\fR \-e ibs_op// \fIperf record\fR \-e ibs_fetch// .fi .SH "DESCRIPTION" .sp Instruction\-Based Sampling (IBS) provides precise Instruction Pointer (IP) profiling support on AMD platforms\&. IBS has two independent components: IBS Op and IBS Fetch\&. IBS Op sampling provides information about instruction execution (micro\-op execution to be precise) with details like d\-cache hit/miss, d\-TLB hit/miss, cache miss latency, load/store data source, branch behavior etc\&. IBS Fetch sampling provides information about instruction fetch with details like i\-cache hit/miss, i\-TLB hit/miss, fetch latency etc\&. IBS is per\-smt\-thread i\&.e\&. each SMT hardware thread contains standalone IBS units\&. .sp Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited using the Linux perf utility\&. The following files will be created at boot time if IBS is supported by the hardware and kernel\&. .sp .if n \{\ .RS 4 .\} .nf /sys/bus/event_source/devices/ibs_op/ /sys/bus/event_source/devices/ibs_fetch/ .fi .if n \{\ .RE .\} .sp IBS Op PMU supports two events: cycles and micro ops\&. IBS Fetch PMU supports one event: fetch ops\&. .sp IBS PMUs do not have user/kernel filtering capability and thus it requires CAP_SYS_ADMIN or CAP_PERFMON privilege\&. .SH "IBS VS\&. REGULAR CORE PMU" .sp IBS gives samples with precise IP, i\&.e\&. the IP recorded with IBS sample has no skid\&. Whereas the IP recorded by regular core PMU will have some skid (sample was generated at IP X but perf would record it at IP X+n)\&. Hence, regular core PMU might not help for profiling with instruction level precision\&. Further, IBS provides additional information about the sample in question\&. On the other hand, regular core PMU has it\(cqs own advantages like plethora of events, counting mode (less interference), up to 6 parallel counters, event grouping support, filtering capabilities etc\&. .sp Three regular core PMU events are internally forwarded to IBS Op PMU when precise_ip attribute is set: .sp .if n \{\ .RS 4 .\} .nf \-e cpu\-cycles:p becomes \-e ibs_op// \-e r076:p becomes \-e ibs_op// \-e r0C1:p becomes \-e ibs_op/cnt_ctl=1/ .fi .if n \{\ .RE .\} .SH "EXAMPLES" .SS "IBS Op PMU" .sp System\-wide profile, cycles event, sampling period: 100000 .sp .if n \{\ .RS 4 .\} .nf # perf record \-e ibs_op// \-c 100000 \-a .fi .if n \{\ .RE .\} .sp Per\-cpu profile (cpu10), cycles event, sampling period: 100000 .sp .if n \{\ .RS 4 .\} .nf # perf record \-e ibs_op// \-c 100000 \-C 10 .fi .if n \{\ .RE .\} .sp Per\-cpu profile (cpu10), cycles event, sampling freq: 1000 .sp .if n \{\ .RS 4 .\} .nf # perf record \-e ibs_op// \-F 1000 \-C 10 .fi .if n \{\ .RE .\} .sp System\-wide profile, uOps event, sampling period: 100000 .sp .if n \{\ .RS 4 .\} .nf # perf record \-e ibs_op/cnt_ctl=1/ \-c 100000 \-a .fi .if n \{\ .RE .\} .sp Same command, but also capture IBS register raw dump along with perf sample: .sp .if n \{\ .RS 4 .\} .nf # perf record \-e ibs_op/cnt_ctl=1/ \-c 100000 \-a \-\-raw\-samples .fi .if n \{\ .RE .\} .sp System\-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward) .sp .if n \{\ .RS 4 .\} .nf # perf record \-e ibs_op/cnt_ctl=1,l3missonly=1/ \-c 100000 \-a .fi .if n \{\ .RE .\} .sp Per process(upstream v6\&.2 onward), uOps event, sampling period: 100000 .sp .if n \{\ .RS 4 .\} .nf # perf record \-e ibs_op/cnt_ctl=1/ \-c 100000 \-p 1234 .fi .if n \{\ .RE .\} .sp Per process(upstream v6\&.2 onward), uOps event, sampling period: 100000 .sp .if n \{\ .RS 4 .\} .nf # perf record \-e ibs_op/cnt_ctl=1/ \-c 100000 \-\- ls .fi .if n \{\ .RE .\} .sp To analyse recorded profile in aggregate mode .sp .if n \{\ .RS 4 .\} .nf # perf report /* Select a line and press \*(Aqa\*(Aq to drill down at instruction level\&. */ .fi .if n \{\ .RE .\} .sp To go over each sample .sp .if n \{\ .RS 4 .\} .nf # perf script .fi .if n \{\ .RE .\} .sp Raw dump of IBS registers when profiled with \-\-raw\-samples .sp .if n \{\ .RS 4 .\} .nf # perf report \-D /* Look for PERF_RECORD_SAMPLE */ .fi .if n \{\ .RE .\} .sp .if n \{\ .RS 4 .\} .nf Example register raw dump: .fi .if n \{\ .RE .\} .sp .if n \{\ .RS 4 .\} .nf ibs_op_ctl: 000002c30006186a MaxCnt 100000 L3MissOnly 0 En 1 Val 1 CntCtl 0=cycles CurCnt 707 IbsOpRip: ffffffff8204aea7 ibs_op_data: 0000010002550001 CompToRetCtr 1 TagToRetCtr 597 BrnRet 0 RipInvalid 0 BrnFuse 0 Microcode 1 ibs_op_data2: 0000000000000013 RmtNode 1 DataSrc 3=DRAM ibs_op_data3: 0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0 DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0 DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0 DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1 DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes OpDcMissOpenMemReqs 12 DcMissLat 0 TlbRefillLat 0 IbsDCLinAd: ff110008a5398920 IbsDCPhysAd: 00000008a5398920 .fi .if n \{\ .RE .\} .sp IBS applied in a real world usecase .sp .if n \{\ .RS 4 .\} .nf ~90% regression was observed in tbench with specific scheduler hint which was counter intuitive\&. IBS profile of good and bad run captured using perf helped in identifying exact cause of the problem: .fi .if n \{\ .RE .\} .sp .if n \{\ .RS 4 .\} .nf https://lore\&.kernel\&.org/r/20220921063638\&.2489\-1\-kprateek\&.nayak@amd\&.com .fi .if n \{\ .RE .\} .SS "IBS Fetch PMU" .sp Similar commands can be used with Fetch PMU as well\&. .sp System\-wide profile, fetch ops event, sampling period: 100000 .sp .if n \{\ .RS 4 .\} .nf # perf record \-e ibs_fetch// \-c 100000 \-a .fi .if n \{\ .RE .\} .sp System\-wide profile, fetch ops event, sampling period: 100000, Random enable .sp .if n \{\ .RS 4 .\} .nf # perf record \-e ibs_fetch/rand_en=1/ \-c 100000 \-a .fi .if n \{\ .RE .\} .sp .if n \{\ .RS 4 .\} .nf Random enable adds small degree of variability to sample period\&. This helps in cases like long running loops where PMU is tagging the same instruction over and over because of fixed sample period\&. .fi .if n \{\ .RE .\} .sp etc\&. .SH "PERF MEM AND PERF C2C" .sp perf mem is a memory access profiler tool and perf c2c is a shared data cacheline analyser tool\&. Both of them internally uses IBS Op PMU on AMD\&. Below is a simple example of the perf mem tool\&. .sp .if n \{\ .RS 4 .\} .nf # perf mem record \-c 100000 \-\- make # perf mem report .fi .if n \{\ .RE .\} .sp A normal perf mem report output will provide detailed memory access profile\&. However, it can also be aggregated based on output fields\&. For example: .sp .if n \{\ .RS 4 .\} .nf # perf mem report \-F mem,sample,snoop Samples: 3M of event \*(Aqibs_op//\*(Aq, Event count (approx\&.): 23524876 Memory access Samples Snoop N/A 1903343 N/A L1 hit 1056754 N/A L2 hit 75231 N/A L3 hit 9496 HitM L3 hit 2270 N/A RAM hit 8710 N/A Remote node, same socket RAM hit 3241 N/A Remote core, same node Any cache hit 1572 HitM Remote core, same node Any cache hit 514 N/A Remote node, same socket Any cache hit 1216 HitM Remote node, same socket Any cache hit 350 N/A Uncached hit 18 N/A .fi .if n \{\ .RE .\} .sp Please refer to their man page for more detail\&. .SH "SEE ALSO" .sp \fBperf-record\fR(1), \fBperf-script\fR(1), \fBperf-report\fR(1), \fBperf-mem\fR(1), \fBperf-c2c\fR(1)