'\" t
.\"     Title: perf-amd-ibs
.\"    Author: [FIXME: author] [see http://www.docbook.org/tdg5/en/html/author]
.\" Generator: DocBook XSL Stylesheets vsnapshot <http://docbook.sf.net/>
.\"      Date: 2024-06-20
.\"    Manual: perf Manual
.\"    Source: perf
.\"  Language: English
.\"
.TH "PERF\-AMD\-IBS" "1" "2024\-06\-20" "perf" "perf Manual"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
perf-amd-ibs \- Support for AMD Instruction\-Based Sampling (IBS) with perf tool
.SH "SYNOPSIS"
.sp
.nf
\fIperf record\fR \-e ibs_op//
\fIperf record\fR \-e ibs_fetch//
.fi
.SH "DESCRIPTION"
.sp
Instruction\-Based Sampling (IBS) provides precise Instruction Pointer (IP) profiling support on AMD platforms\&. IBS has two independent components: IBS Op and IBS Fetch\&. IBS Op sampling provides information about instruction execution (micro\-op execution to be precise) with details like d\-cache hit/miss, d\-TLB hit/miss, cache miss latency, load/store data source, branch behavior etc\&. IBS Fetch sampling provides information about instruction fetch with details like i\-cache hit/miss, i\-TLB hit/miss, fetch latency etc\&. IBS is per\-smt\-thread i\&.e\&. each SMT hardware thread contains standalone IBS units\&.
.sp
Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited using the Linux perf utility\&. The following files will be created at boot time if IBS is supported by the hardware and kernel\&.
.sp
.if n \{\
.RS 4
.\}
.nf
/sys/bus/event_source/devices/ibs_op/
/sys/bus/event_source/devices/ibs_fetch/
.fi
.if n \{\
.RE
.\}
.sp
IBS Op PMU supports two events: cycles and micro ops\&. IBS Fetch PMU supports one event: fetch ops\&.
.sp
IBS PMUs do not have user/kernel filtering capability and thus it requires CAP_SYS_ADMIN or CAP_PERFMON privilege\&.
.SH "IBS VS\&. REGULAR CORE PMU"
.sp
IBS gives samples with precise IP, i\&.e\&. the IP recorded with IBS sample has no skid\&. Whereas the IP recorded by regular core PMU will have some skid (sample was generated at IP X but perf would record it at IP X+n)\&. Hence, regular core PMU might not help for profiling with instruction level precision\&. Further, IBS provides additional information about the sample in question\&. On the other hand, regular core PMU has it\(cqs own advantages like plethora of events, counting mode (less interference), up to 6 parallel counters, event grouping support, filtering capabilities etc\&.
.sp
Three regular core PMU events are internally forwarded to IBS Op PMU when precise_ip attribute is set:
.sp
.if n \{\
.RS 4
.\}
.nf
\-e cpu\-cycles:p becomes \-e ibs_op//
\-e r076:p becomes \-e ibs_op//
\-e r0C1:p becomes \-e ibs_op/cnt_ctl=1/
.fi
.if n \{\
.RE
.\}
.SH "EXAMPLES"
.SS "IBS Op PMU"
.sp
System\-wide profile, cycles event, sampling period: 100000
.sp
.if n \{\
.RS 4
.\}
.nf
# perf record \-e ibs_op// \-c 100000 \-a
.fi
.if n \{\
.RE
.\}
.sp
Per\-cpu profile (cpu10), cycles event, sampling period: 100000
.sp
.if n \{\
.RS 4
.\}
.nf
# perf record \-e ibs_op// \-c 100000 \-C 10
.fi
.if n \{\
.RE
.\}
.sp
Per\-cpu profile (cpu10), cycles event, sampling freq: 1000
.sp
.if n \{\
.RS 4
.\}
.nf
# perf record \-e ibs_op// \-F 1000 \-C 10
.fi
.if n \{\
.RE
.\}
.sp
System\-wide profile, uOps event, sampling period: 100000
.sp
.if n \{\
.RS 4
.\}
.nf
# perf record \-e ibs_op/cnt_ctl=1/ \-c 100000 \-a
.fi
.if n \{\
.RE
.\}
.sp
Same command, but also capture IBS register raw dump along with perf sample:
.sp
.if n \{\
.RS 4
.\}
.nf
# perf record \-e ibs_op/cnt_ctl=1/ \-c 100000 \-a \-\-raw\-samples
.fi
.if n \{\
.RE
.\}
.sp
System\-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward)
.sp
.if n \{\
.RS 4
.\}
.nf
# perf record \-e ibs_op/cnt_ctl=1,l3missonly=1/ \-c 100000 \-a
.fi
.if n \{\
.RE
.\}
.sp
Per process(upstream v6\&.2 onward), uOps event, sampling period: 100000
.sp
.if n \{\
.RS 4
.\}
.nf
# perf record \-e ibs_op/cnt_ctl=1/ \-c 100000 \-p 1234
.fi
.if n \{\
.RE
.\}
.sp
Per process(upstream v6\&.2 onward), uOps event, sampling period: 100000
.sp
.if n \{\
.RS 4
.\}
.nf
# perf record \-e ibs_op/cnt_ctl=1/ \-c 100000 \-\- ls
.fi
.if n \{\
.RE
.\}
.sp
To analyse recorded profile in aggregate mode
.sp
.if n \{\
.RS 4
.\}
.nf
# perf report
/* Select a line and press \*(Aqa\*(Aq to drill down at instruction level\&. */
.fi
.if n \{\
.RE
.\}
.sp
To go over each sample
.sp
.if n \{\
.RS 4
.\}
.nf
# perf script
.fi
.if n \{\
.RE
.\}
.sp
Raw dump of IBS registers when profiled with \-\-raw\-samples
.sp
.if n \{\
.RS 4
.\}
.nf
# perf report \-D
/* Look for PERF_RECORD_SAMPLE */
.fi
.if n \{\
.RE
.\}
.sp
.if n \{\
.RS 4
.\}
.nf
Example register raw dump:
.fi
.if n \{\
.RE
.\}
.sp
.if n \{\
.RS 4
.\}
.nf
ibs_op_ctl:     000002c30006186a MaxCnt    100000 L3MissOnly 0 En 1
        Val 1 CntCtl 0=cycles CurCnt       707
IbsOpRip:       ffffffff8204aea7
ibs_op_data:    0000010002550001 CompToRetCtr     1 TagToRetCtr   597
        BrnRet 0  RipInvalid 0 BrnFuse 0 Microcode 1
ibs_op_data2:   0000000000000013 RmtNode 1 DataSrc 3=DRAM
ibs_op_data3:   0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0
        DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0
        DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0
        DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1
        DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes
        OpDcMissOpenMemReqs 12 DcMissLat     0 TlbRefillLat     0
IbsDCLinAd:     ff110008a5398920
IbsDCPhysAd:    00000008a5398920
.fi
.if n \{\
.RE
.\}
.sp
IBS applied in a real world usecase
.sp
.if n \{\
.RS 4
.\}
.nf
~90% regression was observed in tbench with specific scheduler hint
which was counter intuitive\&. IBS profile of good and bad run captured
using perf helped in identifying exact cause of the problem:
.fi
.if n \{\
.RE
.\}
.sp
.if n \{\
.RS 4
.\}
.nf
https://lore\&.kernel\&.org/r/20220921063638\&.2489\-1\-kprateek\&.nayak@amd\&.com
.fi
.if n \{\
.RE
.\}
.SS "IBS Fetch PMU"
.sp
Similar commands can be used with Fetch PMU as well\&.
.sp
System\-wide profile, fetch ops event, sampling period: 100000
.sp
.if n \{\
.RS 4
.\}
.nf
# perf record \-e ibs_fetch// \-c 100000 \-a
.fi
.if n \{\
.RE
.\}
.sp
System\-wide profile, fetch ops event, sampling period: 100000, Random enable
.sp
.if n \{\
.RS 4
.\}
.nf
# perf record \-e ibs_fetch/rand_en=1/ \-c 100000 \-a
.fi
.if n \{\
.RE
.\}
.sp
.if n \{\
.RS 4
.\}
.nf
Random enable adds small degree of variability to sample period\&. This
helps in cases like long running loops where PMU is tagging the same
instruction over and over because of fixed sample period\&.
.fi
.if n \{\
.RE
.\}
.sp
etc\&.
.SH "PERF MEM AND PERF C2C"
.sp
perf mem is a memory access profiler tool and perf c2c is a shared data cacheline analyser tool\&. Both of them internally uses IBS Op PMU on AMD\&. Below is a simple example of the perf mem tool\&.
.sp
.if n \{\
.RS 4
.\}
.nf
# perf mem record \-c 100000 \-\- make
# perf mem report
.fi
.if n \{\
.RE
.\}
.sp
A normal perf mem report output will provide detailed memory access profile\&. However, it can also be aggregated based on output fields\&. For example:
.sp
.if n \{\
.RS 4
.\}
.nf
# perf mem report \-F mem,sample,snoop
Samples: 3M of event \*(Aqibs_op//\*(Aq, Event count (approx\&.): 23524876
Memory access                                 Samples  Snoop
N/A                                           1903343  N/A
L1 hit                                        1056754  N/A
L2 hit                                          75231  N/A
L3 hit                                           9496  HitM
L3 hit                                           2270  N/A
RAM hit                                          8710  N/A
Remote node, same socket RAM hit                 3241  N/A
Remote core, same node Any cache hit             1572  HitM
Remote core, same node Any cache hit              514  N/A
Remote node, same socket Any cache hit           1216  HitM
Remote node, same socket Any cache hit            350  N/A
Uncached hit                                       18  N/A
.fi
.if n \{\
.RE
.\}
.sp
Please refer to their man page for more detail\&.
.SH "SEE ALSO"
.sp
\fBperf-record\fR(1), \fBperf-script\fR(1), \fBperf-report\fR(1), \fBperf-mem\fR(1), \fBperf-c2c\fR(1)