fi_rxm(7) Libfabric v1.22.0 fi_rxm(7) NAME fi_rxm - The RxM (RDM over MSG) Utility Provider OVERVIEW The RxM provider (ofi_rxm) is an utility provider that supports FI_EP_RDM type endpoint emulated over FI_EP_MSG type endpoint(s) of an underlying core provider. FI_EP_RDM endpoints have a reliable datagram interface and RxM emulates this by hiding the connection management of underlying FI_EP_MSG endpoints from the user. Additionally, RxM can hide memory registration requirement from a core provider like verbs if the apps don't support it. REQUIREMENTS Requirements for core provider RxM provider requires the core provider to support the following features: o MSG endpoints (FI_EP_MSG) o RMA read/write (FI_RMA) - Used for implementing rendezvous protocol for large messages. o FI_OPT_CM_DATA_SIZE of at least 24 bytes. Requirements for applications Since RxM emulates RDM endpoints by hiding connection management and connections are established only on-demand (when app tries to send data), the first several data transfer calls would return EAGAIN. Applications should be aware of this and retry until the operation succeeds. If an application has chosen manual progress for data progress, it should also read the CQ so that the connection establishment progresses. Not doing so would result in a stall. See also the ERRORS section in fi_msg(3). SUPPORTED FEATURES The RxM provider currently supports FI_MSG, FI_TAGGED, FI_RMA and FI_ATOMIC capabilities. Endpoint types The provider supports only FI_EP_RDM. Endpoint capabilities The following data transfer interface is supported: FI_MSG, FI_TAGGED, FI_RMA, FI_ATOMIC. Progress The RxM provider supports both FI_PROGRESS_MANUAL and FI_PROGRESS_AUTO. Manual progress in general has better connection scale-up and lower CPU utilization since there's no separate auto-progress thread. Addressing Formats FI_SOCKADDR, FI_SOCKADDR_IN Memory Region FI_MR_VIRT_ADDR, FI_MR_ALLOCATED, FI_MR_PROV_KEY MR mode bits would be required from the app in case the core provider requires it. LIMITATIONS When using RxM provider, some limitations from the underlying MSG provider could also show up. Please refer to the corresponding MSG provider man pages to find about those limitations. Unsupported features RxM provider does not support the following features: o op_flags: FI_FENCE. o Scalable endpoints o Shared contexts o FABRIC_DIRECT o FI_MR_SCALABLE o Authorization keys o Application error data buffers o Multicast o FI_SYNC_ERR o Reporting unknown source addr data as part of completions o Triggered operations Progress limitations When sending large messages, an app doing an sread or waiting on the CQ file descriptor may not get a completion when reading the CQ after being woken up from the wait. The app has to do sread or wait on the file descriptor again. This is needed because RxM uses a rendezvous protocol for large message sends. An app would get woken up from waiting on CQ fd when rendezvous protocol request completes but it would have to wait again to get an ACK from the receiver indicating completion of large message transfer by remote RMA read. FI_ATOMIC limitations The FI_ATOMIC capability will only be listed in the fi_info if the fi_info hints parameter specifies FI_ATOMIC. If FI_ATOMIC is requested, message order FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_WAR, FI_ORDER_WAW, FI_ORDER_SAR, and FI_ORDER_SAW can not be supported. Miscellaneous limitations o RxM protocol peers should have same endian-ness otherwise connections won't successfully complete. This enables better performance at run- time as byte order translations are avoided. RUNTIME PARAMETERS The ofi_rxm provider checks for the following environment variables. FI_OFI_RXM_BUFFER_SIZE Defines the transmit buffer size / inject size. Messages of size less than or equal to this would be transmitted via an eager protocol and messages greater in size would be transmitted via a rendezvous or SAR (Segmentation And Reassembly) protocol. Transmit data would be copied up to this size (default: ~16k). FI_OFI_RXM_COMP_PER_PROGRESS Defines the maximum number of MSG provider CQ entries (default: 1) that would be read per progress (RxM CQ read). FI_OFI_RXM_ENABLE_DYN_RBUF Enables support for dynamic receive buffering, if available by the message endpoint provider. This feature allows direct placement of received message data into application buffers, bypassing RxM bounce buffers. This feature targets providers that provide internal network buffering, such as the tcp provider. (default: false) FI_OFI_RXM_SAR_LIMIT Set this environment variable to control the RxM SAR (Segmentation And Reassembly) protocol. Messages of size greater than this (default: 128 Kb) would be transmitted via rendezvous protocol. FI_OFI_RXM_USE_SRX Set this to 1 to use shared receive context from MSG provider, or 0 to disable using shared receive context. Shared receive contexts reduce overall memory usage, but may increase in message latency. If not set, verbs will not use shared receive contexts by default, but the tcp provider will. FI_OFI_RXM_TX_SIZE Defines default TX context size (default: 1024) FI_OFI_RXM_RX_SIZE Defines default RX context size (default: 1024) FI_OFI_RXM_MSG_TX_SIZE Defines FI_EP_MSG TX size that would be requested (default: 128). FI_OFI_RXM_MSG_RX_SIZE Defines FI_EP_MSG RX size that would be requested (default: 128). FI_UNIVERSE_SIZE Defines the expected number of ranks / peers an endpoint would communicate with (default: 256). FI_OFI_RXM_CM_PROGRESS_INTERVAL Defines the duration of time in microseconds between calls to RxM CM progression functions when using manual progress. Higher values may provide less noise for calls to fi_cq read functions, but may increase connection setup time (default: 10000) FI_OFI_RXM_CQ_EQ_FAIRNESS Defines the maximum number of message provider CQ entries that can be consecutively read across progress calls without checking to see if the CM progress interval has been reached (default: 128) FI_OFI_RXM_DETECT_HMEM_IFACE Set this to 1 to allow automatic detection of HMEM iface of user buffers when such information is not supplied. This feature allows such buffers be copied or registered (e.g. in Rendezvous) internally by RxM. Note that no extra memory registration is performed with this option. (default: false) Tuning Bandwidth To optimize for bandwidth, ensure you use higher values than default for FI_OFI_RXM_TX_SIZE, FI_OFI_RXM_RX_SIZE, FI_OFI_RXM_MSG_TX_SIZE, FI_OFI_RXM_MSG_RX_SIZE subject to memory limits of the system and the tx and rx sizes supported by the MSG provider. FI_OFI_RXM_SAR_LIMIT is another knob that can be experimented with to optimze for bandwidth. Memory To conserve memory, ensure FI_UNIVERSE_SIZE set to what is required. Similarly check that FI_OFI_RXM_TX_SIZE, FI_OFI_RXM_RX_SIZE, FI_OFI_RXM_MSG_TX_SIZE and FI_OFI_RXM_MSG_RX_SIZE env variables are set to only required values. NOTES The data transfer API may return -FI_EAGAIN during on-demand connection setup of the core provider FI_MSG_EP. See fi_msg(3) for a detailed description of handling FI_EAGAIN. Troubleshooting / Known issues If an RxM endpoint is expected to communicate with more peers than the default value of FI_UNIVERSE_SIZE (256) CQ overruns can happen. To avoid this set a higher value for FI_UNIVERSE_SIZE. CQ overrun can make a MSG endpoint unusable. At higher # of ranks, there may be connection errors due to a node running out of memory. The workaround is to use shared receive contexts for the MSG provider (FI_OFI_RXM_USE_SRX=1) or reduce eager message size (FI_OFI_RXM_BUFFER_SIZE) and MSG provider TX/RX queue sizes (FI_OFI_RXM_MSG_TX_SIZE / FI_OFI_RXM_MSG_RX_SIZE). SEE ALSO fabric(7), fi_provider(7), fi_getinfo(3) AUTHORS OpenFabrics. Libfabric Programmer's Manual 2024-03-13 fi_rxm(7)