regex(3) Library Functions Manual regex(3)

regcomp, regexec, regerror, regfree - POSIX regex functions

Standard C library (libc, -lc)

#include <regex.h>
int regcomp(regex_t *restrict preg, const char *restrict regex,
            int cflags);
int regexec(const regex_t *restrict preg, const char *restrict string,
            size_t nmatch, regmatch_t pmatch[_Nullable restrict .nmatch],
            int eflags);
size_t regerror(int errcode, const regex_t *_Nullable restrict preg,
            char errbuf[_Nullable restrict .errbuf_size],
            size_t errbuf_size);
void regfree(regex_t *preg);
typedef struct {
    size_t    re_nsub;
} regex_t;
typedef struct {
    regoff_t  rm_so;
    regoff_t  rm_eo;
} regmatch_t;
typedef /* ... */  regoff_t;

regcomp() is used to compile a regular expression into a form that is suitable for subsequent regexec() searches.

On success, the pattern buffer at *preg is initialized. regex is a null-terminated string. The locale must be the same when running regexec().

After regcomp() succeeds, preg->re_nsub holds the number of subexpressions in regex. Thus, a value of preg->re_nsub + 1 passed as nmatch to regexec() is sufficient to capture all matches.

cflags is the bitwise OR of zero or more of the following:

Use POSIX Extended Regular Expression syntax when interpreting regex. If not set, POSIX Basic Regular Expression syntax is used.
Do not differentiate case. Subsequent regexec() searches using this pattern buffer will be case insensitive.
Report only overall success. regexec() will use only pmatch for REG_STARTEND, ignoring nmatch.
Match-any-character operators don't match a newline.
A nonmatching list ([^...]) not containing a newline does not match a newline.
Match-beginning-of-line operator (^) matches the empty string immediately after a newline, regardless of whether eflags, the execution flags of regexec(), contains REG_NOTBOL.
Match-end-of-line operator ($) matches the empty string immediately before a newline, regardless of whether eflags contains REG_NOTEOL.

regexec() is used to match a null-terminated string against the compiled pattern buffer in *preg, which must have been initialised with regexec(). eflags is the bitwise OR of zero or more of the following flags:

The match-beginning-of-line operator always fails to match (but see the compilation flag REG_NEWLINE above). This flag may be used when different portions of a string are passed to regexec() and the beginning of the string should not be interpreted as the beginning of the line.
The match-end-of-line operator always fails to match (but see the compilation flag REG_NEWLINE above).
Match [string + pmatch[0].rm_so, string + pmatch[0].rm_eo) instead of [string, string + strlen(string)). This allows matching embedded NUL bytes and avoids a strlen(3) on known-length strings. If any matches are returned (REG_NOSUB wasn't passed to regcomp(), the match succeeded, and nmatch > 0), they overwrite pmatch as usual, and the match offsets remain relative to string (not string + pmatch[0].rm_so). This flag is a BSD extension, not present in POSIX.

Unless REG_NOSUB was passed to regcomp(), it is possible to obtain the locations of matches within string: regexec() fills nmatch elements of pmatch with results: pmatch[0] corresponds to the entire match, pmatch[1] to the first subexpression, etc. If there were more matches than nmatch, they are discarded; if fewer, unused elements of pmatch are filled with -1s.

Each returned valid (non--1) match corresponds to the range [string + rm_so, string + rm_eo).

regoff_t is a signed integer type capable of storing the largest value that can be stored in either an ptrdiff_t type or a ssize_t type.

regerror() is used to turn the error codes that can be returned by both regcomp() and regexec() into error message strings.

If preg isn't a null pointer, errcode must be the latest error returned from an operation on preg.

If errbuf_size isn't 0, up to errbuf_size bytes are copied to errbuf; the error string is always null-terminated, and truncated to fit.

regfree() deinitializes the pattern buffer at *preg, freeing any associated memory; *preg must have been initialized via regcomp().

regcomp() returns zero for a successful compilation or an error code for failure.

regexec() returns zero for a successful match or REG_NOMATCH for failure.

regerror() returns the size of the buffer required to hold the string.

The following errors can be returned by regcomp():

Invalid use of back reference operator.
Invalid use of pattern operators such as group or list.
Invalid use of repetition operators such as using '*' as the first character.
Un-matched brace interval operators.
Un-matched bracket list operators.
Invalid collating element.
Unknown character class name.
Nonspecific error. This is not defined by POSIX.
Trailing backslash.
Un-matched parenthesis group operators.
Invalid use of the range operator; for example, the ending point of the range occurs prior to the starting point.
Compiled regular expression requires a pattern buffer larger than 64 kB. This is not defined by POSIX.
The regex routines ran out of memory.
Invalid back reference to a subexpression.

For an explanation of the terms used in this section, see attributes(7).

Interface Attribute Value
regcomp (), regexec () Thread safety MT-Safe locale
regerror () Thread safety MT-Safe env
regfree () Thread safety MT-Safe

POSIX.1-2008.

POSIX.1-2001.

Prior to POSIX.1-2008, regoff_t was required to be capable of storing the largest value that can be stored in either an off_t type or a ssize_t type.

re_nsub is only required to be initialized if REG_NOSUB wasn't specified, but all known implementations initialize it regardless.

Both regex_t and regmatch_t may (and do) have more members, in any order. Always reference them by name.

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <regex.h>
#define ARRAY_SIZE(arr) (sizeof((arr)) / sizeof((arr)[0]))
static const char *const str =
        "1) John Driverhacker;\n2) John Doe;\n3) John Foo;\n";
static const char *const re = "John.*o";
int main(void)
{
    static const char *s = str;
    regex_t     regex;
    regmatch_t  pmatch[1];
    regoff_t    off, len;
    if (regcomp(&regex, re, REG_NEWLINE))
        exit(EXIT_FAILURE);
    printf("String = \"%s\"\n", str);
    printf("Matches:\n");
    for (unsigned int i = 0; ; i++) {
        if (regexec(&regex, s, ARRAY_SIZE(pmatch), pmatch, 0))
            break;
        off = pmatch[0].rm_so + (s - str);
        len = pmatch[0].rm_eo - pmatch[0].rm_so;
        printf("#%zu:\n", i);
        printf("offset = %jd; length = %jd\n", (intmax_t) off,
                (intmax_t) len);
        printf("substring = \"%.*s\"\n", len, s + pmatch[0].rm_so);
        s += pmatch[0].rm_eo;
    }
    exit(EXIT_SUCCESS);
}

grep(1), regex(7)

The glibc manual section, Regular Expressions

2024-06-15 Linux man-pages 6.9.1