Excel REGEXEXTRACT function

The Excel REGEXEXTRACT function extracts matching text defined by a given pattern. Regex patterns are very flexible and can be configured to match numbers, email addresses, dates, and other values that have an identifiable structure. By default, REGEXEXTRACT will return the first match, but it can be configured to return all matches.

The REGEXEXTRACT function is only available in Excel 365 for now.

Purpose

Extract text with regex pattern

Return value

Text matching pattern

Syntax

=REGEXEXTRACT(text,pattern,[return_mode],[case_sensitivity])

text - The text value to extract from.
pattern - The pattern to extract.
return_mode - [optional] 0 = first match, 1 = all matches, 2 = capture groups.
case_sensitivity - [optional] 0 = Case sensitive, 1= Case-insensitive. Default is 0.

Using the REGEXEXTRACT function

The REGEXEXTRACT function extracts text matching a specific regex pattern from a given text string. For the advanced Excel user, this function is a major upgrade. Instead of working out complex formulas based on functions like LEFT, RIGHT, FIND, MID, etc., REGEXEXTRACT can target data very precisely with a single regex pattern. With REGEXEXTRACT, you can easily extract numbers, dates, times, email addresses, and other text with a recognizable structure. REGEXEXTRACT not only saves time but also reduces errors created by complicated workarounds.

Regular expressions ("regex" for short) are a powerful tool for pattern matching and text manipulation. Regex patterns can match simple strings or very specific sequences like phone numbers, email addresses, dates, and other text that has an identifiable pattern. Compared to the simple wildcards provided in older versions of Excel, regular expressions are a huge upgrade in functionality. They make it possible to match text in extremely specific ways that were impossible before now. While regex itself can be daunting to newcomers, it's a very powerful language. As you become more comfortable with the syntax, you can tackle increasingly complex data extraction challenges.

Example - Extracting numbers
Example - Extracting phone numbers
Example - Extracting email addresses
Example - Extracting dates
Example - Extract all hashtags
Example - Capturing groups
Regex terminology and reference

Example - Extracting numbers

The REGEXEXTRACT function provides a way to extract text values using regular expressions. To use REGEXEXTRACT, provide the text string to extract from and a regex pattern. For example, to extract a number from a text string, you can use REGEXEXTRACT with a pattern like "[0-9]+":

=REGEXEXTRACT("10 apples","[0-9]+") // returns "10"

You can see how this works in the worksheet below, where the formula in cell D5, copied down, is:

=REGEXEXTRACT(B5,"[0-9]+")

Regexextract example - extract numbers

The pattern "[0-9]+" means "match one or more characters, where each character is a digit from 0 to 9":

Square brackets [ ]: - In regex, square brackets define a character set. They mean "match any single character that is inside these brackets."
0-9: - This is a range that includes all digits from 0 to 9. It's a shorthand way of writing [0123456789].
The plus sign +: The "+" is a quantifier that means "one or more of the preceding element." In this case, it applies to the range [0-9].

In practice, this pattern will:

Match any sequence of one or more digits.
Match the entire number, not just a single digit.
Not match decimal points, negative signs, or any non-digit characters.

Note that you can also match one or more digits with the more compact pattern "\d+". The "\d" is a character class representing digits, so "\d+" also means "one or more digits".

Note that if a number is not found in the text, REGEXEXTRACT will return a #N/A error.

Example - Extracting phone numbers

Expanding on the example above, the worksheet below shows how to match and extract phone numbers in the format xxx-xxx-xxxx (i.e., a number like 888-123-1234) from text strings. The formula in cell D5, copied down, looks like this:

=REGEXEXTRACT(B5,"\d{3}-\d{3}-\d{4}")

Regexextract example - extract phone numbers

Here's how the pattern "\d{3}-\d{3}-\d{4}" works:

The "\d" matches any digit (0-9). "{3}" is a quantifier meaning "exactly 3 times". So "\d{3}" matches exactly 3 digits.
The hyphen ("-") is a "literal" and matches a literal hyphen character.
As before, the pattern "\d{3}" matches exactly 3 digits.
The ("-") matches a literal hyphen character.
The pattern "\d{4}" matches exactly 4 digits.

Example - Extracting email addresses

Another classic use of regex is matching and extracting email addresses. In the worksheet below, the formula in cell D5, copied down, is:

=REGEXEXTRACT(B5, "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")

This example shows how matching simple things like an email address can quickly become more complicated as the pattern expands to handle variations in format. You can see how it works in the worksheet below. Briefly, here's how it works:

The pattern "[a-zA-Z0-9.%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}" is designed to match the most common email address formats.
It begins with "[a-zA-Z0-9.%+-]+", which matches one or more characters that can be letters (both lowercase and uppercase), numbers, or certain special characters typically allowed in email usernames.
This is followed by a literal "@" symbol. After the "@", "[a-zA-Z0-9.-]+" matches one or more characters for the domain name, allowing letters, numbers, dots, and hyphens.
The "." matches a literal dot, which is followed by "[a-zA-Z]{2,}", matching two or more letters for the top-level domain (like .com, .org, etc.).

Regexextract example - extract email addresses

Note: although this pattern covers most standard email formats while excluding many invalid ones, it doesn't capture all possible valid email addresses.

Example - Extracting dates

Another classic problem in Excel is how to extract a date from a text string. Traditionally, you might use a formula based on the SEARCH function and the MID function like this:

=MID(A1,SEARCH("??/??/??",A1),8)+0

However, SEARCH only supports Excel's very primitive wildcards, so the formula above is error-prone. With REGEXEXTRACT, we can use a more robust formula like this:

=REGEXEXTRACT(A1,"\b\d{1,2}/\d{1,2}/\d{2,4}\b")+0

You can see how this works in the worksheet below:

Regexextract example - extract dates from text strings

For a full explanation and sample worksheet, see this page.

Example - Extract all hashtags

The return_mode argument in REGEXEXTRACT defaults to zero (0), which will return only the first match. To return all matches, set this value to 1. You can see how this works in the worksheet below, where the formula in cell D5 is:

=REGEXEXTRACT(B5,"#\w+",1)

With the pattern "#\w+", and return_type set to 1, REGEX will return all hashtags for each text string in column B.

REGEXEXTRACT example - extract all hashtags

Multiple matches are returned in an array that spills to the right. If needed, you can use the TEXTJOIN function to combine multiple values into a single text string.

Example - Capturing groups

Capturing groups are a key concept in regular expressions (regex) that provide the ability to isolate and extract specific portions of a matched pattern. These groups are created by enclosing part of a regex pattern in parentheses (), which tells the regex engine to "capture" or remember the text matched by that particular section. REGEXEXTRACT provides basic support for capturing groups by setting the third argument, return_mode, to 2. You can see how this works below, where the formula in cell D5 is:

=REGEXEXTRACT(B5,"(\d{4})-(\w+)-(\w+)",2)

Note that we are using three sets of parentheses (), which create three capturing groups inside the pattern. The value for return_mode is set to 2, which causes REGEXEXTRACT to return all three groups in an array that spills onto the worksheet:

REGEXEXTRACT example - using capturing groups

Regex terminology and reference

Because Regex is essentially a mini programming language, it has its own vocabulary. Here is a list of some important terminology:

Pattern - The actual sequence of characters that defines the regex.
Literal - Characters in a regex pattern that match themselves. For example, in cat, the literals are c, a, and t.
Metacharacter - Characters with special meanings in regex. For example, a period . matches any character, ^ matches the start of a string, $ matches the end of a string.
Character Class - A set of characters enclosed in square brackets [] that matches any one of the characters inside. For example, [aeiou] matches any vowel.
Quantifier - Specifies how many instances of a character, group, or character class must be present in the input for a match. For example, a* matches zero or more a's, a+ matches one or more a's, a{3} matches exactly three a's.