Purpose
Return value
Syntax
=REGEXEXTRACT(text,pattern,[return_mode],[case_sensitivity])
- text - The text value to extract from.
- pattern - The pattern to extract.
- return_mode - [optional] 0 = first match, 1 = all matches, 2 = capture groups.
- case_sensitivity - [optional] 0 = Case sensitive, 1= Case-insensitive. Default is 0.
How to use
The REGEXEXTRACT function extracts text matching a specific regex pattern from a given text string. For the advanced Excel user, this function is a major upgrade. Instead of working out complex formulas based on functions like LEFT, RIGHT, FIND, MID, etc., REGEXEXTRACT can target data very precisely with a single regex pattern. With REGEXESTRACT, you can easily extract numbers, dates, times, email addresses, and other text with a recognizable structure. REGEXEXTRACT not only saves time but also reduces errors created by complicated workarounds.
Table of Contents
- Example - Extracting numbers
- Example - Extracting phone numbers
- Example - Extracting email addresses
- Example - Extracting dates
- Example - Extract all hashtags
- Example - Capturing groups
- Regex terminology
Example - Extracting numbers
The REGEXEXTRACT function provides a way to extract text values using regular expressions. To use REGEXEXTRACT, provide the text string to extract from and a regex pattern. For example, to extract a number from a text string, you can use REGEXEXTRACT with a pattern like "[0-9]+":
=REGEXEXTRACT("10 apples","[0-9]+") // returns "10"
You can see how this works in the worksheet below, where the formula in cell D5, copied down, is:
=REGEXEXTRACT(B5,"[0-9]+")
The pattern "[0-9]+" means "match one or more characters, where each character is a digit from 0 to 9":
- Square brackets [ ]: - In regex, square brackets define a character set. They mean "match any single character that is inside these brackets."
- 0-9: - This is a range that includes all digits from 0 to 9. It's a shorthand way of writing [0123456789].
- The plus sign +: The "+" is a quantifier that means "one or more of the preceding element." In this case, it applies to the range [0-9].
In practice, this pattern will:
- Match any sequence of one or more digits.
- Match the entire number, not just a single digit.
- Not match decimal points, negative signs, or any non-digit characters.
Note that you can also match one or more digits with the more compact pattern "\d+". The "\d" is a character class representing digits, so "\d+" also means "one or more digits".
Example - Extracting phone numbers
Expanding on the example above, the worksheet below shows how to match and extract phone numbers in the format xxx-xxx-xxxx (i.e., a number like 888-123-1234) from text strings. The formula in cell D5, copied down, looks like this:
=REGEXEXTRACT(B5,"\d{3}-\d{3}-\d{4}")
Here's how the pattern "\d{3}-\d{3}-\d{4}" works:
- The "\d" matches any digit (0-9). "{3}" is a quantifier meaning "exactly 3 times". So "\d{3}" matches exactly 3 digits.
- The hyphen ("-") is a "literal" and matches a literal hyphen character.
- As before, the pattern "\d{3}" matches exactly 3 digits.
- The ("-") matches a literal hyphen character.
- The pattern "\d{4}" matches exactly 4 digits.
Example - Extracting email addresses
Another classic use of regex is matching and extracting email addresses. In the worksheet below, the formula in cell D5, copied down, is:
=REGEXEXTRACT(B5, "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
This example shows how matching simple things like an email address can quickly become more complicated as the pattern expands to handle variations in format. You can see how it works in the worksheet below. Briefly, here's how it works:
- The pattern "[a-zA-Z0-9.%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}" is designed to match the most common email address formats.
- It begins with "[a-zA-Z0-9.%+-]+", which matches one or more characters that can be letters (both lowercase and uppercase), numbers, or certain special characters typically allowed in email usernames.
- This is followed by a literal "@" symbol. After the "@", "[a-zA-Z0-9.-]+" matches one or more characters for the domain name, allowing letters, numbers, dots, and hyphens.
- The "." matches a literal dot, which is followed by "[a-zA-Z]{2,}", matching two or more letters for the top-level domain (like .com, .org, etc.).
Note: although this pattern covers most standard email formats while excluding many invalid ones, it doesn't capture all possible valid email addresses.
Example - Extracting dates
Another classic problem in Excel is how to extract a date from a text string. Traditionally, you might use a formula based on the SEARCH function and the MID function like this:
=MID(A1,SEARCH("??/??/??",A1),8)+0
However, SEARCH only supports Excel's very primitive wildcards, so the formula above is error-prone. With REGEXEXTRACT, we can use a more robust formula like this:
=REGEXEXTRACT(A1,"\b\d{1,2}/\d{1,2}/\d{2,4}\b")+0
You can see how this works in the worksheet below:
For a full explanation and sample worksheet, see this page.
Example - Extract all hashtags
The return_mode argument in REGEXEXTRACT defaults to zero (0), which will return only the first match. To return all matches, set this value to 1. You can see how this works in the worksheet below, where the formula in cell D5 is:
=REGEXEXTRACT(B5,"#\w+",1)
With the pattern "#\w+", and return_type set to 1, REGEX will return all hashtags for each text string in column B.
Multiple matches are returned in an array that spills to the right. If needed, you can use the TEXTJOIN function to combine multiple values into a single text string.
Example - Capturing groups
Capturing groups are a key concept in regular expressions (regex) that provide the ability to isolate and extract specific portions of a matched pattern. These groups are created by enclosing part of a regex pattern in parentheses (), which tells the regex engine to "capture" or remember the text matched by that particular section. REGEXEXTRACT provides basic support for capturing groups by setting the third argument, return_mode, to 2. You can see how this works below, where the formula in cell D5 is:
=REGEXEXTRACT(B5,"(\d{4})-(\w+)-(\w+)",2)
Note that we are using three sets of parentheses (), which creates three capturing groups inside the pattern. The value for return_mode is set to 2, which causes REGEXEXTRACT to return all three groups:
Regex terminology
Because Regex is essentially a mini-language, it has its own vocabulary. Here is a list of some important terminology:
- Pattern - The actual sequence of characters that defines the regex.
- Literal - Characters in a regex that match exactly themselves. For example, in the regex cat, the literals are c, a, and t.
- Metacharacter - Characters with special meanings in regex. For example, a period (.) matches any character, ^ matches the start of a string, $ matches the end of a string.
- Character Class - A set of characters enclosed in square brackets [] that matches any one of the characters inside. For example, [aeiou] matches any vowel.
- Quantifier - Specifies how many instances of a character, group, or character class must be present in the input for a match. For example, a* matches zero or more as, a+ matches one or more as, a{3} matches exactly three as.
- Escape Sequence - A way to handle metacharacters as literals by preceding them with a backslash \. For example, \. matches a literal period.
- Group - A part of a regex pattern enclosed in parentheses () that can be referred to later. For example, (abc) matches the exact sequence abc.
- Alternation - The pipe | character used to match one thing or another. For example, cat|dog matches cat or dog.
- Anchor - Special characters that match positions within a string rather than actual characters. For example, ^ matches the start of a string, $ matches the end of a string.
- Wildcard - The dot . character, which matches any single character except a newline character. For example, c.t matches cat, cot, cut, etc.
- Boundary - Special sequences that match positions between characters. For example, \b matches a word boundary, \B matches a non-word boundary.