Table of Contents
- What is regex?
- A brief history of regex in Excel
- Regex vs. Excel wildcards
- How regex is useful
- The REGEXTEST function
- The REGEXEXTRACT function
- The REGEXREPLACE function
- Regex quick reference
- Matching substring vs. entire text
- Regex tips
- Summary
What is regex?
What is regex? Regex, short for Regular Expressions, is a powerful tool for pattern matching in text data. Using a combination of metacharacters, literal characters, character classes, and quantifiers, you can define complex search patterns to extract, validate, or manipulate text data. The main benefit of regex in Excel is the ability to work with text very precisely without resorting to complicated formulas that are hard to understand and maintain. In Excel, Regex support comes primarily from the introduction of three brand-new functions:
In addition to the three dedicated functions above, XLOOKUP and XMATCH have also been upgraded to support regex. Plus, you can use the functions above inside other formulas to instantly upgrade their capabilities. For example, you can use REGEXTEST inside the IF function as the logical test, which "upgrades" IF to support regex. In this article, I'll introduce Excel's new regex functions and provide examples of how these functions are helpful. But first, let's review how we got here.
A brief history of regex in Excel
Why doesn’t Excel support Regex? This is one of those questions that has bothered Excel power users for many years. It’s been a topic of heated debates and the cause of many clunky, complicated formulas. Although regex is a standard feature in many programming languages, it was notably absent from Excel for most of its history. Here's how Excel's text processing capabilities evolved:
- Excel has always supported basic wildcards (* and ?), but these are primitive compared to regex patterns. Users had to rely on combining functions like LEFT, RIGHT, MID, FIND, SEARCH, and SUBSTITUTE for pattern matching, resulting in complex, hard-to-maintain formulas.
- Power users worked around these limitations using VBA, Power Query, or custom add-ins for regex support, but these tools require different skills and are not available to all users.
- In Excel 2013, Microsoft added some regex-like capability with the FILTERXML function, which uses XPath queries for pattern matching. Still, this function is not widely used or available in Excel for Mac.
- In 2022, Microsoft improved text handling with TEXTSPLIT, TEXTBEFORE, and TEXTAFTER functions. These functions make it much easier to split text at specific locations in an Excel formula. However, they do not support regex.
- In December 2024, Excel introduced three regex functions: REGEXTEST, REGEXREPLACE, and REGEXEXTRACT. These functions modernize Excel's text-processing capabilities and bring it up to speed with other professional tools.
Now that Excel supports regex directly, many complicated formulas of the past can be drastically simplified.
Regex vs. Excel wildcards
Excel wildcards are like a toolbox with just two tools: * for "anything" and ? for "one thing." Sure, you can find "apple*" or "?at," but that's about it. They're the flip phone of pattern matching.
Regex is a different story. While wildcards are asking "Does this have an "a" followed by... stuff?", regex is performing complex queries like "Any word that starts with a capital letter, contains exactly two numbers, and ends with x or y but not z".
Want to match exactly three digits followed by optional whitespace and a hyphen? Try \d{3}\s*- Need to find an email address or validate a strong password? Regex has patterns for that. It can even "look ahead" in your text to match patterns only when they occur before something else.
You get the idea. Regex goes far beyond basic wildcards.
How regex is useful
Before we get into the details, let's look at a specific example of how regex can help simplify a formula. In the workbook below, the goal is to extract the numbers from the product codes in column B. With hundreds of functions available, you might think this is a simple problem in Excel, but it's not! The problem is that the numbers vary in length, and their location in the product code also changes. There's just no easy way to figure out where each number begins and ends. Instead, the formula below in cell D5 takes a "brute force" approach and simply removes all non-numeric characters. It looks like this:
=TEXTJOIN("",TRUE,IFERROR(MID(B5,SEQUENCE(LEN(B5)),1)+0,""))
It's not exactly obvious what this formula is doing, right? You can find an explanation here. It's pretty complicated, and I'm even cheating a bit because I'm assuming at least Excel 2021, which has the SEQUENCE function. In Excel 2019, things get uglier because we need to spin up our own number array with the volatile INDIRECT function and the ROW function:
=TEXTJOIN("",TRUE,IFERROR(MID(B5,ROW(INDIRECT("1:"&LEN(B5))),1)+0,""))
In Excel versions before 2019, the formula becomes even more complicated!
What about regex? Does it help in this case? Yes. Regex helps a lot! In the worksheet below, the new formula in cell D5 is based on the REGEXEXTRACT function. Here it is:
=REGEXEXTRACT(B5,"\d+")
Yep, that's the whole formula. Basically, we are asking REGEXEXTRACT for a sequence of 1 or more numbers. You can see the results below. I think you'll agree that this new formula is a lot simpler 🙂
Now that you have a taste of how regex can help simplify difficult formulas, let's look more closely at the three new regex functions.
The REGEXTEST function
The Excel REGEXTEST function tests for a given regex pattern. The result from REGEXTEST is TRUE or FALSE. For example, =REGEXTEST(A1,"[0-9]") will return TRUE if cell A1 contains any numeric digit, and =REGEXTEST(A1,"[A-Z]" will return TRUE if A1 contains any uppercase letters. REGEXTEST opens up new possibilities for data validation and text analysis directly within Excel formulas. In the worksheet below, REGEXTEST is configured to test addresses for one of four US states: MN, MT, ND, and SD. The formula in D5, copied down, looks like this:
=REGEXTEST(B5,"\b(MN|MT|ND|SD)\b")
The pattern "\b(MN|MT|ND|SD)\b" matches MN, MT, ND, or SD. The '\b' is a word boundary character. It will match a space and any punctuation that typically appears around a word. The '|' creates OR logic.
Note: While you probably don't need a TRUE or FALSE result in a case like this, you could use exactly the same formula above inside the FILTER function to list all addresses that contain MN, MT, ND, or SD.
Learn more about the REGEXTEST function.
The REGEXEXTRACT function
The REGEXEXTRACT function extracts specific information from a string based on a Regex pattern. It’s perfect for pulling out key pieces of data from messy text. For example, in the worksheet below, the goal is to extract telephone numbers in the format xxx-xxx-xxxx
from the text strings in column B. The formula in cell D5, copied down, looks like this:
=REGEXEXTRACT(B5,"\d{3}-\d{3}-\d{4}")
This example gives you a sense of regex's power and flexibility. This formula is looking for and extracting phone numbers that follow a pattern like this: "123-456-7890". Let's break down each part:
- '\d{3}' looks for exactly 3 digits
- '-' looks for a hyphen
- '\d{3}' looks for 3 more digits
- '-' looks for another hyphen
- '\d{4}' looks for 4 more digits
So, if you have text that contains something like "My phone number is 555-123-4567", this formula will pull out just "555-123-4567".
Learn more about the REGEXEXTRACT function.
The REGEXREPLACE function
The REGEXREPLACE function allows you to replace parts of a string that match a Regex pattern with something else. It’s very useful for cleaning up or reformatting text. You can think of REGEXREPLACE as a much more powerful version of the simplistic SUBSTITUTE function. For example, in the workbook below, REGEXREPLACE is configured to remove all non-numeric characters from the telephone numbers in column B. The formula in cell D5, copied down, is:
=REGEXREPLACE(B5,"[^0-9]","")
The pattern "[^0-9]" will match any character that is not a number. Since the replacement pattern is an empty string (""), the result is that REGEXREPLACE effectively "strips" all non-numeric characters from the input, and the final result contains numbers only.
Learn more about the REGEXREPLACE function.
Regex quick reference
Regex relies on patterns to match specific text. The table below contains some simple regex patterns. These patterns can be combined to create very capable text-matching formulas.
Pattern | Description and examples |
---|---|
abc | Matches the literal text 'abc'. Example: 'abc' matches 'abc', but not 'ABC' or 'ab'. |
. | Matches any single character except a newline. Example: 'c.t' matches 'cat', 'cot', 'c@t'. |
\d | Matches any digit (0-9). Example: '\d\d\d' matches '123', '999'. |
\w | Matches any word character (letter, digit, underscore). Example: '\w\w' matches 'ab', '1_', 'A9'. |
\s | Matches any whitespace character (space, tab, newline). Example: 'a\sb' matches 'a b'. |
\b | Matches a word boundary. Example: '\bcat\b' matches 'cat' in 'the cat sits' but not 'category'. |
[abc] | Matches any one character listed in brackets. Example: 'gr[ae]y' matches 'gray' and 'grey'. |
[a-z] | Matches any one character in the range. Example: '[a-z]' matches any lowercase letter. |
[^abc] | Matches any character NOT listed. Example: '[^0-9]' matches any non-digit. |
a* | Matches 0 or more 'a'. Example: 'ca*t' matches 'ct', 'cat', 'caaat'. |
a+ | Matches 1 or more 'a'. Example: 'ca+t' matches 'cat', 'caat', but not 'ct'. |
a? | Matches 0 or 1 'a'. Example: 'colou?r' matches 'color' and 'colour'. |
a{3} | Matches exactly 3 'a'. Example: 'a{3}' matches 'aaa', but not 'aa' or 'aaaa'. |
a{2,4} | Matches 2 to 4 'a'. Example: 'a{2,4}' matches 'aa', 'aaa', 'aaaa'. |
^abc | Matches 'abc' at start of string. Example: '^The' matches 'The cat' but not 'In The'. |
abc$ | Matches 'abc' at end of string. Example: '.com$' matches 'example.com'. |
(abc) | Groups pattern and captures match. Example: '(cat|dog)s' matches 'cats' and 'dogs'. |
(?:abc) | Non-capturing group. Like () but doesn't store the match. |
a|b | Matches 'a' or 'b'. Example: 'cat|dog' matches 'cat' or 'dog'. |
Matching substring vs. entire text
By default, regex will match any substrings that match the pattern. For example, the pattern cat
will match "cat", "catapult", "scatter", "concatenate", or "the top category" because "cat" appears as a substring in each text string. To match an entire string exactly (i.e., to match the exact text in a cell in Excel), we need to use special anchors:
-
^
(caret): Matches the start of the string. For example,^abc
will match "abc123" but not "123abc". -
$
(dollar sign): Matches the end of the string. For example,abc$
matches "123abc" but not "abc123".
To match the entire contents of a cell exactly, combine the ^
and the $
in a pattern. For example, the pattern ^abc$
will match "abc" but not "123abc456" or "abcd".
Without these anchors, a regex pattern will match substrings that appear anywhere in the string, which may or may not meet your needs. For instance, the pattern "abc" will match "abc" in "123abc456" or "abcd", but the pattern "^abc$" will only match "abc".
Regex tips
Regex patterns can get complicated fast. Here are some general tips for creating and debugging regex patterns:
- Start small - Test the simplest version of your pattern first, then add complexity. If '\d+' doesn't work, '\d{3}-\d{3}' probably won't either.
- Use REGEXTEST to validate your patterns against sample data. REGEXTEST returns TRUE or FALSE, so it is perfect for testing in Excel.
- Special characters (. * + ? etc.) need to be escaped with '\' to match these characters literally.
- Regular expressions are case-sensitive by default. You can disable case_sensitivity to 1 for a case-insensitive match for each of the regex functions in Excel.
- ^ and $ match start/end of the entire string, not individual lines
- If a pattern isn't matching, try making it more permissive. For example, \s* instead of just a space to handle variable whitespace.
Summary
The formal introduction of regex in Excel formulas is a game-changer. Many complicated formulas of the past will slowly disappear as people learn how to turn "spaghetti code" into elegant and reliable formulas based on regex. As you can see in the table above, regex has a learning curve. Because regex is like a mini-language, the large number of symbols and patterns can be intimidating. However, you don't need to master regex in order to use regex. A little goes a long way. The good news is there has never been a better time to learn to use regex. There are many great resources on the web to help you create the patterns you need. Here are a few of my favorites:
- regex101.com - https://regex101.com/
- regexr.com - https://regexr.com/
- ChatGPT - https://chatgpt.com/
- Claude - https://claude.ai/