Explanation
In this example, the goal is to look up the correct price of the product number entered in cell F4 using the product codes in column B. This problem is trickier than it looks. Each product code begins with 3 uppercase letters and ends with 2 or 3 uppercase letters. In the middle of the product code is a number between 2 and 4 digits. This is the number we want to use for a lookup value. Let's look at how to solve this problem with XLOOKUP and the newly released "regex match" feature. To provide some context on why you might need to use regex, let's start by looking at some XLOOKUP formulas that don't work. All formulas below refer to the worksheet shown above. Download the workbook above and follow along.
Simple XLOOKUP formula
By default, XLOOKUP will perform an exact match. If we search for 56, XLOOKUP will simply return a #N/A error because 56 does not appear as a lookup value:
=XLOOKUP("56",B5:B16,C5:C16) // returns #N/A
- lookup_value - "56"
- lookup_array - B5:B16 (the range containing the product codes)
- return_array - C5:C16 (the range containing the prices)
Note that I've enclosed the "56" in double quotes because the product codes are text values, not numbers. But whether we look for the number 56 or the text "56", the result will be #N/A because XLOOKUP is performing an exact match, and those values do not exist by themselves as product codes.
XLOOKUP with a wildcard match
If you know a little more about XLOOKUP, you might wonder if we can use a wildcard match. Indeed, you can enable a "wildcard character match" by providing the number 2 for match_mode argument like this:
=XLOOKUP("56",B5:B16,C5:C16,,2) // enable wildcard match
The wildcard match doesn't do anything by itself without wildcards, so the formula above will also return #N/A. To actually use a wildcard match, we need to provide some wildcards. A typical approach involved adding some asterisks (*) like this:
=XLOOKUP("*56*",B5:B16,C5:C16,,2) // returns 78
- lookup_value - "*56*"
- lookup_array - B5:B16 (product codes)
- return_array - C5:C16 (prices)
- if_not_found - omitted
- match_mode - 2 (wildcard match)
The asterisks (*) are wildcards that will match "zero or more characters". What we are doing here is looking for some number of characters, then "56," then some number of characters. This seems like it should work. However, the formula returns 78, which is not correct. Why? Well, if "56" only appeared once in the product codes, it would work. However, "56" appears inside three product codes: KP563MN, QR56DE, and HJ3456TU. What's happening here is that XLOOKUP is matching the first product code that contains 56, KP563MN. Standard XLOOKUP behavior when there are multiple matches. Can we make XLOOKUP find the right code? Well, we could take the wildcard match idea a bit further by switching from "*" to the "?" like this:
=XLOOKUP("???56??",B5:B16,C5:C16,,2) // returns 46
The "?" wildcard means "one character of any kind". The literal meaning of this wildcard pattern is "3 characters, followed by 56, followed by 2 characters. In fact, the formula above does return the correct price for PQR56DE, which is $46.00. However, there are some problems with this approach:
- Although PQR56DE has 2 letters ("DE") after the 56, not all product codes follow this pattern. Some have two letters at the end, and some have three letters.
- We could have a product code like ABC5612XY, which will match the pattern above because the "?" will match any character, even numbers. The "?" wildcard makes no distinction between letters and digits.
In summary, with basic wildcards, we don't have a good way to create a pattern that will reliably match all product codes. It's time to roll out the big guns: Regular Expressions, called "regex" for short.
What is regex? Regex, short for Regular Expressions, is a powerful tool for pattern matching in text data. Using a combination of metacharacters, literal characters, character classes, and quantifiers, you can define complex search patterns to extract, validate, or manipulate text data. Regular Expressions have been around for decades, but only recently arrived in Excel. The main benefit of regex in Excel is the ability to work with text very precisely without resorting to complicated formulas that are hard to create and maintain. If you are new to Regular Expressions, see this overview.
XLOOKUP with a regex match
To enable a regex match in XLOOKUP, provide 3 for the fifth argument, called match_mode. Going back to the original example above, once we enable regex, we have this formula:
=XLOOKUP("56",B5:B16,C5:C16,,3) // regex enabled
Its interesting to note that this formula by itself returns $78.00. This is the price for KPX563MN, the first product code that contains "56". In other words, just by enabling regex, we get a working "contains" type match. This is cool because it means we can get XLOOKUP to do a "contains" type match by providing 3 for match_mode. We don't need to use any wildcards or special symbols. However, the result above is incorrect for the same reason we saw above: Three codes contain 56, and XLOOKUP matches the first code that contains 56.
Actually, we don't even need to provide 56 as a text value like "56", because it will get evaluated as text automatically inside in the regex engine. We'll get the same result if we use the number 56 as the lookup value.
To get the formula working correctly, we need to beef up the regex pattern so that we aren't accidentally matching the wrong product code. The first step is to add a pattern to match the beginning of the code. As noted above, the product codes always start with 3 uppercase characters. We can enforce this pattern in our lookup formula by adding "[A-Z]{3}" to the start:
=XLOOKUP("[A-Z]{3}56",B5:B16,C5:C16,,3) // returns 78
The {3} is a quantifier specifying 3 uppercase letters A-Z. This is already quite a bit more robust than our best wildcard formula above because XLOOKUP will now only match the 56 when it comes directly after 3 uppercase letters. Unfortunately, this formula also returns an incorrect result because KPX563MN is the first code to pass this test. To prevent this problem, we need to add a pattern to match the uppercase letters at the end of the product code. This is tricker. We know the code will end with uppercase letters, but there might be 2 characters, and there might be 3. This is where the power of regex patterns starts to really shine. To match 2-3 uppercase letters at the end, we can add "[A-Z]{2,3}" to the end of the pattern. The formula now looks like this:
=XLOOKUP("[A-Z]{3}56[A-Z]{2,3}",B5:B16,C5:C16,,3) // returns 46
- lookup_value - "[A-Z]{3}56[A-Z]{2,3}" (regex pattern)
- lookup_array - B5:B16 (product codes)
- return_array - C5:C16 (prices)
- if_not_found - omitted
- match_mode - 3 (regex match)
The quantifier {2,3} means a minimum of 2 and a maximum of 3. The translation of the full pattern is "3 uppercase letters A-Z, followed by 56, followed by 2-3 uppercase letters A-Z". With this adjustment, the formula finally returns the correct result of 46. Better yet, we can use the same pattern to match any of these codes using only the number part as the lookup value. We now have a working solution, but of course, we don't want the number 56 hardcoded in the formula. The final step is to adjust the formula to get the lookup number from cell F4. We do this by concatenating a reference to F4 inside the regex pattern like this:
=XLOOKUP("[A-Z]{3}"&F4&"[A-Z]{2,3}",B5:B16,C5:C16,,3)
This is the final working formula seen in the worksheet shown. When a user types a new valid number in cell F4, the formula will return a new result.
To learn more about the symbols available for regex patterns, see the cheat sheet near the bottom of this page.
Matching the entire cell
By default, regex will match any substrings that match the pattern. For example, the pattern cat
will match "cat", "catapult", "scatter", "concatenate", or "the top category" because "cat" appears as a substring in each text string. To match the text in a cell exactly, we need to use special anchors:
-
^
(caret): Matches the start of the string. For example,^abc
will match "abc123" but not "123abc". -
$
(dollar sign): Matches the end of the string. For example,abc$
matches "123abc" but not "abc123".
To match the entire contents of a cell exactly, combine the ^
and the $
in a pattern. For example, the pattern ^abc$
will match "abc" but not "123abc456" or "abcd". Moving back to the example above, if we want to restrict the pattern to match the entire contents of a cell, we can modify the formula like this:
=XLOOKUP("^[A-Z]{3}"&F4&"[A-Z]{2,3}$",B5:B16,C5:C16,,3)
This new formula is more specific. It will continue to match a cell that contains "KPX563MN", but it will not match a cell that contains "KPX563MNXY" since the text ends with "MNXY" and the pattern only allows 3 or 3 uppercase letters a the end. The only change to this formula is the ^
at the start and the $
at the end of the pattern.