Normalizing Text with Regex Groups in Python

In this post we’re going to look at how regex groups can help clean messy text data, and in this specific, case by normalizing product numbers. The data comes from RFX instrument bids. The challenge here is that while a company may have a concrete way to identify their own products, these product IDs are commonly mistyped when customers request product. We’ll see how to fix common mistakes with regular expressions that match erroneous forms and use groups to pull out specific parts pf a string.

Basic Yamaha Product ID

Yamaha, the world’s largest instrument manufacturer, has designed a Product ID system which encodes descriptive information about the product. Identifying the inherent morphology allows someone to know which instrument it is, but also any special customizations or features it may have. Luckily for us, we don’t need to know that.

Let’s use the Yamaha Tuba, YBB-105WC, as our base example. For our purposes the pattern of the Yamaha wind instrument is:

Brand	Instrument	Hyphen	Series	Features	Mark/Iteration*
Y	BB	-	105	WC	∅
Yamaha	Tuba	delimiter	100 Series	With Case	Product Iteration

The Mark or iteration is usually written with Roman Numerals, such as YAS-200ADII. No Roman numeral means the product is the first iteration!

Since we don’t need to pull out this level of detail from IDs in our example, we’ll reduce this template even further.

Interestingly enough, this system was so successful that other companies noticed and started following a similar approach. Note the similarity between the Jupiter Tuba JTU1110 and Eastman Tuba EBB234. Any guesses on what the first letter stands for?

Examining the typos

Let’s look at the typos we’ll normalize, specifically at the Yamaha Alto Sax YAS-200ADII. This Product ID might be written a variety of ways such as:

yas200adii
yas 200
YAS 200AD
yas - 200
yas - 200ADII
YAS200AD II

And the list goes on.

Goal

Our goal is to get all Yamaha product IDs into a canonical form, reducing messy data and making them easier to work with. We will not be correcting typos or incomplete IDs such as TFL-222 instead of YFL-222 or YAS-280 in place of YAS-200ADII. Let’s operationalize our normalized form. Ideally:

ALL CAPS
No spaces
Has delineating hyphen
Mark (as in ‘mark Ⅰ’ or ‘mark Ⅱ’) is optional, but present if available

This would mean that the previous list of error examples will all be canonicalized as YAS-200II and whichever mark such as ‘II’ or otherwise will only be included if it’s already listed. It’s absence will only be considered a typo.

Libraries

We’ll be using the re library for regular expressions. We’ll also be using the Counter object from Collections to manipulate the data easier and generate some statistics.

import codecs
import re
from collections import Counter

Python Regex Groups

Part of the re library, groups allow you select and extract certain sub-patterns of a regular expression. So we’ll pull out the germane parts of an ID and and normalize them to our standard form.

The syntax of a group is simple: (?P<group_name>…) where group_name is a name we choose for this group and the ellipsis is the regex for the group.

For example, let’s separate ‘Y’ from ‘FL’ in ‘YFL’. We assign ‘FYL’ to string and for our pattern we make two groups in one regular expression. One named yamaha and the other instrument.

string = 'YFL'
pattern = '(?P<yamaha>y)(?P<instrument>[a-z]+)'
ex = re.match(pattern, string, re.IGNORECASE)

We can examine specific groups by using the group() method and pass it the group name as a string. Please note that re.match() was passed the re.IGNORECASE flag so that the regex is case insensitive.

print(ex.group('yamaha'))
print(ex.group('instrument'))

Y
FL

Simple enough, now for something more complex. Let’s designed a regex that’ll encompass the various iterations of Yamaha product ID. We’re going to break it into 3 groups: yamaha_instrument, series, and mark.

Note that the first group yamaha_instrument we’ll be conflating both the brand ‘Yamaha’ and ‘instrument since we don’t have a practical reason to distinguish them. Series will also conflate any [numeric]+[alpha]* pattern after the delimiter (that is hyphen in our canonical form). The group, mark, only will come to play if the the mark is separated by a space from the series.

We’ll first use the regex to extract all the Yamaha IDs from the data

yamaha_pattern = r'\b(?P<yamaha_instrument>Y[A-Z]{2})[\s-]*(?P<series>[0-9]+[A-Z]*)\s*(?P<mark>I*)\b'

with codecs.open('yamaha_skus.txt', 'r', encoding='utf8') as f:
    products = f.read().split('\n')
    

print(f'Total Number of Yamaha IDs: {len(products)}')
print(f'Sample of Product IDs: {products[:4]}')

Total Number of Yamaha IDs: 596
Sample of Product IDs: ['YAC1607', 'YAS - 26', 'YAS - 480', 'YAS - 480SY']

The table below describes this regex chunk by chunk. For a more thorough regex review, consult the re documentation. Notice

Regex component	Explanation
\b	word boundary
(?P<yamaha_instrument>Y[A-Z]{2})	a group name yamaha_instrument that consists of a y followed by two other alphabetic characters
[\s-]*	zero or more spaces or hyphens in any order
(?P<series>[0-9]+[A-Z]*)	a group named series which can consist one or more numbers followed by zero or more alphabetic characters
\s*	zero or more spaces
(?P<mark>I*)	named group called mark which consists of zero or more i’s
\b’	word boundary

products is a list of various Yamaha product IDs. The data comes from RFX data which can be found here.

Now that we have a regex that will separate all these product IDs, we’ll create a function that takes a product id and normalizes it.

def normalize_yamaha_product_id(product, pattern):
    'Return idealized form of Yamaha product ID'
    product_id = re.match(pattern, product, re.IGNORECASE)
    
    # The three groups.
    yamaha_instrument = product_id.group('yamaha_instrument').upper()
    series = product_id.group('series').upper()
    mark = product_id.group('mark').upper()
    
    # Put extracted groups into final normalized form.
    return f'{ yamaha_instrument}-{series}{mark}'

Let’s test our function and make sure it preforms well. With these basic tests we can tell it works as intended. Be sure to mess around with your own test case, and try to make it fail. What happens, for example, if you have ‘yas - 200 ADII’?

# Test: (input, correct_ouput).
test = [
    ('yas200adii', 'YAS-200ADII'),
    ('yas 200', 'YAS-200'),
    ('yas 200AD', 'YAS-200AD'),
    ('yas - 200ADII', 'YAS-200ADII'),
    ('YAS200AD II', 'YAS-200ADII')
]

for input, output in test:
    test_case = normalize_yamaha_product_id(input, yamaha_pattern)
    if test_case == output:
        print(f'Pass' )
    else:
        print(f'Fail: "{test_case}" is not equal to "{output}')

    

Pass
Pass
Pass
Pass
Pass

Everything passed! Now that we have tested our function and know it works. Let’s use it to normalize the IDs in products. We initialize a Counter object with products. This gives us an object where we can easily get the unique IDs and counts the various instances each individual ID. Let’s normalize just the unique IDs. The ‘Change’ column is True if the input is different than the output and False if there is no difference.

data = Counter(products)

total_changes = 0
counter = 0
print(f'No.\tOriginal\tNormalized\tChange')

ids = []
for product in list(data):
    counter += 1
    normalized_product = normalize_yamaha_product_id(product, yamaha_pattern)
    if product == normalized_product:
        change = False
        ids.append((counter, product, normalized_product, change))
    else:
        change = True
        total_changes += 1
        ids.append((counter, product, normalized_product, change))

for counter, product, normalized_product, change in ids:
    if counter < 10:
        print(f'{counter}\t{product.ljust(10)}\t{normalized_product.ljust(10)}\t{change}')  

No.	Original	Normalized	Change
YAC1607   	YAC-1607  	True
YAS - 26  	YAS-26    	True
YAS - 480 	YAS-480   	True
YAS - 480SY	YAS-480SY 	True
YAS 480   	YAS-480   	True
YAS 6211  	YAS-6211  	True
YAS 662 III	YAS-662III	True
YAS-200AD II	YAS-200ADII	True
YAS-200ADII	YAS-200ADII	False

Feel free to look through the results. It appears that the function is working as intended.

Analysis

Let’s examine how this function changed the data. For comparison we’ll also create a new Counter with the normalized forms.

normalized_data = Counter([normalize_yamaha_product_id(product, yamaha_pattern) for product in products])

num_of_products = len(list(data))
num_of_uniq_products = len(list(normalized_data))

percentage_decrease = round((num_of_products - num_of_uniq_products) / num_of_products * 100,2)

print(f'{num_of_products} unique IDs was reduced to {num_of_uniq_products} IDs')
print(f'That\'s a decrease of {percentage_decrease}%!')

260 unique IDs was reduced to 173 IDs
That's a decrease of 33.46%!

We can make a table comparing the top 10 Product Numbers before normalization and after normalization.

heading = 'No.'.ljust(5) + 'Old ID'.ljust(10) + 'Cnt'.ljust(5) + 'New ID'.ljust(10) + 'Cnt'
print(heading)
print('-'*30)

counter = 0
for old, new in zip(data.most_common(10), normalized_data.most_common(10)):
    counter += 1
    old_id    = old[0].ljust(10)
    old_count = str(old[1]).ljust(5)
    new_id    = new[0].ljust(10)
    new_count = str(new[1])
    print(f'{ str(counter).ljust(5) }{ old_id }{ old_count }{ new_id }{ new_count }')

No.  Old ID    Cnt  New ID    Cnt
------------------------------
  YBS-52    20   YBS-52    31
  YBB-105WC 16   YBB-105WC 24
  YEP-321   16   YEP-201   21
  YAS-26    12   YEP-321   20
  YCL-221   11   YAS-26    17
  YTS-480   11   YTS-480   17
  YBS52     10   YCL-221   15
  YEP-201   10   YCL-255   15
  YCL-255   9    YAS-480   13
 YHR-567   9    YCL-221II 12

From the table we can see that the YBS-52 is still the most common item requested, and that ‘YBS52’ has been removed from the list. As a few other items such as the YCL-255 and YHR-576 were removed from the Top 10, this new Top 10 list provides more accurate insights.

Now let’s look at the percentage increases of the Top 10 items. To get the percentage change:

\[\frac{\text{new number} - \text{old number}}{\text{old number}} \times{100}\]

print('Product ID'.ljust(11), 'Percentage Change')
print('-'*30)
for product, new_count in normalized_data.most_common(10):
    old_count = data[product]
    percentage_increase = round((new_count - old_count) / old_count * 100, 2)
    print(f'{product.ljust(12)}{percentage_increase}%')

Product ID  Percentage Change
------------------------------
YBS-52      55.0%
YBB-105WC   50.0%
YEP-201     110.0%
YEP-321     25.0%
YAS-26      41.67%
YTS-480     54.55%
YCL-221     36.36%
YCL-255     66.67%
YAS-480     62.5%
YCL-221II   71.43%

We can see that the new top products all received significant count increases when we started counting more accurately.

Conclusion

In this post we saw how we could extract substrings using regular expression named groups, and how we could use them to clean messy data. The cleaner the data, the easier it is to model and extract practical, actionable insights.

Let's Make a Tokenizer Part 4 (Categories: python, tokenization, nlp, normalization, project)
Let's Make a Tokenizer Part 3 (Categories: python, tokenization, nlp, normalization, project)
Let's Make a Tokenizer Part 2 (Categories: python, tokenization, nlp, normalization, project)
Let's Make a Tokenizer Part 1 (Categories: python, tokenization, nlp, normalization, project)
Tokenization (Categories: python, tokenization, nlp, normalization)
Internationalization & Localization in Python (Categories: python, internationalization, localization, gettext, text, translation)

« Creating Vowel Plots in R Internationalization & Localization in Python »

Daniel J. Dorado