Let's Make a Tokenizer Part 3

Grammar

So as mentioned in the first post, the grammar is where rules are specified. These grammatical rules will check to see if a token is not grammatical. If a token passes, it’ll pass on the next rule, if it fails, it’ll get broken up, and the newly created tokens will begin their journey through the pipeline. In our grammar, we’ll specify 4 rules:

initial_punctuation - separate token-initial punctuation
final_punctuation - separate token-final punctuation
all_punctuation - separate token-initial and final punctuation
currency - separate currency symbols from quantity amounts

Language

Since a tokenizer is dealing with a orthographic system, there are a few things to keep in mind. This grammar is going to focus on en-US, that is English from the United States. There are different writing standards, in different areas, and often they conflict. So our rules in this grammar will follow a generic American English grammar. In our small grammar this won’t matter too much, but this does change what we consider initial punctuation or final punctuation, how currency is written. With that being said, this tokenizer will not well, for other languages such as Spanish and French.

Regular Expressions

We’ll be defining rules for our grammar using regexes. The longer the regex, the more unwieldy and unreadable it becomes. So we’ll construct a regex by building up it by assigning smaller sets and characters to variables and joining them into larger regexes. A benefit to this style is that these components will be shared by multiple rules.

So let’s start by importing the re library, and creating the basic sets we’ll use in our rules.

Sets are created by placing characters in [].

So lets add the following to our grammar.py file.

import re

ALPHA = '[A-Z]+'
DIGITS = '[0-9]'
BOS = '^'
EOS = '$'
PLUS = '+'
STAR = "*"
PERIOD = r'\.'
INITIAL_PUNCTUATION = '[\'"]'
FINAL_PUNCTUATION = '[\',!?":.]'
CURRENCY_SYMBOL = '[$£¥€]'
QUESTION_MARK = '?

Some of these may seem ridiculous, like PLUS, or PERIOD, but it makes it very clear and readable even to someone who later has to read and edit your grammar. Maybe yourself, months after you last looked at it. Let’s review a few more of these in more detail.

ALPHA is at least one alphabetic character. We’ll be compiling our final regexes with the re.IGNORECASE flag, so it isn’t necessary to include a-z in this alphabetic set. What does this mean for a token like café or naïve",?
BOS means Beginning of String and EOS means End of String.
INITIAL_PUNCTUATION punctuation that appears before a word. Note that the the ' is escaped and that \ is neither a part of initial nor final punctuation.
FINAL_PUNCTUATION punctuation that appears after a word. What about other punctuation that might appear here? Problem! By including ' as final punctuation the possessive plural cats' won’t be allowed.
CURRENCY_SYMBOL This is the currency symbols the tokenizer will recognize. So it’ll separate $1.20 and £1.20 correctly, but not ₴1.20 or ₱1.20.

Groups

When a token matches fails a rule, we want to be able to split the token up into smaller tokens. We’ll accomplish this my using regex groups, that is anything in (). Anything we want to potentially separate needs to be in a group. For example, 'What needs two groups: one for ' and What'. A punctuation group and alphabetic string

So let’s create a function that takes a regex expression and returns it as a regex group.

def create_group(expression):
    return '(' + expression + ')'

This function is mostly for legibility. Notice is uses the characters and sets we defined earlier.

ALPHA_GROUP = create_group(ALPHA)
INITIAL_PUNCTUATION_GROUP = create_group(INITIAL_PUNCTUATION)
FINAL_PUNCTUATION_GROUP = create_group(FINAL_PUNCTUATION + PLUS)
FINAL_PUNCTUATION_STAR_GROUP = create_group(FINAL_PUNCTUATION + STAR)
CURRENCY_SYMBOL_GROUP = create_group(CURRENCY_SYMBOL)
CURRENCY_GROUP = create_group(DIGITS + PLUS + PERIOD + QUESTION_MARK + DIGITS + '{,2}')
ALPHA_PUNCTUATION_GROUP = create_group(ALPHA + FINAL_PUNCTUATION + STAR)

Notice we have two similar groups FINAL_PUNCTUATION_GROUP or FINAL_PUNCTUATION_STAR_GROUP. Why we need a difference? Notice that sometimes group is being passed a concatenated string. While it may be easier just to copy and paste what I’ve put as code, it would be wise to construct these yourself so you can grok how they’re built up.

Rules.

Now that we have base character sets and some groups we can define rules. Notice that these rules are constructed solely of groups. That because if a token matches a rule, it has to be broken up. Each group represents the boundary in which a token needs to be split.

INITIAL_PUNCTUATION_TOKEN = INITIAL_PUNCTUATION_GROUP + ALPHA_PUNCTUATION_GROUP
FINAL_PUNCTUATION_TOKEN = ALPHA_GROUP + FINAL_PUNCTUATION_GROUP
PUNCTUATION_TOKEN = create_group(FINAL_PUNCTUATION) + FINAL_PUNCTUATION_GROUP
CURRENCY_TOKEN = CURRENCY_SYMBOL_GROUP + CURRENCY_GROUP + FINAL_PUNCTUATION_STAR_GROUP

The PUNCTUATION_TOKEN rule, notice that we’re creating in the definition since the group isn’t being used elsewhere. If later a new rule would benefit from the same group, it would be better to definite it as a constant variable like the other groups.

While you make your own rules it’s important avoid non-deterministic expressions.

Rule dictionary

Finally let’s make dictionary, where the key is the rule name, and the value is the string regex rule.

RULES_TO_EXPORT = {
    'initial_punctuation_token' : INITIAL_PUNCTUATION_TOKEN,
    'final_punctuation_token' : FINAL_PUNCTUATION_TOKEN,
    'punctuation_token' : PUNCTUATION_TOKEN,
    'currency_token' : CURRENCY_TOKEN,
}

Something to consider here is rule order. Since we’re not really normalizing text here, order isn’t a huge concern, but it is something to keep in mind. In our case, we want to optimize the tokenizer to run as fast as possible. This is why the PUNCTUATION_TOKEN rule comes after the other rules dealing with punctuation. Since the other rules have to break up punctuation first, it is unlikely that a pure punctuation token is going to exist (unless it’s an emoticon), and so checking before it would be broken down causes superfluous rule checks. On a small dataset there isn’t a noticeable difference, but a larger dataset could definitely be influenced.

You may be concerned that Python dictionaries don’t remember ordering, but for now as of Python 3.6 they do.

Rule compiling

The order of business is to compile our rules. Let’s add a function that does just that. This rule compiles our string, but also adds BOS and EOS to the string. This guarantees that our rule has to match the entire token and not a substring of it. Also notice, that we pass the re.IGNORECASE flag.

def compile_rule(rule):
    """Return a case insensitive rule matching and entire string."""
    return re.compile(BOS + rule + EOS, re.IGNORECASE)

So to create our final set of rules, I’ll basically create a new RULES dictionary where each value is now compiled using our new function. With this action, we’re almost ready to start building the key component of our tokenizer—the tokenizer.

RULES = {key:compile_rule(value) for (key, value) in RULES_TO_EXPORT.items()}

Testing 1, 2, 3

Testing is a must for text normalization and building grammars. They verify, that they do what we intend. For now we’re just going to add a simple test for each rule to verify that the regex we’ve constructed is the regex we intended. If you’re not familiar with testing or pytest , I highly suggest you look into it.

In our test_tokenizer.py file, import our newly created grammar and specifically the rules we just made. These tests are pretty basic, just assert that our regexes are what we think they should be. Any time we make a change to a rule, the regex here will have to be updated. Otherwise the test will fail. See if you can make the test fail. It’s all too easy.

A couple notes about pytest if you’re unfamiliar with it. Naming is important For example, in order for pytest to locate the test file the name as to start with ‘test’! Also, once the file is found each test also has to start with the ‘test’.

import grammar

RULES = grammar.RULES


def test_inital_punctuation_regex():
    test_regex = RULES['initial_punctuation_token'].pattern
    assert test_regex == """^([\'"])([A-Z]+[\',!?":.]*)$"""

def test_final_punctuation_regex():
    test_regex = RULES['final_punctuation_token'].pattern
    assert test_regex == """^([A-Z]+)([',!?":.]+)$"""

def test_all_punctuation_regex():
    test_regex = RULES['punctuation_token'].pattern
    assert test_regex == """^([',!?":.])([',!?":.]+)$"""

def test_currency_regex():
    test_regex = RULES['currency_token'].pattern
    assert test_regex == r"""^([$£¥€])([0-9]+\.?[0-9]{,2})([',!?":.]*)$"""

Once you’ve got the tests setup. In the open your terminal or command prompt and type the following command.

pytest -v

The -v flag prints an itemized list of each test and whether it passed or failed.

platform win32 -- Python 3.7.4, pytest-5.2.1, py-1.8.0, pluggy-0.13.0 -- C:\Users\$USER\Anaconda3\python.exe
cachedir: .pytest_cache
rootdir: PATH\TO\FILE
plugins: arraydiff-0.3, doctestplus-0.4.0, openfiles-0.4.0, remotedata-0.3.2
collected 4 items

test_tokenizer.py::test_inital_punctuation_regex PASSED                                                          [ 25%]
test_tokenizer.py::test_final_punctuation_regex PASSED                                                           [ 50%]
test_tokenizer.py::test_all_punctuation_regex PASSED                                                             [ 75%]
test_tokenizer.py::test_currency_regex PASSED                                                                    [100%]

================================================== 4 passed in 0.05s ==================================================

Testing the future

For now these are all the tests we’re going to add. When the tokenizer is built we’ll add tests for each rule to verify that the rules are working harmoniously with each other and producing correct results. However, feel free to add additional tests testing only an individual rule. The more test cases, the better.

In the next post, we’ll start work on the tokenizer itself, now that the grammar and exceptions components are complete. Feel free to look at the peak and look at the completed source code.

Let's Make a Tokenizer Part 4 (Categories: python, tokenization, nlp, normalization, project)
Let's Make a Tokenizer Part 2 (Categories: python, tokenization, nlp, normalization, project)
Let's Make a Tokenizer Part 1 (Categories: python, tokenization, nlp, normalization, project)
Tokenization (Categories: python, tokenization, nlp, normalization)
Internationalization & Localization in Python (Categories: python, internationalization, localization, gettext, text, translation)
Normalizing Text with Regex Groups in Python (Categories: python, normalization, text, nlp, regex)

« Let's Make a Tokenizer Part 2 Let's Make a Tokenizer Part 4 »

Daniel J. Dorado