Exceptions

So let’s get started making the exceptions.py file. As you see this, is pretty basic. It’s just a Python dictionary with a few tokens that, for this example, we’re going to going to treat as exceptions. The first step in our tokenization pipeline is to see if it’s in this list. If a word is in this list we’ll add the associated tokens in to our final tokens list. Since everything here is an exception, the approved tokens here will not be checked against our grammar rules.

Things to consider adding here are more contractions. This tokenizer isn’t concerned about capitalization so What's and what's are treated separately. I won’t worry about it in this example, but feel free to adjust the tokenizer as needed.

While, I’m not adding too many exceptions for this example, experiment. What kind of other things might be included in here.

It’s not uncommon to have a lexicon for common abbreviations. Like Mr. and U.K.. As needs arise you may finding yourself needing to add to this list.

"""
Exception lexicon for tokenizer.
"""

LEXICON = {
    "don't" : ["do", "n't"],
    "isn't" : ["is", "n't"],
    "What's" : ["What", "'s"],
    "I'm" : ["I", "'m"],
}

Next Post

In the next post we’ll start building the grammar and write tests to ensure the rules are encapsulated as intended.