Let's Make a Tokenizer Part 1

Introduction

So in this post, we’ll start making a simple English tokenizer in pure Python. Simple in the sense that we’re not going to try and aim for perfect tokenization, but the most common kinds of tokens we might see. In preparation, I suggest you might want to brush up on Python regular expressions since we’ll be dealing with them here. We will be building a tokenizer class, so hopefully you’ll be comfortable with classes. Feel free to peruse the final code for this project.

Set up the project.

First step is to create a new folder and add the following python files:

tokenizer.py
grammar.py
exceptions.py
test_tokenizer.py
demo.py

Honestly, we could probably get away with just two files. One for tests, and one for everything else, but I’d like to make this extendable to some degree. Meaning you could build multiple grammars for various languages… or purposes and easily import it into your tokenizer.

File Overview

Let’s take a moment to get acquainted with our files.

Tokenizer

The tokenizer file is the main file. Here we’ll build the tokenizer object, and its internal logic.

Grammar

The grammar file is where we’ll create the rules describing which kinds of tokens are NOT allowed. For example "hi or you", should not be allowed. These rules will be declared with regular expressions. However, since regexes can be difficult to build and decipher as they get longer, we’ll build them up piece by piece into manageable parts with various test to make sure they work as intended.

Exceptions

Exceptions will hold a lexicon for tokens where we want a custom tokenization. For example, isn't may map to is n't. It might also be a good idea to hold common abbreviations in the exceptions lexicon. Ours will only hold a couple for the example, but feel free to fill yours up.

Test_Tokenizer

We’ll be using the Pytest framework to test our grammar rules and tokenizer. This may seem somewhat unnecessary for a toy example, but it will allow us to see just how good or bad or tokenizer preforms. And should we make any changes to our grammar to work better, we’ll need to verify the change works with tests, and that there are no unintended consequences. The more tests the better!

Demo

This file will be a simple file where we’ll import our tokenizer and see it in action, but running it over several sentences. This file will demo our tokenizer.

In the next post we’ll start with adding our exceptions.

Let's Make a Tokenizer Part 4 (Categories: python, tokenization, nlp, normalization, project)
Let's Make a Tokenizer Part 3 (Categories: python, tokenization, nlp, normalization, project)
Let's Make a Tokenizer Part 2 (Categories: python, tokenization, nlp, normalization, project)
Tokenization (Categories: python, tokenization, nlp, normalization)
Internationalization & Localization in Python (Categories: python, internationalization, localization, gettext, text, translation)
Normalizing Text with Regex Groups in Python (Categories: python, normalization, text, nlp, regex)

« Tokenization Let's Make a Tokenizer Part 2 »

Daniel J. Dorado