Daniel J. Dorado

Computational Linguist

Natural Language Processing

» nlp

Articles by category: nlp


R phonology python normalization text regex internationalization localization gettext translation tokenization project
2020
09 Jan 2020

Let's Make a Tokenizer Part 4

Congratulations on making it to the last post in making a tokenizer. Let’s jump right in and bring in our...

06 Jan 2020

Let's Make a Tokenizer Part 3

Grammar So as mentioned in the first post, the grammar is where rules are specified. These grammatical rules will check...

2019
29 Dec 2019

Let's Make a Tokenizer Part 2

Exceptions So let’s get started making the exceptions.py file. As you see this, is pretty basic. It’s just a Python...

28 Dec 2019

Let's Make a Tokenizer Part 1

Introduction So in this post, we’ll start making a simple English tokenizer in pure Python. Simple in the sense that...

21 Dec 2019

Tokenization

Introduction Tokenization is an import step in the NLP pipeline. It is often part of the text normalization process. Many...

13 Nov 2019

Normalizing Text with Regex Groups in Python

Normalizing Text with Regex Groups in Python In this post we’re going to look at how regex groups can help...