Introduction
Tokenization is an import step in the NLP pipeline. It is often part of the text normalization process. Many text transformations can’t can’t be done until the text is tokenized. This post will examine what is tokenization and its challenges.
Tokenization
A token is a sequence of
characters. For example, cat
, ...
, ;)
, What's
, R.S.V.P.
are all tokens.
Notice that tokens aren’t necessarily words like :)
. This allows us to say
that all words are tokens, but not all tokens are words. However, functionally
most tokens are words. The token, "What?!!!!!
may not be
useful in its own right, but "
What
?
!
!
!
!
!
might be.
So tokenization is the process of splitting a text into tokens (essentially words).
For example hi how are you?
would become hi
how
are
you
?
.
At first glance, this may seem pretty simple. The string split()
method
in Python easily splits text into tokens. You might have to knaw off some
punctuation, but it’s essentially done.
Hello, how are you?
in Japanese (according too Google translate) is
こんにちは、元気ですか?
Japanese doesn’t use spaces like English does! This like
Hello,howareyou?
This is one loooong token! Punctuation helps a little to
tease tokens apart: こんにちは
、
元気ですか
?
or Hello
,
howareyou
?
,
however this still doesn’t give you accurate tokens. Often tokenization for these
languages have to be manually done for best results. Although there are a few
algorithms such as (Maximum Matching)[https://medium.com/@anshul16/maximum-matching-word-segmentation-algorithm-python-code-3444fe4bd6f9]
which are successful at tokenizing this kind of text depending on the language.
Languages with shorter words seem to do better. Some languages like Ancient
Greek and Latin don’t use punctuation. Good thing you probably won’t work with
these languages.
Ambiguity and Arbitrary Decisions
Sometimes there might be ambiguity as to what constitutes a word.
You shouldn't've done that!
Is shouldn't've
one word, two words or three, should
not
have
, what
about should
n't
've
?
To properly tokenize text requires good knowledge of the language. It also
requires making arbitrary decisions. Do we want to consider what's
as
a single word or two words? In the end, a decision has to be made based
upon project needs.
Another example is gonna
. Clearly this word is a single token, but do we
want it to be? We might also split this as gon
and na
. How might this
affect Part of Speech tagging?
Punctuation
In English some tokens require punctuation, E.T.A.
for example. So if
punctuation in split. You might get an erroneous tokenization of E.T.A
.
or something odder like E
.
T
.
A
.
. Sometimes the punctuation
may be fulfilling two roles. I like you Mr.
And what about
fo’c’s’le? Obviously very
important.
That Apostrophe S
The 's
has two meanings in English. It can either signal a contraction or
the genitive clitic.
1. The book is John's.
2, No, it's the woman with the green hat's.
3. But the book's been in the library's collection for a while now.
Final Apostrophe
Also to note is that English possessive plurals require an apostrophe. This
is the libraries' book.
I guess this means we can’t strip off all
punctuation…
Internal punctuation.
Not all languages use punctuation in the same way. A personal favorite is Armenian Tonal Punctuation, where punctuation (like !, ? and :) is attached to the syllable in the word that you would like to emphasize. For example with the question mark: (I’ll be using English English to illustrate the example.)
Standard English punctuation:
What are you doing?
Armenian equivalents:
- Wha?t are you doing.
- What a?re you doing.
- What are you? doing.
- What are you do?ing.
While this articulates precisely what the writer is wanting to emphasize, it
makes processing text more interesting. How should do?ing
be tokenized.
You have interwoven tokens. Should doing
and do?ing
be considered separate
tokens? Something tells me that autocorrect in Armenian is probably more
frustrating than with other languages.
Conclusion
By now I hope you’re thinking about how you want to deal with tokenization on for some of your projects. Know of interesting tokenization situations. Feel to post them down below in the comments.