Introduction
You might have heard a lot about one of the hottest topics around; i.e., Natural Language Processing. In this article, we will look at something called tokenization using the Natural Language Toolkit, or NLTK module of Python.
First of all, let’s discuss a bit about the NLTK module. Natural Language Toolkit is a module for Python developers that will aid the programmers with the entire Natural Language Processing (NLP) methodology. It provides easy-to-use interfaces such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning and wrappers for industrial-strength NLP libraries.
In this article, we will be focusing on tokenizing. The first question is, what is tokenizing? Tokenizing simply stands for splitting sentences and words from the body of the text. For example, sentence tokenizer can be used to find the list of sentences and word tokenizer can be used to find the list of words in strings.
Let’s start by installing NLTK 3. I hope you know about the pip installation. First of all, open the shell and you have to type the command given below to install NLTK:
As discussed above, the NLTK module has a lot of components and we will need a few of these. To install the components, type the code nltk.download()to your python ide.
- import nltk
- nltk.download()
A prompt window will open as soon as the command gets executed. This window will have some red color options and as soon as all the installation takes place this will turn green as shown in the images below:
I will suggest to choose ‘all’ for the packages and then click ‘download’. By doing so all of the packages like the tokenizers, chunkers, etc., will get downloaded. Or you have the option to download the packages manually.
Now we have everything we need. So it’s the time to get familiar with some of the basic terms we will be using very often.
Now that you have all the things that you need, let's knock out some quick vocabulary:
- Corpus
Body of text, singular. Corpora are the plural of this. Example: A collection of medical journals.
- Lexicon
Words and their meanings. Example: English dictionary. Consider, however, that various fields will have different lexicons.
- Token
A token is an instance of a sequence of characters in some particular documents that are grouped together as a useful semantic unit for processing. For example, each word is a token when a sentence is "tokenized" into words. We can also tokenize the sentences out of the paragraph.
With that, let's show an example of how one might actually tokenize something into tokens with the NLTK module.
- from nltk.tokenize import sent_tokenize, word_tokenize,
- TEXT = "Hello Mr. Buddy, hope you are doing well. I like to work in Python, and it’s pretty simple too. I hope you will be enjoying this article"
- print(sent_tokenize(TEXT))
The first step for tokenizing the whole paragraph will be splitting the sentences from ‘.’ Or splitting by a period followed by a space. Splitting by words is a real challenge because we had a lot of concatenations, commas and all. But don’t worry, NLTK is there for your help. We can easily tokenize the whole sentences into words using word_tokenize only.
The output of the above code will look like this:
['Hello Mr. Buddy, hope you are doing well.', 'I like to work with python', 'and it’s pretty simple', "I hope you will be enjoying this article"]
So by now, we have done sentence tokenization. Let’s have a look at word tokenization.
- print(word_tokenize(TEXT))
Now our output is: ['Hello', 'Mr.', 'Buddy', ',', 'hope', 'you', ‘are', 'doing', 'well', '.', 'I', 'like', 'to', 'work', 'with', 'python', ',', 'and', 'it’s', 'pretty', 'simple', '.', 'I', 'hope', 'you', 'will', 'be', "enjoying", 'this', 'article', '.']
There are a few things to note here. First, notice that punctuation is treated as a separate token. One more thing we can notice is that there are many words which have no use to us. These types of words are called stop words in NLP.
In my next article, I will explain the meaning of all these tokenized words.