Introduction
One of the common tasks that I have to do is identify nouns and verbs in text. If I am doing object oriented programming, the nouns will help in deciding which classes to create. Even if I am not doing object oriented programming, having a list of nouns and verbs will facilitate understanding.
Python Advanced Concepts
Some of the Python code used might be considered "advanced".
For example, we will use a list of tuples. In my opinion, going for ["txt_1", "txt_2"] to [("txt_1a", "txt_1b"), ("txt_2a", "txt_2b")] is no big deal. Refer to output 2 for how it applies in this write-up.
Also, Counter() is used to count the number of times that a particular tuple occurs. Refer to output 3 for how it applies in this write-up.
The contents of Counter() is a dictionary of dictionaries. The key of the inner dictionary is a tuple. Refer to output 3 for how it applies in this write-up.
Natural Language Toolkit (NLTK)
NLTK is used for tokenization and then assigning parts of speech (POS) to those tokens. Unfortunately, NLTK is large and complex. The reference section at the end provides links and short text snippets for the applicable NLTK components.
Computing Environment
Python version 3.12.9 is used.
NLTK version 3.9.1 is used.
I used Anaconda to install NLTK. It did not install nltk.download() by default because it takes approximately 2.5 GB. Consequently, I had to install punkt_tab in order to get nltk.word_tokenize() to work.
>>> import nltk >>> nltk.download('punkt_tab')
Similarly, I had to install averaged_perceptron_tagger_eng (Source code for nltk.tag.perceptron, class nltk.tag.perceptron.AveragedPerceptron) in order to get nltk.pos_tag() to work.
>>> import nltk >>> nltk.download('averaged_perceptron_tagger_eng')Software Structure
To go to the file containing the working code click here.
The structure of the code is as follows
Create text data
Process the text to tokenize it, assign a POS to it, and then count the combination of tokens and POS.
Filter data such that only nouns and verbs remain.
Sort data such that the output has desired format.
Create Text Data
The test data is created as follows
Code Snippet 1
TEXT_1 = """ This is chunk one of the text. The leaves on a tree are green. The leaves on a tree aren't red. """ TEXT_2 = """ This is chunk two of the text. The color of the car is red. The color of the car also contains blue. The color of the car isn't green. """ LIST_OF_STRINGS = [TEXT_1, TEXT_2]
TEXT_1 and TEXT_2 are easily modifiable so that they can be altered by reader with their own test data. Also, it is easy to expand the data to use TEXT_3, TEXT_4 and so on because the processing is based on a list of text contained in LIST_OF_STRINGS.
Process Text: tokenize, assign POS to tokens, count combinations of token and POS
The processing of text is accomplished with the following code snippet.
Code Snippet 2
# For the combination of a word and its part of speech, # create a count of its associated appearance for i, string in enumerate(LIST_OF_STRINGS): # Decompose string into tokens string_tokenized = nltk.word_tokenize(string) # Assign part of speech to each token string_pos_tag = nltk.pos_tag(string_tokenized) # Count the number of token, parts of speech combinations if i == 0: count_token_pos = Counter(string_pos_tag) else: count_token_pos.update(Counter(string_pos_tag))
Tokenize
Tokenization is accomplished via nltk.word_tokenize().
Its output is a list of strings.
Output 1
['This', 'is', 'chunk', ... 'are', "n't", 'red', '.']
Since the nature of the material is introductory, will not worry about edge cases like contractions ( "n't").
Assign POS to Tokens
Association of a POS w/ a token is accomplished via nltk.pos_tag().
Its output is an array of tuples where each tuple consists of a token and a POS.
Output 2
[('This', 'DT'), ('is', 'VBZ'), ('chunk', 'JJ'), ... ("n't", 'RB'), ('red', 'JJ'), ('.', '.')] String POS Ta
'DT' stands for determiner. 'VBZ' stands for a verb that is third person singular present. For the full list, refer to Alphabetical list of part-of-speech tags used in the Penn Treebank Project at UPenn.Edu
Count Combinations of Token and POS
Counter() is used to count the number of occurences a combination of token and POS occurs.
The output of Counter(string_pos_tag) is
Output 3 (Just for processing TEXT_1)
Counter( { ('.', '.'): 3, ('The', 'DT'): 2, ('leaves', 'NNS'): 2, ... ('green', 'JJ'): 1, ("n't", 'RB'): 1, ('red', 'JJ'): 1 } )
Notice that the contents of Counter() is a dictionary of dictionaries. The key of the inner dictionary is a tuple. The tuple key consists of a token and a POS as shown in output 2. The integer is simply the number of times that the tuple occurs.
Filter Data for Nouns & Verbs
The first step is to identify the POS that correspond to nouns and verbs.
This is implemented via the following code snippet.
Code Snippet 3
NOUN_POS = ["NN", "NNS", "NNP", "NNPS"] VERB_POS = ["VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]
'NN' stands for a noun which is singular. 'NNS' stands for a noun which is plural. For the full list, refer to Alphabetical list of part-of-speech tags used in the Penn Treebank Project at UPenn.Edu
The filtering is implemented via the following code snippet.
Code Snippet 4
list_noun_count = [] list_verb_count = [] for token_pos in count_token_pos: if token_pos[1] in NOUN_POS: list_noun_count.append((token_pos[0], count_token_pos[token_pos])) elif token_pos[1] in VERB_POS: list_verb_count.append((token_pos[0], count_token_pos[token_pos]))
In the above code, there is nothing worthy of note. Just an if statement inside a for loop. If people need a refresher on iterating through a dictionary, consider reading "How to Iterate Through a Dictionary in Python" by Leodanis Pozo Ramos.
The output of this step is provided below.
Output 4.A: List of nouns and their counts
[ ('text', 2), ('leaves', 2), ('tree', 2), ('color', 3), ('car', 3) ]
Output 4.B: List of verbs and their counts
[ ('is', 4), ('are', 2), ('contains', 1) ]
Notice that the output is a list of tuples. Also, each tuple consists of a string and an integer. This will be important when the lists are sorted.
Sort Data & Generate Output
One set of output will be nouns and their associated counts. This needs to be sorted either by the nouns alphabetically or by their counts. Otherwise, the data will be in some sort of random order. Similarly, another set of output will be be verbs and their counts which will also need to be sorted.
This is implemented via the following code snippet.
Code Snippet 5
# Sort data alphabetically list_noun_count.sort() list_verb_count.sort() with open("noun_counts_alphabetized_by_noun.txt", "w", encoding="utf8") as file: for noun_count in list_noun_count: file.write(noun_count[0] + ", " + str(noun_count[1]) + "\n") with open("verb_counts_alphabetized_by_verb.txt", "w", encoding="utf8") as file: for verb_count in list_verb_count: file.write(verb_count[0] + ", " + str(verb_count[1]) + "\n") # Sort data by their counts list_noun_count.sort(key=lambda noun_count: noun_count[1], reverse=True) list_verb_count.sort(key=lambda noun_count: noun_count[1], reverse=True) with open("noun_counts_by_increasing_count.txt", "w", encoding="utf8") as file: for noun_count in list_noun_count: file.write(noun_count[0] + ", " + str(noun_count[1]) + "\n") with open("verb_counts_by_increasing_count.txt", "w", encoding="utf8") as file: for verb_count in list_verb_count: file.write(verb_count[0] + ", " + str(verb_count[1]) + "\n")
The first thing to note is that sorting a list is done in place. Consequently, can just use list_noun_count.sort(). This is different from when use the built-in function sorted() because it returns a new sorted list.
The second thing to note is that specified a key for sorting by count via lambda noun_count: noun_count[1]. Reacall from Output 4.A and 4.B that we are sorting a list of tuples. If people need a refresher on sorting, consider reading "How to Use sorted() and .sort() in Python" by David Fundakowski.
Lastly, please make a habit of specifying utf8 as the encoding when creating a file. Unfortunately, Python uses the OS default for encoding [locale.getencoding()]. This is especially important when using the Windows operating system because it defaults to cp1252.
The final output is provided below.
Output 5.A: Nouns sorted by count
car, 3 color, 3 leaves, 2 text, 2 tree, 2
Output 5.B: Verbs sorted by count
is, 4 are, 2 contains, 1
Alternative Approaches
An alternative to NLTK is spaCy. If Google "Identify Nouns & Verbs in Text Using spaCy" will obtain results which will help with starting the journey.
Summary
Determining nouns and verbs is simple. :-)
Tokenize using nltk.word_tokenize().
Associated a POS with each token using nltk.pos_tag().
Count the number of times that a combination of token and POS occur by using the Python Counter module.
Filter such that only noun and verb POS remain.
Sort the data as needed to generate the desired output.
However, as is commonly said, the devil is in the details. If you are interested in the details, check out the reference section below.
References
Articles
A Good Part-of-Speech Tagger in about 200 Lines of Python by Matthew Honnibal
Alphabetical list of part-of-speech tags used in the Penn Treebank Project at UPenn.Edu
Tag: NN
Description: Noun, singular or mass
...
How to Iterate Through a Dictionary in Python by Leodanis Pozo Ramos
How to Use sorted() and .sort() in Python by David Fundakowski
NLTK Lookup Error at Stack Over Flow
First answer said the missing module is 'the Perceptron Tagger', actually its name in nltk.download is 'averaged_perceptron_tagger'
You can use this to fix the error
nltk.download('averaged_perceptron_tagger')
What to download in order to make nltk.tokenize.word_tokenize work? at Stack Over Flow
At home, I downloaded all nltk resources by nltk.download() but, as I found out, it takes ~2.5GB.
Try to download ... punkt_tab
import nltk
nltk.download('punkt_tab')
Books
Natural Language Toolkit (NLTK)
NLTK Documentation
All modules for which code is available (_modules)
Source code for nltk.tag.perceptron
TAGGER_JSONS = { "eng": { "weights": "averaged_perceptron_tagger_eng.weights.json", "tagdict": "averaged_perceptron_tagger_eng.tagdict.json", "classes": "averaged_perceptron_tagger_eng.classes.json", },
Introduction
The NLTK corpus and module downloader. This module defines several interfaces which can be used to download corpora, models, and other data packages that can be used with NLTK.
Individual packages can be downloaded by calling the download() function with a single argument, giving the package identifier for the package that should be downloaded:
>>> download('treebank')
[nltk_data] Downloading package 'treebank'...
[nltk_data] Unzipping corpora/treebank.zip.
NLTK Taggers
This package contains classes and interfaces for part-of-speech tagging, or simply “tagging”.
A “tag” is a case-sensitive string that specifies some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token). For example, the following tagged token combines the word 'fly' with a noun part of speech tag ('NN'):
>>> tagged_tok = ('fly', 'NN')
An off-the-shelf tagger is available for English. It uses the Penn Treebank tagset:
>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'), ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
NB. Use pos_tag_sents() for efficient tagging of more than one sentence.
NLTK Tokenizer Package
Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the words and punctuation in a string:
>>> from nltk.tokenize import word_tokenize
>>> s = '''Good muffins cost $3.88\nin New York. Please buy me ... two of them.\n\nThanks.'''
>>> word_tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
This particular tokenizer requires the Punkt sentence tokenization models to be installed.
Punkt Sentence Tokenizer
This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.
The NLTK data package includes a pre-trained Punkt tokenizer for English.
codecs - Codec registry and base classes
Codec: cp1252 - Aliases: windows-1252 - Language: Western Europe
Codec: utf_8 - Aliases: U8, UTF, utf8, cp65001 - Language: all languages
On Windows, it is the ANSI code page (ex: "cp1252").
In text mode, if encoding is not specified the encoding used is platform-dependent: locale.getencoding() is called to get the current locale encoding.
Interesting to see these simpler approaches work reliably. Wondering how accurate they can be?
ReplyDeleteArticle: Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? by Christopher D. Manning from Departments of Linguistics and Computer Science at Stanford University
ReplyDeletehttps://nlp.stanford.edu/pubs/CICLing2011-manning-tagging.pdf