# Assignment 2: Parts-of-Speech Tagging (POS)

Welcome to the second assignment of Course 2 in the Natural Language Processing specialization. This assignment will develop skills in part-of-speech (POS) tagging, the process of assigning a part-of-speech tag (Noun, Verb, Adjective…) to each word in an input text. Tagging is difficult because some words can represent more than one part of speech at different times. They are Ambiguous. Let’s look at the following example:

• The whole team played well. [adverb]
• You are doing well for yourself. [adjective]
• Well, this assignment took me forever to complete. [interjection]
• The well is dry. [noun]
• Tears were beginning to well in her eyes. [verb]

Distinguishing the parts-of-speech of a word in a sentence will help you better understand the meaning of a sentence. This would be critically important in search queries. Identifying the proper noun, the organization, the stock symbol, or anything similar would greatly improve everything ranging from speech recognition to search. By completing this assignment, you will:

• Learn how parts-of-speech tagging works
• Compute the transition matrix A in a Hidden Markov Model
• Compute the transition matrix B in a Hidden Markov Model
• Compute the Viterbi algorithm
• Compute the accuracy of your own model

## Part 0: Data Sources

This assignment will use two tagged data sets collected from the Wall Street Journal (WSJ).

Here is an example ‘tag-set’ or Part of Speech designation describing the two or three letter tag and their meaning.

• One data set (WSJ-2_21.pos) will be used for training.
• The other (WSJ-24.pos) for testing.
• The tagged training data has been preprocessed to form a vocabulary (hmm_vocab.txt).
• The words in the vocabulary are words from the training set that were used two or more times.
• The vocabulary is augmented with a set of ‘unknown word tokens’, described below.

The training set will be used to create the emission, transmission and tag counts.

The test set (WSJ-24.pos) is read in to create y.

• This contains both the test text and the true tag.
• The test set has also been preprocessed to remove the tags to form test_words.txt.
• This is read in and further processed to identify the end of sentences and handle words not in the vocabulary using functions provided in utils_pos.py.
• This forms the list prep, the preprocessed text used to test our POS taggers.

A POS tagger will necessarily encounter words that are not in its datasets.

• To improve accuracy, these words are further analyzed during preprocessing to extract available hints as to their appropriate tag.
• For example, the suffix ‘ize’ is a hint that the word is a verb, as in ‘final-ize’ or ‘character-ize’.
• A set of unknown-tokens, such as ‘—unk-verb—‘ or ‘—unk-noun—‘ will replace the unknown words in both the training and test corpus and will appear in the emission, transmission and tag data structures.

Implementation note:

• For python 3.6 and beyond, dictionaries retain the insertion order.
• Furthermore, their hash-based lookup makes them suitable for rapid membership tests.
• If _di_ is a dictionary, key in di will return True if _di_ has a key _key_, else False.

The dictionary vocab will utilize these features.

A few items of the training corpus list
['In\tIN\n', 'an\tDT\n', 'Oct.\tNNP\n', '19\tCD\n', 'review\tNN\n']

A few items of the vocabulary list
['!', '#', '$', '%', '&', "'", "''", "'40s", "'60s", "'70s", "'80s", "'86", "'90s", "'N", "'S", "'d", "'em", "'ll", "'m", "'n'", "'re", "'s", "'til", "'ve", '(', ')', ',', '-', '--', '--n--', '--unk--', '--unk_adj--', '--unk_adv--', '--unk_digit--', '--unk_noun--', '--unk_punct--', '--unk_upper--', '--unk_verb--', '.', '...', '0.01', '0.0108', '0.02', '0.03', '0.05', '0.1', '0.10', '0.12', '0.13', '0.15'] A few items at the end of the vocabulary list ['yards', 'yardstick', 'year', 'year-ago', 'year-before', 'year-earlier', 'year-end', 'year-on-year', 'year-round', 'year-to-date', 'year-to-year', 'yearlong', 'yearly', 'years', 'yeast', 'yelled', 'yelling', 'yellow', 'yen', 'yes', 'yesterday', 'yet', 'yield', 'yielded', 'yielding', 'yields', 'you', 'young', 'younger', 'youngest', 'youngsters', 'your', 'yourself', 'youth', 'youthful', 'yuppie', 'yuppies', 'zero', 'zero-coupon', 'zeroing', 'zeros', 'zinc', 'zip', 'zombie', 'zone', 'zones', 'zoning', '{', '}', '']  Vocabulary dictionary, key is the word, value is a unique integer :0 !:1 #:2$:3
%:4
&:5
':6
'':7
'40s:8
'60s:9
'70s:10
'80s:11
'86:12
'90s:13
'N:14
'S:15
'd:16
'em:17
'll:18
'm:19
'n':20

A sample of the test corpus
['The\tDT\n', 'economy\tNN\n', "'s\tPOS\n", 'temperature\tNN\n', 'will\tMD\n', 'be\tVB\n', 'taken\tVBN\n', 'from\tIN\n', 'several\tJJ\n', 'vantage\tNN\n']

The length of the preprocessed test corpus:  34199
This is a sample of the test_corpus:
['The', 'economy', "'s", 'temperature', 'will', 'be', 'taken', 'from', 'several', '--unk--']


# Part 1: Parts-of-speech tagging

## Part 1.1 - Training

You will start with the simplest possible parts-of-speech tagger and we will build up to the state of the art.

In this section, you will find the words that are not ambiguous.

• For example, the word is is a verb and it is not ambiguous.
• In the WSJ corpus, $86$% of the token are unambiguous (meaning they have only one tag)
• About $14\%$ are ambiguous (meaning that they have more than one tag)

Before you start predicting the tags of each word, you will need to compute a few dictionaries that will help you to generate the tables.

#### Transition counts

• The first dictionary is the transition_counts dictionary which computes the number of times each tag happened next to another tag.

This dictionary will be used to compute:

This is the probability of a tag at position $i$ given the tag at position $i-1$.

In order for you to compute equation 1, you will create a transition_counts dictionary where

• The keys are (prev_tag, tag)
• The values are the number of times those two tags appeared in that order.

#### Emission counts

The second dictionary you will compute is the emission_counts dictionary. This dictionary will be used to compute:

In other words, you will use it to compute the probability of a word given its tag.

In order for you to compute equation 2, you will create an emission_counts dictionary where

• The keys are (tag, word)
• The values are the number of times that pair showed up in your training set.

#### Tag counts

The last dictionary you will compute is the tag_counts dictionary.

• The key is the tag
• The value is the number of times each tag appeared.

### Exercise 01

Instructions: Write a program that takes in the training_corpus and returns the three dictionaries mentioned above transition_counts, emission_counts, and tag_counts.

• emission_counts: maps (tag, word) to the number of times it happened.
• transition_counts: maps (prev_tag, tag) to the number of times it has appeared.
• tag_counts: maps (tag) to the number of times it has occured.

Implementation note: This routine utilises defaultdict, which is a subclass of dict.

• A standard Python dictionary throws a KeyError if you try to access an item with a key that is not currently in the dictionary.
• In contrast, the defaultdict will create an item of the type of the argument, in this case an integer with the default value of 0.
• See defaultdict.
word count = 50000
word count = 100000
word count = 150000
word count = 200000
word count = 250000
word count = 300000
word count = 350000
word count = 400000
word count = 450000
word count = 500000
word count = 550000
word count = 600000
word count = 650000
word count = 700000
word count = 750000
word count = 800000
word count = 850000
word count = 900000
word count = 950000

Number of POS tags (number of 'states'): 46
View these POS tags (states)

### Exercise 06

Instructions: Implement the viterbi_forward algorithm and store the best_path and best_prob for every possible tag for each word in the matrices best_probs and best_tags using the pseudo code below.

for each word in the corpus

for each POS tag type that this word may be

for POS tag type that the previous word could be

compute the probability that the previous word had a given POS tag, that the current word has a given POS tag, and that the POS tag would emit this current word.

retain the highest probability computed for the current word

set best_probs to this highest probability

set best_paths to the index 'k', representing the POS tag of the previous word which produced the highest probability


Please use math.log to compute the natural logarithm.

Hints

• Remember that when accessing emission matrix B, the column index is the unique integer ID associated with the word. It can be accessed by using the 'vocab' dictionary, where the key is the word, and the value is the unique integer ID for that word.

Run the viterbi_forward function to fill in the best_probs and best_paths matrices.

Note that this will take a few minutes to run. There are about 30,000 words to process.

Words processed:     5000
Words processed:    10000
Words processed:    15000
Words processed:    20000
Words processed:    25000
Words processed:    30000

best_probs[0,1]: -24.7822
best_probs[0,4]: -49.5601


## Part 3.3 Viterbi backward

Now you will implement the Viterbi backward algorithm.

• The Viterbi backward algorithm gets the predictions of the POS tags for each word in the corpus using the best_paths and the best_probs matrices.

The example below shows how to walk backwards through the best_paths matrix to get the POS tags of each word in the corpus. Recall that this example corpus has three words: “Loss tracks upward”.

POS tag for ‘upward’ is RB

• Select the the most likely POS tag for the last word in the corpus, ‘upward’ in the best_prob table.
• Look for the row in the column for ‘upward’ that has the largest probability.
• Notice that in row 28 of best_probs, the estimated probability is -34.99, which is larger than the other values in the column. So the most likely POS tag for ‘upward’ is RB an adverb, at row 28 of best_prob.
• The variable z is an array that stores the unique integer ID of the predicted POS tags for each word in the corpus. In array z, at position 2, store the value 28 to indicate that the word ‘upward’ (at index 2 in the corpus), most likely has the POS tag associated with unique ID 28 (which is RB).
• The variable pred contains the POS tags in string form. So pred at index 2 stores the string RB.

POS tag for ‘tracks’ is VBZ

• The next step is to go backward one word in the corpus (‘tracks’). Since the most likely POS tag for ‘upward’ is RB, which is uniquely identified by integer ID 28, go to the best_paths matrix in column 2, row 28. The value stored in best_paths, column 2, row 28 indicates the unique ID of the POS tag of the previous word. In this case, the value stored here is 40, which is the unique ID for POS tag VBZ (verb, 3rd person singular present).
• So the previous word at index 1 of the corpus (‘tracks’), most likely has the POS tag with unique ID 40, which is VBZ.
• In array z, store the value 40 at position 1, and for array pred, store the string VBZ to indicate that the word ‘tracks’ most likely has POS tag VBZ.

POS tag for ‘Loss’ is NN

• In best_paths at column 1, the unique ID stored at row 40 is 20. 20 is the unique ID for POS tag NN.
• In array z at position 0, store 20. In array pred at position 0, store NN.

### Exercise 07

Implement the viterbi_backward algorithm, which returns a list of predicted POS tags for each word in the corpus.

• Note that the numbering of the index positions starts at 0 and not 1.
• m is the number of words in the corpus.
• So the indexing into the corpus goes from 0 to m - 1.
• Also, the columns in best_probs and best_paths are indexed from 0 to m - 1

In Step 1:
Loop through all the rows (POS tags) in the last entry of best_probs and find the row (POS tag) with the maximum value.
Convert the unique integer ID to a tag (a string representation) using the dictionary states.

Referring to the three-word corpus described above:

• z[2] = 28: For the word ‘upward’ at position 2 in the corpus, the POS tag ID is 28. Store 28 in z at position 2.
• states(28) is ‘RB’: The POS tag ID 28 refers to the POS tag ‘RB’.
• pred[2] = 'RB': In array pred, store the POS tag for the word ‘upward’.

In Step 2:

• Starting at the last column of best_paths, use best_probs to find the most likely POS tag for the last word in the corpus.
• Then use best_paths to find the most likely POS tag for the previous word.
• Update the POS tag for each word in z and in preds.

Referring to the three-word example from above, read best_paths at column 2 and fill in z at position 1.
z[1] = best_paths[z[2],2]

The small test following the routine prints the last few words of the corpus and their states to aid in debug.

The prediction for pred[-7:m-1] is:
['see', 'them', 'here', 'with', 'us', '.']
['VB', 'PRP', 'RB', 'IN', 'PRP', '.']

The prediction for pred[0:8] is:
['DT', 'NN', 'POS', 'NN', 'MD', 'VB', 'VBN']
['The', 'economy', "'s", 'temperature', 'will', 'be', 'taken']


Expected Output:

Now you just have to compare the predicted labels to the true labels to evaluate your model on the accuracy metric!

# Part 4: Predicting on a data set

Compute the accuracy of your prediction by comparing it with the true y labels.

• pred is a list of predicted POS tags corresponding to the words of the test_corpus.
The third word is: temperature
Your prediction is: NN
Your corresponding label y is:  temperature NN


### Exercise 08

Implement a function to compute the accuracy of the viterbi algorithm’s POS tag predictions.

• To split y into the word and its tag you can use y.split().
Accuracy of the Viterbi algorithm is 0.9531

##### Expected Output

Congratulations you were able to classify the parts-of-speech with 95% accuracy.

### Key Points and overview

In this assignment you learned about parts-of-speech tagging.

• In this assignment, you predicted POS tags by walking forward through a corpus and knowing the previous word.
• There are other implementations that use bidirectional POS tagging.
• Bidirectional POS tagging requires knowing the previous word and the next word in the corpus when predicting the current word’s POS tag.
• Bidirectional POS tagging would tell you more about the POS instead of just knowing the previous word.
• Since you have learned to implement the unidirectional approach, you have the foundation to implement other POS taggers used in industry.