Dependency Grammar and Dependency Structure
Parse trees in NLP, analogous to those in compilers, are used to analyze the syntactic structure of sentences. There are two main types of structures used:
- constituency structures
- dependency structures
Dependency structure of sentences shows which words depend on (modify or are arguments of) which other words. These binary asymmetric relations between the words are called dependencies and are depicted as arrows going from the head (or governor, superior, regent) to the dependent (or modifier, inferior, subordinate). Usually these dependencies form a tree structure. They are often typed with the name of grammatical relations (subject, prepositional object, apposition, etc.). An example of such a dependency tree is shown in below

Usually some constraints:
- Only one word is adependent of ROOT
- Don’twantcyclesA->B,B->A (tree structure)
- Final issue is whether arrows can cross (non-projective) or not
- Defn: There are no crossing dependency arcs when the words are laid out in their linear order, with all arcs above the words
- Dependencies parallel to a CFG tree must be projective: Forming dependencies by taking 1 child of each category as head
- But dependency theory normally does allow non-projective structures to account for displaced constituents: You can’t easily get the semantics of certain constructions right without these non-projective dependencies
Parsing
Given a parsing model M and a sentence S, derive the optimal dependency graph D for S according to M.
- Dynamic programming
Eisner (1996) gives a clever algorithm with complexity O(n3), by producing parse items with heads at the ends rather than in the middle - Graph algorithms
You create a Minimum Spanning Tree for a sentence McDonald et al.’s (2005) MSTParser scores dependencies independently using an ML classifier (he uses MIRA, for online learning, but it can be something else) - Constraint Satisfaction
Edges are eliminated that don’t satisfy hard constraints. Karlsson (1990), etc. - Transition-based parsing or deterministic dependency parsing
Greedy choice of attachments guided by good machine learning classifiers MaltParser (Nivre et al. 2008). Has proven highly effective.
Neural Transition-Based Dependency Parsing
A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between head words, and words which modify those heads. Your implementation will be a transition-based parser, which incrementally builds up a parse one step at a time. At every step it maintains a partial parse, which is represented as follows:
- A stack of words that are currently being processed.
- A buffer of words yet to be processed.
- A list of dependencies predicted by the parser.
Initially, the stack only contains ROOT, the dependencies list is empty, and the buffer contains all words of the sentence in order. At each step, the parser applies a transition to the partial parse until its buffer is empty and the stack size is 1. The following transitions can be applied:
- SHIFT: removes the first word from the buffer and pushes it onto the stack.
- LEFT-ARC: marks the second (second most recently added) item on the stack as a dependent of
the first item and removes the second item from the stack. - RIGHT-ARC: marks the first (most recently added) item on the stack as a dependent of the second item and removes the first item from the stack.
On each step, your parser will decide among the three transitions using a neural network classifier.Go through the sequence of transitions needed for parsing the sentence “I parsed this sentence correctly”. The dependency tree for the sentence is shown below. At each step, give the configuration of the stack and buffer, as well as what transition was applied this step and what new dependency was added (if any). The first three steps are provided below as an example.

1 | #!/usr/bin/env python3 |
We are now going to train a neural network to predict, given the state of the stack, buffer, and
dependencies, which transition should be applied next. First, the model extracts a feature vector
representing the current state. They can be represented as a list of integers $[w_1,w_2,\cdots,w_m]$ where m is the number of features and each $0 \leq w_i < |V|$ is the index of a token in the vocabulary (|V| is the vocabulary size). First our network looks up an embedding for each word and concatenates them into a single input vector:
We then compute our prediction as:
where $h$ is referred to as the hidden layer,$l$ is referred to as the logits, $\hat{y}$ is referred to as the predictions. We will train the model to minimize cross-entropy loss:

1 | #!/usr/bin/env python3 |
Runing the model
1 | #!/usr/bin/env python3 |
Reference
- slides and course notes of cs224n