VBDU and MWTTR
Disclaimer: this blog entry is concerned with certain aspects of natural language processing and automated text analysis and may therefore appear excessively nerdy to the non-initiated. Read at your own risk.
VBDU and MWTTR - don’t worry, those aren’t breakfast cereals, government agencies or contagious diseases.
Every once in a while, I feel the need to brush up on my programming skills. Lately, most of what I’ve been doing has been centered around writing human-readable text (as opposed to machine-readable text, i.e. code) and therefore I felt a little PHP practice was in order.
The result of yesterday’s half-day coding session is this script. Read on for an explanation of what exactly it does.
Thinking of what I could possibly code, I remembered an interesting paper by Eniko Csomay that I came across a while ago. In it, Eniko suggests a methodology for segmenting texts into smaller units according to their internal structure*. How can you teach a machine (even in general terms) where one section of a text ends and a new one begins? In her article, Eniko suggests the following approach: if a text moves from one part to another (say the transition from analysis to conclusions in a scientific paper) it is plausible that the lexical material used changes. To simplify a little, one section is likely to use a specific bundle of recurring words, while another will use different terms. Eniko calls these sections vocabulary-based discourse units (VBDUs) and she has shown that variation between VBDUs can be used to find topical and argumentative shifts in a text.
How does it work in practice? VBDUs can be measured by taking a snapshot of N words from a text and then comparing it with another window of the same size that follows the first one.
Let me give an example:
text = Mary had a little lamb, John had a little pony
window size = 5 words
window 1 = Mary had a little lamb
window 2 = John had a little pony
The calculated difference between the two windows equals 2, because John and pony differ from Mary and lamb, while the rest of the words are identical.
How can we calculate this variation for a text in its entirity? By moving through it, word by word.
If we move window 1 forward by a single word and do the same with window 2, the difference between the two windows may change. The example above isn’t terribly well-suited to demonstrate this, simply because the windows are very small, but if you boost window size to 50 or 100 words, you can get an idea of how this works.
Another thing that I decided to implement in my little script is a measure called Moving Average Type-Token Ratio (MATRR)**. The terms types and tokens are used in computational linguistics to differentiate between unique words and total words in a text. To use the example from above, the sentence Mary had a little lamb, John had a little pony consists of 10 tokens (actually 11 if you count the comma), but only 7 types, because the words had, a and little occur twice and we only count each unique word once when looking at types.
Comparing the ratio of unique words to total words is useful for several reasons. Generally, we can expect written texts which convey a lot of information to have a higher type-token ratio than (for example) spoken conversation, where certain material is likely to occur again and again (say, the pronouns I and you). This difference is not absolute, but there is a strong tendency for information-dense pieces of discourse (scientific papers, legal texts) to have a higher TTR than less dense material (casual conversation, probably most blogs).
However, there’s a minor methodological issue. TTR is tied to text length and tends to decrease the longer a text is - the amount of lexical material at our disposal is simply not infinite and therefore the ratio inevitably goes down.
The solution to this problem can be integrated into our approach to VBDU analysis: compare two windows, then move forward by a word and repeat the process.
Right, so what’s the result of all this? Lo and behold
The VDBU Difference and Moving Window Type-Toke Ratio Calculator (and no, that is probably not hyphenated correctly)
Go ahead and try it. Simply paste a text into the window, preferably over 1.000 words, and hit submit. A value of 100 for the window size seemed like a good idea to me - values of under 50 and over 250 appear to work less well.
The resulting chart is drawn using Google’s Visualization API and I think it looks quite spiffy. Here are two examples
- A news report from the New York Times (source text, visualization)
- The first chapter of Edgar Allan Poe’s short novel Arthur Gordon Pym of Nantucket (source text, visualization)
How can the results be interpreted? The x axis represents the progression of the text - essentially we are moving through the text word by word from left to right. On the y axis three normalized scores a represented: the word-based variability between our two windows (VBDUdiff, light blue), the type-token ratio of the first window (TTR1, red) and the type-token ratio of the second window (TTR2, orange).
Great, so what does it all mean?
By itself, probably not too much. You’re unlikely to find a clear-cut correlation between shifts in topic or section transitions by looking at VBDU_diff peaks (those places where difference between the two windows is highest) only. Language is just too tricky for something that simple. But I can imagine there being interesting shifts in word class percentages and the like from one part of a text to the next. Integrating a part of speech tagger would be interesting, but that’s something I’ll save for another day.
In the mean time, try the script and let me know if you find something interesting. Visualizations are stored on the server for now and you can retrieve them later by using the URL at the bottom of each page.
Oh and to be a bit meta, here’s the analysis of this blog entry. Hmmm….
* I need to note several things regarding my implementation of VBDU analysis:
- I’ve reproduced the procedure from memory, meaning it is likely to differ from the original implementation in some form and may incorporate infelicities or errors
- in addition to possible methodological flaws, simple programming bugs are also imaginable
- as a result, use this at your own risk and do not cite or use this script in a serious context (i.e. publishing) without contacting me first
** Michael A. Convington published an implementation of MATTR for Windows last year and explained the method to me at a conference. Essentially, I’ve just recreated MATTR in PHP, hopefully without any significant bugs.



