Tagging it
Wow, my timing has really been off recently. Only a month late, I realize Heather Hamilton had a linguistically interesting post up in March. I feel a little bad for always picking up Heather’s stuff, but the simple reason for that is that her blog is one of relatively few business blogs I really enjoy reading. The post that caught my eye isn’t exactly a “serious” one, but there’s still more to it than you might think at first glance.
Dear Bloggers and friends,
I’m so sorry I haven’t blogged since week before last (time frame). I have been incredibly (adverb) busy (adjective). And my inbox (noun) has been overflowing (adjective). I have some blog posts in mind and hope to get them up soon (lie).
What Heather did with her little text is called part-of-speech tagging and it a common technique in computational linguistics - automated part-of-speech tagging, to be precise. I ran Heather’s text through TreeTagger, which I also use for my corpus project, together with this very nifty Flash-based interface developed by my colleague Thomas Koller. Here’s the result:
I PP I
‘m VBP be
so RB so
sorry JJ sorry
I PP I
have VHP have
n’t RB n’t
blogged VVN <unknown>
since IN since
week NN week
before IN before
last JJ last
. SENT .
I PP I
have VHP have
been VBN be
incredibly RB incredibly
busy JJ busy
. SENT .
And CC and
my PP$ my
inbox NN <unknown>
has VHZ have
been VBN be
overflowing VVG overflow
. SENT .
I PP I
have VHP have
some DT some
blog NN <unknown>
posts NNS post
in IN in
mind NN mind
and CC and
hope VVP hope
to TO to
get VV get
them PP them
up RP up
soon RB soon
. SENT .
It might look a little cryptic at first, but if you have an idea what the abbreviations stand for the meaning becomes pretty clear.
I (word) / personal pronoun (word class) / I (base form)
‘m (word) / the verb BE, present tense (word class) / be (base form)
so (word) / adverb (word class) / so (base form)
…
What’s neat about modern taggers such as TreeTagger is that they are able to deal with words they haven’t encountered before. The words blog, blogged and inbox are all unknown to TreeTagger, but the program still “gets” that they are two forms of a verb and a noun, respectively. This may not get us any closer to making machines understand human language, but it is pretty useful nonetheless. TreeTagger has chewed through well over 20.000 blog entries in my corpus database and it has encountered more than just a few odd words on the way, most of which it identified correctly.
Here are a few of my personal obscure favorites:
blog-fueled
jumpstarted
road-raged
Delusionville
nonsteroidals
Vegetopians
ululations
…and my personal favorite: unavialable. While that is exceedingly likely to just be a misspelling of unavailable, it could also mean “something that can’t be brought to fly”.
Hmm, or maybe not.




(On Apr 18th, 2007 at 10:33 pm)
This makes my post look a lot more interesting than it is! But your post actually made sense to me which I am finding a little frightening (in a good way).
I’m not sure if you know about “madlibs”, which inspired the post, as they are likely an American invention (and very much a product of the 70s). If you haven’t heard of them before, you should do some searches and check them out. You may be the one person to find some deeper meaning than the rest of us!
Also, please send me your resume when you start looking for a job : )Seriously!