Screenshots of the Corporate Blogging Corpus
I feel guilty for not blogging enough lately, but I’ve just been too darn busy. Or maybe I should say I’ve felt too darn busy. If FT500 executives can find the time to blog, a leisure-spoiled PhD student with a laughable 30-hour workweek (that’s just the day job though, research comes on top of that) should really not complain.
Let’s just say that I have been distracted. And because I’m a nerd I feel the need to share the origin of my distraction with my readers.
Here are a few screen shots of what has been keeping me busy over the last weeks:






In case you are wondering what on earth Corporati is exactly: it is linguistic database (or corpus) that I’ve developed for the empirical part of my thesis project. It automatically indexes posts from a number of corporate blogs (about 120 at the moment) and performs statistical language analysis. Before, it was just able to count words and sentences and build a list of the most common words in the collection. Since last weekend, however, it can also automatically get grammatical information about the words in a text - whether something is a noun or adjective, whether it is singular or plural etc. I didn’t code that part myself but used this great tool. Automating the task (called part-of-speech tagging) is not just for lazy people. I have close to 9,000 posts in that database now… and I do hope to finish that PhD while I’m still young. Before statistical tagger were common, people (=brave/crazy linguists) did all tagging by hand. Ouch.
Next time we return to our regular scheduled program.
Note: I’m aware that it probably doesn’t look impressive at all to the non-linguist (not sure if most linguists would find it impressive either, but perhaps at least somewhat interesting). I plan to make it look prettier in the future, but since it’s mostly a research tool I doubt normal (non-nerd) people will want to use it anyway. ![]()