Screenshots of the Corporate Blogging Corpus
I feel guilty for not blogging enough lately, but I’ve just been too darn busy. Or maybe I should say I’ve felt too darn busy. If FT500 executives can find the time to blog, a leisure-spoiled PhD student with a laughable 30-hour workweek (that’s just the day job though, research comes on top of that) should really not complain.
Let’s just say that I have been distracted. And because I’m a nerd I feel the need to share the origin of my distraction with my readers.
Here are a few screen shots of what has been keeping me busy over the last weeks:






In case you are wondering what on earth Corporati is exactly: it is linguistic database (or corpus) that I’ve developed for the empirical part of my thesis project. It automatically indexes posts from a number of corporate blogs (about 120 at the moment) and performs statistical language analysis. Before, it was just able to count words and sentences and build a list of the most common words in the collection. Since last weekend, however, it can also automatically get grammatical information about the words in a text - whether something is a noun or adjective, whether it is singular or plural etc. I didn’t code that part myself but used this great tool. Automating the task (called part-of-speech tagging) is not just for lazy people. I have close to 9,000 posts in that database now… and I do hope to finish that PhD while I’m still young. Before statistical tagger were common, people (=brave/crazy linguists) did all tagging by hand. Ouch.
Next time we return to our regular scheduled program.
Note: I’m aware that it probably doesn’t look impressive at all to the non-linguist (not sure if most linguists would find it impressive either, but perhaps at least somewhat interesting). I plan to make it look prettier in the future, but since it’s mostly a research tool I doubt normal (non-nerd) people will want to use it anyway. ![]()




(On Nov 8th, 2006 at 8:43 am)
great post, any interesting analysis you can tell us, without revealing too much of course?
(On Nov 8th, 2006 at 11:54 am)
Thanks for stopping by, John. There’s a great deal of variation in terms of style, but overall a significant number of business bloggers writes in a very involved and immediate way. This is signalled by a high frequency of personal pronouns and certain constructions that are not traditionally used in written language, such as “I think” or “I guess” to qualify what you’re saying, directly addressing the readers (”What’s your take on this?”), incomplete sentences, informal expressions, etc. However, they tend to use longer and more complex sentences than private bloggers and variation in spelling (”c u l8er”) is not too common.
I’ll be looking at stylistic differences between individual blogs soon… stay tuned.
(On Nov 8th, 2006 at 8:52 pm)
interesting results.
(On Nov 8th, 2006 at 6:17 pm)
[…] The only one on the list that has not just been abandoned, but deleted. See archive.org for proof of its passing. No worries Dan, your five posts are safe for posterity in my indestructible linguistic database. Your blog on “HP’s industry leading support services which provide innovative support of HP products and also help customers manage their IT environment operations more efficiently across all vendor platforms” may be gone, but it is not forgotten. And believe me, in my statistics all those juicy adjectives make a nice dent under “suasive language”. […]
(On Nov 8th, 2006 at 4:47 pm)
I take it you didn’t need to use my script, looks pretty cool anyways.
(On Nov 8th, 2006 at 4:58 pm)
Hey Dan, sorry for not getting back to you! You are exactly right, I went from something else because I couldn’t get the results I was looking for. I ended up using TreeTagger (http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/), a very nice statistical tagger from the University of Stuttgart. It works as follows: I write a text from my database to a file buffer, then call TreeTagger from PHP (it’s a perl binary), let it tag the text and then write the result back to the DB. It is quite precise and very fast and I’m really happy I managed to integrate it.
Kudos to you for your efforts regaring a PHP-based tagger. That would probably have been the cleanest solution, but I’m happy with the way it works for now.