Visualizing blog language data
I’ve been playing around with this great little tool for several days now and thought I’d share some of the results with you.
But first, here’s a brief recap of what I’ve been doing before I start throwing statistics at you.
I am in the process of building a textual database (or corpus, as linguists call it) of corporate and enterprise web logs. The purpose of this corpus is to investigate corporate blogs as a text type. In the current phase of my research, I am especially interested in the following questions
- how do corporate blogs compare stylistically with non-corporate blogs, news texts and other types?
- is there a typical ‘corporate blogging style’ in terms of how people write?
- are there recognizable differences in style that correspond with differences in purpose or authorship (in other words, do CEOs, marketers, software developers, etc have distinct styles?)
- how much variation is there stylistically between different blogs, different bloggers in the same hub (e.g. MSDN) and between different posts by the same blogger?
- are there patterns of change in style over time?
You might wonder what such a description is good for (well, apart from furthering the pursuit of knowledge and all that). I think that, on the practical level, it will enable us to better understand what people are trying to achieve with blogs and how they do it. Ultimately blogging is about good writing. The trouble is, neither is ‘good’ easily defined, nor is it always the same to everyone on any occasion. Blogging styles are highly dynamic and situation-dependent and I think the most successful bloggers very consciously adapt different styles to address different people and issues.
Right, so what do I have so far?
One of the first measures I’ve implemented into my database is a relatively simple formula for calculating how formal/informational or (on the other end of the scale) involved/context-dependent a text is. This is done by adding the frequencies of certain types of words together and subtracting others, under the assumption that (for example) nouns are more numerous in texts which are primarily informational, while a high frequency of pronouns indicates involvement. The formula looks like this:
0.5 * ((NOUNS + ADJECTIVES + PREPOSITIONS + DETERMINERS) – (PRONOUNS + VERBS + ADVERBS + INTERJECTIONS) + 100)
As you can guess, the results are potentially ambiguous – in other words, texts can have a very high or low score for a variety of reasons – and should be used with care. That being said, the measure produces some pretty interesting results.
This is a chart of f-scores from Robert Scoble’s blog
Each data point in the graph is the f-score for a single post, or the average for several posts made on a single day. As the graph shows, Scoble’s posts are fairly consistently in the 50s in August and September. They surge to over 100 in mid-October and make overall gains in November and December, though these gains aren’t really as significant as they might look at first. The more notable change is the high degree of variation in these months compared to the time span before that.
You might wonder which posts exactly get a high or low f-score. Here are the entries with the highest score, by date.
Comparing new TailRank/DiggTech/TechMeme to Google Reader, 16 October 2006 (f-score 102)
Grapes on a Plane, 29 October 2006 (f-score 97)
The highs and lows of CES, 15 January 2007 (f-score 93)
Photo “training”, 21 January 2007 (f-score 106)
If you have a look at those posts, you’ll probably notice that they aren’t really in any way more formal than Scoble’s other writing. The difference is that they tend to be more informational, i.e. have more and more condensed information crammed into to them than most entries. Lists and enumerations will immediately lead to a high score (because they usually translate into a high noun count) and for Scoble those entries which are written in a sort of telegraph style to convey information about a photowalk or CES thus have a high score. This doesn’t really demerit the f-score as a metric – it simply means that it’s context-sensitive. What’s important is that, with an overall mean score of 60, Scobelizer ranks on the extreme low end of the formal/informational vs involved/contextual scale. To Scoble, blogs really are conversations, not just metaphorically but in a quite literal stylistic way.
That’s the score for one source over time. Let’s compare a bunch of sources.
If you have trouble seeing anything on the chart, look for a little dropdown menu on the lower right hand side labeled dot size. Change it from ‘posts’ to ‘no selection’ and all the dots will be changed to have the same size, which should make the whole thing a lot easier to read.
The chart is a representation of scores for 137 different blogs, computed from data collected during the last five months. Each dot represents a single blog and its average f-score on the x axis. The position of a dot on the y axis indicates the standard deviation of values inside of that blog, i.e. the degree of internal variation
The vast majority of the sources I’ve used are corporate blogs – after all that’s what my research is about. But in addition I’ve also thrown in a few non-corporate sources, simply to be able to compare one type of blog with another one. Thus the list contains 17 personal blogs randomly found via blogger.com, 1 a-list professional blogger (Scoble), 1 political blog hub (huffingtonpost.com) and 3 non-blog sources, namely editorials from the New York Times, the Washington Post and the LA Times collected in the course of this week (see below for a full list of sources).
The first thing likely to catch you eyes are the outliers. On the far right hand side, there is one source simply tagged “Blog” (informative, I know) with a record f-score of 195 and and a standard deviation of 92. That’s Ray Ozzie, Chief Software Architect of Microsoft. Now, if you have a look at his blog you might find that the best description for his writing is not so much formal, but rather “technical” or maybe “information-oriented”. The reasons for the high scores are the many compound nouns (things like development ecosystem, application components, clipboard data formats, etc) coupled with the overall significant length of entries. Like the other outlier, Irving Wladawsky-Berger of IBM, Ozzie also produces very long posts. Ozzie’s longest has 1,700 words, while Wladawsky-Berger is a close second with 1,500. Length tends to coincide with somewhat higher f-scores, however, there are counter-examples. Heather Hamilton has one post with a whopping word count of over 2,000 and an f-score of only 105. Generally brief posts tend to coincide with lower scores, but, as the example shows, there are exceptions.
Overall it is important to consider a few things, especially in regards to the those sources with a high standard deviation and a high f-score:
- the deviation is often high simply because there aren’t many posts (for example, Ozzie only has 6 entries)
- several of the high-deviation blogs are hubs, i.e. they aggregate a number of individual blogs (e.g. MSDN and HuffPo)
But the cool part is that the remaining sources usually contain very conscious stylistic variation (Jonathan Schwarz is a prime example). I other words, they write differently to address different people and achieve different things and this – at least to some extent – stylistically visible. Compare that with the scores for the three newspaper editorials grouped together in the lower right area of the plot. They are surprisingly consistent if you consider that we’re looking at texts published in three different papers, written by an even larger number of journalists. Which just shows that the editorial is a pretty solidified type of text in terms of style, while the (corporate) blog isn’t – at least not yet.
Anyway, I’ll wrap it up for now and save the more in-depth look for another post.
I Love Me, vol. I
Loic Le Meur Blog
Marcel Reichart Blog
Amazon Web Services Blog
Cisco High Tech Policy Blog
Digital Straight Talk
Direct2Dell, Dell’s Weblog
eBay Developers Program
EDS’ Next Big Thing Blog
From Edison’s Desk – GE Global Research Blog
Real Baking with Rose Levy Beranbaum
GM Fastlane Blog
Dan Socci’s Blog
ING Asia/Pacific’s Blog
Open for Discussion
Things That Make You Go Wireless
The Lobby from SPG
Jonathan Schwartz’s Weblog
Texas Instruments Video360 Blog
The Jason Calacanis Weblog
Boeing Blog: Randy’s Journal
Guided By History
Yahoo! Search Blog
The CEO’s Blog – John Mackey
The Bocada Blog
Michael M’s X10 Blog
Notes from MNR
Hu Yoshida’s Blog
Novell Open PR
Jeff Jaffe’s Blog
Thompson Holidays Blog
The Bovine Bugle
Stone Creek Coffee Blog
Speaking of Security
Jonathan Bruce’s WebLog
The Tinbasher Sheet Metal Blog
The NCC Weblog
Signs Never Sleep
Life at Wal-Mart
The Baby Blawg
life’s short…make it sweet…
I am the evil master genius
i want you
44 Words for 365 People
Discover Norwegian Music
my smiles arent a facade
ï¿½?Å¯ï¿½?Ã°Â£Ð· ï¿½?ï¿½? Å¦ï¿½?Ç¿Å¯Äï¿½?Å§ï¿½?
The Irony of Life
Over the Horizon
Forum Nokia Blogs
Nokia N90 Blog
Sparkle Like The Stars
Southwest Airlines Blog
Benra Blog: ZoomAlbum, Photos & Photo Sharing
WeatherBug Corporate Blog
CTO Blog – TalkBMC
Commentary from Cape Clear’s CEO [...]
QuickBooks Online Edition The Team Blog
The QuickBooks Team Blog
The Mindjet Blog
Warehousing and Distribution
The Official Salesforce Blog
Park City Mountain Resort
Scenic Nursery Gardening Blog
Lightning Labels Blog
Eriska, Scottish Islan
Outdoor Landscape Lighting
Thoughts of Beauty
Chevron Collectible Toy Cars
Ruby is Coming
am I lonely
Verizon – PoliBlog
The Student LoanDown
Emerson Process Experts
A Thousand Words
All My Eye
HuffPo Full Blog Feed
Open standards, open source, open minds, open opportunities
Marriott on the Move
Washington Post Editorials
LA Times Editorials