Visualizing blog language data

2007 February 9
by Cornelius

I’ve been playing around with this great little tool for several days now and thought I’d share some of the results with you.

But first, here’s a brief recap of what I’ve been doing before I start throwing statistics at you.

I am in the process of building a textual database (or corpus, as linguists call it) of corporate and enterprise web logs. The purpose of this corpus is to investigate corporate blogs as a text type. In the current phase of my research, I am especially interested in the following questions

- how do corporate blogs compare stylistically with non-corporate blogs, news texts and other types?

- is there a typical ‘corporate blogging style’ in terms of how people write?

- are there recognizable differences in style that correspond with differences in purpose or authorship (in other words, do CEOs, marketers, software developers, etc have distinct styles?)

- how much variation is there stylistically between different blogs, different bloggers in the same hub (e.g. MSDN) and between different posts by the same blogger?

- are there patterns of change in style over time?

You might wonder what such a description is good for (well, apart from furthering the pursuit of knowledge and all that). I think that, on the practical level, it will enable us to better understand what people are trying to achieve with blogs and how they do it. Ultimately blogging is about good writing. The trouble is, neither is ‘good’ easily defined, nor is it always the same to everyone on any occasion. Blogging styles are highly dynamic and situation-dependent and I think the most successful bloggers very consciously adapt different styles to address different people and issues.

Right, so what do I have so far?

One of the first measures I’ve implemented into my database is a relatively simple formula for calculating how formal/informational or (on the other end of the scale) involved/context-dependent a text is. This is done by adding the frequencies of certain types of words together and subtracting others, under the assumption that (for example) nouns are more numerous in texts which are primarily informational, while a high frequency of pronouns indicates involvement. The formula looks like this:


(see Heylighen and Dewaele 2002)

As you can guess, the results are potentially ambiguous – in other words, texts can have a very high or low score for a variety of reasons – and should be used with care. That being said, the measure produces some pretty interesting results.

This is a chart of f-scores from Robert Scoble’s blog

Each data point in the graph is the f-score for a single post, or the average for several posts made on a single day. As the graph shows, Scoble’s posts are fairly consistently in the 50s in August and September. They surge to over 100 in mid-October and make overall gains in November and December, though these gains aren’t really as significant as they might look at first. The more notable change is the high degree of variation in these months compared to the time span before that.

You might wonder which posts exactly get a high or low f-score. Here are the entries with the highest score, by date.

Comparing new TailRank/DiggTech/TechMeme to Google Reader, 16 October 2006 (f-score 102)

Grapes on a Plane, 29 October 2006 (f-score 97)

The highs and lows of CES, 15 January 2007 (f-score 93)

Photo “training”, 21 January 2007 (f-score 106)

If you have a look at those posts, you’ll probably notice that they aren’t really in any way more formal than Scoble’s other writing. The difference is that they tend to be more informational, i.e. have more and more condensed information crammed into to them than most entries. Lists and enumerations will immediately lead to a high score (because they usually translate into a high noun count) and for Scoble those entries which are written in a sort of telegraph style to convey information about a photowalk or CES thus have a high score. This doesn’t really demerit the f-score as a metric – it simply means that it’s context-sensitive. What’s important is that, with an overall mean score of 60, Scobelizer ranks on the extreme low end of the formal/informational vs involved/contextual scale. To Scoble, blogs really are conversations, not just metaphorically but in a quite literal stylistic way.

That’s the score for one source over time. Let’s compare a bunch of sources.

If you have trouble seeing anything on the chart, look for a little dropdown menu on the lower right hand side labeled dot size. Change it from ‘posts’ to ‘no selection’ and all the dots will be changed to have the same size, which should make the whole thing a lot easier to read.

The chart is a representation of scores for 137 different blogs, computed from data collected during the last five months. Each dot represents a single blog and its average f-score on the x axis. The position of a dot on the y axis indicates the standard deviation of values inside of that blog, i.e. the degree of internal variation

The vast majority of the sources I’ve used are corporate blogs – after all that’s what my research is about. But in addition I’ve also thrown in a few non-corporate sources, simply to be able to compare one type of blog with another one. Thus the list contains 17 personal blogs randomly found via, 1 a-list professional blogger (Scoble), 1 political blog hub ( and 3 non-blog sources, namely editorials from the New York Times, the Washington Post and the LA Times collected in the course of this week (see below for a full list of sources).

The first thing likely to catch you eyes are the outliers. On the far right hand side, there is one source simply tagged “Blog” (informative, I know) with a record f-score of 195 and and a standard deviation of 92. That’s Ray Ozzie, Chief Software Architect of Microsoft. Now, if you have a look at his blog you might find that the best description for his writing is not so much formal, but rather “technical” or maybe “information-oriented”. The reasons for the high scores are the many compound nouns (things like development ecosystem, application components, clipboard data formats, etc) coupled with the overall significant length of entries. Like the other outlier, Irving Wladawsky-Berger of IBM, Ozzie also produces very long posts. Ozzie’s longest has 1,700 words, while Wladawsky-Berger is a close second with 1,500. Length tends to coincide with somewhat higher f-scores, however, there are counter-examples. Heather Hamilton has one post with a whopping word count of over 2,000 and an f-score of only 105. Generally brief posts tend to coincide with lower scores, but, as the example shows, there are exceptions.

Overall it is important to consider a few things, especially in regards to the those sources with a high standard deviation and a high f-score:

- the deviation is often high simply because there aren’t many posts (for example, Ozzie only has 6 entries)

- several of the high-deviation blogs are hubs, i.e. they aggregate a number of individual blogs (e.g. MSDN and HuffPo)

But the cool part is that the remaining sources usually contain very conscious stylistic variation (Jonathan Schwarz is a prime example). I other words, they write differently to address different people and achieve different things and this – at least to some extent – stylistically visible. Compare that with the scores for the three newspaper editorials grouped together in the lower right area of the plot. They are surprisingly consistent if you consider that we’re looking at texts published in three different papers, written by an even larger number of journalists. Which just shows that the editorial is a pretty solidified type of text in terms of style, while the (corporate) blog isn’t – at least not yet.

Anyway, I’ll wrap it up for now and save the more in-depth look for another post.


iUpload InSights

Time Leadership

I Love Me, vol. I

Simply Albert

PR Thoughts

Occam’s Razor

Loic Le Meur Blog

CTO Blog


Marcel Reichart Blog


Amazon Web Services Blog

Cisco High Tech Policy Blog

Digital Straight Talk

Direct2Dell, Dell’s Weblog

eBay Developers Program

EDS’ Next Big Thing Blog

From Edison’s Desk – GE Global Research Blog

Real Baking with Rose Levy Beranbaum

GM Fastlane Blog

Google Blog

Dan Socci’s Blog

Kara R

ING Asia/Pacific’s Blog

Open for Discussion

One Louder



Things That Make You Go Wireless

The Lobby from SPG

Jonathan Schwartz’s Weblog

Texas Instruments Video360 Blog

The Jason Calacanis Weblog

Boeing Blog: Randy’s Journal

Guided By History


Yahoo! Search Blog

The CEO’s Blog – John Mackey


Kate’s Blog

The Bocada Blog

Michael M’s X10 Blog

Notes from MNR

Entrepreneurial Marketing

TiVo Blog

Guiness Blog

Hu Yoshida’s Blog

Forta Blog

Novell Open PR

Jeff Jaffe’s Blog


Mena’s Corner

Alan Meckler


Thompson Holidays Blog

Baby Babble

The Bovine Bugle

Stone Creek Coffee Blog


Speaking of Security

Hybrid Talk

Jonathan Bruce’s WebLog

The Tinbasher Sheet Metal Blog

The NCC Weblog

Signs Never Sleep


English Cut

Life at Wal-Mart


The DustBlog

The Baby Blawg

life’s short…make it sweet…


I am the evil master genius

i want you

44 Words for 365 People

neurotic kitten

Discover Norwegian Music

my smiles arent a facade

�?ů�?ð£з �?�? Ŧ�?ǿůĝ�?ŧ�?

Flying Tragic

The Irony of Life


Over the Horizon



developerWorks blogs

Irving Wladawsky-Berger

Forum Nokia Blogs

Nokia N90 Blog

Sparkle Like The Stars

FYI Blog

Southwest Airlines Blog

Benra Blog: ZoomAlbum, Photos & Photo Sharing

WeatherBug Corporate Blog

CTO Blog – TalkBMC

Commentary from Cape Clear’s CEO [...]

QuickBooks Online Edition The Team Blog

The QuickBooks Team Blog

The Mindjet Blog

Warehousing and Distribution

The Official Salesforce Blog

Park City Mountain Resort


TaylorMade Blogs

Scenic Nursery Gardening Blog

Lightning Labels Blog

Wiggly Wigglers


Eriska, Scottish Islan

Outdoor Landscape Lighting

Thoughts of Beauty

Stormhoek Winery

Chevron Collectible Toy Cars

MSDN Blogs

Ruby is Coming

am I lonely

Pineywoods Opinings

Tangent, Oregon

Verizon – PoliBlog

Ted’s Take

The Student LoanDown

Emerson Process Experts

A Thousand Words

Glenfiddich Blog

IT@Intel Blog

All My Eye

HuffPo Full Blog Feed

News@Cisco Notes

Mobile Visions

Open standards, open source, open minds, open opportunities

Marriott on the Move

NYT Editorials

Washington Post Editorials⊂=new

LA Times Editorials

2007 February 9

[...] Blogging is an art, not a science Wait, it’s the other way around. Or it’s both. Oh forget it. Filed under: Blogging [...]

2007 February 9

Wonder if you could do an analysis for my blog. It would be interesting to know how I fare.

2007 February 9

Sure Kuma, I’ll index your blog with the next pass.

2007 February 9

Thank you very much, Cornelius. This is a very useful tool. I will be writing a review on this and will send you the URL. As you mention, the more frequent you post, the lower the f-score.

2007 February 9

It tends to work like that, although it depends on many (!) factors, and a high or low score can mean a lot of different things. Also, one shouldn’t get the idea that a high score is “better” in any way. Lower scores are characteristic for more conversational or “involved” blogs. Simplifying a lot: low scores = more like a conversation, high scores = more like a ‘traditional’ written text.

2007 February 11

Just posted my comments at Your research is very fascinating and I wish all the best.

2007 February 20

[...] Krishna Kumar has written a great post summarizing what I’ve been doing lately with f-scores in blogs. My favorite quote: [...]

2007 February 22

[...] I must have buried my head too deep in the sand lately (the sand being my f-score stuff), because I somehow managed to overlook the whole event up to now. If I hadn’t I certainly would have submitted something there. [...]

2007 February 23

interesting research there.
blogging to me is like a daily NEED.
its not about the things i do or what i want. but its more abt what i feel, what i want them to know about my life. and my blog to me is like whole story of life’s journey.

i see my blog there in one of ur sources. i was rather shocked actually. well do continue ur research and all the best.


2007 May 9

[...] a) f-score (for details on what that is, read this post) [...]

2007 August 30


There might be a “small” problem with the computation of the F-score as given above and the F-score as defined by Heylighen and Dewaele (2002). You seem to use frequencies: ((6 nouns – 4 verbs) + 100) / 2 = 51 (F-score). However, [HD2002] work with percentages: ((60% nouns – 40% verbs) + 100) / 2 = 60 (F-score).

Whether this invalidates your results is difficult to say, but it may be clear that working with percentages scales for the size of a document, whereas working with frequencies does not. The [HD2002] has does been designed to keep the F-score between 0% and 100%.

Interestingly, with percentages one can also apply it to very short documents (such as chats).


2007 August 31

Thanks again for making me aware of this and for the great exchange of ideas we had, Anjo. I felt that there was something funny about the scores and in retrospect it’s certainly something I should have fixed earlier.

From what I can see so far it actually doesn’t seem to effect the tendencies I’ve discovered before though – in most cases they are actually more pronounced now. I’ll recalculate my scores and post something on that soon…

Comments are closed for this entry.