Visualizing blog language data

2007 February 9
by Cornelius

I’ve been playing around with this great little tool for several days now and thought I’d share some of the results with you.

But first, here’s a brief recap of what I’ve been doing before I start throwing statistics at you.

I am in the process of building a textual database (or corpus, as linguists call it) of corporate and enterprise web logs. The purpose of this corpus is to investigate corporate blogs as a text type. In the current phase of my research, I am especially interested in the following questions

- how do corporate blogs compare stylistically with non-corporate blogs, news texts and other types?

- is there a typical ‘corporate blogging style’ in terms of how people write?

- are there recognizable differences in style that correspond with differences in purpose or authorship (in other words, do CEOs, marketers, software developers, etc have distinct styles?)

- how much variation is there stylistically between different blogs, different bloggers in the same hub (e.g. MSDN) and between different posts by the same blogger?

- are there patterns of change in style over time?

You might wonder what such a description is good for (well, apart from furthering the pursuit of knowledge and all that). I think that, on the practical level, it will enable us to better understand what people are trying to achieve with blogs and how they do it. Ultimately blogging is about good writing. The trouble is, neither is ‘good’ easily defined, nor is it always the same to everyone on any occasion. Blogging styles are highly dynamic and situation-dependent and I think the most successful bloggers very consciously adapt different styles to address different people and issues.

Right, so what do I have so far?

One of the first measures I’ve implemented into my database is a relatively simple formula for calculating how formal/informational or (on the other end of the scale) involved/context-dependent a text is. This is done by adding the frequencies of certain types of words together and subtracting others, under the assumption that (for example) nouns are more numerous in texts which are primarily informational, while a high frequency of pronouns indicates involvement. The formula looks like this:

0.5 * ((NOUNS + ADJECTIVES + PREPOSITIONS + DETERMINERS) – (PRONOUNS + VERBS + ADVERBS + INTERJECTIONS) + 100)

(see Heylighen and Dewaele 2002)

As you can guess, the results are potentially ambiguous – in other words, texts can have a very high or low score for a variety of reasons – and should be used with care. That being said, the measure produces some pretty interesting results.

This is a chart of f-scores from Robert Scoble’s blog




Each data point in the graph is the f-score for a single post, or the average for several posts made on a single day. As the graph shows, Scoble’s posts are fairly consistently in the 50s in August and September. They surge to over 100 in mid-October and make overall gains in November and December, though these gains aren’t really as significant as they might look at first. The more notable change is the high degree of variation in these months compared to the time span before that.

You might wonder which posts exactly get a high or low f-score. Here are the entries with the highest score, by date.

Comparing new TailRank/DiggTech/TechMeme to Google Reader, 16 October 2006 (f-score 102)

Grapes on a Plane, 29 October 2006 (f-score 97)

The highs and lows of CES, 15 January 2007 (f-score 93)

Photo “training”, 21 January 2007 (f-score 106)

If you have a look at those posts, you’ll probably notice that they aren’t really in any way more formal than Scoble’s other writing. The difference is that they tend to be more informational, i.e. have more and more condensed information crammed into to them than most entries. Lists and enumerations will immediately lead to a high score (because they usually translate into a high noun count) and for Scoble those entries which are written in a sort of telegraph style to convey information about a photowalk or CES thus have a high score. This doesn’t really demerit the f-score as a metric – it simply means that it’s context-sensitive. What’s important is that, with an overall mean score of 60, Scobelizer ranks on the extreme low end of the formal/informational vs involved/contextual scale. To Scoble, blogs really are conversations, not just metaphorically but in a quite literal stylistic way.

That’s the score for one source over time. Let’s compare a bunch of sources.




If you have trouble seeing anything on the chart, look for a little dropdown menu on the lower right hand side labeled dot size. Change it from ‘posts’ to ‘no selection’ and all the dots will be changed to have the same size, which should make the whole thing a lot easier to read.

The chart is a representation of scores for 137 different blogs, computed from data collected during the last five months. Each dot represents a single blog and its average f-score on the x axis. The position of a dot on the y axis indicates the standard deviation of values inside of that blog, i.e. the degree of internal variation

The vast majority of the sources I’ve used are corporate blogs – after all that’s what my research is about. But in addition I’ve also thrown in a few non-corporate sources, simply to be able to compare one type of blog with another one. Thus the list contains 17 personal blogs randomly found via blogger.com, 1 a-list professional blogger (Scoble), 1 political blog hub (huffingtonpost.com) and 3 non-blog sources, namely editorials from the New York Times, the Washington Post and the LA Times collected in the course of this week (see below for a full list of sources).

The first thing likely to catch you eyes are the outliers. On the far right hand side, there is one source simply tagged “Blog” (informative, I know) with a record f-score of 195 and and a standard deviation of 92. That’s Ray Ozzie, Chief Software Architect of Microsoft. Now, if you have a look at his blog you might find that the best description for his writing is not so much formal, but rather “technical” or maybe “information-oriented”. The reasons for the high scores are the many compound nouns (things like development ecosystem, application components, clipboard data formats, etc) coupled with the overall significant length of entries. Like the other outlier, Irving Wladawsky-Berger of IBM, Ozzie also produces very long posts. Ozzie’s longest has 1,700 words, while Wladawsky-Berger is a close second with 1,500. Length tends to coincide with somewhat higher f-scores, however, there are counter-examples. Heather Hamilton has one post with a whopping word count of over 2,000 and an f-score of only 105. Generally brief posts tend to coincide with lower scores, but, as the example shows, there are exceptions.

Overall it is important to consider a few things, especially in regards to the those sources with a high standard deviation and a high f-score:

- the deviation is often high simply because there aren’t many posts (for example, Ozzie only has 6 entries)

- several of the high-deviation blogs are hubs, i.e. they aggregate a number of individual blogs (e.g. MSDN and HuffPo)

But the cool part is that the remaining sources usually contain very conscious stylistic variation (Jonathan Schwarz is a prime example). I other words, they write differently to address different people and achieve different things and this – at least to some extent – stylistically visible. Compare that with the scores for the three newspaper editorials grouped together in the lower right area of the plot. They are surprisingly consistent if you consider that we’re looking at texts published in three different papers, written by an even larger number of journalists. Which just shows that the editorial is a pretty solidified type of text in terms of style, while the (corporate) blog isn’t – at least not yet.

Anyway, I’ll wrap it up for now and save the more in-depth look for another post.

Sources

iUpload InSights
http://hopper.iupload.com/default.asp

Time Leadership
http://www.jimestill.com/

I Love Me, vol. I
http://www.michaelocc.com/

Simply Albert
http://simplyalbert.blogspot.com/

ChristianLindholm.com
http://www.christianlindholm.com/christianlindholm/

PR Thoughts
http://www.prthoughts.com/

Occam’s Razor
http://mgoldberg.typepad.com/occams_razor/

Loic Le Meur Blog
http://www.loiclemeur.com/

CTO Blog
http://www.capgemini.com/ctoblog/

Lakattack
http://spreadlog.net/

Marcel Reichart Blog
http://marcellomedia.blogs.com/mrb/

stefan
http://stefan.21publish.com/

Amazon Web Services Blog
http://aws.typepad.com/

Cisco High Tech Policy Blog
http://blogs.cisco.com/gov/

Digital Straight Talk
http://www.digitalstraighttalk.com/

Direct2Dell, Dell’s Weblog
http://www.direct2dell.com/default.aspx

eBay Developers Program
http://ebaydeveloper.typepad.com/

EDS’ Next Big Thing Blog
http://www.eds.com/sites/cs/blogs/eds_next_big_thing_blog/default.aspx

From Edison’s Desk – GE Global Research Blog
http://www.grcblog.com/

Real Baking with Rose Levy Beranbaum
http://www.realbakingwithrose.com/

GM Fastlane Blog
http://fastlane.gmblogs.com/

Google Blog
http://googleblog.blogspot.com/

Dan Socci’s Blog
http://h20325.www2.hp.com/blogs/socci

Kara R
http://www.honeywellblogs.com/kara_r/

ING Asia/Pacific’s Blog
http://mycupofcha.ingblogs.com/

TinyScreenfuls.com
http://www.tinyscreenfuls.com/

Open for Discussion
http://csr.blogs.mcdonalds.com/default.asp

One Louder
http://blogs.msdn.com/heatherleigh/

NIKEBASKETBALL
http://blog.nikebasketball.com/

OraBlogs
http://www.orablogs.com/orablogs/

Things That Make You Go Wireless
http://businessblog.sprint.com/1/1/

The Lobby from SPG
http://www.thelobby.com/

Jonathan Schwartz’s Weblog
http://blogs.sun.com/jonathan

Texas Instruments Video360 Blog
http://blogs.ti.com/

The Jason Calacanis Weblog
http://www.calacanis.com/

Boeing Blog: Randy’s Journal
http://www.boeing.com/randy/

Guided By History
http://blog.wellsfargo.com/guidedbyhistory/

PlayOn
http://blogs.parc.com/playon/

Yahoo! Search Blog
http://www.ysearchblog.com/

The CEO’s Blog – John Mackey
http://www.wholefoodsmarket.com/blogs/jm/

Blog
http://www.nixonmcinnes.co.uk/about-us/blog/

Kate’s Blog
http://katesblog.u3.com/

The Bocada Blog
http://bocada.typepad.com/bocadablog/

Michael M’s X10 Blog
http://www.x10community.com/michaelm/

Notes from MNR
http://blogs.adobe.com/notesfrommnr/

Entrepreneurial Marketing
http://blogs.accenture.nl/EntrepreneurialMarketing/

TiVo Blog
http://blog.tivo.com/tivo_blog/

Guiness Blog
http://www.guinnessblog.co.uk/blogs/home.aspx?App=guinnessblog&allowAccess=4r7a6h

Hu Yoshida’s Blog
http://blogs.hds.com/hu/

Forta Blog
http://www.forta.com/blog/

Novell Open PR
http://www.novell.com/prblogs/

Jeff Jaffe’s Blog
http://www.novell.com/ctoblog/

Blog
http://rayozzie.spaces.live.com/blog/

Mena’s Corner
http://www.sixapart.com/about/corner/

Alan Meckler
http://weblogs.jupitermedia.com/meckler/

Infrablog
http://blogs.verisign.com/infrablog/

Thompson Holidays Blog
http://thomsonholidays.blogs.com/my_weblog/

Baby Babble
http://stonyfield.typepad.com/babybabble/

The Bovine Bugle
http://stonyfield.typepad.com/bovine/

Stone Creek Coffee Blog
http://sccv3.stonecreekcoffee.com/blog.cfm

bugBlog
http://rescuebugblog.typepad.com/rescue_bugblog/

Speaking of Security
http://www.rsasecurity.com/blog/

Hybrid Talk
http://hybridtalk.nyse.com/

Jonathan Bruce’s WebLog
http://jonathanbruceconnects.com/jonathan_bruce/

The Tinbasher Sheet Metal Blog
http://www.butlersheetmetal.com/tinbasherblog/

The NCC Weblog
http://www.northfieldconstruction.net/

Signs Never Sleep
http://signsneversleep.typepad.com/

ACCAbuzz
http://www.accabuzz.com/

English Cut
http://www.englishcut.com/

Life at Wal-Mart
http://walmartfacts.com/lifeatwalmart/

Scobelizer
http://scobleizer.wordpress.com/

The DustBlog
http://thedustblog.blogspot.com/

The Baby Blawg
http://babyblawg.blogspot.com/

life’s short…make it sweet…
http://dunlin.blogspot.com/

xbsg
http://mi50.blogspot.com/

I am the evil master genius
http://arnique.blogspot.com/

i want you
http://nuratikahnabilah.blogspot.com/

44 Words for 365 People
http://44for365.blogspot.com/

neurotic kitten
http://nkitten.blogspot.com/index.html

Discover Norwegian Music
http://discovernorwegianmusic.blogspot.com/

my smiles arent a facade
http://badass-freak.blogspot.com/

�?ů�?ð£з �?�? Ŧ�?ǿůĝ�?ŧ�?
http://chibinyu.blog.com/

Flying Tragic
http://tragicflyer.blog.com/

The Irony of Life
http://mujerlatina319.blog.com/

cudgeland
http://cudge.blogspot.com/

Over the Horizon
http://blogs.zdnet.com/OverTheHorizon/

DaveBlog
http://blogs.netapp.com/dave/

Earthling
http://blogs.earthlink.net/

developerWorks blogs
http://www-03.ibm.com/developerworks/blogs/

Irving Wladawsky-Berger
http://irvingwb.typepad.com/

Forum Nokia Blogs
http://blogs.forum.nokia.com/author_group.html?id=2

Nokia N90 Blog
http://n90.bloggercomm.com/

Sparkle Like The Stars
http://www.sparklelikethestars.com/

FYI Blog
http://fyi.gmblogs.com/

Southwest Airlines Blog
http://www.blogsouthwest.com/

Benra Blog: ZoomAlbum, Photos & Photo Sharing
http://benra.typepad.com/

WeatherBug Corporate Blog
http://blog.weatherbug.com/

CTO Blog – TalkBMC
http://talk.bmc.com/blogs/blog-bishop/cto/

Commentary from Cape Clear’s CEO [...]
http://www.capeclear.com/annrai/

QuickBooks Online Edition The Team Blog
http://quickbooks_online_blog.typepad.com/

The QuickBooks Team Blog
http://www.quickbooks.blogs.com/

The Mindjet Blog
http://blog.mindjet.com/

Warehousing and Distribution
http://thirdpartylogistics.blogspot.com/

The Official Salesforce Blog
http://blogs.salesforce.com/

Park City Mountain Resort
http://parkcity.typepad.com/park_city_mountain_resort/

SunbeltBLOG
http://sunbeltblog.blogspot.com/

TaylorMade Blogs
http://www.taylormadeblogs.com/

Scenic Nursery Gardening Blog
http://www.scenicnursery.com/

Lightning Labels Blog
http://lightninglabels.typepad.com/blog/

Wiggly Wigglers
http://wigglywigglers.blogspot.com/

EIE FLUD
http://www.eieflud.co.uk/blog/

Eriska, Scottish Islan
http://www.isleoferiska.com/

Outdoor Landscape Lighting
http://www.residential-landscape-lighting-design.com/blogger.html

Thoughts of Beauty
http://www.overallbeauty.com/beauty-blog/

Stormhoek Winery
http://www.stormhoek.com/

Chevron Collectible Toy Cars
http://chevroncarsblog.com/

MSDN Blogs
http://blogs.msdn.com/

Ruby is Coming
http://rubyiscoming.blogspot.com/

am I lonely
http://rongsheng.blogspot.com/

Pineywoods Opinings
http://longleaf.blogspot.com/

Tangent, Oregon
http://tangentcity.blogspot.com/

Verizon – PoliBlog
http://poliblog.verizon.com/PoliBlog/Blogs/poliblog.aspx

Ted’s Take
http://ted.aol.com/

The Student LoanDown
http://blog.wellsfargo.com/StudentLoanDown/

Emerson Process Experts
http://www.emersonprocessxperts.com/

A Thousand Words
http://1000words.kodak.com/

Glenfiddich Blog
http://blog.glenfiddich.com/

IT@Intel Blog
http://blogs.intel.com/it/

All My Eye
http://allmyeye.blogspot.com/

HuffPo Full Blog Feed
http://www.huffingtonpost.com/theblog/

News@Cisco Notes
http://blogs.cisco.com/news/

Mobile Visions
http://blogs.cisco.com/wireless/

Open standards, open source, open minds, open opportunities
http://www-03.ibm.com/developerworks/blogs/page/BobSutor

Marriott on the Move
http://www.blogs.marriott.com/

NYT Editorials
http://topics.nytimes.com/top/opinion/editorialsandoped/editorials/

Washington Post Editorials
http://www.washingtonpost.com/wp-dyn/content/opinions/columnsandblogs/?nav%3Dleft⊂=new

LA Times Editorials
http://www.latimes.com/news/opinion/editorials/

13 Comments
2007 February 9

[...] Blogging is an art, not a science Wait, it’s the other way around. Or it’s both. Oh forget it. Filed under: Blogging [...]

Pingback
2007 February 9

Wonder if you could do an analysis for my blog. It would be interesting to know how I fare.

2007 February 9

Sure Kuma, I’ll index your blog with the next pass.

2007 February 9

Thank you very much, Cornelius. This is a very useful tool. I will be writing a review on this and will send you the URL. As you mention, the more frequent you post, the lower the f-score.

2007 February 9

It tends to work like that, although it depends on many (!) factors, and a high or low score can mean a lot of different things. Also, one shouldn’t get the idea that a high score is “better” in any way. Lower scores are characteristic for more conversational or “involved” blogs. Simplifying a lot: low scores = more like a conversation, high scores = more like a ‘traditional’ written text.

2007 February 11

Just posted my comments at http://krishami.blogspot.com/2007/02/corporate-blogging-research.html. Your research is very fascinating and I wish all the best.

2007 February 20

[...] Krishna Kumar has written a great post summarizing what I’ve been doing lately with f-scores in blogs. My favorite quote: [...]

Pingback
2007 February 22

[...] I must have buried my head too deep in the sand lately (the sand being my f-score stuff), because I somehow managed to overlook the whole event up to now. If I hadn’t I certainly would have submitted something there. [...]

Pingback
2007 February 23

interesting research there.
blogging to me is like a daily NEED.
its not about the things i do or what i want. but its more abt what i feel, what i want them to know about my life. and my blog to me is like whole story of life’s journey.

i see my blog there in one of ur sources. i was rather shocked actually. well do continue ur research and all the best.

iyliee.

2007 May 9

[...] a) f-score (for details on what that is, read this post) [...]

Pingback
2007 August 30

Cornelius.

There might be a “small” problem with the computation of the F-score as given above and the F-score as defined by Heylighen and Dewaele (2002). You seem to use frequencies: ((6 nouns – 4 verbs) + 100) / 2 = 51 (F-score). However, [HD2002] work with percentages: ((60% nouns – 40% verbs) + 100) / 2 = 60 (F-score).

Whether this invalidates your results is difficult to say, but it may be clear that working with percentages scales for the size of a document, whereas working with frequencies does not. The [HD2002] has does been designed to keep the F-score between 0% and 100%.

Interestingly, with percentages one can also apply it to very short documents (such as chats).

Anjo.

2007 August 31

Thanks again for making me aware of this and for the great exchange of ideas we had, Anjo. I felt that there was something funny about the scores and in retrospect it’s certainly something I should have fixed earlier.

From what I can see so far it actually doesn’t seem to effect the tendencies I’ve discovered before though – in most cases they are actually more pronounced now. I’ll recalculate my scores and post something on that soon…

Comments are closed for this entry.