Dissecting Robert Scoble

2006 October 2
by Cornelius

Disclaimer: No bloggers were harmed in the course of this experiment.

As I’ve hinted at in the past, I’m in the process of building a textual database that contains thousands of posts culled from the RSS feeds of about a hundred corporate blogs, plus a comparison group of several “miscellaneous” blogs randomly picked through blogger.com and blog.com. The corpus currently has a little under 800,000 words and is expected to reach a round million words (or tokens) in about two to three weeks time.

So far, I’m just calculating a few very basic statistics: post word count, post sentence count and average word/sentence/post length, along with a top 100 list of the most frequent words. Though these are very basic figures, they nevertheless give a few interesting clues about the sources in question, especially when you compare one collection of blogs with another.

My test subject today will be Robert Scoble’s blog, Scobleizer. I’ll compare it to a) a large collection of other company blogs and b) a collection of randomly chosen non-corporate blogs. My reasons for picking Robert are pretty unspectacular. I happened to add him to the database fairly early on so that now I have a reasonable amount of data. Also, his immense popularity should make for some interesting results… note that I say “interesting” and not conclusive – a few language statistics don’t equate to the recipe for the Scoble Special Sauce of Blogging Fame. Anyway, let’s crunch a few numbers.

Scobleizer

Posts: 327

First Post to Last Post (FPLP): 2 August 2006, 03:26 - 30 September 2006, 22:07

Tokens / Types (Ratio): 17014 / 3743 (4.55)

Sentences (SC): 1950

Average Word Length (AWL): 4.9

Average Sentence Length (ASL): 10.1

Average Words per Post (AWpP): 52.9*

* not relevant because Scoble’s RSS doesn’t include complete posts but only summaries (the first 56 words)

Corporate Blogs

(Blogs: 107)

Posts: 4443

First Post to Last Post (FPLP): 2 May 2005, 00:00 - 2 October 2006, 00:50

Tokens / Types (Ratio): 667969 / 62230 (10.73)

Sentences (SC): 44350

Average Word Length (AWL): 5.5

Average Sentence Length (ASL): 15.9

Average Words per Post (AWpP): 155.1

Random Blogs Comparison Group

(Blogs: 18)

Posts: 576

First Post to Last Post (FPLP): 17 November 2004, 03:17 - 2 October 2006, 00:48

Tokens / Types (Ratio): 105253 / 16979 (6.2)

Sentences (SC): 10335

Average Word Length (AWL): 5.1

Average Sentence Length (ASL): 10.8

Average Words per Post (AWpP): 184.5

The stats

The first thing to note is that the three collections differ significantly in terms of size. The Scobleizer collection only has a size of 17,014 tokens (words), while both the corporate blog collection (667,969 tokens) and the random blogs comparison group (105,253 tokens) are much larger. This has strong implications for the accuracy of the figures, as a larger sample is obviously more accurate. The posts indexed in my database are not the total of posts made in those blogs, but only those which have been recorded since I began indexing a few months ago. Some entries date back several years, which is simply due to the fact that some of the RSS feeds which were used go back that far.

You might be wondering what on earth types are. Don’t worry, it’s really simple: while tokens are all words in a text, types are all unique words. So while the sentence “The cat ate the mouse” has 5 tokens, it only has 4 types because “the” occurs twice. The token-type-ratio for that sentence would be 5:4, or 1.25. As you can imagine, a long text will have a significantly larger number of tokens than types, since function words (pronouns, articles, prepositions etc) are re-used all the time, while lexical words (something like “blog”, “Google” or “greenish”) occur a lot less often.

The other statistics are pretty straight-forward: the number of total posts in the database, the time span from the first to the last post, the total number of sentences and three averages: average word length (AWL), average sentence length (ASL) and average words per post (AWpP). AWL refers to the number of characters in a word, while ASL in turn refers to the number of words in a sentence. As mentioned above, Scoble’s AWpP value should be ignored, since his RSS feed does not include complete entries but only summaries.

A cautious interpretation

The comparison shows that Robert Scoble uses shorter words and sentences than both the blogs in the random comparison group and those in the corporate blogging collection. Words are only slightly shorter (Scoble: 4.9; Corp.blogs: 5.5; Random blogs: 5.1) but it should be noted that variation in this category is normally not very strong, thus the difference between Scoble and the corporate blogs seems notable. The differences in sentence length (10.1; 15.9; 10.8) are even more pronounced: on average, the other corporate blogs have much longer sentences than Scoble, who is again a little below the average value of the random blogs. Finally, it cannot be determined if Scoble’s posts are shorter than those in the other two collections (52.9*; 155.1; 184.5) because his RSS syndicates only summaries, though my personal bet would be that they are. This is also the only category where the random group scores higher than the corporates.

So what does this mean? In one sentence, it means that on average Robert Scoble seems to use shorter sentences than most other corporate bloggers, and that the words he uses are also significantly shorter. Looking further, it appears that Scoble’s style – only speaking in terms of word and sentence length – is closer to that of non-corporate bloggers. However, these numerical statistics aren’t terribly exciting by themselves, which is why tomorrow I’ll take a peek at a list of the most frequently used words in our three source collections.

(to be continued)

Edit: My claim that Robert’s RSS feed does not contain full texts is bogus - my indexing tool was simply looking in the wrong place. I’ll correct the problem asap. Mea culpa.

1 Comment
2006 October 7

[...] As promised earlier, today I’m going to look at how Robert Scoble’s blog differs from other corporate blogs, and from blogs in general (apologies for the delay, this should have been up two days ago). [...]

Pingback

Comments are closed for this entry.