Oracle bloggers are storytellers, Microsoft bloggers are technocrats
Well, at least that’s what my corporate blogging corpus tells me. I’ve been tracking both companies’ developer blog networks for quite a while now (here and here) and my little stylometrics daemon has found some peculiar differences. So now that the holidays and my trip to California are both in the past, let’s get right back into the trenches of language analysis. Here’s the data.
OraBlogs.com (Oracle)
Posts: 1,234
Tracked from/to: 7 Aug 2006 to 19 Dec 2006*
Words (types): 100,805
Words (tokens): 177,488
Ratio: 0.57**
MSDN Blogs (Microsoft)
Posts: 1,411
Tracked from/to: 14 Sep 2006 to 19 Dec 2006*
Words (types): 204,790
Words (tokens): 455,947
Ratio: 0.45**
(* = I’m still tracking both feeds, I simply selected this date as a cut-off point)
(** = Note that the ratio is not particularly useful unless certain normalizations are performed)
What these statistics tell us is that both blog networks have a comparable total number of posts (1,234 Oracle vs 1,411 Microsoft) and that I’ve been tracking them for roughly the same time (three months for Microsoft, four for Oracle). You can also see that MSDN has a higher daily output – the Microsofties appear to have produced more posts in less time. That doesn’t necessarily mean that they are individually more productive though, since we don’t know how many different authors are in each network. If MSDN has twice the number of bloggers they’ll obviously have no trouble producing a greater number of posts per day than the guys and girls over at Orablogs.com. Also keep in mind that I might occasionally miss posts as I rely on their RSS feeds for syndication and entries can “scroll past me†on a busy day.
What is quite interesting, however, is the total number of words (tokens) for each hub. Oracle comes in with a word count of just over 177k, while Microsoft has a staggering 455k – more than 2.5x that number. We’re relatively safe to assume that MSDN is simply the bigger network, but there’s more to it than that. Let’s look at the sentence count and a few averages.
OraBlogs
Sentences (SC): 10,271
Average Word Length (AWL): 4.1
Average Sentence Length (ASL): 17.3
Average Words per Post (AWpP): 143.8
MSDN
Sentences (SC): 33,803
Average Word Length (AWL): 4.6
Average Sentence Length (ASL): 13.5
Average Words per Post (AWpP): 323.1
MSDN has longer words, quite a bit above the average length for all blogs that I’m indexing (that global word length average stands at 4.2). This could potentially be caused by a whole lot of things, but my prime suspect is something not really related to specifically to the language of blogs: URLs and non-English words. URLs (web addresses) can be very long and they give tools for language computing headaches because they look like words (a string with whitespace left and right). It is possible to extract them of course, but I haven’t implemented that yet. While in most larger collections this shouldn’t make too much of a difference, as the relative frequency of URLs is reasonably low, things could be different with MSDN. Another thing is interference from non-English sources – MSDN has quite a few people who write in Russian, Mandarin Chinese, etc, and because the tagger doesn’t recognize these as non-English sources, they are likely to be misinterpreted.
The differences in post and sentence length, however, are fairly unlikely to be error-induced. As it stands, MSDN bloggers write posts which are twice as long as those written by their colleagues from Oracle. The Oracle guys write longer sentences – slightly about the corpus mean (16.6 words per sentence). By comparison, MSDN’s 13.5 average sentences length seems relatively low. One of the lowest values I have in my corpus is 12.5 (from an American teenager’s blog), a high one is 42.9, from IBM’s Irving Wladawsky-Berger. Please don’t draw any quick conclusions from this though. I’m pretty sure that Ernest Hemingway could be in the single digits with most of his prose.
Shorter sentences seem to correlate with longer posts, something that isn’t really too surprising, but still interesting to see with live numbers. Normally you would factor in type-token ratio here to look at the lexical density, but I’m not ready to do that with vast differences is total word count (there is a technique to avoid such issues but I’ll spare you the details). Anyway, let’s move on to wordlists.
|
OraBlogs 1 the DT 8095 2 to TO 4354 3 a DT 3650 4 and CC 3175 5 I PP 2996 6 of IN 2889 7 in IN 2358 8 is VBZ 1961 9 It PP 1794 10 For IN 1665 11 you PP 1628 12 on IN 1392 13 this DT 1341 14 Oracle NP 992 15 with IN 978 16 that IN 966 17 be VB 855 18 was VBD 836 19 at IN 757 20 my PP$ 713 21 are VBP 693 22 an DT 660 23 as IN 649 24 from IN 649 25 but CC 631 |
MSDN 1 the DT 18155 2 to TO 10036 3 a DT 7408 4 and CC 7261 5 of IN 6406 6 in IN 5202 7 is VBZ 4642 8 For IN 3856 9 I PP 3836 10 you PP 3821 11 It PP 3171 12 this DT 3127 13 on IN 2810 14 with IN 2100 15 are VBP 1996 16 that IN 1996 17 be VB 1965 18 we PP 1861 19 can MD 1647 20 as IN 1634 21 If IN 1581 22 that WDT 1518 23 will MD 1407 24 an DT 1313 25 from IN 1281 |
I’ve highlighted a few things that look interesting to me. First of all, the general similarities are really not very surprising if you’ve seen this kind of thing before. Sometimes the news media comes up with sensationalist stories about the decline of our civilization, often with a pseudolinguistic angle a la “teenagers these days use only 100 different words in their speechâ€. Lists such as this one illustrate why it’s a ridiculous argument. In both speech and writing we do most of the work using the same handy components over and over again. The definite article THE in English is usually in the number one position when you’re dealing with written texts, and it is fairly often followed by TO, AND and OF though the exact sequence varies from one type of text to the next. While there is a marked difference in such distributions between written and spoken language (makes sense when you think about it), the picture is otherwise quite consistent. For example, in any larger collection of text it is exceedingly likely that there will be very few or no nouns among the top 10 words, even though numerically nouns clearly dominate over other word classes in the lexicon. In frequency lists function words clearly dominate because they are the brick and mortar of writing.
I’ll leave it at that for today, but the second part of this little style comparison will be posted shortly. Stay tuned.