Oracle bloggers are storytellers, Microsoft bloggers are technocrats

Well, at least that’s what my corporate blogging corpus tells me. I’ve been tracking both companies’ developer blog networks for quite a while now (here and here) and my little stylometrics daemon has found some peculiar differences. So now that the holidays and my trip to California are both in the past, let’s get right back into the trenches of language analysis. Here’s the data.

OraBlogs.com (Oracle)

Posts: 1,234

Tracked from/to: 7 Aug 2006 to 19 Dec 2006*

Words (types): 100,805

Words (tokens): 177,488

Ratio: 0.57**

MSDN Blogs (Microsoft)

Posts: 1,411

Tracked from/to: 14 Sep 2006 to 19 Dec 2006*

Words (types): 204,790

Words (tokens): 455,947

Ratio: 0.45**

(* = I’m still tracking both feeds, I simply selected this date as a cut-off point)

(** = Note that the ratio is not particularly useful unless certain normalizations are performed)

What these statistics tell us is that both blog networks have a comparable total number of posts (1,234 Oracle vs 1,411 Microsoft) and that I’ve been tracking them for roughly the same time (three months for Microsoft, four for Oracle). You can also see that MSDN has a higher daily output - the Microsofties appear to have produced more posts in less time. That doesn’t necessarily mean that they are individually more productive though, since we don’t know how many different authors are in each network. If MSDN has twice the number of bloggers they’ll obviously have no trouble producing a greater number of posts per day than the guys and girls over at Orablogs.com. Also keep in mind that I might occasionally miss posts as I rely on their RSS feeds for syndication and entries can “scroll past me” on a busy day.

What is quite interesting, however, is the total number of words (tokens) for each hub. Oracle comes in with a word count of just over 177k, while Microsoft has a staggering 455k – more than 2.5x that number. We’re relatively safe to assume that MSDN is simply the bigger network, but there’s more to it than that. Let’s look at the sentence count and a few averages.

OraBlogs

Sentences (SC): 10,271

Average Word Length (AWL): 4.1

Average Sentence Length (ASL): 17.3

Average Words per Post (AWpP): 143.8

MSDN

Sentences (SC): 33,803

Average Word Length (AWL): 4.6

Average Sentence Length (ASL): 13.5

Average Words per Post (AWpP): 323.1

MSDN has longer words, quite a bit above the average length for all blogs that I’m indexing (that global word length average stands at 4.2). This could potentially be caused by a whole lot of things, but my prime suspect is something not really related to specifically to the language of blogs: URLs and non-English words. URLs (web addresses) can be very long and they give tools for language computing headaches because they look like words (a string with whitespace left and right). It is possible to extract them of course, but I haven’t implemented that yet. While in most larger collections this shouldn’t make too much of a difference, as the relative frequency of URLs is reasonably low, things could be different with MSDN. Another thing is interference from non-English sources – MSDN has quite a few people who write in Russian, Mandarin Chinese, etc, and because the tagger doesn’t recognize these as non-English sources, they are likely to be misinterpreted.

The differences in post and sentence length, however, are fairly unlikely to be error-induced. As it stands, MSDN bloggers write posts which are twice as long as those written by their colleagues from Oracle. The Oracle guys write longer sentences – slightly about the corpus mean (16.6 words per sentence). By comparison, MSDN’s 13.5 average sentences length seems relatively low. One of the lowest values I have in my corpus is 12.5 (from an American teenager’s blog), a high one is 42.9, from IBM’s Irving Wladawsky-Berger. Please don’t draw any quick conclusions from this though. I’m pretty sure that Ernest Hemingway could be in the single digits with most of his prose.

Shorter sentences seem to correlate with longer posts, something that isn’t really too surprising, but still interesting to see with live numbers. Normally you would factor in type-token ratio here to look at the lexical density, but I’m not ready to do that with vast differences is total word count (there is a technique to avoid such issues but I’ll spare you the details). Anyway, let’s move on to wordlists.

OraBlogs

1 the DT 8095

2 to TO 4354

3 a DT 3650

4 and CC 3175

5 I PP 2996

6 of IN 2889

7 in IN 2358

8 is VBZ 1961

9 It PP 1794

10 For IN 1665

11 you PP 1628

12 on IN 1392

13 this DT 1341

14 Oracle NP 992

15 with IN 978

16 that IN 966

17 be VB 855

18 was VBD 836

19 at IN 757

20 my PP$ 713

21 are VBP 693

22 an DT 660

23 as IN 649

24 from IN 649

25 but CC 631

MSDN

1 the DT 18155

2 to TO 10036

3 a DT 7408

4 and CC 7261

5 of IN 6406

6 in IN 5202

7 is VBZ 4642

8 For IN 3856

9 I PP 3836

10 you PP 3821

11 It PP 3171

12 this DT 3127

13 on IN 2810

14 with IN 2100

15 are VBP 1996

16 that IN 1996

17 be VB 1965

18 we PP 1861

19 can MD 1647

20 as IN 1634

21 If IN 1581

22 that WDT 1518

23 will MD 1407

24 an DT 1313

25 from IN 1281

I’ve highlighted a few things that look interesting to me. First of all, the general similarities are really not very surprising if you’ve seen this kind of thing before. Sometimes the news media comes up with sensationalist stories about the decline of our civilization, often with a pseudolinguistic angle a la “teenagers these days use only 100 different words in their speech”. Lists such as this one illustrate why it’s a ridiculous argument. In both speech and writing we do most of the work using the same handy components over and over again. The definite article THE in English is usually in the number one position when you’re dealing with written texts, and it is fairly often followed by TO, AND and OF though the exact sequence varies from one type of text to the next. While there is a marked difference in such distributions between written and spoken language (makes sense when you think about it), the picture is otherwise quite consistent. For example, in any larger collection of text it is exceedingly likely that there will be very few or no nouns among the top 10 words, even though numerically nouns clearly dominate over other word classes in the lexicon. In frequency lists function words clearly dominate because they are the brick and mortar of writing.

I’ll leave it at that for today, but the second part of this little style comparison will be posted shortly. Stay tuned.

This article has 2 comments so far!

  1. CorpBlawg » Oracle bloggers are storytellers, Microsoft bloggers are technocrats (II) says —

    […] Welcome to part two of this class: Blog Stylistics 101. Last week we looked at some statistics and word lists comparing the OraBlogs and MSDN blog hubs. Today, let’s turn to the specific differences between the two hubs. I’ll start by giving you the updated word list, since the one use in the previous entry is already a tad stale by now. […]

  2. CorpBlawg » Oracle bloggers are storytellers, Microsoft bloggers are technocrats (III) says —

    […] I think it’s about time that I finished up my little stylometric analysis of Oracle’s and Microsoft’s blog hubs that I started last month (part I, part II). While what I conducted was really just a quick glimpse at how certain linguistic features are distributed in both blogs, I think it still gave an impression of the differences in “blog culture” between the two companies. […]

I am a hard bloggin' scientist - read the Manifesto Subscribe to the CorpBlawg Feed

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 License.