Oracle bloggers are storytellers, Microsoft bloggers are technocrats (III)

I think it’s about time that I finished up my little stylometric analysis of Oracle’s and Microsoft’s blog hubs that I started last month (part I, part II). While what I conducted was really just a quick glimpse at how certain linguistic features are distributed in both blogs, I think it still gave an impression of the differences in “blog culture” between the two companies.

Let’s look at the key observations again:

1. MSDN produces more posts per day than OraBlogs

This is hardly surprising, as there are likely to be more individual bloggers in Microsoft’s hub than in Oracle’s.

2. Individual posts in MSDN are on average twice as long as they are in OraBlogs

In conjunction with this, sentences in MSDN are generally shorter than they are in OraBlogs. There may be interference from non-English sources or from non-native English speakers here which could skew the numbers for MSDN somewhat.

3. Use of the first-person pronoun is marked in OraBlogs, while it is unmarked in MSDN

The use of I in OraBlogs is en par with the overall distribution in the blogs I’m tracking, whereas it is below average in MSDN. If we conclude some level of personal involvement from this, it suggests that MSDN bloggers are less involved than most bloggers.

4. Use of modals to express possibility/futurity is marked in MSDN, while past-tense markers rank lower-than-average

That is, future events and possibility appear to be referenced more frequently in MSDN, whereas past events play a larger role in OraBlogs.

5. When contrasting the two sources posts from OraBlogs tend to be more verbal, while posts from MSDN tend to be more nominal

Overall MSDN exhibits a high noun frequency, while OraBlogs has a comparatively low one. This fits quite well with the findings noted earlier.

So what does all of this mean? Well, the headline already gave the verdict away. However it’s a good idea to differentiate a bit more.

Oracle bloggers seem to relate things they are (or were) personally involved in more often than MSDN contributors. The latter seem to focus on present and future events and play a smaller role in their own writing, i.e. what they write relates less immediately to themselves than is the case with the orabloggers. Oracle’s bloggers also produce shorter posts, something that makes sense if you think about how “resource-intesive” writing a technical text usually is, compared to relating what you did on the weekend.

So there you have it: Microsoft is all business while Oracle has a knack for telling stories - at least in terms of how people at the two companies blog. Are these differences in style the result of different corporate cultures? Is it pure coincidence that there is a different in how they define blogging? Or am I just overlooking something?

Let me know what you think.

Oracle bloggers are storytellers, Microsoft bloggers are technocrats (II)

Edit #1: As Justin Kestelyn points out, Orablogs.com is not Oracle’s official blog hub (blogs.oracle.com is).

 

Edit #2: Sadly some of the charts in this post are still missing due to problems with a recent Wordpress update. If I find the time I will write a follow-up on this with new charts.

 

 

Welcome to part two of this class: Blog Stylistics 101. Last week we looked at some statistics and word lists comparing the OraBlogs and MSDN blog hubs. Today, let’s turn to the specific differences between the two hubs. I’ll start by giving you the updated word list, since the one use in the previous entry is already a tad stale by now.

OraBlogs1 the DT 88742 to TO 4815

3 a DT 4098

4 and CC 3528

5 I PP 3339

6 of IN 3212

7 in IN 2618

8 is VBZ 2172

9 It PP 2002

10 For IN 1837

11 you PP 1767

12 on IN 1563

13 this DT 1469

14 with IN 1106

15 Oracle NP 1080

16 that IN 1074

17 be VB 939

18 was VBD 932

19 at IN 823

20 my PP$ 803

21 are VBP 757

22 an DT 748

23 as IN 736

24 from IN 700

25 but CC 699

MSDN1 the DT 217912 to TO 11819

3 a DT 8811

4 and CC 8626

5 of IN 7701

6 in IN 6186

7 is VBZ 5687

8 I PP 4614

9 For IN 4610

10 you PP 4454

11 It PP 3864

12 this DT 3689

13 on IN 3317

14 with IN 2506

15 that IN 2411

16 are VBP 2394

17 be VB 2368

18 we PP 2334

19 as IN 1964

20 If IN 1926

21 can MD 1921

22 that WDT 1778

23 will MD 1682

24 from IN 1636

25 an DT 1563

 

First I’ve highlighted the pronouns I, WE, IT, YOU and the possessive determiner MY. The OraBloggers are a bit more egocentric (I #5) than the Microbloggers (I #9, WE #18), who appear to mention the team more frequently (Borg Collective, anyone?). Now, before you run amok with those numbers, there are of course a lot of possible factors and caveats there. You can avoid I and WE pronouns by using IT, or THERE-constructions, or by simply repeating the referenced noun phrase (maybe that’s the case with ORACLE at #15 – hard to say). Of course WE can refer to something other than the company; it can simply be an indicator that people tend to hang out in groups at Microsoft while Oracle devs are more solitary. WE can mean the royal company we as in we at Microsoft love our customers, or it can be just be any group of people the speaker includes himself in, as in Bob and I, we had sandwiches for lunch. Technically, the second scenario is actually more likely, but common sense tells us that occurrences of this “general WE” shouldn’t be more frequent in Microsoft’s blogs than they are in Oracle’s unless there is some difference in either their behavior or in the report thereof. But even when taking this and a number of other things into account, the difference seems at least worthy of closer investigation, especially the variation in first person personal pronouns, which is pretty clearly marked. The frequency of I is determined both by the author’s stylistic preference and by the subject matter. Generally, personal involvement of the author makes it very hard to omit the use of I (as, for example, referring to yourself in the third person is not really an viable strategy in English) but there are exceptions. For example, it it relatively impossible to report what you did last summer without using I, but it is quite possible to report how you conducted a scientific experiment with little or no use of that pronoun. In most media reportages there is no explicit voice that is linguistically detectable, even if the reporting journalist is clearly the individual who has experienced the events. Likewise, use or omission of I makes a big difference when expressing opinion or criticism. A presidential address typically contains no first-person reference to the speaker because the president is not offering his private opinion, but acting in his official function. Assuming, however, that none of this is really typical of blogs (which prefer to be quite involved, with lots of I-usage) the higher I-count in OraBlogs really signals more personal involvement compared with Microsoft. Or, you can interpret it as egocentrism at Oracle vs. team-orientedness at Microsoft. Tricky, isn’t it?

 

Next I’ve marked past tense BE (#18 in OraBlogs) and the modals CAN (#21 in MSDN) and WILL (#23 in MSDN). It is notable that the modals rank higher-than-average in MSDN but lower in OraBlogs (corpus averages are #33, #32). In other words, there is more past tense usage in Oracle’s blogs than in the corpus mean. Since that includes personal blogs and other types which tend to have a knack for storytelling, the tendency is actually a relatively strong one. MSDN, by contrast, is more about future events and possibility than storytelling.

 

So far so good – let’s look at word classes.

 

OraBlogs(left) , MSDN (right)

OraBlogsMSDN

 

 

 

 

This chart probably needs a little explanation. Start with the leftmost column, where the first line starts with “CC”. That stands for “part of speech” and is used to label word classes such as noun, verb, adjective etc. The second column has the absolute frequency of that part of speech. So the adjective (JJ) count for OraBlogs is 10,428. That in turn means that adjectives make up 5.3% of all words in Oracle’s blogs. The graph in the column right of the percentage visualizes this accordingly, which is why it’s so long for the NN type. NN stands for common noun (things like man, dog, or cable connector all belong to this category) which is usually significantly represented as a class.

 

So where are the differences? One notable thing is the higher IN-frequency in OraBlogs (9.1%) compared with MSDN (8.1%). The IN tag is used for both prepositions (e.g. behind, on) and subordinating conjunctions (e.g. whether, despite), which makes it rather difficult to say what exactly is more frequent here. However, the higher IN-frequency in OraBlogs makes sense in context with the greater average sentence length – longer sentences demand either coordination (measured with the CC tag) or subordination. The other interesting thing is the frequency of NN (common nouns) and NP (proper nouns) because that’s where Microsoft’s bloggers score very high, much higher than Oracle who is actually below the corpus average. So what are all those nouns needed for? My assumption is they’re mostly for talking about inanimate subjects – stuff – because that would fit with the comparatively low pronoun (PP) count. The table is actually incomplete; the figures for verbs (which would appear further down the list, after TO) are missing but there isn’t a lot of observable variation there - except for a higher past-tense usage on the part of the Oracles.

 

Okay, enough to digest for one sitting. I’ll put the grand conclusion into the third part of this series. And yes, I’ll try to post that in less than a week from now. :-)

Oracle bloggers are storytellers, Microsoft bloggers are technocrats

Well, at least that’s what my corporate blogging corpus tells me. I’ve been tracking both companies’ developer blog networks for quite a while now (here and here) and my little stylometrics daemon has found some peculiar differences. So now that the holidays and my trip to California are both in the past, let’s get right back into the trenches of language analysis. Here’s the data.

OraBlogs.com (Oracle)

Posts: 1,234

Tracked from/to: 7 Aug 2006 to 19 Dec 2006*

Words (types): 100,805

Words (tokens): 177,488

Ratio: 0.57**

MSDN Blogs (Microsoft)

Posts: 1,411

Tracked from/to: 14 Sep 2006 to 19 Dec 2006*

Words (types): 204,790

Words (tokens): 455,947

Ratio: 0.45**

(* = I’m still tracking both feeds, I simply selected this date as a cut-off point)

(** = Note that the ratio is not particularly useful unless certain normalizations are performed)

What these statistics tell us is that both blog networks have a comparable total number of posts (1,234 Oracle vs 1,411 Microsoft) and that I’ve been tracking them for roughly the same time (three months for Microsoft, four for Oracle). You can also see that MSDN has a higher daily output - the Microsofties appear to have produced more posts in less time. That doesn’t necessarily mean that they are individually more productive though, since we don’t know how many different authors are in each network. If MSDN has twice the number of bloggers they’ll obviously have no trouble producing a greater number of posts per day than the guys and girls over at Orablogs.com. Also keep in mind that I might occasionally miss posts as I rely on their RSS feeds for syndication and entries can “scroll past me” on a busy day.

What is quite interesting, however, is the total number of words (tokens) for each hub. Oracle comes in with a word count of just over 177k, while Microsoft has a staggering 455k – more than 2.5x that number. We’re relatively safe to assume that MSDN is simply the bigger network, but there’s more to it than that. Let’s look at the sentence count and a few averages.

OraBlogs

Sentences (SC): 10,271

Average Word Length (AWL): 4.1

Average Sentence Length (ASL): 17.3

Average Words per Post (AWpP): 143.8

MSDN

Sentences (SC): 33,803

Average Word Length (AWL): 4.6

Average Sentence Length (ASL): 13.5

Average Words per Post (AWpP): 323.1

MSDN has longer words, quite a bit above the average length for all blogs that I’m indexing (that global word length average stands at 4.2). This could potentially be caused by a whole lot of things, but my prime suspect is something not really related to specifically to the language of blogs: URLs and non-English words. URLs (web addresses) can be very long and they give tools for language computing headaches because they look like words (a string with whitespace left and right). It is possible to extract them of course, but I haven’t implemented that yet. While in most larger collections this shouldn’t make too much of a difference, as the relative frequency of URLs is reasonably low, things could be different with MSDN. Another thing is interference from non-English sources – MSDN has quite a few people who write in Russian, Mandarin Chinese, etc, and because the tagger doesn’t recognize these as non-English sources, they are likely to be misinterpreted.

The differences in post and sentence length, however, are fairly unlikely to be error-induced. As it stands, MSDN bloggers write posts which are twice as long as those written by their colleagues from Oracle. The Oracle guys write longer sentences – slightly about the corpus mean (16.6 words per sentence). By comparison, MSDN’s 13.5 average sentences length seems relatively low. One of the lowest values I have in my corpus is 12.5 (from an American teenager’s blog), a high one is 42.9, from IBM’s Irving Wladawsky-Berger. Please don’t draw any quick conclusions from this though. I’m pretty sure that Ernest Hemingway could be in the single digits with most of his prose.

Shorter sentences seem to correlate with longer posts, something that isn’t really too surprising, but still interesting to see with live numbers. Normally you would factor in type-token ratio here to look at the lexical density, but I’m not ready to do that with vast differences is total word count (there is a technique to avoid such issues but I’ll spare you the details). Anyway, let’s move on to wordlists.

OraBlogs

1 the DT 8095

2 to TO 4354

3 a DT 3650

4 and CC 3175

5 I PP 2996

6 of IN 2889

7 in IN 2358

8 is VBZ 1961

9 It PP 1794

10 For IN 1665

11 you PP 1628

12 on IN 1392

13 this DT 1341

14 Oracle NP 992

15 with IN 978

16 that IN 966

17 be VB 855

18 was VBD 836

19 at IN 757

20 my PP$ 713

21 are VBP 693

22 an DT 660

23 as IN 649

24 from IN 649

25 but CC 631

MSDN

1 the DT 18155

2 to TO 10036

3 a DT 7408

4 and CC 7261

5 of IN 6406

6 in IN 5202

7 is VBZ 4642

8 For IN 3856

9 I PP 3836

10 you PP 3821

11 It PP 3171

12 this DT 3127

13 on IN 2810

14 with IN 2100

15 are VBP 1996

16 that IN 1996

17 be VB 1965

18 we PP 1861

19 can MD 1647

20 as IN 1634

21 If IN 1581

22 that WDT 1518

23 will MD 1407

24 an DT 1313

25 from IN 1281

I’ve highlighted a few things that look interesting to me. First of all, the general similarities are really not very surprising if you’ve seen this kind of thing before. Sometimes the news media comes up with sensationalist stories about the decline of our civilization, often with a pseudolinguistic angle a la “teenagers these days use only 100 different words in their speech”. Lists such as this one illustrate why it’s a ridiculous argument. In both speech and writing we do most of the work using the same handy components over and over again. The definite article THE in English is usually in the number one position when you’re dealing with written texts, and it is fairly often followed by TO, AND and OF though the exact sequence varies from one type of text to the next. While there is a marked difference in such distributions between written and spoken language (makes sense when you think about it), the picture is otherwise quite consistent. For example, in any larger collection of text it is exceedingly likely that there will be very few or no nouns among the top 10 words, even though numerically nouns clearly dominate over other word classes in the lexicon. In frequency lists function words clearly dominate because they are the brick and mortar of writing.

I’ll leave it at that for today, but the second part of this little style comparison will be posted shortly. Stay tuned.

I am a hard bloggin' scientist - read the Manifesto Subscribe to the CorpBlawg Feed

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 License.