No “I” in corporate blogs?

2007 March 23
by Cornelius

I thought I’d post this quick bit as a belated response to Pranam Kolari’s piece on the frequency of “I” in blogs and to Doug Karr’s comment on the post (Dave Sifry also weighed in).

Pranam first looked at the relative frequency of the token “I” (which often signifies the first person pronoun in English) in blog posts and then at the overall total of posts containing “I”. Since according to Nielsen BlogPulse roughly 45% of all blog posts feature at least a single occurrence of “I”, it can be deduced that a bit more than twice of all Technorati matches for “I” is the total number of blog entries written in a given month.

As Pranam explains:

The token “I” (1, 2) can provide interesting cues on the Blogosphere, other than signifying the obvious personal nature of blog posts. “I” sometimes use it to study the growth of the blogosphere (between David Sifry reports ofcourse), or just for fun to see how frequently indices of blog search engines are updated and if any of them are in a “breather” mode.

Apart from the fact that it is annoying having to resort to such complicated measures (wouldn’t it be nice if Technorati just provided us with these numbers?) my first impulse was to complain that “I” is not a very good candidate for this calculation. Firstly, it isn’t the most frequent word in English-language writing, at least in off-line contexts. A quick look at the BNC shows that it’s not even among the top 10.

BNC frequencies

rank | token | word class
1 | THE | AT0
2 | OF | PRF
3 | AND | CJC
4 | A | AT0
5 | IN | PRP
6 | TO | TO0
7 | IS | VBZ
8 | TO | PRP
9 | WAS | VBD
10 | IT | PNP

“I” comes in at number 16, making an odd choice at first glance. But of course anyone who has ever worked with a search engine to do some kind of linguistic investigation will immediately know why Pranam didn’t pick “THE” instead: BlogPulse (like Google) ignores function words (all those listed above) and thus you don’t get any matches if you search for them.

Still, I’m very skeptical about measuring the size of the English-speaking blogosphere by counting the frequency of “I”, for several reasons. The most important one is that “I” may indicate the first person pronoun in English, but it can just as well mean something else. Look at the following results from Google for “I”-matches in languages other than English.

1,170,000 Arabic pages for i

1,320,000 Bulgarian pages for i

1,430,000 Catalan pages for i

6,860,000 Czech pages for i

13,300,000 Spanish pages for i

These are just examples - I think it is safe to say that you can get in excess of a million matches with basically any language, including some which are not represented in a large number of pages. In some cases these matches may come from agglutinative languages in which it is possible to compose what equates to a full sentence in English by “gluing” morphemes together*. “I” is a meaning-distinguishing morpheme in many languages and a word by itself in others (if I’m not mistaken it means “and” in Catalan, a minor alternation in spelling compared to the Spanish “y”). Note that neither Technorati not Google take case into account, so “I” and “i” are both counted.

Apart from that, there are many initialisms starting with “i” (iMode, iPod, iFilm), roman numerals (i. , ii., iii. ..) and other possible sources of error. “I” is a veritable minefield, at least from my point of view.

So let’s turn to the question of why and when it appears in blog posts as the English first person pronoun.

Doug Karr, commenting on Pranams post:

I believe what is missing is the implementation of corporate blogs and aggregation blogs. These blogs are less apt to utilize ‘I’ because they speak to an organization or to a technology. They are less likely to be personal.

Since I’ve been collecting what should by now be a representative sample of corporate blogs - 4.82 million words from 140 sources, as of today - I’m itching to answer that question. Let’s see.

Corporate Blog Frequencies

Rank | Word | POS | Frequency
1 | the | DT | 210540
2 | to | TO | 118966
3 | and | CC | 106665
4 | of | IN | 93750
5 | a | DT | 92646
6 | in | IN | 64756
7 | I | PP | 55424
8 | is | VBZ | 49713
9 | For | IN | 43940
10 | It | PP | 43651

This chart also gives us a better idea of how frequent “I” really is in blogs. When you compare it with the BNC table above you’ll see the difference. “I” is much more frequent in blogs than it is in most registers contained in the BNC, which is not that surprising when you take into account that the BNC contains no computer-mediated communication, which is considerably more likely to be direct interpersonal communication than material from the paper age.

But what about corporate blogs vs. blogs in general? My corpus contains 18 non-corporate blog sources for comparison. Here are the word frequencies for personal blogs:

Personal Blog Frequencies

Rank | Word | POS | Frequency
1 | the | DT | 11585
2 | and | CC | 7908
3 | to | TO | 7495
4 | of | IN | 5108
5 | a | DT | 5049
6 | i | NP | 4590
7
| I | PP | 4318
8 | in | IN | 3526
9 | It | PP | 3365
10 | you | PP | 3091

There’s a minor hiccup here: number 6 (”i”) and number 7 (”I”) are probably both instances of the first person pronoun. The trouble for my tagger is that those creative individual bloggers don’t bother with standard spelling and write “i” instead of “I”, which the tagger stubbornly interprets as “NP” (proper noun). If you add the two frequencies “I” comes in second, with a fair margin.

As you can see, Doug’s basic assumption - that “I” is less frequent in corporate blogs than in blogs in general - is confirmed. However, “I” frequency in corporate blogs is still much higher than it is in company press releases or newspaper op-ed columns, where it is generally not even in the top 50. Or, to put it another way: blogs are firstly blogs and secondly corporate, political, private etc when it comes to “I” count. This isn’t even a stylistic thing that’s unique to blogs. It’s simply very difficult to write a text in English that is somehow concerned with actions, events or objects that have any kind of relation to you without using “I”. Only when you are not involved in any direct way in what’s happening can “I” be easily avoided (think of a wikipedia entry or your VCR’s instruction manual). In established genres such as legal texts and scholarly articles, “I” is artificially avoided** to “foreground” the institution and “background” the individual. But since the opposite is conventional in blogs - the author is normally very visible - and because every piece of text written in a blog must have a clearly visible author because of the way entries are structured by the blogging software, it would be perceived as quite unusual if such avoidance strategies were used.

Note that there’s a minor problem with my frequency lists: they don’t really give us the same information that Pranam looked up, because he checked how many blog posts contain “I”, while I checked how often it is used across sources. So let’s check for “has I” vs. “doesn’t have I” as well.

19502 total posts (100%)
10670 with “I” (54.7%)
8832 without “I” (45.3%)

As it stands, I have almost 10% more posts with at least one occurrence of “I” than BlogPulse. Obviously my sample is much smaller than theirs, but when you consider that the vast majority of blogs in my corpus are institutional blogs, the 45% figure seems quite low. The simplest explanation that I can think of is that there is a large number of non-English sources indexed by BlogPulse and that these sources are “I”-free. While that doesn’t explain the disconnect between the BlogPulse and Technorati numbers, I think there are more sources of error in there than we can possibly account for. We should better stick to David’s numbers and assume that the English-language blogosphere hasn’t peaked - at least that’s my take.

* In the Eskimo-Aleut language Yup’Ik the “word” angyaqegciuq is actually not a (single) word but an entire phrase, the English translation of which is He has a good boat.

** Common strategies to avoid “I” in academic writing are agentless passives (”It is assumed that…”), existential There constructions (”There is reason to assume that…”), use of the plural pronoun (”We demonstrate…”; in some cases natural, e.g. when there are multiple authors, in some cases as a purely stylistic device) and use of so-called inanimate agent constructions (”This paper argues that…”, “The data shows that…” etc.).

2 Comments
2007 March 23

Wow! This is fantastic. Thanks so much for taking the time to do this and reporting your findings!

2007 March 23

Glad you found it helpful! I’ll try to post some more findings from my data soon.

Comments are closed for this entry.