Give me your blog and I’ll tell you who you are
Occasionally, like most nerds in academics, I wonder where exactly the usefulness of my research lies. What I do is fairly applied (in contrast to, say, theoretical syntax) but it still isn’t purely about solving real-life problems. Some people think that that’s a bad thing, while others completely oppose the view that science should generally be concerned with solving real-life problems.
Anyway, I recently came across a piece of applied research that’s both very interesting and mildly scary in its implications (although not very surprising if you’ve done research in the area in question).
The paper Effects of Age and Gender on Blogging by Schler et al is a study of roughly 140 million words of running text by close to 20,000 bloggers. The authors of the paper explain their questions and objective as follows:
How do content and writing style vary between male and female bloggers and among bloggers of different ages? How much information can we learn about somebody simply by reading a text that they have authored? These are very basic questions that are both of fundamental theoretical interest and of great practical consequence in forensic and commercial domains.
Note that the authors aren’t referring to human readers here, they are talking about natural language analysis using a computer. What they’ve done is to look at whether certain patterns in language use correspond with gender and age groups in a systematic way. In other words, is there a typical way of writing that distinguishes 14-year-old girls from 45-year-old men? If there is, it means you should be able to predict the age and gender of an author given enough textual material. Schler et al looked at several kinds of features for their analysis and found that they could predict the age and gender of bloggers with 70%-80% accuracy – or more, in some cases.
Here’s one of their observations on gender:
First, note that for each age bracket, female bloggers use more pronouns and assent/negation words while male bloggers use more articles and prepositions. Also, female bloggers use blog words far more [things like “lol” and “ur”] than do male bloggers, while male bloggers use more hyperlinks than do female bloggers. All of this confirms and extends findings reported earlier in [1,5,7] and lends support to the hypothesis that female writing tends to emphasize what Biber [3] calls “involvedness”, while male writing tends to emphasize “information”.
The results are equally stereotypical when it comes to characteristic content words for gender and age groups. Males talk about gaming, google, india and democracy, while shopping, cute, boyfriend and pink give away females. Teens are prone to use homework, boring, crappy and mum while twens like bar, apartment, beer and dating. And once we’re in the 30s we’re suddenly more interested in marriage, tax, son and development. Note that these words are used more often by one gender/age group than by others, not that girls write only about shopping. This also explains the typicality of a word such as boyfriend – teenage girls are more likely to use this than other groups simply because they are more likely to have boyfriends than males.
In another table, Schler et al look at which words are most strongly gendered, i.e. most likely to be used significantly more often by members of one sex over the other. According to their results, men use the words money, job, sports and tv more often, while women use sleep, eating, sex, family, friends and words that express positive or negative emotions more frequently than the mean.
So where does that leave us? Do we have to feel depressed over the fact that apparently our grammar gives away who we are? What about our individuality?
I don’t think it’s a big deal, for two reasons. Firstly, these are averages and their goal is to describe what is typical, not how you or I really write. Secondly, the goal here is to identify authors by age and gender in ambiguous cases (for example in forensic linguistics), not to make any blanket label judgements about men and women or teenagers and grown-ups.
But of course it’s interesting to see that the way we express ourselves can be so very markedly gendered. Where does the male preference for articles come from? Why do females use more pronouns? And does our writing style really become “more male” as we age, as recognized by Schler et al?
Another point to note is that prepositions and articles, which are used more frequently by male bloggers, are used with increasing frequency by all bloggers as they get older. Conversely, pronouns, assent/negation words and blog words, which are used more frequently by female bloggers, are used with decreasing frequency as bloggers get older. In short, the very same features that distinguish between male and female blogging style also distinguish between older and younger blogging style.
Or perhaps this just supports the idea that teenage girls have a unique subculture that sets them apart linguistically…



