Jul 8th, 2008 | Google, IBM, Linguistics, Many Eyes, Visualization | 3 Comments
Two years ago, the news that Google was going to make available the largest collection of n-grams to the global research community that had ever been compiled sparked a lot of interest. I was among those who immediately ordered those six DVDs… and ever since they have been resting dutifully on a shelf in my office, collecting dust and reminding me that I need to bring them into a more accessible format. Alas, so many things to do, so little time.
Something led me to look for information on that corpus this morning and I came across this. Sadly, the link to Chris Harrison’s site no longer seems to work, but when I saw his visualization I immediately thought of Many Eyes.
My reasoning goes a little something like this:
Google N-gram corpus hosted on Google Palimpsest servers + IBM’s Many Eyes = Fantastic web-based tool for linguists
To elaborate: Google has a gigantic database of word collocations that can be used as a baseline for all sorts of interesting analysis, but you can’t really do any of these things unless you have a user interface and enough computing juice to sift through almost 100 gigabytes of text data on the fly. On the other hand, solutions like Many Eyes are amazing, but currently there’s no way you can use it with a really big data set like the n-gram corpus and therefore the research utility is limited.
But it must be possible somehow to bring together
- the data to analyze
- the computing power required and
- the user interface needed to allow a non-technical person to interact with the data
and to put the whole thing on the Web. It’s Google’s stated intention to host data for us and they are the owner of the n-gram dataset, so I can’t imagine there being any licensing issues. And, as if to put a cherry on that sundae, here’s the announcement of a joint project by IBM, Google and the NSF to do exactly that kind of stuff. Put the 6 DVDs on a cloud, throw in a tweaked version of Many Eyes (think the word tree vis with a few extras) and construction grammarians everywhere will absolutely love it.
What do you think?
Mar 14th, 2008 | Linguistics, Teaching | No Comments
Here’s a bit of shameless self-promotion: I’ve made some updates to my CV to include two presentations and an article about eLanguage that was published last fall. And in the (I assume fairly unlikely) case that you’re interested in a blog-based introduction into English linguistics: I’ve assembled a table of contents for the topics that were covered in my class in the winter semester.
Nov 16th, 2007 | Academic Publishing, Linguistics, Open Access Publishing, Presentation | No Comments
As you can probably tell by the long pauses in between posts, I am still not quite back to my normal blogging routine, but thankfully things are picking up little by little. Last week I held two presentations, one at the Max Planck Institute for Evolutionary Anthropology in Leipzig (concerned with the eLanguage project) and another at the University of Paderborn (about using the Web for linguistic research). Oh, and I can announce with some degree of pride that I have published my first peer-reviewed article (in First Monday, together with Peter Reimer) which is also related to eLanguage.
WALS and eLanguage (MPI-EVA, Leipzig)
Corpora, Blogs and Linguistic Variation (Paderborn)
Puschmann, Cornelius, and Reimer, Peter. “DiPP and eLanguage: Two cooperative models for open access” First Monday [Online], Volume 12 Number 10 (1 October 2007)
Jul 30th, 2007 | Conferences, Linguistics, Presentation | No Comments
Phew, what a month! I’m certainly not complaining - July was terrifically productive and interesting, with two conferences on very different subjects (Open Access Publishing and Corpus Linguistics). But the consequence was a pretty uneven blogging schedule, with much activity two weeks ago followed by a long silence.
Anyway, I’m picking up the reins again this week. But let me start by linking to the presentation I held at Birmingham.
I really enjoyed the colloquium and learned a lot from the other participants, especially because of the diverse disciplinary set-up (Linguistics, Information Retrieval, Computer Sciences, …). As Serge Sharoff pointed out, we were all very much talking about the same thing, even though our approaches may be different. Thanks once more to the organizers, Serge and Marina Santini, for their effort!
Jul 21st, 2007 | Linguistics, Other Stuff, Visualization, Web 2.0 | No Comments
It’s amazing what kind of great data visualizations you can create with IBM’s web statistics tool Many Eyes (I’ve used it before). The Many Eyes team has recently added a simple concordancing function so that you can see in what context a given word is used. People doing literary studies can do some interesting things with such a tool, as this word cloud from the ME site demonstrates.


While I was already at it, I decided to create a word cloud for HuffingtonPost.com using 2175 entries made in the last six months. You get a fairly clear idea of the topics that were central in that time by looking at the cloud. In case you were wondering - the terms appear so large because I used the top 50 words with their individual frequencies instead of a raw text.


Jul 17th, 2007 | Academic Publishing, Conferences, Linguistics, Open Access Publishing, PKP 2007, Presentation | No Comments
After a whoppin’ 14 hours of sleep (preceded by an equally whoppin’ 28 hours awake - talk about the no-nonsense approach to jetlag!) I am now back on the ground in Germany. Just in case you’re curious about what’s next on the menu here at CorpBlawg: I plan to write something about blogging and text composition towards the end of the week, inspired by this interesting piece that Jakob Nielsen recently published on his site.
But before I do that I want to briefly point you to this post about the presentation that I held last week at PKP Vancouver. There’s excellent coverage of all presentations held at the conference and I really think that it’s a great idea to blog such an event - a perfect fit, since it was about scholarly publishing and open access to knowledge. While I’m at it, here are the slides for the presentation.
Jul 9th, 2007 | Blogosphere, Corporate Blogging, Google, Linguistics, Style | 12 Comments
(Edit: this post by Teresa Valdez Klein on the subject is also interesting.)
The tumult over the whole affair has been impossible to miss. A little over a week ago, Lauren Turner, a health care marketer at Google, wrote a blog entry in which she criticized Michael Moore’s new movie Sicko for its allegedly unfair depiction of health care companies. The piece was posted in Google’s new Health Advertising Blog and led to an outcry. Many in the blogosphere saw Turner’s recommendation to insurance companies - buy ads from Google to fix your image problems - as a sleazy and manipulative form of marketing (samples: this post by ZDNet’s Dan Farber and this bit by Mike Abundo calling for Turner to be fired). The company reacted with two meta posts, one by Turner, explaining that the views expressed in her initial post were purely her own and a second one in Google’s main corporate blog that also sought to douse the flames. Since the incident made Slashdot it can be considered a fairly bad moment for Google’s PR.
Most comments that I’ve read deal with the question of accountability - whose opinion is expressed in an official blog and where do we draw the line between personal opinion and the company’s official stance?
While I also want to deal with that question, my impression is that Turner’s (and thus Google’s) mistake is not firstly the opinion expressed in the post - that Sicko is biased and treats health care companies unfairly - but failing to understand the communicative situation in which the exchange takes place. Turner manifests a fairly stunning lack of knowledge and sensitivity when it comes to blog sociology and that is why the piece caused such an uproar.
Let me elaborate, using several quotes from the post:
Lights, camera, action: the healthcare industry is back in the spotlight. (Not that it ever left the stage.) Next week, Michael Moore’s documentary film, Sicko, will start playing in movie theaters across America.
The New York Times calls Sicko a “cinematic indictment of the American health care system.” The film is generating significant buzz and is sure to spur a lively conversation about health coverage, care, and quality in America. While legislators, litigators, and patient groups are growing excited, others among us are growing anxious. And why wouldn’t they? Moore attacks health insurers, health providers, and pharmaceutical companies by connecting them to isolated and emotional stories of the system at its worst. Moore’s film portrays the industry as money and marketing driven, and fails to show healthcare’s interest in patient well-being and care.
These are the first two paragraphs of Turner’s piece and careful reading quickly reveals several interesting things. Firstly, the style is very journalesque. The lights, camera, action-enumeration in the first sentence could also be from a movie review or some other traditional journalistic text type (e.g. an editorial).
A slight shift occurs with the first instance of a personal pronoun (us). While the referent of the pronoun is at least somewhat ambiguous, it appears to be what could be called the ‘universal we‘ that Turner uses - legislators, litigators, and patient groups are part of the American public, as are others among us. The referent of others is named a bit later: health insurers, health providers, and pharmaceutical companies are worried about the way the movie depicts them. The important detail here is that Turner does not place the two groups equally for away from herself. She could have simply written others are growing anxious or something similar, but by inserting among us she has placed herself (and arguably her employer) in direct proximity to her potential clients in the health care business. Of course that placement is quite deliberate - she wants to sell ads to these companies, after all - but it soon becomes clear why it is also highly problematic.
Sound familiar? Of course. The healthcare industry is no stranger to negative press. A drug may be a blockbuster one day and tolled as a public health concern the next. News reporters may focus on Pharma’s annual sales and its executives’ salaries while failing to share R&D costs. Or, as is often common, the media may use an isolated, heartbreaking, or sensationalist story to paint a picture of healthcare as a whole. With all the coverage, it’s a shame no one focuses on the industry’s numerous prescription programs, charity services, and philanthropy efforts.
I think you’ll agree that the entire paragraph is essentially a flowery declaration of love for the health care industry. Now, this isn’t surprising per se (again, this is a sales pitch), but the lack of balance is still noteworthy (the nasty press vs. the friendly insurance companies). But wait, there’s more.
Many of our clients face these issues; companies come to us hoping we can help them better manage their reputations through “Get the Facts” or issue management campaigns. Your brand or corporate site may already have these informational assets, but can users easily find them?
Note that here the pronominal references change. We becomes Google and the more distant our clients is replaced by you / your brand. Why is this significant?
Because the post starts out with no clear speaker and referent. There is no “I”, as in “I want to express my views on Sicko and the health care industry today” and no “you” as in “Dear John, how are you ?”. The latter -that there is no clear referent - is perfectly normal for a blog, but the former is unusual. More importantly, these roles are only clearly assigned in the last two paragraphs.
We can place text ads, video ads, and rich media ads in paid search results or in relevant websites within our ever-expanding content network. Whatever the problem, Google can act as a platform for educating the public and promoting your message. We help you connect your company’s assets while helping users find the information they seek.
The pronominal reference at this point is clearly we = Google, you = health care companies. In other words, this is a message from Google to companies in that industry and while other people may also be reading it they are of no concern to the author. When a third party is introduced into the text (the public, later users), it is treated as though it were not a part of the exchange. Apart from pronominal use there are other signature characteristics of the text type that Mrs. Turner had in mind when writing this: verbs such as act, educate, promote, connect and help are indicting, as is the need to tart up nouns adjectivally (relevant websites, ever-expanding content network etc).
If you’re interested in learning more about issue management campaigns or about how we can help your company better connect its assets online, email us. We’d love to hear from you! Setting up these campaigns is easy and we’re happy to share best practices.
This is the equivalent of telling Bob that you think Mary is fat… while she is standing next to you. The public that needs to be educated is the elephant in the room and it doesn’t like to be talked down to. Turner appears to be unaware of this however. She seems to either assume that only potential clients will read the blog and that her pitch will work with them, or (even worse) that the gullible and asinine public will read it but not be offended.
The moral of the story is simple: you should anticipate that your blog is a public forum, no matter how specialized and in-group it may seem. Corporate bloggers should also forget most of what they know about the language of marketing. Certain linguistic tropes (like the aforementioned super-dupering of products via excessive use of adjectives) are recognized immediately and have a lot of potential for negative interpretation.
Delivering a sales pitch like this through a blog is bad enough, priding yourself with how effectively your employer can manipulate the public opinion for the right price is… well, I believe in American English it is called effing stupid. The problem is further aggravated by the fact that Turner’s claim - this is my opinion, not Google’s - is extremely weak.
In all but the last sentence we is the personal pronoun of choice, and that we clearly refers to the company. Obviously, Google as a corporate entity cannot have an opinion, but what is posted in an official corporate blog will understandably be interpreted as noted and accepted by someone further up the ladder (and it seems unlikely that there was no monitoring in Turner’s case).
Not understanding blog stylistics is at least a part of Turner’s failure. She has applied a language common in one context to a completely different and inappropriate one and the result is a bit like someone telling a bad joke aloud at a funeral. Clarifying that your views are your own by using I instead of the collective company we is a decent start.
Jul 4th, 2007 | Linguistics, Other Stuff, Robert Scoble, Technology, Web 2.0 | 2 Comments
Robert Scoble likes Google better than Microsoft (but not much) - and I have proof for that. He also holds his wife Maryam dearer than his company PodTech, but sadly she is outranked by Twitter and Apple. Ah, cruel World 2.0 capitalism.
How do I know? Simple, I have a list of 1,587 posts with 273,994 running words of text that Mr. Scoble has produced between 2 Aug 2006 and 4 Jul 2007. That translates into 18,362 sentences. An average Scoble blog entry has a length of 172.6 words, with 14.9 words per sentence and an average word length of 3.8; all of which is fairly - deep breath - average for a blog.
All, except for the word count. It’s pretty impressive, especially when you consider that he’s been at it for almost 6 years (I believe he started in October 2001 - correct me if I’m wrong). That’s 69 months of blogging, which translates into an estimated staggering 1,65 million words. That would make him twice as productive as William Shakespeare, who (only) managed 884,647 words in his entire lifetime, though in all fairness it has to be noted that Mr. Scoble didn’t have to write all that with a quill pen.
And here are his favorite nouns, by frequency (the number after the word indicates how often in occurs).
1 Google 1015
2 blog 779
3 Microsoft 776
4 people 688
5 video 503
6 stuff 393
7 things 365
8 something 357
9 way 354
10 Web 343
11 lot 322
12 today 320
13 time 301
14 thing 290
15 link 280
16 Apple 267
17 week 259
18 Search 258
19 world 256
20 post 245
21 videos 229
22 bloggers 220
23 interview 217
24 Twitter 215
25 blogs 213
26 company 206
27 one 199
28 Maryam 199
29 update 197
30 day 195
31 fun 193
32 someone 192
33 news 190
34 team 185
35 companies 178
36 lots 177
37 iPhone 175
38 service 172
39 Steve 171
40 show 171
41 site 170
42 TechMeme 169
43 business 165
44 phone 160
45 Windows 159
46 conference 158
47 year 158
48 PodTech 153
49 minutes 153
50 developers 151
Jun 17th, 2007 | Conferences, Linguistics, Open Access Publishing, Presentation | No Comments
Note: this post is largely about my conference travel plans for the next two months and a few academic issues. I promise to publish something more clearly corporate-blogging-related very soon.
As already mentioned in the last post, I was at the University of Osnabrück last month to present something related to methodology in corpus linguistics. My basic claim was that there are merits to using blogs as corpus data, because it allows us to effectively analyze the language use of countless individuals. This adds a level of granularity when making generalizations about “language as such” - what may be frequent in one speaker’s use may be non-existent in another’s. Anyway, you can have a look at the presentation slides if you like:
Turning to the future, I’m excited about my first trip to Canada next month. I am presenting at the 2007 PKP Scholarly Publishing Conference in Vancouver and the title of the talk is “eLanguage.net: Shifting the paradigm in linguistics from academic publishing to scholarly communication”. It’s a hot topic for me, as I’m the lead developer for eLanguage, the Linguistic Society of America’s platform for open access, peer-reviewed electronic journals (see my previous post). We hope to expand out talk into an article for a special issue of First Monday, to be published after the conference.Finally, I’m also presenting at the Corpus Linguistics 2007 in Birmingham (U.K., not U.S.). My presentation is part of the colloquium Towards a Reference Corpus of Web Genres, which will be mainly concerned with computational approaches to automatically classifying types of web pages according to their linguistic (and other) content. Read my abstract here (and yeah - of course it’ll be about my corporate blog collection).
I’ll be in Vancouver from the 10th to the 15th of July and in Birmingham on the 26th and 27th, also July. Be sure to say hello if you’re there and have the burning desire to talk about corporate blogging, linguistics, or the new Modest Mouse album.
As you can see, the work of the poorly paid PhD student is never done. Not that other people don’t get a whole lot more blogging done, despite a schedule that is far busier than mine for sure…
May 10th, 2007 | Blogging and Gender, Linguistics, Style | No Comments
Occasionally, like most nerds in academics, I wonder where exactly the usefulness of my research lies. What I do is fairly applied (in contrast to, say, theoretical syntax) but it still isn’t purely about solving real-life problems. Some people think that that’s a bad thing, while others completely oppose the view that science should generally be concerned with solving real-life problems.
Anyway, I recently came across a piece of applied research that’s both very interesting and mildly scary in its implications (although not very surprising if you’ve done research in the area in question).
The paper Effects of Age and Gender on Blogging by Schler et al is a study of roughly 140 million words of running text by close to 20,000 bloggers. The authors of the paper explain their questions and objective as follows:
How do content and writing style vary between male and female bloggers and among bloggers of different ages? How much information can we learn about somebody simply by reading a text that they have authored? These are very basic questions that are both of fundamental theoretical interest and of great practical consequence in forensic and commercial domains.
Note that the authors aren’t referring to human readers here, they are talking about natural language analysis using a computer. What they’ve done is to look at whether certain patterns in language use correspond with gender and age groups in a systematic way. In other words, is there a typical way of writing that distinguishes 14-year-old girls from 45-year-old men? If there is, it means you should be able to predict the age and gender of an author given enough textual material. Schler et al looked at several kinds of features for their analysis and found that they could predict the age and gender of bloggers with 70%-80% accuracy – or more, in some cases.
Here’s one of their observations on gender:
First, note that for each age bracket, female bloggers use more pronouns and assent/negation words while male bloggers use more articles and prepositions. Also, female bloggers use blog words far more [things like “lol” and “ur”] than do male bloggers, while male bloggers use more hyperlinks than do female bloggers. All of this confirms and extends findings reported earlier in [1,5,7] and lends support to the hypothesis that female writing tends to emphasize what Biber [3] calls “involvedness”, while male writing tends to emphasize “information”.
The results are equally stereotypical when it comes to characteristic content words for gender and age groups. Males talk about gaming, google, india and democracy, while shopping, cute, boyfriend and pink give away females. Teens are prone to use homework, boring, crappy and mum while twens like bar, apartment, beer and dating. And once we’re in the 30s we’re suddenly more interested in marriage, tax, son and development. Note that these words are used more often by one gender/age group than by others, not that girls write only about shopping. This also explains the typicality of a word such as boyfriend – teenage girls are more likely to use this than other groups simply because they are more likely to have boyfriends than males.
In another table, Schler et al look at which words are most strongly gendered, i.e. most likely to be used significantly more often by members of one sex over the other. According to their results, men use the words money, job, sports and tv more often, while women use sleep, eating, sex, family, friends and words that express positive or negative emotions more frequently than the mean.
So where does that leave us? Do we have to feel depressed over the fact that apparently our grammar gives away who we are? What about our individuality?
I don’t think it’s a big deal, for two reasons. Firstly, these are averages and their goal is to describe what is typical, not how you or I really write. Secondly, the goal here is to identify authors by age and gender in ambiguous cases (for example in forensic linguistics), not to make any blanket label judgements about men and women or teenagers and grown-ups.
But of course it’s interesting to see that the way we express ourselves can be so very markedly gendered. Where does the male preference for articles come from? Why do females use more pronouns? And does our writing style really become “more male” as we age, as recognized by Schler et al?
Another point to note is that prepositions and articles, which are used more frequently by male bloggers, are used with increasing frequency by all bloggers as they get older. Conversely, pronouns, assent/negation words and blog words, which are used more frequently by female bloggers, are used with decreasing frequency as bloggers get older. In short, the very same features that distinguish between male and female blogging style also distinguish between older and younger blogging style.
Or perhaps this just supports the idea that teenage girls have a unique subculture that sets them apart linguistically…