Two years ago, the news that Google was going to make available the largest collection of n-grams to the global research community that had ever been compiled sparked a lot of interest. I was among those who immediately ordered those six DVDs… and ever since they have been resting dutifully on a shelf in my office, collecting dust and reminding me that I need to bring them into a more accessible format. Alas, so many things to do, so little time.
Something led me to look for information on that corpus this morning and I came across this. Sadly, the link to Chris Harrison’s site no longer seems to work, but when I saw his visualization I immediately thought of Many Eyes.
My reasoning goes a little something like this:
Google N-gram corpus hosted on Google Palimpsest servers + IBM’s Many Eyes = Fantastic web-based tool for linguists
To elaborate: Google has a gigantic database of word collocations that can be used as a baseline for all sorts of interesting analysis, but you can’t really do any of these things unless you have a user interface and enough computing juice to sift through almost 100 gigabytes of text data on the fly. On the other hand, solutions like Many Eyes are amazing, but currently there’s no way you can use it with a really big data set like the n-gram corpus and therefore the research utility is limited.
But it must be possible somehow to bring together
the data to analyze
the computing power required and
the user interface needed to allow a non-technical person to interact with the data
and to put the whole thing on the Web. It’s Google’s stated intention to host data for us and they are the owner of the n-gram dataset, so I can’t imagine there being any licensing issues. And, as if to put a cherry on that sundae, here’s the announcement of a joint project by IBM, Google and the NSF to do exactly that kind of stuff. Put the 6 DVDs on a cloud, throw in a tweaked version of Many Eyes (think the word tree vis with a few extras) and construction grammarians everywhere will absolutely love it.
Last Wednesday, I had the opportunity to give a presentation on new forms of scholarly publishing, Open Access and Open Research at a virtual meeting organized by Catalina Danis of IBM’s Social Computing Group. It was great, although preciously little time for discussion remained, due to a slightly overambitious (i.e. too voluminous) presentation on my part. Thankfully, the session next week will be used for discussion and I am very much looking forward to that. Once more, a big thanks to Catalina for inviting me and to everyone who attended.
Edit: if this looks strange, please reload the page. For reasons I cannot fathom slideshare’s embeds manage to blow up the page unless I manually adjust the source code…
And now, after an exciting trip into the world of science blogging, we return to our regular scheduled program.
I’ve been meaning to write something on knowledge blogs (that I’ve previously referred to as expert or industry blogs) as one specific subgenre of corporate blogs for quite a while now. Several recent conversations on the subject have further increased my interest and yesterday I realized that I have been sitting on an exclusive interview with a knowledge blog expert for several months - something that I should absolutely share.
Knowledge blogs are written with the intention of providing insight and information into a topic a company blogger has substantial expertise in. They can be public-facing or have restricted access, but in both cases the target audience is usually a specialized one. A public-facing knowledge blog (or a limited-access blog that allows providing access to affiliates) can be written for customers who seek information and instruction, partners who collaborate in a project, experts at academic institutions, consultants etc. I imagine a typical intranet blog is likely to be more bidirectional than a public-facing one, meaning it is likely to be used for internal communication, partly replacing email, whereas a blog that is accessible to everyone (like the one I’ll present in a moment) is normally used for instruction, making the exchange between blogger and reader more unidirectional.
Software companies like Microsoft, IBM, Sun, SAP and Adobe use public-facing knowledge blogs on a large scale for the purposes mentioned above. The very technical nature of their products makes customer service a largely informational challenge and many of the customers are not end-users, but second-level developers who use specialized development tools to in turn create end-user products.
One extremely successful example of a knowledge blog from the IT sector (and obviously there are many) is If broken it is, fix it you should which is maintained by Tess Ferrandez. Tess is “an escalation engineer in PSS (product support services) at Microsoft, mostly dealing with ASP.NET but anything .NETish works” (from her about page). The application of terms such as “knowledge” and “expert” becomes natural when you take a look at what Tess writes about. To someone not educated in debugging ASP.net applications virtually every sentence in the blog will be completely opaque, but to Tess’ sizable international audience her troubleshooting tips are invaluable.
Independently of whether or not you have a grasp of the subject matter, it becomes apparent quite quickly when reading If broke it is that Tess has a knack for explaining highly complex problems in an accessible way. Another aspect that intrigues me is that she often frames problems in a tone that resembles storytelling - there’s an arc of suspense, from the initial situation (something doesn’t work) to the discovery of the root of the problem and its resolution. Notably this kind of framing is the direct inversion of how issues are presented in a classical knowledge base. Contextual data (e.g. what the engineer thinks or experiences while he is working on the problem) is omitted. There is no sequence of events; instead facts are presented outside of time. For example, compare this entry from Tess’ blog with the knowledge base article it cites. The knowledge base article has no identifiable author (there is no “I”, like there is in the blog) and the sequence of topics does not map to a sequence of events. By contrast, Tess’ debugging examples are narratives; they don’t contain an objectively-detached analysis of a piece of software but the subjective-experiential description of how she approaches, assesses and fixes a problem. We learn by example.
There’s a lot I could write about why I think this is a very promising approach and what it has to do with how we process information, but I’ll save that for another post.
Here are Tess’ answers to 10 questions I asked her via email. I plan to conduct more of these interviews and use them for my thesis, to accurately describe the practitioner’s perspective on corporate blogging.
Once more, I would like to thank Tess for allowing me to interview her.
E-mail interview with Tess Ferrandez
Cornelius: What (if anything) do you enjoy most about blogging?
Tess: I enjoy the instant feedback from people reading the blog, and I enjoy teaching and debugging so blogging is the perfect venue for me to teach debugging and make sure that people don’t have to run into issues that they could easily avoid if they knew about them.
Cornelius: Did someone else encourage you to blog or did you start out of you own accord?
Tess: I started on my own accord, we keep telling customers the same thing over and over in emails and I figured that a) I could avoid having to reinvent the wheel all the time b) other people that don’t call support could benefit from this knowledge and c) if it is documented somewhere people will trust it more since it is something that is already known and not something that was made up to fit the evidence from the dumps.
Cornelius: Do you publish in certain intervals or create a schedule for publication?
Tess: I don’t have a schedule, I blog when I have something that I think is interesting to write about and when I have time to blog. My blog posts are pretty sporadic, one blog post one month and 5 the next.
Cornelius: What prompts you to write a piece?
Tess: When I have had a case that was either extremely interesting or when I find that I see the same issue over and over.
Cornelius: How would you describe your goals when writing a piece?
Tess: My goals are that the posts should be interesting to as many people as possible, so I mostly blog about issues that will affect a lot of different developers. My goals are also that it should be easy to digest while at the same time contain enough detail to be useful, so I structure the content in a way that you can either read it all if you are interested in the details or just read the bottom line if you are just interested in the solution. The primary purpose of the posts are to show common issues and their solutions but also provide debugging tips so that people can resolve similar issues on their own.
Cornelius: Has your employer made any suggestions to you regarding topics that should be avoided (e.g. for legal reasons) or made any suggestions to you on what to blog about?
Tess: Not really, however I avoid four things:
1. Naming customers,
2. Naming 3rd party components
3. Providing information about items that are either confidential or that I know are prone to change to avoid confusion.
These are pretty much the same rules that apply to any communication we have with customers, they expect to be able to trust us so we should not leave out any information about them, and in terms of 3rd party products, if I haven’t tested them myself in a formal way I can’t really expect to be able to express a formal opinion about them.
Cornelius: What kind of reactions do you get from colleagues, clients etc. regarding your blog?
Tess: Only positive, a lot of my colleagues have started blogging after they saw my blog and how many readers I got, i.e. how many people benefit from it, and I have seen a trend of these blogs being very successful.
My blog gets about 100 000 web hits and 400 000 RSS hits a month, and if something I write even helps 1 % of those that would be a good return on investment.
I almost get emails on a weekly bases with positive comments from readers and customers which is extremely encouraging and prompts me to write even more.
Cornelius: Do you put a lot of care into formal aspects like spelling, grammar etc?
Tess: I try not to misspell too many wordsJ but I don’t fret about it too much, after all my blog is not about linguisticsJ
Cornelius: Oh, linguists get these things wrong all the time, don’t worry
The reason I ask is mainly because some people (Robert Scoble, for example) say that to them blogs are conversations, so that in contrast to expository writing where you check, revise and edit a lot it’s mostly about speed and efficiency.
Your posts are very informational and complex and thus you probably spend more time planning and editing than someone like Scoble, who posts 4 or 5 very short pieces per day.
Cornelius: Has your approach to blogging changed over time?
Tess: Yes and no, after writing a lot of posts I can tell which posts are going to get a lot of hits and which ones aren’t, and also what people tend to search for when they get to my blog, so I try to keep titles etc. relevant so that more people can reach it and see immediately if it is relevant or not.
Cornelius: Do personal experiences play a role in your blogging?
Tess: I am not sure how to answer that. My blog is about personal experiences with issues that I have worked but I am not sure if that is what you are looking for.
Cornelius: My bad, the questions wasn’t phrased very well. What I meant was: do you ever refer to things that aren’t strictly work-related, things that you would describe as personal? Obviously you don’t post pictures of your cat (though some tech people do) but do you ever use anecdotes or stories in your posts?
Tess: I would say no, I don’t post much about personal experiences, in fact I think the only personal post I have made so far was when I got blog tagged.
The main reason is because I don’t think that is what people reading my blog are interested in, but having said that I would use personal references if it adds to the story, i.e. if something in my personal life could act as an analogy to explain something complex.
I do add a lot of personal comments though to make the posts more readable because I don’t want them to be stale and dry, but on the other hand I would never tell stories about my family and friends in the blog because I want to keep it informational rather than “here is what i did today”.
That’s essentially the brave question that Phil Hall asks over at Strumpette (found via Blog Campaigning) in a very interesting post. He summarizes his own attitude as follows.
I would like to make a statement that many PR people will view as apostasy: I think corporate blogs are, on the whole, a waste of time.
Well, he isn’t the first to make such an outrageous claim, though it could be that he’s the first person in PR. He continues by arguing that even those company blogs that perpetrate it aren’t really written for consumers but target the media crowd.
People like me are looking for quality goods at reasonable prices. Reading the blog posting of some CEO ruminating on this-and-that is of no value to folks like me.
[Just a quick stylistic observation: it’s genuinely cute (and clever) to start a sentence with the phrase people like me and then end the next one with folks like me if you’re the former president of Open City Communications, a New York PR agency, and former editor of PR News. I imagine that PR executives with book deals are not entirely en par with the majority of people shopping at Wal-Mart in terms of income. But perhaps that’s just my dirty mind. It doesn’t hurt his argument either - I just assume that somewhere in PR school you learn that it’s always better to phrase personal opinions in the “folks-like-me-plural”.]
Hall then raises several familiar points: consumers don’t care about company blogs, blogging is risky because of litigation, a comment-enabled blog gives trolls and haters a platform, etc. He closes asking for examples of interesting corporate blogs.
But beyond those examples – sorry, but I am not aware of corporate blogs being used as anything more than a poorly-disguised sales vehicle. If you know of some genuinely clever examples of the format, please share them here – I would love to learn about them and have a reason to change my negative opinion.
I think there are quite a few counter-examples, though his criticism that many company blogs are boring and manipulative is certainly legitimate. My impression is that many smart implementations of blogging exist to improve company-internal communication. I’ve commented on the MSDN and Oracle blog hubs before - they represent knowledge management resources which enable tech experts to exchange ideas and improve products. I’m pretty sure Joe User doesn’t care about ASP.NET errors, but to people writing code for a living it’s clearly a relevant issue. Internal blogs have become a fixture in the tech sector and it seems they have potential in other areas as well. For a rare and valuable piece of empirical research on internal corporate blogging at IBM see Kolari et al (thanks to Pranam for pointing me to it).
Let’s look at other applications of corporate blogging as well. Apart from marketing there’s PR, customer relations management, recruiting, communication, lobbying and strategy blogging, plus countless hybrids. All of these functions target different groups of people (look here for a -certainly incomplete- list and more thoughts on the issue). Thus it is quite possible, nay, likely that Joe Consumer is not the target audience for XYZ Corp’s CEO blog. The target audience are partners, investors, competitors and of course journalists, who can be counted on to follow such a blog quite closely.
In that context it is interesting that Hall brings up the SEC.
And what about the investment community? Yeah, can you imagine the SEC giving the thumbs up for publicly-traded companies using blogs to communicate with investors?
Yes, I can. While no decision has been made yet (to my knowledge), I think Cox’s comment serves as an indicator that blogs may soon be used for exactly that purpose.
With such an audience, the idea that posts are edited and reviewed carefully before publication is perfectly plausible - and then again, why not? The idea that blogs must be unedited and highly personal confuses the historical origin of blogs as web-based diaries with their status today. In other words: you can use blogs purely as a means of publishing content on-line, or you can adopt a “bloggy” style of writing. There are no rules when it comes to how you write - you can rehash ad copy or explain your corporate strategy, write about annoying business trips or how to make cranberry walnut bread. All that is corporate blogging and all of it, presumably, somehow serves a purpose for the companies that sponsor it.
So corporate blogs can potentially serve a number of purposes, many of which are outside the scope of marketing or PR. Huge global players such as IBM need sophisticated tools to communicate and coordinate their efforts internally - most people will agree that email is no longer the appropriate tool for that. Beyond internal communication corporate blogs are relevant where they address specific people with some kind of stake in the company’s actions: disgruntled consumers, activists, potential employees, competitors, shareholders, journalists, bloggers. The only thing that won’t work is starting a blog about toilet paper because that’s what you happen to sell. If you can’t make it relevant to anyone, don’t start a corporate blog. The chic of blogging alone won’t do.
But in the end this is less about how companies (or institutions in general) can use blogging as an effective tool and more about how employee blogging will change companies in the long run. Corporate hierarchies partly exist to manage the flow of information inside an organization. Executives are supposed to know and understand internal processes and manage them effectively. But once everyone in an organization is more or less connected with everyone else the overall need for a strict hierarchy is at least somewhat diminished.
Now, I’m no utopian suggesting that organizations will somehow be crowd-governed in the future, but it seems plausible to assume that the monopoly of a few (management, PR, communications dept) to exclusively represent a company to “the outside world” and to control the flow of information internally is fading. Of course nobody is going to care about anything you have to say just because they buy your products. But that doesn’t mean that there aren’t a lot of people listening quite closely - for other reasons. My impression is that “the long tail of corporate blogging” - i.e. employee blogging - will matter more than glitzy PR texts or marketing copy in the long run. I believe this because our conception of public vs. personal communication is in the process of changing radically and in that light it seems illogical to assume that institutions will somehow be spared from the effects.
Perhaps the whole question of who drives the changes vs. who is driven by them follows the inverted logic of the classic Slashdotmeme: in Soviet Russia, corporate blog writes you.
I’ve been playing around with this great little tool for several days now and thought I’d share some of the results with you.
But first, here’s a brief recap of what I’ve been doing before I start throwing statistics at you.
I am in the process of building a textual database (or corpus, as linguists call it) of corporate and enterprise web logs. The purpose of this corpus is to investigate corporate blogs as a text type. In the current phase of my research, I am especially interested in the following questions
- how do corporate blogs compare stylistically with non-corporate blogs, news texts and other types?
- is there a typical ‘corporate blogging style’ in terms of how people write?
- are there recognizable differences in style that correspond with differences in purpose or authorship (in other words, do CEOs, marketers, software developers, etc have distinct styles?)
- how much variation is there stylistically between different blogs, different bloggers in the same hub (e.g. MSDN) and between different posts by the same blogger?
- are there patterns of change in style over time?
You might wonder what such a description is good for (well, apart from furthering the pursuit of knowledge and all that). I think that, on the practical level, it will enable us to better understand what people are trying to achieve with blogs and how they do it. Ultimately blogging is about good writing. The trouble is, neither is ‘good’ easily defined, nor is it always the same to everyone on any occasion. Blogging styles are highly dynamic and situation-dependent and I think the most successful bloggers very consciously adapt different styles to address different people and issues.
Right, so what do I have so far?
One of the first measures I’ve implemented into my database is a relatively simple formula for calculating how formal/informational or (on the other end of the scale) involved/context-dependent a text is. This is done by adding the frequencies of certain types of words together and subtracting others, under the assumption that (for example) nouns are more numerous in texts which are primarily informational, while a high frequency of pronouns indicates involvement. The formula looks like this:
As you can guess, the results are potentially ambiguous - in other words, texts can have a very high or low score for a variety of reasons - and should be used with care. That being said, the measure produces some pretty interesting results.
Each data point in the graph is the f-score for a single post, or the average for several posts made on a single day. As the graph shows, Scoble’s posts are fairly consistently in the 50s in August and September. They surge to over 100 in mid-October and make overall gains in November and December, though these gains aren’t really as significant as they might look at first. The more notable change is the high degree of variation in these months compared to the time span before that.
You might wonder which posts exactly get a high or low f-score. Here are the entries with the highest score, by date.
Comparing new TailRank/DiggTech/TechMeme to Google Reader, 16 October 2006 (f-score 102)
If you have a look at those posts, you’ll probably notice that they aren’t really in any way more formal than Scoble’s other writing. The difference is that they tend to be more informational, i.e. have more and more condensed information crammed into to them than most entries. Lists and enumerations will immediately lead to a high score (because they usually translate into a high noun count) and for Scoble those entries which are written in a sort of telegraph style to convey information about a photowalk or CES thus have a high score. This doesn’t really demerit the f-score as a metric - it simply means that it’s context-sensitive. What’s important is that, with an overall mean score of 60, Scobelizer ranks on the extreme low end of the formal/informational vs involved/contextual scale. To Scoble, blogs really are conversations, not just metaphorically but in a quite literal stylistic way.
That’s the score for one source over time. Let’s compare a bunch of sources.
If you have trouble seeing anything on the chart, look for a little dropdown menu on the lower right hand side labeled dot size. Change it from ‘posts’ to ‘no selection’ and all the dots will be changed to have the same size, which should make the whole thing a lot easier to read.
The chart is a representation of scores for 137 different blogs, computed from data collected during the last five months. Each dot represents a single blog and its average f-score on the x axis. The position of a dot on the y axis indicates the standard deviation of values inside of that blog, i.e. the degree of internal variation
The vast majority of the sources I’ve used are corporate blogs - after all that’s what my research is about. But in addition I’ve also thrown in a few non-corporate sources, simply to be able to compare one type of blog with another one. Thus the list contains 17 personal blogs randomly found via blogger.com, 1 a-list professional blogger (Scoble), 1 political blog hub (huffingtonpost.com) and 3 non-blog sources, namely editorials from the New York Times, the Washington Post and the LA Times collected in the course of this week (see below for a full list of sources).
The first thing likely to catch you eyes are the outliers. On the far right hand side, there is one source simply tagged “Blog” (informative, I know) with a record f-score of 195 and and a standard deviation of 92. That’s Ray Ozzie, Chief Software Architect of Microsoft. Now, if you have a look at his blog you might find that the best description for his writing is not so much formal, but rather “technical” or maybe “information-oriented”. The reasons for the high scores are the many compound nouns (things like development ecosystem, application components, clipboard data formats, etc) coupled with the overall significant length of entries. Like the other outlier, Irving Wladawsky-Berger of IBM, Ozzie also produces very long posts. Ozzie’s longest has 1,700 words, while Wladawsky-Berger is a close second with 1,500. Length tends to coincide with somewhat higher f-scores, however, there are counter-examples. Heather Hamilton has one post with a whopping word count of over 2,000 and an f-score of only 105. Generally brief posts tend to coincide with lower scores, but, as the example shows, there are exceptions.
Overall it is important to consider a few things, especially in regards to the those sources with a high standard deviation and a high f-score:
- the deviation is often high simply because there aren’t many posts (for example, Ozzie only has 6 entries)
- several of the high-deviation blogs are hubs, i.e. they aggregate a number of individual blogs (e.g. MSDN and HuffPo)
But the cool part is that the remaining sources usually contain very conscious stylistic variation (Jonathan Schwarz is a prime example). I other words, they write differently to address different people and achieve different things and this - at least to some extent - stylistically visible. Compare that with the scores for the three newspaper editorials grouped together in the lower right area of the plot. They are surprisingly consistent if you consider that we’re looking at texts published in three different papers, written by an even larger number of journalists. Which just shows that the editorial is a pretty solidified type of text in terms of style, while the (corporate) blog isn’t - at least not yet.
Anyway, I’ll wrap it up for now and save the more in-depth look for another post.