Why don’t we put the Google N-gram corpus on the Web?

2008 July 8
by Cornelius

Two years ago, the news that Google was going to make available the largest collection of n-grams to the global research community that had ever been compiled sparked a lot of interest. I was among those who immediately ordered those six DVDs… and ever since they have been resting dutifully on a shelf in my office, collecting dust and reminding me that I need to bring them into a more accessible format. Alas, so many things to do, so little time.

Something led me to look for information on that corpus this morning and I came across this. Sadly, the link to Chris Harrison’s site no longer seems to work, but when I saw his visualization I immediately thought of Many Eyes.

My reasoning goes a little something like this:

Google N-gram corpus hosted on Google Palimpsest servers + IBM’s Many Eyes = Fantastic web-based tool for linguists

To elaborate: Google has a gigantic database of word collocations that can be used as a baseline for all sorts of interesting analysis, but you can’t really do any of these things unless you have a user interface and enough computing juice to sift through almost 100 gigabytes of text data on the fly. On the other hand, solutions like Many Eyes are amazing, but currently there’s no way you can use it with a really big data set like the n-gram corpus and therefore the research utility is limited.

But it must be possible somehow to bring together

  • the data to analyze
  • the computing power required and
  • the user interface needed to allow a non-technical person to interact with the data

and to put the whole thing on the Web. It’s Google’s stated intention to host data for us and they are the owner of the n-gram dataset, so I can’t imagine there being any licensing issues. And, as if to put a cherry on that sundae, here’s the announcement of a joint project by IBM, Google and the NSF to do exactly that kind of stuff. Put the 6 DVDs on a cloud, throw in a tweaked version of Many Eyes (think the word tree vis with a few extras) and construction grammarians everywhere will absolutely love it.

What do you think?

4 Comments
2008 July 9

not sure why the link to Chris Harrison’s work seems not to work for you, but it seems fine from here: trigram viz.

2008 July 9

It seems the site was offline for a bit yesterday (I got a server error), but you’re right – everything is working again now. Thanks for pointing that out.

2008 July 14

Many Eyes Visualization Of Business Blogging Factors Within The F500…

Reading my RSS feeds tonight I was found Cornelius Puschmann’s excellent blog, CorpBlawg, he wrote a post about the Google N-gram corpus, and in the course of discussing the post he mentioned IBM’s Many Eyes site, a site for data…

Trackback
2009 July 5

[...] CorpBlawg (Cornelius Puschmann): Why don’t we put the Google N-gram corpus on the Web [...]

Pingback

Comments are closed for this entry.