Why don’t we put the Google N-gram corpus on the Web?
Two years ago, the news that Google was going to make available the largest collection of n-grams to the global research community that had ever been compiled sparked a lot of interest. I was among those who immediately ordered those six DVDs… and ever since they have been resting dutifully on a shelf in my office, collecting dust and reminding me that I need to bring them into a more accessible format. Alas, so many things to do, so little time.
Something led me to look for information on that corpus this morning and I came across this. Sadly, the link to Chris Harrison’s site no longer seems to work, but when I saw his visualization I immediately thought of Many Eyes.
My reasoning goes a little something like this:
Google N-gram corpus hosted on Google Palimpsest servers + IBM’s Many Eyes = Fantastic web-based tool for linguists
To elaborate: Google has a gigantic database of word collocations that can be used as a baseline for all sorts of interesting analysis, but you can’t really do any of these things unless you have a user interface and enough computing juice to sift through almost 100 gigabytes of text data on the fly. On the other hand, solutions like Many Eyes are amazing, but currently there’s no way you can use it with a really big data set like the n-gram corpus and therefore the research utility is limited.
But it must be possible somehow to bring together
- the data to analyze
- the computing power required and
- the user interface needed to allow a non-technical person to interact with the data
and to put the whole thing on the Web. It’s Google’s stated intention to host data for us and they are the owner of the n-gram dataset, so I can’t imagine there being any licensing issues. And, as if to put a cherry on that sundae, here’s the announcement of a joint project by IBM, Google and the NSF to do exactly that kind of stuff. Put the 6 DVDs on a cloud, throw in a tweaked version of Many Eyes (think the word tree vis with a few extras) and construction grammarians everywhere will absolutely love it.
What do you think?




(On Jul 8th, 2008 at 4:12 pm)
not sure why the link to Chris Harrison’s work seems not to work for you, but it seems fine from here: trigram viz.
(On Jul 8th, 2008 at 4:25 pm)
It seems the site was offline for a bit yesterday (I got a server error), but you’re right - everything is working again now. Thanks for pointing that out.
(On Jul 8th, 2008 at 4:45 am)
Many Eyes Visualization Of Business Blogging Factors Within The F500…
Reading my RSS feeds tonight I was found Cornelius Puschmann’s excellent blog, CorpBlawg, he wrote a post about the Google N-gram corpus, and in the course of discussing the post he mentioned IBM’s Many Eyes site, a site for data…