Jul 8th, 2008 | Google, IBM, Linguistics, Many Eyes, Visualization | 3 Comments
Two years ago, the news that Google was going to make available the largest collection of n-grams to the global research community that had ever been compiled sparked a lot of interest. I was among those who immediately ordered those six DVDs… and ever since they have been resting dutifully on a shelf in my office, collecting dust and reminding me that I need to bring them into a more accessible format. Alas, so many things to do, so little time.
Something led me to look for information on that corpus this morning and I came across this. Sadly, the link to Chris Harrison’s site no longer seems to work, but when I saw his visualization I immediately thought of Many Eyes.
My reasoning goes a little something like this:
Google N-gram corpus hosted on Google Palimpsest servers + IBM’s Many Eyes = Fantastic web-based tool for linguists
To elaborate: Google has a gigantic database of word collocations that can be used as a baseline for all sorts of interesting analysis, but you can’t really do any of these things unless you have a user interface and enough computing juice to sift through almost 100 gigabytes of text data on the fly. On the other hand, solutions like Many Eyes are amazing, but currently there’s no way you can use it with a really big data set like the n-gram corpus and therefore the research utility is limited.
But it must be possible somehow to bring together
- the data to analyze
- the computing power required and
- the user interface needed to allow a non-technical person to interact with the data
and to put the whole thing on the Web. It’s Google’s stated intention to host data for us and they are the owner of the n-gram dataset, so I can’t imagine there being any licensing issues. And, as if to put a cherry on that sundae, here’s the announcement of a joint project by IBM, Google and the NSF to do exactly that kind of stuff. Put the 6 DVDs on a cloud, throw in a tweaked version of Many Eyes (think the word tree vis with a few extras) and construction grammarians everywhere will absolutely love it.
What do you think?
May 16th, 2008 | Digital Humanities, Many Eyes, Project Bamboo, Web 2.0, iScience | No Comments
I’ve recently discovered Project Bamboo, an initiative that describes itself on the project website as a multi-institutional, interdisciplinary, and inter-organizational effort that brings together researchers in arts and humanities, computer scientists, information scientists, librarians, and campus information technologists to tackle the question:
“How can we advance arts and humanities research through the development of shared technology services?”
Come again? At first, the concept of shared technology services may seem a little vague. But a closer look at the full project proposal makes it fairly clear what is meant.
While academics use digital technology and the Net for a wide variety of things (research, teaching, publishing, communication), all of these uses have a degree of improvisation to them. Very few of the tools we use are developed specifically for the context of science and research, and sometimes this limitation shows.
For example, I’ve started to use del.icio.us to tag all books I read in Google Books (see what I’ve recently tagged). Del.icio.us is an all-purpose bookmark management application, yet the ability to collaboratively create bibliographies with colleagues in the same subfield makes it a useful tool for researchers. Del.icio.us is not the only example - Google Documents can be used to collaboratively work on a publication and SlideShare is great for making your presentations available directly and linking them to your CV (see my own), instead of just offering them for download. But for other, more specialized tasks there is still a severe lack of tools.
A few months ago, a colleague of mine needed a corpus (a collection of texts for linguistic analysis) for her research. Corpora exist in a wide variety of shapes and sizes, but the specific issue she was working on made it necessary for her to create an entirely new corpus (built from blog texts) instead of working with material from more traditional sources (newspapers, fiction etc). In addition, she also had only a basic working knowledge of corpora and the ways in which they can be used.
We approached the problem from two different angles. I helped her build a specialized corpus by using a piece of software that I had developed for my own work on blogs. To analyze the data, I pointed her to two interesting functions of Many Eyes, a web-based application for visualizing statistical information: tag clouds and word trees.
Tag clouds (or, in this case, word clouds) make it possible to visualize how often a word occurs in a piece of writing. Simply paste a text into the appropriate form field on the site and Many Eyes will do the rest (have a look at this cloud for Shakespeare’s complete works for a nice example).
Word trees visualize textual data in another way, allowing the reader in essence to navigate from one word to the next.
There are of course specialized tools for corpus analysis that do a whole lot more than this in terms of statistics and Many Eyes lacks a whole range of feature that a genuine linguistic research tool would need (say, differentiating between different word classes). Yet Many Eyes has several advantages that the more specialized tools lack. It is
- web-based
- freely accessible
- easy to use
and
- versatile
In a sense, the points above make all the difference. Desktop-based software is under all sorts of constraints: you have to acquire it, install it and figure out how to get data from and to it, keep it up to date and do all sorts of other “chores” that have little to with your main objective. And then you can’t even share your data and collaborate as easily as you can on the Web. In other words, you’re using a program, not a service.
Of course Project Bamboo is not just about developing new tools (well, at least not in my mind). The assumption has long been that as soon as someone puts a useful service on the web, a user community will magically appear. This may be true of web video, blogging, wikis and many other services with a broad appeal, all of which can and should be used much more in academia. But with more specialized services, adoption is something that should be actively supported. In others words: we need to do more than just develop tools. We should work to popularize general-purpose services like del.icio.us and document ways in which they can be appropriated for research and teaching - and (most importantly) how they can be connected to one another. At the same time, just putting developers and researchers into a room together can produce impressive results.
A great example for both a mashup of services and a new way of looking at data is the Web version of the World Atlas of Language Structures (WALS). It’s a combination of Google Maps with the print version of the atlas, which shows the distribution of linguistic features across the world’s languages (say, which languages have definite articles). Not only is WALS Online more convenient to use than both the print version and the CD-ROM that comes with it (not to forget it is also free), but it makes entirely new uses possible. Think about collaborative annotation or linking research articles directly to WALS. Imagine an paper that lives on the Web and shows a map section from WALS in a side window, with the text flowing around it.
Developing services like WALS and getting them out there has the potential to completely transform academia in the long run, making it much collaborative and transparent than it is today. It will be exciting to see what role Project Bamboo plays in that context.
Edit: I forgot to include a link to the project outline, plus a workshop transcript and some background information.
Nov 1st, 2007 | Chrysler, Corporate Blogging, Johnson & Johnson, Many Eyes, Marriott, Palm Inc, Visualization | 1 Comment
If blogs were people, this would be a little bit like a beauty pageant. I’ve taken four blogs from my corpus of company blogs and analyzed them using IBM’s Many Eyes. Many Eyes is a hosted software tool for quick and simple data visualization - you should try it out if you ever have something statistical to present.
Here are the four (randomly picked) candidates.
1. JNJ BTW
Posts: 52
Words: 17077
Sentences: 729
Average Word Length (AWL): 4.8
Average Sentence Length (ASL): 23.4
Average Words per Post (AWpP): 328.4
Word Cloud:

Word Tree:

2. Chrysler Blog
Posts: 59
Words: 13341
Sentences: 780
Average Word Length (AWL): 4.6
Average Sentence Length (ASL): 17.1
Average Words per Post (AWpP): 226.1
Word Cloud:

Word Tree:

3. The Official Palm Blog
Posts: 46
Words: 9262
Sentences: 446
Average Word Length (AWL): 4.5
Average Sentence Length (ASL): 20.8
Average Words per Post (AWpP): 201.3
Word Cloud:

Word Tree:

4. Marriott on the Move
Posts: 60
Words: 4937
Sentences: 305
Average Word Length (AWL): 4.5
Average Sentence Length (ASL): 16.2
Average Words per Post (AWpP): 82.3
Word Cloud:

Word Tree:

All four candidates have around 50 entries, with word counts ranging from roughly 5,000 (Marriot on the Move) to about 17,000 (JNJ BTW). I’ve picked different starting terms for the word trees, depending on the the respective company’s industry, but you can easily search inside a tree for any word that occurs in the blog.