Many Eyes
  • Throbber_banner

Popular tags:


 

Phrase Net Guide

When to use a phrase net

A phrase net diagrams the relationships between different words used in a text. It uses a simple form of pattern matching to provide multiple views of the concepts contained in a book, speech, or poem. The image below is a word graph made from an article taken from the IBM web site. The program has drawn a network of words, where two words are connected if they appear together in a phrase of the form "X and Y":

word graph explanation

For instance, "the" is connected to other words by thicker arrows. Usually you would choose to hide common words. If we do so, then "smarter" appears to be a central word. The result of this simple pattern matching scheme is a surprisingly coherent view of some of the concepts in the article. In a book, for example, you might find a large cluster of the main characters and their relationships is on one side and separate clusters related to emotion and attitude.

phrase net screenshot

How phrase nets work

Phrase net analyzes a text by looking for pairs of words that fit particular patterns. You can specify this pattern by using asterisks as wildcard characters. For instance, the pattern "* and *" will match phrases like "play and sing" or "vexation and regret." Punctuation matters, so it will not match "left, and then". You can choose from some useful defaults or you can type your own patterns in the field below the list.

After you specify a pattern, the program creates a network diagram of the words it finds as matches. Two words are connected if they occur in the same phrase. The size of a word is proportional to the number of times it occurs in a match; the thickness of an arrow between words tells you how many times those two words occur in the same phrase. The color of a word indicates whether it is more likely to be found in the first or second slot of a pattern. The darker the word, the more often it appears in the first position.

Defining patterns

Matching different patterns gives different views of the text. Each text is unique, so it is worth experimenting. For instance, looking for the pattern "* and *" will often highlight key related concepts. In contrast, the pattern "* 's *" will often result in a diagram of the main people and the things they possess. The simplest pattern is "* *" which links words if they come in immediate succession; this is often provides a surprisingly clear view, especially for short documents. Sometimes there is a special pattern that will provide information on a particular document. For example, applying "* begat *" to the King James Bible yields a rough family tree.

There are three ways to specify a pattern. The easiest is to choose one of the defaults from the list on the left. A second way is to type a pattern with two asterisks for the "slots" of the pattern. Note that you need exactly two asterisks for the pattern to work. Finally, there's an advanced programmers-only option, which is to use a "regular expression" with two capturing groups. For an introduction to regular expressions, read this tutorial (java.sun.com/docs/books/tutorial/essential/regex/).

Filtering results

Not all matching words are shown in the visualization. Very common English words, such as "the" or "of," are usually not informative in this kind of display, and are removed by default. If you do want to see them, uncheck the "Hide common words" box.

In addition, if the network contains more than 50 words, it often becomes hard to read. By default, the diagram only shows the top 50 most frequent matches. In some cases you may want to change this setting, either to make the network smaller, or to include more words. To do so, type a new number in the "Show top:" box and press Enter.

Interaction and highlighting

As with other Many Eyes network visualizations, you can pan by right-clicking and dragging, and you can zoom either by using the mousewheel or by dragging to define an area to zoom to. Click the "Reset view" button to fit the entire network on the screen.

Move the mouse over a word to see how many times it occurred in a match, or over an arrow to see how many times a particular pair of words occurred. You can also click on a word to highlight it in orange; this can be helpful when making comments.

Data format

Phrase net accepts free (unstructured) text data. It can handle documents with up to about a million words.

Expert notes

This is an experimental technique that can be viewed as a halfway point between the tag cloud and the word tree. We're very interested in any comments. The visualization itself owes a debt to Peter Cho's diagram of news stories and Franco Moretti's work on literary style.

An experiment brought to you by IBM Research and the IBM Cognos software group