T9 Synonyms


"Swinger" is to "Pygmies" as "Encode" is to ...?

My friends Samantha and Greg stayed with me out in San Francisco this past weekend. Somewhere in between football playoff games and our second or third six-pack, Greg and I got talking about text messaging. Specifically, we were talking about interesting T9 synonyms. Greg thought they would make good SAT questions. Granted, he also made fun of me for writing this, so what does he know.

T9 synonyms are word pairs that, when entered in using T9 text messaging shorthand, have the same sequence of key-presses. To determine the key sequence you observe the relationship between each digit on the cell phone and the letters below it. For instance, the number "2" corresponds to the letters "a", "b", and "c". By replacing each letter of a word with its numeric equivalent, a word can be mapped into a series of key-presses. For example, the word "aged" can be compressed into the sequence "2-4-3-3".

Converting a word to its T9 key sequence is trivial and fast. For example, the following Perl code can do it in one line:


    $t9word =~ tr/a-z/22233344455566677778889999/;

Note that while each word maps to only one key sequence, the inverse is not true. For example, the key sequence "2-4-3-3" not only represents the word "aged," but also the words "chef" and "aide." When two words share the same key sequence they are considered T9 synonyms.

Say you had a key sequence and wanted to get the list of the words that it maps back to. The straightforward (i.e., naïve) approach would be to generate all the permutations of strings that can be formed from that particular sequence and check them each in turn against a dictionary.

Of course, for every 2 digit sequence there are between 9 and 16 potential words, depending on the keys (as keys "7" and "9" have an extra character). And for every 3 digit sequence there are between 27 and 64 potential words. In fact, just a 7 digit sequence (in the worst case) could have up to 16,384 permutations to examine.

While 10,000 dictionary lookups may be trivial amount of work for a modern CPU, it is still worth considering faster alternatives. Particularly since this would typically be computed on a cell phone while the user was interacting with the phone and waiting for feedback.

As one simple alternative, the T9 dictionary can be generated in advance and reverse indexed by key sequence. If that index is a hash, then each entry can be a list of words that correspond to the key sequence.

Here's a little bit of Perl that can build up that dictionary:


    my %T9_Dictionary;

    foreach my $word ( get_words( ) )
    {
        push( @{ $T9_Dictionary{ get_t9( $word ) } }, $word );
    }
  

At the cost of some space, this dictionary is now precomputed and can return almost immediately with a list of words that match each key sequence. In fact, this dictionary will come in useful in answering some questions about T9 synonyms later on.

However, most cell phones that support T9 text entry have an interface that quickly displays potential matches for partial sequences, even if the sequence does not have a word associated with it directly. For example, the sequence "2-2" does not form a complete word, but the cell phone would like to start returning the word "act" or "cat" or something.

But how to do this, if the T9 dictionary is only populated by key sequences that fully map to words?

One approach would be to start appending new digits to the sequence and looking them up. And if none of those existed, adding yet another digit. Then another. Then another, and so on. In fact, in the worst case -- a word can not even be formed, you are back to the same number of lookups that you were before even building the dictionary -- and this time you wouldn't even know when to stop looking. Even if you decided that no words could be longer than 8 letters you are still talking about tens of thousands of lookups if a match can't be found.

A way around this would be to pre-compute all of those partial prefixes as well. But there are a lot of them. In fact, even limited to 8 character words there are more than ten million prefixes that you'd need to index. So that's out of the question.

It turns out that a tree structure can help here. It doesn't even really matter what kind of tree, as each node will have at most 8 children (one per digit on the keypad with letters). And the depth of the tree will be at most the number of letters of the longest word in your word list. And at most -- even in a world in which there are no redundant T9 synonyms, there will be only one leaf node per word in the dictionary.

The tree is built up from the initial word list. The algorithm is easy:

For each word, find the related T9 word sequence. Then start at the root node and look the first digit in that sequence. If a child node exists with that digit, then follow the pointer to the child. If a node does not already exist, then create a new child with that digit. (Since there are exactly 8 potential children per node, a simple array can serve as a constant time index for their pointers. In fact, storing the digit isn't even necessary, thus turning this more into a trie structure.)

Finally, if this is the last digit in the T9 word, append the word to that node. Otherwise, move on to the next digit and repeat the above steps.

A node, using basic Perl types as data structures could look like this:


    # This is the node for the sequence 7-2-4-3-7

    my %node = ( 'digit' => '7',
                 'words' => [ 'pager', 'pages', 'rages', 'raids', 'sages' ],
                 'children' => [ \%child_2, \%child_3, ..., \%child_8 ] );

Using a tree instead of a hash is interesting when interacting with the application is dynamically interacting with a user. Specifically, the application can start walking along the tree as soon as the user types in a digit. Every subsequent key-press in that sequence just moves the pointer deeper down the tree. At any point in time finding the associated word (or words) is as simple and as instantaneous as checking the list at that location. (And of course, the full sequence up to that point is simply the digits that were traversed on the way down.)

Moreover, there an easy (and fast) algorithm exists for finding potential completions. If no words are associated with the current node (i.e., the sequence so far), just begin a breadth first traversal of the child nodes until you find one with a word. This word then becomes a candidate completion. Continuing the traversal in this order can let the user cycle between potential completions. (A feature my cell phone doesn't even have.)

(As a minor aside, this tree can also be easily compressed, as long chains of "only children" that lead to a leaf can be combined into a single node with a multi-digit sequence. This avoids the wasted space of multiple wordless nodes, at the cost of some extra logic and necessitating that the digits are again stored in each node.)

But the hash is all we need to determine some interesting things about T9 synonyms. For instance, once the T9 dictionary has been built, this code sorts by the number of synonyms and prints them out:


    foreach my $t9_word ( sort synonym_count_sort get_t9_words( ) )
    {
	print_t9_word( $t9_word );
    }

    sub get_t9_words
    {
        return keys %T9_Dictionary;
    }

    sub synonym_count_sort
    {
       return get_num_synonyms( $b ) <=> get_num_synonyms( $a );
    }

Here are the T9 key sequences with the most synonyms:

  1. 22737: acres, bards, barer, bares, baser, bases, caper, capes, cards, cares, cases
  2. 46637: goner, goods, goofs, homer, homes, honer, hones, hoods, hoofs, inner
  3. 2273: acre, bard, bare, base, cape, card, care, case
  4. 729: paw, pay, Paz, raw, ray, saw, sax, say
  5. 76737: pores, poser, poses, roper, ropes, roses, sorer, sores

Or you can print out all of the words with more than one synonym, sorting with the longest words first:


    foreach my $t9_word ( sort length_sort get_t9_words_with_synonyms( ) )
    {
	print_t9_word( $t9_word );
    }

    sub get_t9_words_with_synonyms
    {
        return grep( get_num_synonyms( $_ ) > 1, get_t9_words( ) );
    }

    sub length_sort
    {
        return length( $b )  <=> length( $a );
    }

Here are the longest T9 sequences with at least one synonym:

  1. 25287876746242: claustrophobia, claustrophobic
  2. 2427228374937: characterizer, characterizes
  3. 2474784264937: Christianizer, Christianizes
  4. 7445676744937: philosophizer, philosophizes
  5. 732663448737: reconfigurer, reconfigures

But there you will notice that those long key sequences make for rather boring synonyms. If only one character is different then the association between the words is obvious.

The sort can be improved by calculating the distance between pairs of synonyms. The distance can be defined as the number of characters that differ for at the same position between two strings. And the synonym distance for a T9 key sequence can be defined as the maximum distance between all pairs of words.

The code for this is a little longer. And in a real program it is necessary to build a distance table to expedite the sort.


    foreach my $t9_word ( sort distance_sort get_t9_words( ) )
    {
        print_t9_word( $t9_word );
    }

    sub get_word_distance
    {
        my ( $a, $b ) = @_;
        length $a == length $b or die( 'length $a != length $b' );
        my $distance = 0;
        my @a = split( //, lc $a );
        my @b = split( //, lc $b );
        for ( my $i = 0; $i < scalar @a; $i++ )
        {
             distance++ if $a[$i] ne $b[$i];
	}
        return $distance;
    }

    sub get_max_distance
    {
        my ( @words ) = @_;

        my $max = 0;
        for ( my $i = 0; $i < scalar( @words ) - 1; $i++ )
        {
             for ( my $j = $i + 1; $j < scalar( @words ); $j++ )
             {
                 my $distance = get_word_distance( $words[$i], $words[$j] );
                 $max = $distance if $distance > $max;
             }
        }
        return $max;
    }

    sub get_synonym_distance
    {
        my ( $t9_word ) = @_;
	return get_max_distance( get_synonyms( $t9_word ) );
    }

    sub distance_sort
    {
        return get_synonym_distance( $b ) <=> get_synonym_distance( $a );
    }

And finally, here the most "interesting" T9 synonyms (i.e,. the ones with the greatest word distances):

  1. 2668687: amounts, contour
  2. 2787433: astride, brushed, crushed
  3. 6333673: Medford, offense
  4. 726337: pander, sander, scoffs
  5. 7946437: pygmies, swinger

(Actually, I find the fact that "lips" and "kiss" are also T9 synonyms to be far more interesting, but that is more subjective, and a lot harder to code.)

If you are so inclined, you can download the complete T9 Perl script. I have no idea why I wrote any of this in the first place, other than idle curiosity. But this is typical of the type of stuff I would ask a candidate to code during a job interview so maybe it is worth reading.