le accidental occurrence

42at

Archive for the ‘nlp’ tag

Optimizing WordPos

with one comment

WordPos is pretty slow. A 512-word corpus (220 lookups less stopwords and duplicates) with getPOS() takes about 8 seconds (dual core 64-bit win7).  Clearly there is room for optimization.

WordNet index files are simple text files, one line per word,  sorted by the word.  The entry for ‘awesome’ in index.adj is

awesome a 1 1 & 1 0 01282510

‘a’ (or ‘s’) means it’s an adjective, and the last 8-digit number is the byte offset in the file data.adj which contains the definition.  Words can have multiple synonyms, so there could be more byte offsets, space separated.

Natural’s lookup() does a binary search of the word in the index file.  The largest index file is index.noun with 117,798 words taking O(log N) = 17 iterations to find the correct line.

Bucketize

What if we use a secondary index to speed things up, using the first two characters of the word as a hash to bucketize the entries, then search within the bucket.  Table below shows the result using a 2-char hash key:

index #words #buckets avg
bucket size
Noun 117,798 679 173
Adjective 21,479 416 51
Verb 11,529 225 51
Adverb 4,481 213 21
O(#words) O(avg
bucket size)
 17 8
14 6
14 6
12 5



So now a word can be found typically in less than half the iterations as before.  Indeed getPOS() now takes about 4.2 seconds.  Still not fast enough.

What if we bucketize based on 3-char hash key.

index #words #buckets avg
bucket size
Noun 117,798 4493 26
Adjective 21,479 2326 9
Verb 11,529 1623 7
Adverb 4,481 1170 3.8
O(#words) O(avg
bucket size)
 17 5
14 3
14 3
12 2



Number of iterations drop by half again, and getPOS() now takes about 2.2 seconds.   That’s a four-fold increase in performance over the original scheme.

Fast-index

These secondary index files, which we’ll call the fast-index files, are saved as JSON objects and loaded into memory for fast lookup.  The format of the fast-index file is:

{
  // <meta info>
  "offsets":{
    ...
    "awe":[74929,"awf"]
    ...
  }
}

“awe” is the 3-char key of the bucket, 74929 is the starting offset in the main index file, and “awf” is the next bucket key (next key is essential as Object literals have no inherent sort order). So the algorithm becomes:

find key (first 3 chars of word)
look it up in offsets: O(1)
if it exists:
  get offset of key and offset of next key
  read main index file between the two offsets
  binary search read data: O(log avg)

These fast-index files add up to less than 200 KB memory footprint, which is quite manageable. (Four main index files are about 6 MB.)

The bottleneck

2.2 seconds isn’t too bad to look up 220 words in 4 files and 6 MB of data (that’s about 10 msec/word).  But it’s not web-scale!  Turns out natural’s WordNet module is pretty conservative in its disk read for the binary search.  It reads one character at a time to find the enclosing EOLs of a line.  This is done asynchronously and each byte read incurs a callback.

Now with the bucketized scheme, we know precisely the boundary of the bucket.  For example, index.noun buckets average 26 entries (or lines).  Assume each line averages two synonyms is about 42 bytes, we have 26 × 42 ≈ 1 KB to read on average.  The worst case is the ‘gen’ bucket for index.noun with 3859 entries, or about 167 KB that have to be read.  Next biggest bucket has about 1200 entries or 50 KB, and it continues to decline fast.  This is still quite manageable, specially if care is taken to properly cache the disk reads.

So reading  bucket-at-a-time, using split(‘\n’) to break the buffer into lines, then binary searching for the word on each line, now getPOS() takes about 70 msec, a 30-fold improvement over the previous best of 2.2 seconds.   Now we’re talking web-scale.

Clearly the bottleneck is the disk read.

Finer grain

70 msec is pretty good from the original 8 seconds.  Any additional optimization should be weighed against the effort put in to it.  However, one thing is obvious.  If we’re looking up, e.g.,

wordpos.getAdjectives('It was so awesome that I was awestruck', console.log)

Both ‘awesome’ and  ‘awestruck’ belong in the same ‘awe’ bucket, so we shouldn’t read the ‘awe’ bucket from disk twice.  Instead, we’ll queue up the callbacks for each bucket and process them once the bucket has been read.

Ready to use

These and other optimizations are in wordpos 0.1.4, available on github.  The fast-index files are generated at install time and placed in the same folder as WNdb’s dictionary folder.  If any errors occurs or the json files can’t be read, the original lookup method is used.

Written by Moos

May 23rd, 2012 at 1:11 am

Posted in blog

Tagged with , , ,

Introducing WordPos

without comments

Wordpos (github, MIT license) is a node.js library and npm module build on top of natural’s WordNet module.   While natural.WordNet lets one lookup() a word in the WordNet database, I needed a simple way to pick all adjectives from a small corpus.

WordNet data files are split into the four principal parts-of-speech: nouns, verbs, adverbs, adjectives.  The lookup() methods naturally looks in all four files for the definition.  If we’re just looking for adjectives, this can be a bit inefficient.

Hence the wordpos module.  To use:

var WordPOS = require('wordpos'),
    wordpos = new WordPOS();

wordpos.getAdjectives('The angry bear chased the frightened little squirrel.', function(result){
    console.log(result);
});
// [ 'little', 'angry', 'frightened' ]

Similar functions exist getNouns(), getVerbs(), getAdverbs(). If you need all of them:

wordpos.getPOS('The angry bear chased the frightened little squirrel.', console.log)
// output:
  {
    nouns: [ 'bear', 'squirrel', 'little', 'chased' ],
    verbs: [ 'bear' ],
    adjectives: [ 'little', 'angry', 'frightened' ],
    adverbs: [ 'little' ],
    rest: [ 'the' ]
  }

Certain stopwords and duplicates are removed before the lookup. result.rest are the words not found in the dictionary.

Underneath, getX() functions use an efficient isX() which looks up the word in the index files, but does not look up the definition in the data files.  The isX() functions are similarly exposed:

wordpos.isVerb('fish', console.log); // true
wordpos.isNoun('fish', console.log);  // true
wordpos.isAdjective('fishy', console.log); // true
wordpos.isAdverb('fishly', console.log);  // false

Note that all these functions are async. To install:

npm install wordpos

This will also install WNdb module containing the WordNet files (about 10 MB compressed).

Written by Moos

May 18th, 2012 at 6:53 pm

Posted in blog

Tagged with , ,

npm install WNdb

without comments

WordNet is a large lexigraphical database of English words, including definitions, synonyms (aka, synsets), and part-of-speech data.  Now it has come to node.js through NaturalNode’s natural NLP module.  natural will download the WordNet files on first invocation.  But I found it more convenient to download the database off-line rather than on-demand.

So I’ve added the WordNet database files as an npm module.  To install

npm install WNdb

This will install the complete WordNet 3.0 data files (about 34 MB uncompressed) and make them available to your program.

See npm registry and WNdb on github  for more info.

Written by Moos

May 11th, 2012 at 10:45 am

Posted in blog

Tagged with , , ,