le accidental occurrence

42at

Archive for the ‘wordpos’ tag

Optimizing WordPos

with one comment

WordPos is pretty slow. A 512-word corpus (220 lookups less stopwords and duplicates) with getPOS() takes about 8 seconds (dual core 64-bit win7).  Clearly there is room for optimization.

WordNet index files are simple text files, one line per word,  sorted by the word.  The entry for ‘awesome’ in index.adj is

awesome a 1 1 & 1 0 01282510

‘a’ (or ‘s’) means it’s an adjective, and the last 8-digit number is the byte offset in the file data.adj which contains the definition.  Words can have multiple synonyms, so there could be more byte offsets, space separated.

Natural’s lookup() does a binary search of the word in the index file.  The largest index file is index.noun with 117,798 words taking O(log N) = 17 iterations to find the correct line.

Bucketize

What if we use a secondary index to speed things up, using the first two characters of the word as a hash to bucketize the entries, then search within the bucket.  Table below shows the result using a 2-char hash key:

index #words #buckets avg
bucket size
Noun 117,798 679 173
Adjective 21,479 416 51
Verb 11,529 225 51
Adverb 4,481 213 21
O(#words) O(avg
bucket size)
 17 8
14 6
14 6
12 5



So now a word can be found typically in less than half the iterations as before.  Indeed getPOS() now takes about 4.2 seconds.  Still not fast enough.

What if we bucketize based on 3-char hash key.

index #words #buckets avg
bucket size
Noun 117,798 4493 26
Adjective 21,479 2326 9
Verb 11,529 1623 7
Adverb 4,481 1170 3.8
O(#words) O(avg
bucket size)
 17 5
14 3
14 3
12 2



Number of iterations drop by half again, and getPOS() now takes about 2.2 seconds.   That’s a four-fold increase in performance over the original scheme.

Fast-index

These secondary index files, which we’ll call the fast-index files, are saved as JSON objects and loaded into memory for fast lookup.  The format of the fast-index file is:

{
  // <meta info>
  "offsets":{
    ...
    "awe":[74929,"awf"]
    ...
  }
}

“awe” is the 3-char key of the bucket, 74929 is the starting offset in the main index file, and “awf” is the next bucket key (next key is essential as Object literals have no inherent sort order). So the algorithm becomes:

find key (first 3 chars of word)
look it up in offsets: O(1)
if it exists:
  get offset of key and offset of next key
  read main index file between the two offsets
  binary search read data: O(log avg)

These fast-index files add up to less than 200 KB memory footprint, which is quite manageable. (Four main index files are about 6 MB.)

The bottleneck

2.2 seconds isn’t too bad to look up 220 words in 4 files and 6 MB of data (that’s about 10 msec/word).  But it’s not web-scale!  Turns out natural’s WordNet module is pretty conservative in its disk read for the binary search.  It reads one character at a time to find the enclosing EOLs of a line.  This is done asynchronously and each byte read incurs a callback.

Now with the bucketized scheme, we know precisely the boundary of the bucket.  For example, index.noun buckets average 26 entries (or lines).  Assume each line averages two synonyms is about 42 bytes, we have 26 × 42 ≈ 1 KB to read on average.  The worst case is the ‘gen’ bucket for index.noun with 3859 entries, or about 167 KB that have to be read.  Next biggest bucket has about 1200 entries or 50 KB, and it continues to decline fast.  This is still quite manageable, specially if care is taken to properly cache the disk reads.

So reading  bucket-at-a-time, using split(‘\n’) to break the buffer into lines, then binary searching for the word on each line, now getPOS() takes about 70 msec, a 30-fold improvement over the previous best of 2.2 seconds.   Now we’re talking web-scale.

Clearly the bottleneck is the disk read.

Finer grain

70 msec is pretty good from the original 8 seconds.  Any additional optimization should be weighed against the effort put in to it.  However, one thing is obvious.  If we’re looking up, e.g.,

wordpos.getAdjectives('It was so awesome that I was awestruck', console.log)

Both ‘awesome’ and  ‘awestruck’ belong in the same ‘awe’ bucket, so we shouldn’t read the ‘awe’ bucket from disk twice.  Instead, we’ll queue up the callbacks for each bucket and process them once the bucket has been read.

Ready to use

These and other optimizations are in wordpos 0.1.4, available on github.  The fast-index files are generated at install time and placed in the same folder as WNdb’s dictionary folder.  If any errors occurs or the json files can’t be read, the original lookup method is used.

Written by Moos

May 23rd, 2012 at 1:11 am

Posted in blog

Tagged with , , ,