wordpos at le accidental occurrence

Archive for the ‘wordpos’ tag

Optimizing WordPos

WordPos is pretty slow. A 512-word corpus (220 lookups less stopwords and duplicates) with getPOS() takes about 8 seconds (dual core 64-bit win7). Â Clearly there is room for optimization.

WordNet index files are simple text files, one line per word, Â sorted by the word. Â The entry for ‘awesome’ in index.adj is

awesome a 1 1 & 1 0 01282510

‘a’ (or ‘s’) means it’s an adjective, and the last 8-digit number is the byte offset in the file data.adj which contains the definition. Â Words can have multiple synonyms, so there could be more byte offsets, space separated.

Natural’s lookup() does a binary search of the word in the index file. Â The largest index file is index.noun with 117,798 words taking O(log N) = 17 iterations to find the correct line.

Bucketize

What if we use a secondary index to speed things up, using the first two characters of the word as a hash to bucketize the entries, then search within the bucket. Â Table below shows the result using a 2-char hash key:

index	#words	#buckets	avg bucket size
Noun	117,798	679	173
Adjective	21,479	416	51
Verb	11,529	225	51
Adverb	4,481	213	21

O(#words)	O(avg bucket size)
Â 17	8
14	6
14	6
12	5

So now a word can be found typically in less than half the iterations as before. Â Indeed getPOS() now takes about 4.2 seconds. Â Still not fast enough.

What if we bucketize based on 3-char hash key.

index	#words	#buckets	avg bucket size
Noun	117,798	4493	26
Adjective	21,479	2326	9
Verb	11,529	1623	7
Adverb	4,481	1170	3.8

O(#words)	O(avg bucket size)
Â 17	5
14	3
14	3
12	2

Number of iterations drop by half again, and getPOS() now takes about 2.2 seconds. Â That’s a four-fold increase in performance over the original scheme.

Fast-index

These secondary index files, which we’ll call the fast-index files, are saved as JSON objects and loaded into memory for fast lookup. Â The format of the fast-index file is:

{
  // <meta info>
  "offsets":{
    ...
    "awe":[74929,"awf"]
    ...
  }
}

“awe” is the 3-char key of the bucket, 74929 is the starting offset in the main index file, and “awf” is the next bucket key (next key is essential as Object literals have no inherent sort order). So the algorithm becomes:

find key (first 3 chars of word)
look it up in offsets: O(1)
if it exists:
  get offset of key and offset of next key
  read main index file between the two offsets
  binary search read data: O(log avg)

These fast-index files add up to less than 200 KB memory footprint, which is quiteÂ manageable. (Four main index files are about 6 MB.)

The bottleneck

2.2 seconds isn’t too bad to look up 220 words in 4 files and 6 MB of data (that’s about 10 msec/word). Â But it’s not web-scale! Â Turns out natural’s WordNet module is pretty conservative in its disk read for the binary search. Â It reads one character at a time to find the enclosing EOLs of a line. Â This is done asynchronously and each byte read incurs a callback.

Now with the bucketized scheme, we know precisely the boundary of the bucket. Â For example, index.noun buckets average 26 entries (or lines). Â Assume each line averages two synonyms is about 42 bytes, we have 26Â Ã— 42Â â‰ˆ 1 KB to read on average. Â The worst case is the ‘gen’ bucket for index.noun with 3859 entries, or about 167 KB that have to be read. Â Next biggest bucket has about 1200 entries or 50 KB, and it continues to decline fast. Â This is still quite manageable, specially if care is taken to properly cache the disk reads.

So reading Â bucket-at-a-time, using split(‘\n’) to break the buffer into lines, then binary searching for the word on each line, now getPOS() takes about 70 msec, a 30-fold improvementÂ over the previous best of 2.2 seconds. Â Now we’re talking web-scale.

Clearly the bottleneck is the disk read.

Finer grain

70 msec is pretty good from the original 8 seconds. Â Any additionalÂ optimizationÂ should be weighed against the effort put in to it. Â However, one thing is obvious. Â If we’re looking up, e.g.,

wordpos.getAdjectives('It was so awesome that I was awestruck', console.log)

Both ‘awesome’ and Â ‘awestruck’ belong in the same ‘awe’ bucket, so we shouldn’t read the ‘awe’ bucket from disk twice. Â Instead, we’ll queue up the callbacks for each bucket and process them once the bucket has been read.

Ready to use

These and other optimizations are in wordpos 0.1.4, available on github. Â The fast-index files are generated at install time and placed in the same folder as WNdb’s dictionary folder. Â If any errors occurs or the json files can’t be read, the original lookup method is used.

Written by Moos

May 23rd, 2012 at 1:11 am

Posted in blog

Tagged with nlp, wndb, wordnet, wordpos

le accidental occurrence

Archive for the ‘wordpos’ tag

Optimizing WordPos

Bucketize

Fast-index

The bottleneck

Finer grain

Ready to use

Projects

Tags

Archives

I Read