Widcode
A byte-aligned encoding for lists of non-negative ints, using fewer bytes for smaller ints. This is intended for lists of word ids (wids). The ordinary string .find() method can be used to find the encoded form of a desired wid-string in an encoded wid-string. As in UTF-8, the initial byte of an encoding can't appear in the interior of an encoding, so find() can't be fooled into starting a match "in the middle" of an encoding. Unlike UTF-8, the initial byte does not tell you how many continuation bytes follow; and there's no ASCII superset property.
Details:
+ Only the first byte of an encoding has the sign bit set.
+ The first byte has 7 bits of data.
+ Bytes beyond the first in an encoding have the sign bit clear, followed by 7 bits of data.
+ The first byte doesn't tell you how many continuation bytes are following. You can tell by searching for the next byte with the high bit set (or the end of the string).
The int to be encoded can contain no more than 28 bits.
If it contains no more than 7 bits, 0abcdefg, the encoding is 1abcdefg
If it contains 8 thru 14 bits, 00abcdef ghijkLmn the encoding is 1abcdefg 0hijkLmn
Static tables _encoding and _decoding capture all encodes and decodes for 14 or fewer bits.
If it contains 15 thru 21 bits, 000abcde fghijkLm nopqrstu the encoding is 1abcdefg 0hijkLmn 0opqrstu
If it contains 22 thru 28 bits, 0000abcd efghijkL mnopqrst uvwxyzAB the encoding is 1abcdefg 0hijkLmn 0opqrstu 0vwxyzAB