Text Indexes

Text indexes combine an inverted index and a lexicon to support text indexing and searching. A text index can be created without passing any arguments:

>>> from zope.index.text.textindex import TextIndex
>>> index = TextIndex()

By default, it uses an "Okapi" inverted index and a lexicon with a pipeline consistening is a simple word splitter, a case normalizer, and a stop-word remover.

We index text using the index_doc method:

>>> index.index_doc(1, u"the quick brown fox jumps over the lazy dog")
>>> index.index_doc(2,
...    u"the brown fox and the yellow fox don't need the retriever")
>>> index.index_doc(3, u"""
... The Conservation Pledge
... =======================
...
... I give my pledge, as an American, to save, and faithfully
... to defend from waste, the natural resources of my Country;
... it's soils, minerals, forests, waters and wildlife.
... """)
>>> index.index_doc(4, u"Fran\xe7ois")
>>> word = (
...     u"\N{GREEK SMALL LETTER DELTA}"
...     u"\N{GREEK SMALL LETTER EPSILON}"
...     u"\N{GREEK SMALL LETTER LAMDA}"
...     u"\N{GREEK SMALL LETTER TAU}"
...     u"\N{GREEK SMALL LETTER ALPHA}"
...     )
>>> index.index_doc(5, word + u"\N{EM DASH}\N{GREEK SMALL LETTER ALPHA}")
>>> index.index_doc(6, u"""
... What we have here, is a failure to communicate.
... """)
>>> index.index_doc(7, u"""
... Hold on to your butts!
... """)
>>> index.index_doc(8, u"""
... The Zen of Python, by Tim Peters
...
... Beautiful is better than ugly.
... Explicit is better than implicit.
... Simple is better than complex.
... Complex is better than complicated.
... Flat is better than nested.
... Sparse is better than dense.
... Readability counts.
... Special cases aren't special enough to break the rules.
... Although practicality beats purity.
... Errors should never pass silently.
... Unless explicitly silenced.
... In the face of ambiguity, refuse the temptation to guess.
... There should be one-- and preferably only one --obvious way to do it.
... Although that way may not be obvious at first unless you're Dutch.
... Now is better than never.
... Although never is often better than *right* now.
... If the implementation is hard to explain, it's a bad idea.
... If the implementation is easy to explain, it may be a good idea.
... Namespaces are one honking great idea -- let's do more of those!
... """)

Then we can search using the apply method, which takes a search string:

>>> [(k, "%.4f" % v) for (k, v) in index.apply(u'brown fox').items()]
[(1, '0.6153'), (2, '0.6734')]
>>> [(k, "%.4f" % v) for (k, v) in index.apply(u'quick fox').items()]
[(1, '0.6153')]
>>> [(k, "%.4f" % v) for (k, v) in index.apply(u'brown python').items()]
[]
>>> [(k, "%.4f" % v) for (k, v) in index.apply(u'dalmatian').items()]
[]
>>> [(k, "%.4f" % v) for (k, v) in index.apply(u'brown or python').items()]
[(1, '0.2602'), (2, '0.2529'), (8, '0.0934')]
>>> [(k, "%.4f" % v) for (k, v) in index.apply(u'butts').items()]
[(7, '0.6948')]

The outputs are mappings from document ids to integer scored. Items with higher scores are more relevent.

We can use unicode characters in search strings:

>>> [(k, "%.4f" % v) for (k, v) in index.apply(u"Fran\xe7ois").items()]
[(4, '0.7427')]
>>> [(k, "%.4f" % v) for (k, v) in index.apply(word).items()]
[(5, '0.7179')]

We can use globbing in search strings:

>>> [(k, "%.3f" % v) for (k, v) in index.apply('fo*').items()]
[(1, '2.179'), (2, '2.651'), (3, '2.041')]

Text indexes support basic statistics:

>>> index.documentCount()
8
>>> index.wordCount()
114

If we index the same document twice, once with a zero value, and then with a normal value, it should still work:

>>> index2 = TextIndex()
>>> index2.index_doc(1, [])
>>> index2.index_doc(1, ["Zorro"])
>>> [(k, "%.4f" % v) for (k, v) in index2.apply("Zorro").items()]
[(1, '0.4545')]

Tracking Changes

In order to have as few writes as possible, the index doesn't change its state if we index a docid with the same values twice. To test this behaviour we have to create a simple data manager.

>>> class DM:
...     def __init__(self):
...         self.called = 0
...     def register(self, ob):
...         self.called += 1
...     def setstate(self, ob):
...         ob.__setstate__({'x': 42})

If we index a document the first time it changes the state of the underlying index. At the start _p_changed is False.

>>> index._p_jar = index.index._p_jar = DM()
>>> index.index._p_changed
False
>>> index.index_doc(100, u"a new funky value")
>>> index.index._p_changed
True

Now for testing we set the changed flag to false again.

>>> index.index._p_changed = False

If we index it a second time, the underlying index should not be changed.

>>> index.index_doc(100, u"a new funky value")
>>> index._p_changed is index.index._p_changed is False
True
>>> index.index._p_changed = False

But if we change it the state changes too.

>>> index.index_doc(100, u"an even newer funky value")
>>> index.index._p_changed
True

The same as for index_doc applies to unindex_doc, if an object is unindexed that is not indexed no indexes chould change state.

>>> index._p_changed = index.index._p_changed = False
>>> index.unindex_doc(100)
>>> index.index._p_changed
True
>>> index._p_changed = index.index._p_changed = False
>>> index.unindex_doc(100)
>>> index._p_changed is index.index._p_changed is False
True