Child pages
  • Overview
Skip to end of metadata
Go to start of metadata

Architecture Overview

Zoie is a realtime indexing and search system, and as such needs to have relatively close coupling between the logically distinct Indexing and Searching subsystems: as soon as a document made available to be indexed, it must be immediately searchable.

The Zoie System is the primary component of Zoie, that incorporates both Indexing (via implementing DataConsumer<V>) and Search (via implementing IndexReaderFactory<ZoieIndexReader<R extends IndexReader>>).

Configuration

Zoie can be configured via Spring:

<!-- An instance of a DataProvider:
     FileDataProvider recurses through a given directory and provides the DataConsumer
     indexing requests built from the gathered files.
     In the example, this provider needs to be started manually, and it is done via jmx.
-->
<bean id="dataprovider" class="proj.zoie.impl.indexing.FileDataProvider">
  <constructor-arg value="file:${source.directory}"/>
  <property name="dataConsumer" ref="indexingSystem" />
</bean>


<!--
  an instance of an IndexableInterpreter:
  FileIndexableInterpreter converts a text file into a lucene document, for example
  purposes only
-->
<bean id="fileInterpreter" class="proj.zoie.impl.indexing.FileIndexableInterpreter" />

<!-- A decorator for an IndexReader instance:
     The default decorator is just a pass through, the input IndexReader is returned.
-->
<bean id="idxDecorator" class="proj.zoie.impl.indexing.DefaultIndexReaderDecorator" />

<!-- A zoie system declaration, passed as a DataConsumer to the DataProvider declared above -->
<bean id="indexingSystem" class="proj.zoie.impl.indexing.ZoieSystem" init-method="start" destroy-method="shutdown">

  <!-- disk index directory-->
  <constructor-arg index="0" value="file:${index.directory}"/>

  <!-- sets the interpreter -->
  <constructor-arg index="1" ref="fileInterpreter" />

  <!-- sets the decorator -->
  <constructor-arg index="2">
    <ref bean="idxDecorator"/>
  </constructor-arg>

  <!-- set the Analyzer, if null is passed, Lucene's StandardAnalyzer is used -->
  <constructor-arg index="3">
    <null/>
  </constructor-arg>

  <!-- sets the Similarity, if null is passed, Lucene's DefaultSimilarity is used -->
  <constructor-arg index="4">
    <null/>
  </constructor-arg>

  <!-- the following parameters indicate how often to triggered batched indexing,
       whichever the first of the following two event happens will triggered indexing
  -->

  <!-- Batch size: how many items to put on the queue before indexing is triggered -->
  <constructor-arg index="5" value="1000" />

  <!-- Batch delay, how long to wait before indxing is triggered -->
  <constructor-arg index="6" value="300000" />

  <!-- flag turning on/off real time indexing -->
  <constructor-arg index="7" value="true" />
</bean>

<!-- a search service -->
<bean id="mySearchService" class="com.mycompany.search.SearchService">
  <!-- IndexReader factory that produces index readers to build Searchers from -->
  <constructor-arg ref="indexingSystem" />
</bean>

Indexing

Documents get into the Zoie System for addition to lucene indices by way of a decoupled DataProvider abstraction, which indexes via push: Zoie implements the DataConsumer interface, the natural partner to DataProvider . What follows is a brief call-stack walk-through of indexing:

  • DataProvider is running on its own thread/pool/remote machine/etc, and controls the flow of DataEvent<V> by calling...
  • DataConsumer .consume(Collection<DataEvent<V>>), which in this case is the ZoieSystem which delegates to several internal
  • ZoieIndexableInterpreter , whose job it is to iterate over DataEvent<V> and spit out Indexable objects via ZoieIndexable .convertAndInterpret(V data), and these resultant objects provide
  • Document objects via buildIndexingReqs().

RAM-to-Disk Index Segment Copy:

Prior to 1.4.0, indexing for RAM Index and Disk Index both tokenized document data and built inverted indexes separately. In 1.4.0, we eliminated this duplicate work. Disk index now copies index segments from RAM index instead of going through tokenization and inversion again. This greatly reduces CPU load and disk I/O when documents are flushed to Disk index.

Overview: The part of Zoie that enables real-time searchability is the fact that ZoieSystem contains three IndexDataLoader objects:

  • a RAMLuceneIndexDataLoader, which is a simple wrapper around a RAMDirectory,
  • a DiskLuceneIndexDataLoader, which can index directly to the FSDirectory in batches via an intermediary
  • BatchedIndexDataLoader, whose primary job is to queue up and batch DataEvents that need to be flushed to disk

All write requests that come in through the DataProvider are tee'ed off into a "currentWritable" RAMDirectory and into an in-memory index of DataEvent's which the BatchedIndexDataLoader collects until it hits a threshold batch size (and a minimum delay time has passed), after which it gets added to disk.

Searching

ZoieSystem acting as an IndexReaderFactory, provides an "expert" search api (note that these IndexReader instances will always be read-only, and thus not usable for modifying the index - only searching it), for clients of ZoieSystem who need access to the IndexReader internals (for faceting, caching, etc...). For clients who do not need/want such an expert api, there will be (in an upcoming Zoie release) a more simplified "Searcher Factory" interface which compartmentalizes the IndexReader internals a bit more by wrapping the IndexReader's in a MultiSearcher.

ZoieSystem delegates the getIndexReaders() and returns a list of ZoieIndexReader instances.

SeeCode Samples wiki for examples.

  • No labels