This page is outdated

This page is outdated. So are some other pages about feeding and searching data in eXist.

See new pages

http://trac.talia.discovery-project.eu/wiki/ExistFeedingSpecs

http://trac.talia.discovery-project.eu/wiki/ExistSearchRequest

http://trac.talia.discovery-project.eu/wiki/ExistNormalSearchResult

http://trac.talia.discovery-project.eu/wiki/ExistMacrocontributionSearchResult

http://trac.talia.discovery-project.eu/wiki/ExistMediaSearchResult




Search index in XML database

This document is not complete. These are the tentative specification to keep a fulltext search index, with metadata, in the eXist XML database. It describes the XML data that will be fed into the eXist database as well as the interfaces to interact with the index.

General Architecture

The search system consists of several elements:

  • The eXist database itself, running on an application server
  • A servlet to communicate with the Talia "clients"
  • A Talia "client" module that communicates with the servlet
  • A servlet that accepts new XML documents to be posted into the eXist databse
  • A Talia service that "feeds" new XML documents into the eXist system

Queries to the eXist system

To execute a search query, Talia will prepare a query and post it to the search servlet. The query consist of key-value pairs. The first element of the pair indicates the "field" that will be searched in the XML documents; the second field is the condition.

An Example:

fulltext	"eine Frau" AND sinne
author      Nietzsche

This would search for XML elements where the document data contains "eine Frau" (as a phrase) and 'sinne'. The author tag of the XML must contain 'Nietzsche'.

The query will be taken by the search servlet, which will transform it into an XQuery expression and use that to query the eXist database. The servlet will then reformat the result from the eXist database and send an XML document with the search results back to Talia.

Problem: We must also be able to use boolean operators

Adding XML to the system

The 'index' will be built by a service in Talia. It will scan all documents and create appropriate XML documents for each. It will send the XML to the "feed" servlet which will store them in eXist.

Scenarios/Use Cases

These are the uses cases for the search functionality. They describe how the search can be used by the scholars. The search functionality will be designed to enable those use cases.

Global Fulltext search

A scholar wants to search the whole site for a keyword from scholar mode. This will return all hits from the whole database. For example, she may search the whole NietzscheSource site for the word "wanderer". In the search, OR and AND operators can be used, and phrases can be search for by enclosing the phrase in quotation marks.

Open questions on this:

  • How will the structure be show in the result list. E.g., if the search phrase is found on different pages of a book or in different version of the same transcription, how will it show up?
  • How many results should be shown? For once this is a performance/overview tradeoff - but also it will be more difficult showing "hierarchies" of results if not all results are available.

Global property search (combined with Fulltext)

This will work like the global fulltext search, but it will allow the scholar to also specify criteria on the metadata "properties". E.g. the search above can have the additional criterium of "the author must contain 'D'Iorio'". The property search also allows for "AND" and "OR" queries.

Open questions: See above.

Simple Mode search

The main search works like in the examples above. However, the results will be restricted to only those documents that are part of the given macrocontribution.

Rationale

The search will store the data in a simple key-value format. That means that the content in the XML database will basically consist of "fields" that have a "value" attached (see also the XML example).

The names of the fields are (except for some default fields) not specified in advance. The search servlet will take key-value combinations (as stated above) and search the fields given in the query if they exist.

The reason for this approach is that it keeps advanced logic and the information about the internal structure out of the search database. This means the actual code of the search servlet will not have to be changed if the data structure changes. It will also be possible to add new "fields" as necessary.

XML feed format

(Proposed format)

<document>
        <!-- These are "fixed" elements specified by the XML format -->
        <id>siglum</id> <!-- The siglum/identifier for the document -->
        <order>1</order> <!-- The ordering for the document -->
        
        <!-- The following are "free" key-value pairs. They are not required by 
             the XML format, but can be searched upon if they exist. The 
             following are given as examples.
             We could potentially also use a different XML syntax if that is easier,
             such as <item key="xxx">value</item>
        -->
        <title>A cool title for the document</title>
        <author>D'Iorio, Paolo</author>
        <author>Reigem, Oystein</author
        <book>NII-6</book>
        <page>NII-6,2</page>
        <macrocontribution>diorio-edition</macrocontribution>
        
        <!-- The actual data for the fulltext search -->
        <data>
        </data>
        
</document>

Flattening the Hierarchy

This format does not express any hierarchy between elements. Instead the hierarchy will be "flattened" into the key-value approach: If a document has children (subparts) and the document has a property "X", then the property will also be added to the search XML of all children (and their children, etc.)

Result Format