This page is outdated

This page is outdated. So are some other pages about feeding and searching data in eXist.

See new pages

http://trac.talia.discovery-project.eu/wiki/ExistFeedingSpecs

http://trac.talia.discovery-project.eu/wiki/ExistSearchRequest

http://trac.talia.discovery-project.eu/wiki/ExistNormalSearchResult

http://trac.talia.discovery-project.eu/wiki/ExistMacrocontributionSearchResult




Description of the eXist Search in Hyper. See also #585

Data is stored in an xml database system called eXist. eXist runs as a servlet under Tomcat.

The data stored in eXist is text content taken from contribution files in the Hyper file system, together with metadata taken from the Hyper postgres database. So the data in eXist is a (converted, reformatted) copy of data elsewhere in the system.

XML data storage

We store in eXist the contributions of Hyper that contain text. To be more specific: We store - if possible - the displayed version of the contributions. I.e, we don't store the xml version, we store the transformed html. If a contribution occurs in several versions (e.g, diplomatic vs linear, writing layer 0 vs 1 vs 2), we store it several times, one for each displayed (transformed) version.

For contributions in pdf format we don't store html, but pure text. (My plan was to store html also for pdf contributions - html as similar to the pdf as possible, but I ran into problems with converting pdf to html.)

When the html (or text) is stored in eXist, it is wrapped inside an xml document, that also contains metadata (author, title, etc). For details see below.

Metadata search in eXist

The forms for doing search have a field where you can enter full text search criteria (word(s) and/or phrase(s)), and several fields for metadata search criteria. If only metadata criteria are filled in, the search is done not in eXist, but in the postgres database.

Not long ago Danilo decided we should change the system so that all search is done in eXist. But then we must store in eXist also stuff that don't contain text. This has not been implemented since Hyper was being moved down on the priority list. But the xml we must store for text-less stuff will be very similar to the xml for textual contributions. It will contain the same metadata elements, but no html/text content.

Feeding the eXist database

What feeds the eXist database with data is a java servlet (the Feed servlet), also running under Tomcat. It's started manually. It was meant to be started by a cron job or something, but that hasn't been implemented.

The Feed servlet generates the whole content of the eXist database from scratch. (When I discussed macro contributions with Danilo in Lecce in February/March we considered having some incremental feed process for the data specific to macro contributions.)

Servlets

Then there are a couple of java servlets doing the actual search. The interface sends requests to these servlets, and they return the result to the interface.

Search modes

I have implemented three search pages: Full Mode Normal Search (FMNS), Research Studies Normal Search (RSNS) and Critical Edition Normal/Advanced Search (CEAS). The first two are quite similar, and use the same servlet (StandardSearch?). CEAS has its own servlet (CritEdSearch?).

The StandardSearch? servlet returns the search result as xml, to be converted to html by the interface. The CritEdSearch? servlet hasn't reached the same maturity level and returns html. But CEAS will anyway be changed and enhanced to become a more general macro contribution search.

XML Data sample

<?xml version='1.0' encoding='utf-8'?>
<document>

<!-- many contributions in hyper
    have several displayed versions,
    e.g, linear layer 1 vs diplomatic layer 1.
    the exist db contains one xml document
    for each displayed version.
    the "version" element represents the version.
    the reasons for the extra "document" element
    around the "version" element are historical:
    at one time the developer thought it might be useful
    to have all the versions of the same contribution
    in a common "document" element.
    this might still turn out to be useful, btw -->
<version>

 <metadata>
   <!-- it's useful to have a human-recognizable id
        for the documents stored in an eXist db.
        we use as an id the string
        siglum/something/.../something
        from the static link,
        but with "/" replaced by "_ -->
   <id>...</id>
      <!-- contribution type, e.g, "essays" -->
   <type>...</type>
      <siglum>...</siglum>
       <authors>
     <author>
       <lastname>...</lastname>
       <firstname>...</firstname>
     </author>
     ...
   </authors>
     <title>...</title>
     <!-- a "standard" title here if there is no title in the postgresql db -->
  <standard_title>...</standard_title>

  <!-- iso two letter language code -->
  <language>...</language>
     <!-- iso data format, e.g, "2007-11-27" -->
  <date>...</date>
     <!-- critical edition data -
       one element per critical edition the contribution belongs to -->
  <critical_edition>
    <!-- siglum of the critical edition -->
    <siglum>...</siglum>
    <!-- name of the critical edition -->
    <name>...</name>
    <!-- siglum of the work -->
    <work_siglum>...</work_siglum>
    <!-- name of the work -->
    <work_name>...</work_name>
    <!-- name of the related material -->
    <related_material_name>...</related_material_name>
    <!-- the related material's position within the work
         (not within the critical edition) -->
    <position_within_work>...</position_within_work>
    <!-- siglum of the related material -->
    <related_material_siglum>...</related_material_siglum>
    <!-- a hierarchical position thing,
         made up of work siglum, underline,
         and position within work (5 digits, zero filled),
         e.g, "WS_00013" -->
    <position>...</position>
    <!-- an element to allow "All" search.
         similar to the "position" element.
         made up of "all", underline,
         and position within work (5 digits, zero filled),
         e.g, "all_00013" -->
    <all_position>...</all_position>
  </critical_edition>
    </metadata>
   <!-- the "data" element contains the html of the contribution -->
 <data>...</data>     </version>

</document>