- How to feed eXist with contribution data
How to feed eXist with contribution data
(Lingo comment: The eXist XML database system has its data in a "database", but the "database" can be divided into many "collections". When in the below we say "database" we really mean "collection". Granted, we currently have all the data in the same collection, making the distinction less relevant, but in the future we might have more than one collection.)
(Media contributions have their own specs at the end of the document.)
General
The servlet FeedExist executes requests for feeding contribution data to the eXist database.
If the site is www.mycommunity.org, the port for the web application server is 8080, and the servlets reside in a web application named myapplication, the URL for a feed is the following:
http://www.mycommunity.org:8080/myapplication/FeedExist/store
The servlet can feed data for one contribution at a time.
The request is a POST request, with the XML containing contribution data in a POST parameter xml.
The request is done by a "feeding service" in Talia. The service repeatedly requests the feed servlet to store a contribution, until all are stored. First the service issues a special request to purge, or clear, the eXist database of old data.
The "purge" request is done with the following URL:
http://www.mycommunity.org:8080/myapplication/FeedExist/purge
The "purge" request is a POST request, with no parameters.
For each XML document with contribution data that the feed servlet is asked to store, it does some necessary processing before storing.
Below is a sketch and discussion of the main structure of the format. At the bottom of this document there are the detailed specs themselves. They are not a DTD or XML schema, but more like an example of an XML document to be fed to eXist.
The detailed specs also contain comments about the processing the servlet does before storing.
The main structure/elements of the format
Document, and root element
The data for each contribution is stored as an XML document with a root element <talia:source>. (source was chosen as the name for the root element instead of contribution because there is a chance we in the future must store material data in addition to contribution data to support some kinds of macrocontributions.)
Metadata
First there is a <talia:metadata> element with metadata for the contribution. These are data common to all the contribution's versions (see below).
These data are used in "normal" search, which combines full text search with search on metadata criteria. They are not used in macrocontribution search, at least not at the time of writing (2008-09-17).
"Versions"
Each contribution might have different versions that are displayed to the user. E.g, a Nietzsche transcription might have a base version stored as an XML (HNML) encoded document, but shown to the user as various XSLT-transformed HTML versions - "diplomatic layer 0", "diplomatic layer 1", etc, "linear layer 0", "linear layer 1", etc. The main reason for storing data in eXist is to support full text search, and such search is (must be) done on the transformed versions. For this reason the document stored in eXist contains a <talia:version> element for each of the transformed versions, with data specific to the version, particularly the transformed text (HTML) itself, which is used for full text search. The various <talia:version> elements are kept in a <talia:versions> element.
Note that there must always be one <talia:version> element present, even if the contribution contains no text.
Note that the base XML version is not among the versions stored. But if XML based search is to be implemented in Talia, also the base XML must be stored in eXist, but perhaps in a different structure.
The version text (HTML) data are used in both "normal" search and macrocontribution search.
Macrocontribution data
Finally there are data about the contribution's belonging to macrocontributions. There is one <talia:macrocontribution> element for each macrocontribution that the contribution belongs to - possibly none. The <talia:macrocontribution> elements are kept in a <talia:macrocontributions> element.
If the contribution doesn't belong to any macrocontribution, there shouldn't be any <talia:macrocontribution> or <talia:macrocontributions> elements present.
But - do macrocontributions really contain contributions? Don't they rather contain versions? This is a matter that has been discussed (between Øystein and Danilo at least) but not reached its conclusion. Currently we store macrocontribution data at the contribution level. So from the eXist data and servlet's point of view all versions are equally relevant. The user interface for macrocontribution search must use such search parameters that only the relevant version(s) is (are) retrieved. E.g, the critical edition search for the Nietzsche Talia must search for the versions that are tagged as "preferred" (the linear version with the highest writing layer).
The macrocontribution data are only used in macrocontribution search (of course).
The main structure/elements
<?xml version="1.0" encoding="UTF-8"?>
<talia:source xmlns:talia="http://trac.talia.discovery-project.eu/wiki/Exist#">
<talia:metadata>
....
</talia:metadata>
<talia:versions>
<talia:version>
...
</talia:version>
...
</talia:versions>
<talia:macrocontributions>
<talia:macrocontribution>
...
</talia:macrocontribution>
...
</talia:macrocontributions>
</talia:source>
The structure of a Talia macrocontribution, and how macrocontribution data are stored in eXist
These are issues that can benefit from some explanation.
A macrocontribution is a hierarchy of materials, with contributions related to the leaf nodes. The hierarchy has everywhere at least two levels (root node not counted). Typically the hierarchy is a sequence of books, where each book is a sequence or hierachy of chapters and/or pages and/or paragraphs.
The data stored in eXist about macrocontributions are stored per contribution. For each contribution is stored some information about which macrocontribution(s) it belongs to, and the path in the hierarchy.
In the example XML document at the bottom of this document, we have a contribution related to a paragraph in a chapter in a book in a macrocontribution. This might be the most complex hierarchical relation a contribution will have to a macrocontribution.
Note that the chapter typically is a special level (granule), because we want to have it in the left menu of the result, but not in the search form (the advanced macrocontribution search form). We'll come back to that in a minute.
A macrocontribution might be complex also in other ways - ways that cannot be shown in our example with a single contribution. E.g, in he "real world" the macrocontribution might contain books with a mix of pages and paragraphs. Pages and paragraphs overlap, and in the ideal system the structure should be realized as a graph with two overlapping and cross-related hierarchies. Instead Talia will assume a simplified structure - a hierachy where the pages come first, then the paragraphs.
Another example of complexity is a macrocontribution with a book with chapters containing pages, but with the first pages not being part of a chapter.
As just stated: A macrocontribution is a hierarchy of materials, with contributions related to the leaf nodes. The hierarchy has everywhere at least two levels (root node not counted). But in the search and result interface the user meets two versions of this hierarchy - a complete hierarchy, and a simplified one. The complete hierarchy might e.g have all the levels book, chapter and paragraph. The user meets this complete hierarchy in the left menu of the search result list. (To be precise, the user normally sees a subset of the hierarchy, because a search usually returns a subset of the macrocontribution's contributions, and not all of them. But the important thing is that all levels of the complete hierarchy are present in the left menu, also chapter.) Then we have a simplified hierarchy, e.g book and paragraph, which the user meets in the search form itself (the advanced search form). The form has dropdown listboxes for two levels only, e.g, book and paragraph. Here the chapter level has been suppressed.
Now back to which data are stored in eXist: When a contribution belongs to a macrocontribution, the data stored have information about the contribution's path in the complete hierarchy (see the talia:macrocontribution/talia:path element). The information that is stored about the path is enough to support the building of the left menu. For each node (material) in the path we store (a) the granularity level (book or chapter, etc), (b) the URI of the material and (c) the title of the material.
For the nodes in the simplified hierarchy we also store (d) a position value - a six digit zero-filled integer value. For a paragraph in a chapter in a book, the value might be "000036", indicating that the paragraph is the 36th paragraph in the book (not in the chapter). For the chapter node there is no position value.
In a macrocontribution search, the information in the "simplified" nodes can be used to sort the found contributions in the correct order.
It can also be used for some kinds of search - so-called "slice" or "range" search.
But to make search and ordering more efficient, the feed servlet builds a new, single element with a single value from various node element values. That new element is called <talia:search_key> (even though it is also used for ordering, and not just search).
(Why we have positions within the simplified and not the complete hierarchy? Answer: The simplified position information is easier for the feeding service to find than the complete position information. And the simplified position information is sufficient to support ordering and search.)
Here is an example <talia:search_key> value, broken into pieces just to make it more readable:
http://a.b.c/ABC
book
000003
http://a.b.c/ABC/hhh
para
000101
http://a.b.c/ABC/jjj
Between the pieces there are delimiters "." that are not shown here.
This is the value for a contribution related to the 101st paragraph of the 3rd book of the macrocontribution with URI http://a.b.c/ABC.
The numbers (positions) in the value are there for obvious reasons.
The strings book and para are there to support macrocontributions with a mixed structure, like books with both pages and paragraphs. Strictly speaking the book value at the higher level might be unnecessary, but the para at the lower level helps distinguish between page 101 and paragraph 101, and makes sure that the pages come before the paragraphs.
In general we would like to have larger granules before smaller ones. By accident the values book, chap, page, para, zone, representing the granules in descending size, have ascending values that can be used directly for sorting. (These values the feed servlet derive directly from <talia:granularity> values Book, Chapter, Page, Paragraph, Zone from the feeding service.)
The URI values for book and paragraph are there because the position values might not be unique (!). Explanation: The creators of macrocontributions sometimes make mistakes and give the same position to more than one material. In some cases they might even do this deliberately, because there might be materials that have no specific order. But in the search we do want to have a predetermined order, so the URIs are there to make the positions unique.
Btw - the developer of the search (Øystein) takes his chances that the siglum-like value at the end of the URI is sufficient to guarantee uniqueness:
http://a.b.c/ABC
book
000003
hhh
para
000101
jjj
Haha.
Note: In theory a leaf level material, e.g, a paragraph, might have more than one contribution. But there is currently no data in the system that says anything about the order of such contributions. A macrocontribution search will retrieve them in undefined order.
The details of the format structure of the xml to be given to the feed servlet by the feeding service
<?xml version="1.0" encoding="UTF-8"?>
<talia:source xmlns:talia="http://trac.talia.discovery-project.eu/wiki/Exist#">
<talia:metadata>
<talia:maintype>contribution</talia:maintype>
<talia:type>transcription</talia:type>
<talia:subtype>hnml</talia:subtype>
<talia:uri>http://a.b.c/ccc</talia:uri>
<talia:authors>
<talia:author>
<talia:firstname>Andrew</talia:firstname>
<talia:lastname>Williams</talia:lastname>
<talia:uri>http://a.b.c/qwert</talia:uri>
</talia:author>
...
</talia:authors>
<talia:title>...</talia:title>
<talia:standard_title>...</talia:standard_title>
<talia:language>it</talia:language>
<talia:date>2006-03-03</talia:date>
</talia:metadata>
<talia:versions>
<talia:version>
<talia:version_type>xxx</talia:version_type>
<talia:version_layer>xxx</talia:version_layer>
<talia:preferred>true</talia:preferred>
<!--
here follows
- either a <talia:content> element
containing the version content to be stored,
typically html
- or a URI to where the content can be found
- but not both.
the <talia:content> element must contain the html in utf-8 format,
and with XML entities escaped
(so that the content at first sight looks like text content to the servlet).
to emphasize: the characters "<", ">", "&" should be escaped.
they should be represented as entities "<", ">", "&".
(do _not_ escape html entities in general.)
the feed servlet takes the content (or gets it from the URI),
and does some conversion.
it cleans html to valid xhtml (which is a form of XML).
if the content is a pdf, it extracts the text (which is also XML compatible).
if the content has no text, e.g a facsimile, the result is empty.
afterwards the content is stored in a <talia:content> element.
the <talia:uri> element is not stored
-->
<talia:content>...</talia:content>
OR
<talia:uri>...</talia:uri>
<!--
here the feed servlet will place a duplicate of the
<talia:macrocontributions> element, to speed up search and ordering
-->
</talia:version>
...
</talia:versions>
<talia:macrocontributions>
<talia:macrocontribution>
<talia:uri>http://a.b.c/ABC</talia:uri>
<talia:path>
<!--
the feed servlet will set a 'talia:leaf' attribute
in each <talia:node> element.
values: "false" and "true".
this will make the data better suited
for uri-to-search_key conversion
(see comment elsewhere)
-->
<talia:node>
<talia:granularity>Book</talia:granularity>
<talia:uri>http://a.b.c/ABC/hhh</talia:uri>
<talia:title>Xxxx xx</talia:title>
<!-- position of book within MC.
a 6-digit zero-filled integer -->
<talia:position>000003</talia:position>
</talia:node>
<talia:node>
<talia:granularity>Chapter</talia:granularity>
<talia:uri>http://a.b.c/ABC/iii</talia:uri>
<talia:title>Yyy yyyy yyy</talia:title>
<!-- no position info for chapter -->
</talia:node>
<talia:node>
<talia:granularity>Paragraph</talia:granularity>
<talia:uri>http://a.b.c/ABC/jjj</talia:uri>
<talia:title>Zzz z zzzz zzzz</talia:title>
<!-- position of paragraph within book.
a 6-digit zero-filled integer -->
<talia:position>000101</talia:position>
</talia:node>
</talia:path>
<!--
here the feed servlet will insert a <talia:search_key> element
with a value that can be used for ordering,
and to find the contribution
when searching for slices.
it will contain info for the two main levels,
not chapter.
when the search servlet runs a macrocontribution search
it will get some search values from the search interface as URIs.
sometimes it needs the search values as <talia:search_key> values instead.
then the servlet must do preliminary queries to convert
the URI values to "search keys".
for that purpose it needs the <talia:search_key> element,
and the <talia:uri> element of the leaf <talia:node>.
that's why the feed servlet flags the <talia:node> elements
with a 'talia:leaf' attribute.
<talia:search_key>http://a.b.c/ABC.book.000003.hhh.para.000101.jjj</talia:search_key>
-->
</talia:macrocontribution>
...
</talia:macrocontributions>
</talia:source>
Special case: Media contributions. The details of the format structure of the xml to be given to the feed servlet by the feeding service
<?xml version="1.0" encoding="UTF-8"?>
<talia:source xmlns:talia="http://trac.talia.discovery-project.eu/wiki/Exist#">
<talia:metadata>
<talia:maintype>contribution</talia:maintype>
<!--
this is how we distinguish the media contributions
from the other contribution types:
we have the value "AvMedia" in the <talia:type> element
-->
<talia:type>AvMedia</talia:type>
<!--
I think ideally this element should be present,
with a value of "audio" or "video".
but for the time being the element is not necessary,
and can be left out
-->
<talia:subtype>audio</talia:subtype>
<!--
the URI of the contribution
-->
<talia:uri>http://a.b.c/ccc</talia:uri>
<!--
this is a new element for media contributions -
the URL of the audio/video file
-->
<talia:url>http://d.e.f/fff</talia:url>
<!--
this is a new element for media contributions -
the URL of a thumbnail image for the audio/video file
-->
<talia:thumbnail_url>http://d.e.f/ggg</talia:thumbnail_url>
<talia:authors>
<talia:author>
<!--
unfortunately the data we have for RAI media contributions
have author names as one field, with first name first.
instead of trying to divide the name up properly
and store the first name in <talia:firstname>
and the last name in <talia:lastname>,
we just store it in _one_ of these fields (elements).
we decided to store it in the <talia:firstname> element.
but if somebody decides it should rather be stored in <talia:lastname>
that's ok too. it makes no difference to the search.
search on author will presumably work quite well
even if the name is not properly stored in two elements.
-->
<talia:firstname>Andrew Williams</talia:firstname>
<!--
I'm not sure if the <talia:lastname> is necessary,
but it's safest to have it
-->
<talia:lastname></talia:lastname>
<!--
the names in RAI media contributions have no URI,
and the element can be left out.
but perhaps it will be used in the future.
because of that it is mentioned here
-->
<talia:uri>http://a.b.c/qwert</talia:uri>
</talia:author>
...
</talia:authors>
<talia:title>...</talia:title>
<talia:standard_title>...</talia:standard_title>
<talia:language>it</talia:language>
<!--
this is the publication date.
all contributions have a publication date.
-->
<talia:date>2006-03-03</talia:date>
<!--
this is a new element for media contributions.
this is the creation date of the media
-->
<talia:creation_date>2000-11-22</talia:creation_date>
<!--
I assume we don't need to store series and category
-->
<!--
a new element for media contributions.
the length in time of the audio or video.
I am not sure exactly what it contains.
but I think it is only used for display,
so it doesn't matter too much if the value isn't formalized
but perhaps it should have a more specific name? media_length? not very important
-->
<talia:length>...</talia:length>
<!--
a new element for media contributions.
I'm not sure if we want to store it.
but it does no harm at least
-->
<talia:bibliography>...</talia:bibliography>
<!--
a new element for media contributions.
contains one or more <talia:keyword> elements,
each with a keyword from a controlled (?) vocabulary.
(at least I assume there will always be at least one keyword.
but if a contribution have no keyword elements,
it will of course not be found with a keyword search)
-->
<talia:keywords>
<!--
(a new element for media contributions.)
-->
<talia:keyword>...</talia:keyword>
...
</talia:keywords>
</talia:metadata>
<!--
(we use the <talia:versions> and <talia:version> elements
even if media contributions have no transformed versions.
this is nothing particular to media contributions.
also some other contributions never have versions.
we still use the <talia:versions> and <talia:version> elements for their content)
-->
<talia:versions>
<talia:version>
<!--
these are elements used for other contributions.
I don't know if these elements are at all relevant
for media contributions.
they can be skipped for now at least
-->
<talia:version_type>xxx</talia:version_type>
<talia:version_layer>xxx</talia:version_layer>
<!--
this element, however, we should have, to be on the safe side
-->
<talia:preferred>true</talia:preferred>
<!--
for other contributions than media contributions here follows
- either a <talia:content> element
containing the version content to be stored,
typically html
- or a <talia:uri> element with a URI to where the content can be found
- but not both.
for media contributions we always have a <talia:content> element,
never a <talia:uri> element.
this <talia:content> element always contains a <talia:abstract> element,
which contains the abstract itself, as pure text.
even if there is no abstract, there should be a <talia:abstract> element here
(an empty one, in that case).
-->
<talia:content>
<talia:abstract>...</talia:abstract>
</talia:content>
</talia:version>
...
</talia:versions>
</talia:source>
