Warning: Can't synchronize with the repository (GIT backend not available). Look in the Trac log for more information.

Talia System Description

This document is designed as a global overview of the talia_core functionality, mainly for developers that want to use the source code. It may refer to other, more in-depth documents where appropriate. It is helpful if the reader has a basic understanding of RDF and the semantic web, as well as of Ruby on Rails.

System startup

The first step of the Talia startup happens when the talia_core/lib/talia_core.rb file is included. This will in turn load the talia_dependencies.rb file, which will attempt to load all the modules on which Talia depends (currently "assit", "activerdf_net7" and "semantic_naming"). The modules are loaded by the TLoad module from the loader_helper file. It will first check if the submodule directories exist at the same level as the talia_core directory. If these are not found, the modules are loaded as gems.

The loader will also load some parts of Rails (if not already in a Rails application) and set up the file system paths for the rest of the startup.

In a Rails application, the "real" Talia startup will be triggered through a Rails initializer file, commonly config/initializers/talia_initializer.rb. By default that will setup the load paths, set the environment (development/production/testing) and call TaliaCore::Initializer.run(configuration_file). The initialization logic is found in the initializer.rb file.

Configuration files

The main configuration file for Talia is config/talia_core.yml. In addition, the usual database.yml file is used for the database connection, and a rdfstore.yml file to configure the connection to the RDF data store.

The talia_core.yml file also contains the namespaces that are recognized by Talia. The namespaces "rdf", "xsd", "rdfs" and "owl" must not be configured here, they are set by default inside the semantic_naming module.

All configuration options will be available on the TaliaCore::CONFIG hash after initialization. Users may add "custom" options that are not recognized by talia_core itself, but can be read from TaliaCore::CONFIG. See the Initializer documentation for details about the configuration options.

ActiveSource

TaliaCore::ActiveSource and its subclasses are at the heart of the talia_core. ActiveSource is intended to be a "semantic" replacement for ActiveRecord::Base. The main development goal is to have a behavior as similar to ActiveRecord::Base as possible. (ActiveSource is currently a subclass of ActiveRecord::Base but may be rewritten in the future to a generic class conforming to the ActiveModel API).

TaliaCore::Source is a subclass that adds "intelligent" handling for property access (allow, for example, source.rdf::type-like accesses)

The classes are found in the active_source.rb and source.rb files.

Dual backend store

The ActiveSource records and all the semantic data exist in the SQL database, but all semantic data is duplicated in the RDF store. The active_sources table does only contain the URI of each source, and the type for the single table inheritance (STI). The semantic_relations table contains the semantic triple, with the subject_id field referring to the "subject" source, the predicate_uri containing the uri of the predicate of the triple and object_id referencing an "object" source or a SemanticProperty record as the object. The semantic_properties table contains the object properties that are literal values, at this point only strings; these may also contain longer texts.

The SQL data will be replicated to the RDF store when the record is saved (see also CreatingRdfInTalia?). The logic for the automatic handling of the RDF is in the talia_core/lib/talia_core/active_source_parts/rdf_handler.rb file.

Accessing semantic properties

The ActiveSource and Source classes make an effort to allow the user to access semantic properties in a similar way as database fields. The following methods will work on all ActiveSource objects:

# This will return an URI combined of the given namespace and 
# an appended "local" part
N::RDF.local

# Access a property rdf:local, all the following are equivalent
source.predicate(:rdf, :local)
source[N::RDF.local]
source["http://www.w3.org/1999/02/22-rdf-syntax-ns#local"]

# Add a property rdf:local, all the following are equivalent
source.predicate_set(:rdf, :local, "value")
source[N::RDF.local] << "value"
source["http://www.w3.org/1999/02/22-rdf-syntax-ns#local"] << "value"

# Add a property only if it does not already exist
source.predicate_set_uniq(:rdf, :local, "value")

# Replace the given predicate with the given value
source.predicate_replace(:rdf, :local, "value")

The Source class also supports the "intelligent" interface:

source.rdf::local # Reading
source.rdf::local << "value" # Writing

All methods that return multiple values will return an object of the class SemanticCollectionWrapper, which (mostly) acts like a normal collection/array. The source.namespace call does in fact return a DummyHandler object:

# This 
source.rdf::local
# is equivalent to
dummy = source.rdf
# Dummy contains the DummyHandler
dummy::local # == dummy.local

Mass assignment and defined properties

All Talia sources can take a hash of "attributes", like ActiveRecord::Base does. If an attribute corresponds to a database field, it will be handled in the normal way (e.g. "uri"). All other attributes will be treated as predicate URIs:

TaliaCore::ActiveSource.new(:uri => N::LOCAL.newsource, N::RDF.value => [ "value1", "value2" ])
# The following will ADD a new value to rdf:value
source.update_attributes(:uri => N::LOCAL.othervalue, N::RDF.value => "newstuff")
# Talia-specific: This will REPLACE rdf:value
source.rewrite_attributes(:uri => N::LOCAL.othervalue, N::RDF.value => [ "value3", "value4" ])

The #rewrite_attributes method is a special case of #update_attributes. Whereas the "update" method will leave the existing values for a property and add new ones, the "rewrite" method will replace the existing values with the given ones.

Mass assignments from a hash of attributes are handled in two parts: The first is ActiveSource.split_attribute_hash, found in lib/talia_core/active_source_parts/class_methods.rb. This method returns a hash which contains the :db_attributes (for database fields) and :semantic_attributes (all semantic properties) as two separate hashes. The :db_attributes are fed into the default mass-update methods (update_attributes, new of the base class). The :semantic_attributes are fed into the add_semantic_attributes method of ActiveSource. The adding of semantic attributes can either remove the existing values (overwrite) or not.

However, this way of using the methods does not play well with standard forms. Forms in Rails return a hash with simple strings as keys, and simple strings as values. To work around this limitation, subclasses of ActiveSource can use the singular_property and multi_property helpers to declare "static" attributes that work very much like Rails attributes. See TaliaModelsInRails for more information.

Handling and Caching of semantic properties/predicates

The reading and writing of semantic properties is mainly contained in the files predicate_handler.rb, semantic_collection_wrapper.rb and semantic_collection_item.rb. The main goals here are to query more efficiently and cache some of the values in order to avoid unnecessary queries.

  • Each SemanticCollectionWrapper holds all properties for a single predicate on a single source. It will only load the data when the collection is first accessed (using load!). When the data is loaded it will use a special query to get everything with a single SQL request (find_fat_relations) and it is also possible to provide the data from outside (init_from_fat_rels) so that an outside process can fill multiple wrappers from the same dataset.
  • The SemanticCollectionWrapper is modified in memory. When the source is saved, it will call the save_items! method on the wrappers. This will save all the relation triples from the wrapper to the database. It will skip the step if the values where never accessed, and will skip all element known to be already in the database.
  • The SemanticCollectionItem is mainly a wrapper around a "fat relation". This is a SemanticRelation with additional fields that contain the related values or uris. Since additional fields are present, a "fat" relation can never be saved to the database.
  • In the ActiveSource object (see the predicate_handler.rb file), all the currently loaded wrappers are cached. On saving the source, all wrappers are loaded and saved as described above. This file also contains the logic for "prefetching" a number of relations. This allows the base class to load a number of sources, including all related properties, with a single query.

RDF handling

The RDF store is updated each time a source is saved. The logic for handling the RDF updates is found in the file lib/talia_core/active_source_parts/rdf_handler.rb. The user/developer may disable the automatic RDF creation, which can be used as an optimization in some cases, by unsetting the autosave_rdf property. The automatic creation is triggered through the auto_create_rdf callback, the main logic is in create_rdf.

Writing the RDF

The writing of the rdf has three "modes".

  • :create is the standard mode. It will only write back the properties that are currently cached and loaded. The old values for each of those properties is removed separately.
  • :force forces the RDF for the source to be completely rewritten. This will remove all the properties of the source in a single request, rewriting all of them.
  • :false is a special mode for sources that are not yet in the database. It will write all cached properties, but does not attempt to delete any old values.

Accessing the RDF

The "RDF object" for a source is available through the source.my_rdf accessor. This object has a similar interface to ActiveSource for accessing properties:

# Reads the data from the RDF
source.my_rdf[N::RDF.local]
# Reads the data from the SQL database
source[N::RDF.local]

Finders and queries

Additions to the default find method of ActiveRecord are found in lib/talia_core/active_source_parts/finders.rb. The main changes are described in the RDoc documentation and include the :find_through option to search for sources based on the value of a semantic property:

# Search for sources where N::RDF.local has a value of "value"
TaliaCore::ActiveSource.find(:all, :find_through => [N::RDF.local, "value"])

The finders may be reworked in the future, the main weakness is that they only work on the SQL database, not on the RDF.

RDF Queries

At the moment all RDF Queries are done directly through the ActiveRDF query interface. The queries are modelled after the SPARQL query language:

ActiveRDF::Query.new(TaliaCore::ActiveSource).select(:book).where(:book, N::DCT.creator, :author).where(:author, N::FOAF.name, "Danilo").execute

The class passed to the new method will be used for new "Resource" objects (represented by URIs). Literal values will be returned as strings. The above query would return an ActiveSource object for each "book" where the creator has a name of "Danilo".

Creating sources

The logic for the ActiveSource.new method (like the other general class methods) is found in the lib/talia_core/active_source_parts/class_methods.rb file. Note that in the special case of calling new with a single string (a URI) as a parameter, the method will return the existing source if there is one. This is probably slated to change, and it will not happen if the URI is passed in as part of a hash.

If only a single string is passed to new (or find) that consists only of numbers, or looks like "0000-something", it will be treated as a numeric database id. In addition, if any URI passed to the source does not appear to be a valid URI, it is considered to be the local part of a URI in the "local" namespace. See the to_uri_s method for reference.

XML export

Talia can create XML/RDF and XML representations of sources, and import sources from XML. Most code for the XML handling can be found in lib/talia_core/active_source_parts/xml and lib/talia_util/xml

Building XML and RDF

The base class for creating XML is the TaliaUtil::Xml::BaseBuilder. It uses the standard XML builder internally. This base class is used to write "builders" that contain specialized XML writing code. The TaliaUtil::Xml::RdfBuilder is one of that subclasses that turns an array of triples into an xml-rdf file.

There are two special builders that are used specifically from the ActiveSource objects: The TaliaCore::ActiveSourceParts::Xml::SourceBuilder can write a source object in a simple XML format, which is the "Talia internal XML" format. The TaliaCore::ActiveSourceParts::Xml::RdfBuilder does the same, but produces rdf-xml as an output

XML Import

Talia has a built-in facility for importing XML data in the "internal" format, and the TaliaCore::ActiveSourceParts::Xml::GenericReader class and friends provide an easy way to create "readers" for almost any XML format imaginable. The readers will contain handler blocks, which are internally use to create instance methods on the Reader object(s) using define_method.

The xml import can be started from the command line, using the talia_core:xml_import rake task:

$ rake talia_core:xml_import xml=import_data.xml

RDF Import

The Talia framework can also import RDF data, through the RDF.rb library. You can install the plugins to RDF.rb to support various RDF file formats. To use this, just use one of the RDF reader subclasses as the Reader for the xml_import task.

RDF data can also be imported directly into the RDF store. This means that resources from the RDF are not available as Sources in Talia, and that queries do only work through the ActiveRDF::Query interface. The import logic is contained in lib/talia_core/rdf_import.rb. The import can be run as a rake task:

rake talia_core:rdf_import rdf_syntax=rdfxml files=test.rdf

Ordered Collections

The Collection class in lib/talia_core/collection.rb is a source that contains an ordered collection of sources. For compatibility with standard RDF collections, the contained sources are connected to the collection using relations (predicates) of http://www.w3.org/1999/02/22-rdf-syntax-ns#_<order number>. SQL operations will use the rel_order field for sorting - the number contained in that field will be the same as the order number of the relation.

The collection contains an internal array, where the position of an element in the array corresponds to the position in the collection. Most operations of the Collection class are passed through to the array and the collections will behave like arrays for practical purposes. This also means that standard methods like #sort can be used on it.

All operations take place in memory, and on saving the collection will be completely re-written on the data store (see the #rewrite_order_relations method). The collection class will bypass the standard callbacks for rewriting the RDF, and will force a complete RDF rewrite after the order_relations were rewritten.

Semantic Routing

Talia includes a default controller for active sources (in generators/talia_base/templates/app/controllers/sources_controller.rb. It will also write a new default route:

# Default semantic dispatch
map.connect ':dispatch_uri.:format', :controller => 'sources', :action => 'dispatch',
   :requirements => { :dispatch_uri => /[^\.]+/ }

This will route all unrecognised requests (that is, those not handled by another controller) to the SourcesController#dispatch action. It will do the following:

  • Try to find a source with the URI that was typed into the browser (a trailing ".<format>" will be ignored for that)
  • If no source is found, a RecordNotFound? (404) is raised - that means that you will also see this 404 in case some of the other routes is not configured properly
  • If the source is found, the controller will attempt to find a template for the source (see #template_for, #map_templates_in and #template_map):
    • Templates are in app/views/sources/semantic_templates
    • If there is more than one template for a given source, one of them is selected (note that in this case, the selection may even change between different calls)
    • Templates in semantic_templates/default match the Ruby runtime class, that is the class name of the source (without the module name). E.g. a default/collection.html.erb would match "TaliaCore::Collection" sources.
    • Templates named semantic_templates/<other namespace>/<type name> will match RDF types. For example foaf/friend.html.erb would match sources that have an RDF type of foaf:Friend
    • If no matching template is found semantic_templates/default/default.html.erb is used

Within the template, the current source is available as @source

Data File handling

Each source can have zero or more DataRecords attached. The data records are found in talia_core/lib/talia_core/data_types. The main interface is defined in data_record.rb - the data record defines a file-like interface on a byte sequence. However, the base class does not assume file storage, and there are even cases where the real data is not available through the standard interface (e.g. IIP, see also the MediaLink data type)

Each record also has a MIME type, and the mime type determines how the record is created during "automatic loading".

File Records

File-based storage is realised through subclasses of FileRecord. This class only contains the base API, most of the file storage code is in lib/talia_core/data_types/file_store.rb. The create/load logic is in lib/talia_core/data_loader.rb.

Files are stored in the "data directory" (configured in talia_core.yml), in the following way: <ModelClassName>/<xxx>/<db id of data record>, where xxx is the last three digits of the database id. E.g. a XmlData record with the id 19837 would end up as XmlData/837/19837. There is a number of helpers for creating data and temporary paths, found in [http://github.com/net7/talia_core/blob/master/lib/talia_core/data_types/path_helpers.rb lib/talia_core/data_types/path_helpers.rb] The file records keep an internal @file_handle (in case the corresponding file is open). When a record is created from a new file (usually using the #create_from_file or the #create_from_data method, or when a new file is attached (using #file=(v), the data is only written to the "official" directory when the db record is saved.

Writing the data (write_file_after_save) is an operation that depends on a number of conditions, both from the data itself, and the system settings.

  • If the original data is a file, then the file_data_to_write should be a DataPath object (see #create_from_file)
  • If the original data is a string/binary array, then the data will not be a DataPath (see #create_from_data)
  • In the latter case, #save_cached_data is called, which writes the data to the correct data file
  • In case the original data is a file, things get a bit more involved (#copy_data_file and #copy_or_move):
    • If the delete_original flag was passed during the creation of the record, the file will be moved to it's new location (which is often much quicker than a copy operation)
    • The move operation, at this point, makes a call to the system's mv command. This is due to problems that FileUtils sometimes has under jruby. It should soon be modified to go back to the FileUtil version, or use the same approach as the copy operation (see below)
    • If the delete_original flag is not set, Talia will attempt to copy the file:
      • If delay_file_copies is set, the files will not be copied; instead the process will create a plain text file that contains cp commands that can be executed as a shell script. Obviously, this is blazing fast (since nothing is done)
      • If fast_copies is set, the system will use the FileUtils.copy method to do the actual copy. Note that this has caused some problems in the past when a large number of big files was copied using jruby.
      • Otherwise a call will be made to the system's cp command to do the actual copy. This is less portable but appeared to be most stable

Mime Types and loading/creating of records

The default mime type mapping, and methods to access/manipulate it, are found in lib/talia_core/data_types/mime_mapping.rb. The Mime types are registered as standard MIME types for rails, and are stored in the database along with the records. The Mime types are used when creating a new data record through the "data loader" facility. The "data loader" is contained in lib/talia_core/data_types/data_loader.rb, and it is used when a new record is created through #create_from_url. The method can take both a file path or a web URI as a parameter, and the behaviour depends on what is passed.

The creation of the record (and determination of the correct MIME type) will work in the following way:

  • In case a MIME type is passed as an option, this type is always used, and no attempt is made to automatically detect the MIME type
  • To determine if the uri is a file, the method checks if it exists on the file system. If not it is assumed to be a web URL, otherwise it is obviously a file. (Any "file://" is stripped, allowing the use of "file://" URIs as well)
  • In case the uri is a file, the method#mime_by_location is be used to determine the MIME type from the file location (if the mime type is not already know). It checks if the file extension matches one of those for the registered MIME types, and this MIME type is used for the record.
  • In case the uri refers to a web resource, a connection will be opened to retrieve the data. The system will first attempt to use the MIME type that was sent by the server. If the server doesn't supply any MIME type, the loader will try to use the "extension", as above.

After the loader has opened the data location and determined the mime type, the creation is passed to #open_and_create. This method will use the #loader_type_from method to determine the "loader" for the given MIME type.

Data Loaders

A loader for data can be two things in Talia: Either there is only a data record class to use for a specific MIME type (a subclass of FileRecord), or the system has a class plus the name of a loader method to call.

In the first case, the procedure to create the new data record is quite easy: A new record of the given type is created, and #create_from_file or #create_from_data is called with the current data file (or in-memory data).

If the system has the name of a loader method (as a symbol), it will attempt to call that method on the data record class provided for the loader. The loader method will be called with the parameters mime_type, location, source, is_file - which are the MIME type of the new record, the "location" (file name) to use, the IO source (or location) for the data and an indication whether or not the source is a file system file.

For an example of what a loader method can do, see the IipLoader module (the "loader method" is create_iip). If you configure the use of the Iip loader, this method will be called to create a new record.

Configuring Data Loaders

The default configuration for loading is defined in the mime_mapping.rb file. As you can see, the default uses only data classes, and will simply create a new record for the configured MIME type when used.

Different loaders can be configured in the initialiser file for Talia usually app/initializers/talia.rb. Example:

TaliaCore::DataTypes::MimeMapping.add_mapping(:jpeg, TaliaCore::DataTypes::ImageData, :create_iip)

This configures the loader to use the create_iip loader for the MIME type "jpeg". If a new data object is determined to be a jpeg, the system will call TaliaCore::DataTypes::ImageData.create_iip with the parameters described above.

Additional Source types

The Talia Core contains some additional source types, which are in lib/talia_core/source_types. E.g. there are types for DcResource and MarcontResource, which define source types with special fields for particular types of resources.

Workflow

There is some remaining code for a "workflow" feature in the core. This may be pulled soon, and possibly re-implemented only when needed.