Importing data and creating the editions for the Discovery sites

This page describes in all gory detail how to find the data files and import them into the Discovery installations. (Which means that this information is only interesting for the Discovery web masters).

Where are the files found

First of all, let's see where the import XML files are stored when created (usually by Paolo):

ILIESI sites

These are the modern, socratics, presocratics and laertius sites on the daphnet server: the XML files are found in

/Library/WebServer/Documents/fabrica2/export

Here you will find 4 directories, one per site, and each of them will contain the import files. In the modern case, since it also has a facsimile edition (as a contrary of having only text editions, like the other 3 sites) you will find two directories, one for texts, the other for facsimiles.

The facsimiles images themselves, which are useful when you have to create the pyramidal version of them, are found in

/Library/WebServer/Documents/fabrica2/facsimiles

Here you will find some directories, named after the author to which the facsimiles are referred (e.g.: Vico).

(FYI, in /Library/WebServer/Documents/fabrica2/texts the XML/HTML files containing the transcription are found. You usually won't use them directly, but may be useful for testing purpose).

Oxford sites

NietzscheSource and WittgensteinSource: bad news, the files are stored on a different machine (known as http://mini.maison.ox.ac.uk or 163.1.59.101 on the internet). This is because that's the server serving Fabrica (for both the nietzsche and wittgenstein scholar groups). Up to now, the process was to copy the files from the http://mini.maison.ox.ac.uk to the actual server (known as http://www.nietzschesource.org or 163.1.59.100 or... other names) and to process them there. Please note that also facsimiles are found on the "mini" and must be moved on the actual server before processing them (for creating the pyramidal versions and importing them).

On the http://mini.maison.ox.ac.uk, you will find in the a couple of directories named "critical" and "facsimile" in

/Library/WebServer/Documents/Nietzsche-Export 

They contain, respectively, the XML files for the text edition and for the facsimile edition.

On the same machine, but in

/Library/WebServer/Documents/Witt-Export

there are a couple of directories named "text" and "fax" are found. They contain the XML files for the text edition and for the facsimile edition of Wittgenstein.

/Library/WebServer/Documents/fabrica2

contains, among others, the two "texts" and "facsimiles" directories.

/Library/WebServer/Documents/fabrica2/texts 

contains two directories, one for nietzschesource and the other for wittgensteinsource. "BTE" (which stands for "Bergen Text Edition") is the one containing the transcriptions of Wittgenstein's books. The other "eKGWB" (which stands for "something some German guy may want to write down as I know nothing of german, please:)") contains the transcriptions of Nietzsche's books. You usually won't use these files directly, as they are used by the importers at import time. Yet you may be interesting in them for testing purpose.

/Library/WebServer/Documents/fabrica2/facsimiles 

also contains two directories, one for nietzschesource and the other for wittgensteinsource. "BFE" ("Bergen Facsimile Edition") contains the facsimiles of Wittgenstein's books. "DFGA" contains the facsimiles of Nietzsche's books. Unluckily, you will need to move these files to the other server (163.1.59.100) in orther to create the pyramidal version of them and to import them.

How to import and create editions

Now, let's see how to import these data into Talia.

All the commands found below, are supposed to be run inside the talia.sh shell. Also, for your own peace of mind you should use screen

Screen manual here: http://sunsite.ualberta.ca/Documentation/Gnu/screen-3.9.4/html_chapter/screen_toc.html

So a typical session will start with (let's pretend we are importing nietzschesource data, we are root here):

## start screen
sh-3.2# screen -h 5000
## move to the desired directory (for example /Library/WebServer/nietzsche_source/)
sh-3.2# cd /Library/WebServer/<discovery source>/
## set this var so we'll work with the production environment
sh-3.2# export RAILS_ENV=production
## open the talia shell
sh-3.2# ./talia.sh  shell

Opening JRuby environment for Talia (using /bin/sh)...
talia-bash /Library/WebServer/<discovery source> >> 

now we are ready.

First Of all, if you want to completely delete the old data in the installation, you must perform a "clear_all". (Of course you'll lose all the data in there, hence: pay attention!)

rake discovery:clear_all

Let's see an example, now.

This is what was done last time we imported ModernSource's data:

rake discovery:clear_all;

jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/Library/WebServer/Documents/fabrica2/export/modern/text/Modern\ Source-pages.xml; 
jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/Library/WebServer/Documents/fabrica2/export/modern/text/Modern\ Source-texts.xml; 

jruby -J-Xmx2000m `which rake` discovery:create_critical_edition name=Bruno nick=Bruno header=modern catalog=Bruno; 
jruby -J-Xmx2000m `which rake` discovery:create_critical_edition name=Kant nick=Kant header=modern catalog=Kant; 
jruby -J-Xmx2000m `which rake` discovery:create_critical_edition name=Baumgarten nick=Baumgarten header=modern catalog=Baumgarten; 
jruby -J-Xmx2000m `which rake` discovery:create_critical_edition name=Descartes nick=Descartes header=modern catalog=Descartes; 
jruby -J-Xmx2000m `which rake` discovery:create_critical_edition name=Vico nick=Vico header=modern catalog=Vico; 

rake talia_core:setup_ontologies ontology_folder=ontologies/modern --trace; rake talia_core:owl_to_rdfs_update --trace;

First, all data are wiped away. Then the two import files are imported. Then the editions are created. See how ModernSource, having more than one catalog, needs several create_critical_edition invocations, one per catalog. These will be found as links in the home page of the site.

The ending line re-loads the ontologies back in the DB. For some reasons they are lost during imports/edition creations.

Please note that you will need to restart tomcat at this point

To restart tomcat, just kill the process (it will be restarted automatically):

kill `cat /Library/Tomcat/logs/tomcat.pid`

after a while, tomcat will restart. In a minute or so you will be able to see the site running again (with the new data).

Let's now analyze each installation on it's own.

ModernSource

ModernSource has both critical and facsimile editions.

Please note that when you are importing facsimiles data, it's very important that you use the "prepared_images" parameter, as otherwise the original images are not copied in the right place (and you won't be able to use them in the site).

rake discovery:clear_all;

jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/Library/WebServer/Documents/fabrica2/export/modern/text/Modern\ Source-catalogues.xml;
jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/Library/WebServer/Documents/fabrica2/export/modern/text/Modern\ Source-pages.xml; 
jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/Library/WebServer/Documents/fabrica2/export/modern/text/Modern\ Source-texts.xml; 
jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/Library/WebServer/Documents/fabrica2/export/modern/text/Modern\ Source-users.xml;

jruby -J-Xmx2000m `which rake` discovery:create_critical_edition name=Bruno nick=Bruno header=modern catalog=Bruno; 
jruby -J-Xmx2000m `which rake` discovery:create_critical_edition name=Kant nick=Kant header=modern catalog=Kant; 
jruby -J-Xmx2000m `which rake` discovery:create_critical_edition name=Baumgarten nick=Baumgarten header=modern catalog=Baumgarten; 
jruby -J-Xmx2000m `which rake` discovery:create_critical_edition name=Descartes nick=Descartes header=modern catalog=Descartes; 
jruby -J-Xmx2000m `which rake` discovery:create_critical_edition name=Vico nick=Vico header=modern catalog=Vico; 

jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/Library/WebServer/Documents/fabrica2/export/modern/Facsmile/Modern\ Source-pages.xml ; 
jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/Library/WebServer/Documents/fabrica2/export/modern/Facsmile/Modern\ Source-fax.xml prepared_images=/Users/iliesi/prepared_images/;
jruby -J-Xmx2000m `which rake` discovery:create_color_facsimile_edition nick='VicoFacsimile' name='Vico Facsimile' header=modern catalog=Vico;

rake talia_core:setup_ontologies ontology_folder=ontologies/modern --trace; rake talia_core:owl_to_rdfs_update --trace;

Of course these are valid, provided that :

  • the XML import files are stored in the /Library/WebServer/Documents/fabrica2/export/etc directory
  • the pyramidal images are stored in /Users/iliesi/prepared_images/

If this is not the case, please change the commands accordingly

Socratics

rake discovery:clear_all; 

jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/Library/WebServer/Documents/fabrica2/export/socratics/Socratics\ Source-pages.xml ;
jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/Library/WebServer/Documents/fabrica2/export/socratics/Socratics\ Source-texts.xml ; 
jruby -J-Xmx2000m `which rake` discovery:create_critical_edition nick=Socratics name=Socratics catalog=Socratics header=socratics

rake talia_core:setup_ontologies ontology_folder=ontologies/socratics --trace; rake talia_core:owl_to_rdfs_update --trace

Presocratics

rake discovery:clear_all;
jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/Library/WebServer/Documents/fabrica2/export/presocratics/Presocratics\ Source-pages.xml ;
jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/Library/WebServer/Documents/fabrica2/export/presocratics/Presocratics\ Source-texts.xml ;
jruby -J-Xmx2000m `which rake` discovery:create_critical_edition name=Presocratics nick=Presocratics catalog=Presocratics header=presocratics; 

rake talia_core:setup_ontologies ontology_folder=ontologies/presocratics --trace; rake talia_core:owl_to_rdfs_update --trace;

Laertius

rake discovery:clear_all; 

jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/Library/WebServer/Documents/fabrica2/export/laerzio/Laerzio\ Source-pages.xml ;
jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/Library/WebServer/Documents/fabrica2/export/laerzio/Laerzio\ Source-texts.xml ;  
jruby -J-Xmx2000m `which rake` discovery:create_critical_edition nick=Laertius name=Laertius header=laerzio catalog=Laertius;

rake talia_core:setup_ontologies ontology_folder=ontologies/laertius --trace; rake talia_core:owl_to_rdfs_update --trace;

With this, the ILIESI's platform are covered, let's move on the 163.1.59.100 server now, where NietzscheSource and WittgensteinSource are found.

Please note that, since the import files and the facsimiles files are stored on a different server (163.1.59.101), we need to move all these files somewhere on the 163.1.59.100 one. Since I cannot tell where you are going to move these files, I will use pseudo paths here. You will need to adjust them to fit the actual positions, sorry.

WittgensteinSource

rake discovery:clear_all;

jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/path/to/imports/text/Wittgenstein%20Source-pages.xml;
jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/path/to/imports/text/Wittgenstein%20Source-texts.xml;

jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/path/to/imports/fax/Wittgenstein%20Source-pages.xml ; 
jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/path/to/imports/fax/Wittgenstein%20Source-fax.xml prepared_images=/Users/daniel/prepared-BFE/;

jruby -J-Xmx2000m `which rake` discovery:create_critical_edition nick=BTEn name="Bergen Text Edition | Normalized" header=wittgenstein_dark catalog=BTE version=norm --trace; 
jruby -J-Xmx2000m `which rake` discovery:create_critical_edition nick=BTEd name="Bergen Text Edition | Diplomatic" header=wittgenstein_dark catalog=BTE version=dipl --trace ; 
jruby -J-Xmx2000m `which rake` discovery:create_color_facsimile_edition nick=BFE name="Bergen Facsimile Edition" header=wittgenstein_dark catalog=BFE --trace;

rake talia_core:setup_ontologies ontology_folder=ontologies/wittgenstein --trace; rake talia_core:owl_to_rdfs_update --trace

Again, here the pyramidal images are supposed to be found at "/Users/daniel/prepared-BFE/", if they are not, please change the related command accordingly.

NietzscheSource

rake discovery:clear_all;
jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/path/to/imports/texts/Nietzsche%20Source-pages.xml;
jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/path/to/imports/texts/Nietzsche%20Source-texts.xml;
jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/path/to/imports/fax/Nietzsche%20Source-pages.xml;
jruby -J-Xmx2000m `which rake` discovery:xml_import xml=/path/to/imports/fax/Nietzsche%20Source-fax.xml prepared_images=/Users/daniel/prepared/;

jruby -J-Xmx2000m `which rake` discovery:create_critical_edition nick=eKGWB name=eKGWB header=nietzsche_yellow catalog=eKGWB;
jruby -J-Xmx2000m `which rake` discovery:create_color_facsimile_edition nick=DFGA name=DFGA header=nietzsche_blue catalog=DFGA;

rake talia_core:setup_ontologies ontology_folder=ontologies/nietzsche --trace; rake talia_core:owl_to_rdfs_update --trace

Final Notes

First of all, do you see the semicolons (;) at the end of each line? Those are there because you really want to run all the commands on a single line (separated by semicolons). This is because the import and edition creation processes are very long, you usually want to run them all together (in a screen!) and check them from time to time (or more probably go to bed and forget about them until the day after ;) ).

These are the times I've track down last time nietzschesource's data were imported:

KGW-pages: 4 hours (~16000 items)

KGW-texts : 6 hours (~15500 items)

KGW-critical_edition_creation: 50 hours

DEF-pages: 15 hours (14336 items)

DEF-fax: 17 hours (~14000 items)

DEF-facsimile_edition_creation (14000 items) (Unknown, the server was shut down during the import, doh!. I haven't this data at hand. sorry)

As you can see these are very long processes, and you don't wont to make them even longer waiting for you to "hit the enter key"! So just run all the commands on one line, and wait...

Backing up

As you've finished to wait for the processes to end, and everything is imported and editions created, you may (should!) want to perform a back up of the data just imported. A script for backing up/restoring is available at lib/script/quick_backup, in each installation.

For creating a backup, you can use the following command line (we are, of course, inside a talia.sh shell here, and possibly in a screen):

 jruby lib/scripts/quick_backup backup -data -iip

This will start the process of creating a backup (it will ask you for the mysql's root user's password) which will back up all the database, all the data files on the filesystem (facsimile, xml transcriptions, etc), and also the iip pyramized images. That is everything.

You can invoke the command without the -data or the -iip (or both) parameters, it will not backup the data file or the iip images respectively (without any of those parameters only the databases are backed up).

The process will create a (quite large) tgz file, whose name contains the data and time of creation, containing all the backed up files.

To restore a backup, you will first need to un-tar the backup file e.g.:

tar zxvf backup-20090417-164755.tgz

and after it has been expanded, you will be able to restore it (WARNING: of course it will overwrite the data you have in your installation, meaning that you may mess it up and end up losing files if you don't know what you are doing!). This command:

jruby lib/script/quick_backup restore backup-20090417-164755

will restore a backup contained in the directory backup-20090417-164755 (which was created by the tar zxvf backup-... command above)