Update daemon

Table of contents

   Introduction
   Metadata collections
      Configuring collections
   Installation
   Administration
      Starting and stopping the daemon
      Sending update requests
      Logs
      Database housekeeping

Introduction

The Update daemon keeps the ACIS database tables up-to-date with the metadata collections that an ACIS service uses. Ivan Kurmanov also sometimes calls it “RI daemon”. RI stands for RePEc-Index, which is a historical name for the modules behind the daemon.

The daemon reads data about documents, software components, series, and institutions from the collections’ data files and inserts it into corresponding tables. If a record disappears from the underlying metadata collection, the RI daemon deletes corresponding data from the tables.

The RI daemon is called a daemon because it does not do anything by its own will. It sits there and waits for somebody to send it a request. Update requests are, basically, orders for work. The only way the daemon communicates with the world is by writing a log.

Sometimes ACIS itself sends an update request to the daemon. Sometimes you might want to send an update request to the daemon.

Metadata collections

Collections is how ACIS thinks about your metadata. You may have zero, one or more separate collections. If you have zero, ACIS won’t be of much use. Each collection has an identifier, a type and is stored in data files somewhere in the local file system. These things are specified in the collections’ configuration (see below).

Collection identifiers must be unique for any particular ACIS service installation. “ACIS” is a reserved collection identifier.

Collection type must be “AMF” if your collection will consist of AMF files. (For RePEc’s ReDIF it is “RePEcRec”.) Other collection types may be defined in the future.

A collection consists of any number of data files with stable names, grouped into any directory structure that fits. The directory structure must not contain circular symbolic links. The update daemon must have file-system access to the files and directories of the collection.

Each file of a collection may contain zero, one or many data records. Each data record must have a globally unique identifier. Identifiers are treated in case-insensitive manner. Having been lowercased they still must be unique.

If two or more records in a collection have the same identifier, the update daemon excludes them all from the ACIS database. All other records are processed and their data is saved to the ACIS database.

For AMF collections only files with .amf.xml name extension (case-insensitive) are treated as data files. Other files are ignored.

If data of a collection has changed, you want ACIS to check and process it to reflect these changes in the database. Use bin/updareq utility is for this. If your collection changes often, you will want to run this utility at regular intervals.

Configuring collections

You configure the collections with metadata-collections parameter of the main.conf file. For each collection you put the collection identifier into this parameter and you specify its type and path to it with metadata-X-type and metadata-X-home parameters, where X is the collection identifier.

If you have several collections, separate their identifiers in metadata-collections with a space.

ACIS” is a reserved collection identifier. The system uses it for its own generated data.

You use collection identifiers when you request a data update of a collection with bin/updareq utility.

Right now ACIS understands data in two metadata formats: RePEc’s ReDIF and AMF. Support for other metadata formats and collection structures can be added. It is not very difficult to develop, provided the data is file-based or has a pretty simple way to map records to files.

The parsed collections configuration is written into the {HOME}/RI/collections file.

Installation

Normally, the update daemon is installed when you install ACIS and no special installation is needed.

If you need a separate installation of the update daemon or if you want to manually upgrade the update daemon, you may follow these instructions. Grab the latest RePEc-Index package from http://acis.openlib.org/code/. You unpack the package and then run:

$ RePEc-Index-0.XX/install.sh HOME

where HOME is the path to your ACIS home directory (or the directory you want to install the daemon to).

The actual daemon script will be installed to bin/control_daemon.pl. But you do not normally need it. ACIS includes everything you need to work with it, like scripts to start and stop the daemon and to send update requests to it.

I suggest that for a serious ACIS installation you make a private copy of Berkeley DB. This will protect you from system-wide software updates. Read why and how to do that.

Administration

Starting and stopping the daemon

Use bin/rid start and bin/rid stop to start and stop update daemon, respectively.

You better run the daemon all the time while ACIS is working. If for some reason it were not running for some time, it is not a catastrophe. Some update requests may be lost, but generally you can recover by running “bin/updareq ACIS /”.

Sending update requests

bin/updareq COLLECTION PATH [TOO_OLD]
Sends an update request to the daemon. It asks to update file or directory PATH in collection COLLECTION. TOO_OLD is time in seconds. If a file was last time processed more than TOO_OLD seconds ago, the daemon will process it again (even if it didn’t change since). By default, TOO_OLD is 86400*12 seconds, which means 12 days.

Logs

The main log of the update daemon is RI/daemon.log. It is the general log of requests coming in and what processing channel took it for processing.

All details of processing particular requests go into logs RI/update_ch0.log, RI/update_ch1.log, … RI/update_ch5.log. Each of these correspond to a processing channel.

These logs protocol what is being done, what files are read, what records are found in them and so on. If there were any problems with processing data, it will be logged in there.

Database housekeeping

Recent versions of update daemon use Berkeley DB Transactional Data Store for its database. This causes it to work a little slower, when compared to plain file data storage, but it gives us parallel processing feature and great fault-protection.

The database files are stored in {HOME}/RI/data. General files (so called “database environment”) are stored in this directory, and files for specific collections are stored in subdirectories of it. For example, data a for collection called “bliss” would live in {HOME}/RI/data/bliss.

When data is added or modified in the database, the Berkeley DB library creates sequentially numbered log files, e.g. RI/data/log.0000000001. If you have lots of data going in and getting modified in the database, number of these files can grow fast and they will occupy a real huge amount of disk space.

To solve this you need to do two things:

  1. initiate checkpoints regularly;

  2. delete unnecessary log files.

You can read about it in the Berkeley DB documentation. Basically, you can do it this way:

db_checkpoint -1 -h {HOME}/data
db_archive -d -h {HOME}/RI/data

db_checkpoint and db_archive utilities are from the Berkeley DB library package. Some care you need here, because you may have these utilities installed system-wide. And those can be of wrong version.
Therefore, if you have installed a private copy of Berkeley DB, then you’ll have to refer to these utilities by full path.

For instance, in Ivan’s case, he had to use /home/ivan/lib/bdb/bin/db_checkpoint and /home/ivan/lib/bdb/bin/db_archive. So, Ivan uses a crontab entry which runs a little script like this:

 #!/bin/sh
 /home/ivan/lib/bdb/bin/db_checkpoint -1 -h /opt/ACIS/RI/data
 /home/ivan/lib/bdb/bin/db_archive -d -h /opt/ACIS/RI/data

Here are some documentation links, just in case you need to know more:


If you want to know more about the daemon, you may look into internals doc.