Introduction
Metadata collections
Configuring collections
Installation
Administration
Starting and stopping the daemon
Sending update requests
Logs
Database housekeeping
The Update daemon keeps the ACIS database tables up-to-date with the metadata collections that an ACIS service uses. Ivan Kurmanov also sometimes calls it “RI daemon”. RI stands for RePEc-Index, which is a historical name for the modules behind the daemon.
The daemon reads data about documents, software components, series, and institutions from the collections’ data files and inserts it into corresponding tables. If a record disappears from the underlying metadata collection, the RI daemon deletes corresponding data from the tables.
The RI daemon is called a daemon because it does not do anything by its own will. It sits there and waits for somebody to send it a request. Update requests are, basically, orders for work. The only way the daemon communicates with the world is by writing a log.
Sometimes ACIS itself sends an update request to the daemon. Sometimes you might want to send an update request to the daemon.
Collections is how ACIS thinks about your metadata. You may have zero, one or more separate collections. If you have zero, ACIS won’t be of much use. Each collection has an identifier, a type and is stored in data files somewhere in the local file system. These things are specified in the collections’ configuration (see below).
Collection identifiers must be unique for any particular
ACIS service installation. “ACIS
” is a reserved
collection identifier.
Collection type must be “AMF
” if your collection
will consist of AMF files. (For RePEc’s ReDIF it is
“RePEcRec
”.) Other collection types may be defined
in the future.
A collection consists of any number of data files with stable names, grouped into any directory structure that fits. The directory structure must not contain circular symbolic links. The update daemon must have file-system access to the files and directories of the collection.
Each file of a collection may contain zero, one or many data records. Each data record must have a globally unique identifier. Identifiers are treated in case-insensitive manner. Having been lowercased they still must be unique.
If two or more records in a collection have the same identifier, the update daemon excludes them all from the ACIS database. All other records are processed and their data is saved to the ACIS database.
For AMF collections only files with .amf.xml
name
extension (case-insensitive) are treated as data files.
Other files are ignored.
If data of a collection has changed, you want ACIS to check and process it to reflect these changes in the database. Use bin/updareq utility is for this. If your collection changes often, you will want to run this utility at regular intervals.
You configure the collections with
metadata-collections parameter of the
main.conf file. For each collection you put the
collection identifier into this parameter and you specify
its type and path to it with metadata-X-type
and metadata-X-home parameters, where
X
is the collection identifier.
If you have several collections, separate their identifiers in metadata-collections with a space.
“ACIS
” is a reserved collection identifier. The system
uses it for its own generated data.
You use collection identifiers when you request a data update of a collection with bin/updareq utility.
Right now ACIS understands data in two metadata formats: RePEc’s ReDIF and AMF. Support for other metadata formats and collection structures can be added. It is not very difficult to develop, provided the data is file-based or has a pretty simple way to map records to files.
The parsed collections configuration is written into the
{HOME}/RI/collections
file.
Normally, the update daemon is installed when you install ACIS and no special installation is needed.
If you need a separate installation of the update daemon or if you want to manually upgrade the update daemon, you may follow these instructions. Grab the latest RePEc-Index package from http://acis.openlib.org/code/. You unpack the package and then run:
$ RePEc-Index-0.XX/install.sh HOME
where HOME is the path to your ACIS home directory (or the directory you want to install the daemon to).
The actual daemon script will be installed to
bin/control_daemon.pl
. But you do not normally need
it. ACIS includes everything you need to work with it, like
scripts to start and stop the daemon and
to send update requests to it.
I suggest that for a serious ACIS installation you make a private copy of Berkeley DB. This will protect you from system-wide software updates. Read why and how to do that.
Use bin/rid start
and bin/rid
stop
to start and stop update daemon, respectively.
You better run the daemon all the time while ACIS is
working. If for some reason it were not running for some
time, it is not a catastrophe. Some update requests may be
lost, but generally you can recover by running
“bin/updareq ACIS /
”.
bin/updareq COLLECTION PATH [TOO_OLD]
PATH
in collection
COLLECTION
. TOO_OLD
is time in seconds.
If a file was last time processed more than TOO_OLD
seconds ago, the daemon will process it again (even if it
didn’t change since). By default, TOO_OLD
is
86400*12
seconds, which means 12 days. The main log of the update daemon is
RI/daemon.log
. It is
the general log of requests coming in and what processing
channel took it for processing.
All details of processing particular requests go into logs
RI/update_ch0.log
, RI/update_ch1.log
, …
RI/update_ch5.log
. Each of these correspond to a
processing channel.
These logs protocol what is being done, what files are read, what records are found in them and so on. If there were any problems with processing data, it will be logged in there.
Recent versions of update daemon use Berkeley DB Transactional Data Store for its database. This causes it to work a little slower, when compared to plain file data storage, but it gives us parallel processing feature and great fault-protection.
The database files are stored in {HOME}/RI/data
.
General files (so called “database environment”) are stored
in this directory, and files for specific collections are stored
in subdirectories of it. For example, data a for collection called
“bliss” would live in {HOME}/RI/data/bliss
.
When data is added or modified in the database, the Berkeley
DB library creates sequentially numbered log files, e.g.
RI/data/log.0000000001
. If you have lots of data
going in and getting modified in the database, number of
these files can grow fast and they will occupy a real huge
amount of disk space.
To solve this you need to do two things:
initiate checkpoints regularly;
delete unnecessary log files.
You can read about it in the Berkeley DB documentation. Basically, you can do it this way:
db_checkpoint -1 -h {HOME}/data
db_archive -d -h {HOME}/RI/data
db_checkpoint
and db_archive
utilities are
from the Berkeley DB library package. Some care you need
here, because you may have these utilities installed
system-wide. And those can be of wrong version.
Therefore, if you have installed a private copy of Berkeley DB, then
you’ll have to refer to these utilities by full path.
For instance, in Ivan’s case, he had to use
/home/ivan/lib/bdb/bin/db_checkpoint
and
/home/ivan/lib/bdb/bin/db_archive
. So, Ivan uses a
crontab entry which runs a little script like this:
#!/bin/sh
/home/ivan/lib/bdb/bin/db_checkpoint -1 -h /opt/ACIS/RI/data
/home/ivan/lib/bdb/bin/db_archive -d -h /opt/ACIS/RI/data
Here are some documentation links, just in case you need to know more:
If you want to know more about the daemon, you may look into internals doc.
Generated: Wed Aug 29 22:59:09 2007
ACIS project, acis@openlib.org