Interoperability with document submission services

Table of contents

   Introduction and cooperation levels
   Level 1. Export document metadata
      Generating AMF
      Configuring a collection in ACIS
   Level 2. Personal identification fields in document metadata
   Level 3. Personal identification aid
      The tools
         ACIS’ names and records database table
         ACIS::PIDAID module
         ACIS::PIDAID::XML module
   Level 4. Metadata update request
      Logic of interaction
      The /meta/update screen
         Configuration
         Processing requests (Implementation technical details)
         Response
      Data mirroring function (in perl)
      Update request helper module ACIS::MetaUpdate::Request
         Configuration
         The acis_metaupdate_request( OBJECT, [PARAMETERS] ) function

Introduction and cooperation levels

ACIS stage 3 plan (“Saskatoon”) sets the development goal: to make ACIS cooperate with external document submission services. The aim is to quickly and automatically include new submitted documents into the corresponding personal profiles, so that users do not have to do it manually. At the same time it will support creation of richer metadata, where the document’s and the author’s records are connected with each other.

Several levels of cooperation between an ACIS service and a submission service are possible. The higher the level, the easier will be user experience and/or more precise and complete will be the metadata collected.

Level 0: A submission service and an ACIS service exist and function completely independently. There’s no data exchange between them. Nothing interesting happens.

Level 1: A submission service provides all of its document metadata to an ACIS system. The data is in a stable simple directory and file structure and is in a compatible format. ACIS processes it on a regular basis, notices changed/new records and adds them to its database just as any other data (that is coming from any other source). Compatible metadata formats for ACIS are AMF and ReDIF.

This requires some transport method for the metadata files to get them to the ACIS machine, but that’s out of scope of ACIS.

Level 2: A submission service includes personal identification data into its author and editor sections of the document metadata. Personal identification data is one of:

If document metadata contains such data, ACIS can uniquely identify the document’s authors and editors. It will include the document into their personal research profiles. A feature called ARPU — automatic research profile update — makes this possible.

In a submission service such data can only come from users. So on this level the service must allow its users to enter such information and it must include it into the generated metadata.

Level 3: A submission service helps its users to find personal identification data and include it into the document metadata.

When a user fills in author/editor info for a document being submitted, the document service may search a database of the ACIS-registered persons. If it finds any personal records which match user input, it offers them for the user to choose from. If user accepts one of the personal records, the chosen record’s identifier is included into the document metadata. Document metadata is then exported to ACIS in the usual way.

This level requires two things. First, ACIS needs to expose its personal records database to the submission service. Second, submission service must search the database, must display the options in an unobtrusive way, must give a user an easy method to choose an option.

The Level 3 section of the EPrints-related stuff in ACIS describes in detail how can the user interface of a document submission service be enhanced.

Level 4: Submission service immediately notifies an ACIS service about each new document submission. ACIS service in real-time requests this data via web (http or ftp), stores it locally and processes it. If known personal identifiers are found in this data, it is noticed and works are added to the corresponding personal profiles in a matter of minutes.

Usage of the term level in this document is arguable. Only level 1 is really required for other levels. You can build level 4 (metadata update request) into a service without having levels 2-3 (person identification aid) there.

Not having a better term I stick with this one. The way the levels are listed in here is the most logical order to implement them. We first do the most simple but useful things, then less simple and less useful.

Level 1. Export document metadata

Let’s say you run a document submission service. Let’s say you want an ACIS service to consume document metadata that your service is collecting. Then you need to build a metadata collection.

Metadata collections is what ACIS update daemon keeps track of. Update daemon is what ACIS uses to process document and other metadata, that is coming from outside. It monitors data, processes it and puts it into ACIS database.

In terms of the ACIS update daemon, a metadata collection is a set of files with stable names with data in it. Files may be grouped into a voluntarily built directory structure. (Exception: no circular symbolic links are allowed.) Each file contains zero, one or more metadata records. Each record has a unique identifier. Identifiers are treated in case-insensitive manner. Having been lowercased they must still be unique.

The data records which have colliding (non-unique) record identifiers are excluded from ACIS database.

Update daemon must have file-system access to the files of the collection. It is your responsibility to organize transport of files from your service machine to the ACIS service machine.

For a document submission service it will probably be easiest to put one record into one file.

There are two basic ways for a system to export its data into metadata files. One is to generate it on the fly as things are happenning. Another way is to keep things in a database and then upon necessity run an export script, which will query the database and create necessary files.

In our case — in case of a document service exporting metadata to ACIS — it will be much better to generate metadata files on the fly, immediately after there’s a document record submission or change. It is especially true if you plan to implement level 4 later, because recent user action will immediately get reflected in the data. Without it level 4 will not work reliably.

Also it will help ACIS because the files of the records that didn’t change would have the same last-modified timestamp on it. So ACIS won’t have to read and process them every time, which is especially good if there are a lot of files.

Generating AMF

AMF is an XML-based format for academic metadata. It is documented on the amf.openlib.org website.

There is an open-source Perl software for parsing and generating AMF called AMF-perl. It is available from the ACIS website’s code/ section.

Here is a little example. It generates AMF for a document titled “New AMF text noun”, authored by Joe Spark, and having full-text in a PDF file. The id of the record is GFIO:ZXCVBN.

use AMF::Parser;
use AMF::Record;

my $new  = AMF::Record -> new( ID => 'GFIO:ZXCVBN', TYPE => 'text' );  # create a record

$new -> adjective( 'title', { 'xml:lang' => 'en' }, "New AMF text noun" );  # assign english-language title

my $person = AMF::Record -> new( TYPE => 'person' );                   # create a person record
$person -> adjective( 'givenname',  {}, "Joe"   );
$person -> adjective( 'familyname', {}, "Spark" );
$new -> verb( "hasauthor", {}, $person );                              # assign author to the record

my $file = AMF::AdjContainer -> new;                                   # create an  adjective container
$file -> adjective( 'url',    {}, "http://file.service.site/great/file.pdf" );
$file -> adjective( 'format', {}, "application/pdf" );

$new -> adjcontainer( 'file', {}, $file );                             # assign the file container to the record

my $amf = $new -> stringify;                                           # get resulting AMF/XML string

Now. To build an actual AMF file you need one more step. AMF records always come wrapped in amf element. It looks like this:

<amf xmlns='http://amf.openlib.org'
     xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
     xsi:schemaLocation='http://amf.openlib.org http://amf.openlib.org/2001/amf.xsd'>

... <!-- amf records go here -->

</amf>

AMF files need to have .amf.xml name extension (case-insensitive) to be processed by ACIS.

Configuring a collection in ACIS

For an ACIS service to use a metadata collection it needs to be added to the collections configuration.

Level 2. Personal identification fields in document metadata

Presence of the metadata fields which allow personal identification in author/editor data is important for further levels. But not much else can be said here, because it completely depends on particular service you are building these features into (and the software it runs on).

It may be that you’ll have to add a personal identification field to author/editor metadata of documents. It may be that you’ll use an existing field for personal identifiers. It may involve changing submission forms and database structures of the application or just changing the visible label of a field and end-user instructions.

It is important that the actual personal identification data (e.g. the personal identifiers) via these fields gets exported to ACIS (see level 1).

Level 3. Personal identification aid

Saskatoon document says:

Imagine a scenario. A user submits a document to a document archive. She enters the names of the document authors among other document information. The document archive searches for personal records, whose names or name variations match to what the submitting user entered. If it finds any personal records which look appropriate, the service might suggest it to the user, so that she can identify the author herself.

Building these features into a real document submission service would involve a lot of work. This work is specific to this particular service or to the software, it runs on. We implemented it for EPrints. In this section we describe some general tools, which can be used by anyone who steps on this path.

The tools

ACIS’ names and records database table

The names and records tables of an ACIS service database is the most basic tool you’ll need. An ACIS administrator controls who and how can access the tables. See How to expose ACIS database for Person Identification Aid for more details.

ACIS::PIDAID module

The module searches the ACIS’ personal database and returns results as a list of personal data records.

Configuration

Parameters:

There are two ways of specifying the configuration.

First: directly define variable $ACIS::PIDAID::CONF:

$ACIS::PIDAID::CONF = {
  host => 'acis.super.edu',
  port => '9099',
  db   => 'ACIS',
  user => 'peter',
  pass => 'jolly',
  max_results => '25', # optional, default: 15
};

Second: pass a hash of these parameters to the PIDAID object constructor new() as a reference.

my $conf = {
  host => 'acis.super.edu',
  port => '9099',
  db   => 'ACIS',
  user => 'peter',
  pass => 'jolly',
  max_results => '25', # optional, default: 15
};
my $pidaid = ACIS::PIDAID -> new( $conf );
Constructor: new( [CONF] )

Usage example:

my $pidaid = ACIS::PIDAID -> new( $conf );

The CONF parameter is optional. If it is not defined, subroutine will try to use global variable $ACIS::PIDAID::CONF instead. It will die if neither that nor CONF is defined. It will also die if it fails to connect to the specified database.

Subroutine returns the created ACIS::PIDAID object reference on success.

Method: find_by_name( LASTNAME [, FIRSTNAME ] )

Usage example:

my $persons = $pidaid -> find_by_name( $last, $first );

Method searches the names and records tables for matching personal records by last name and, optionally, by first name. If any records are found, it packs the data into a list of records. find_by_name() returns reference to the list on success. It returns reference to an empty list if no matching records found. It returns string “too many” if more than max_results records found.

If the LASTNAME value ends with an asterisk character “*”, then it is assumed to be just the beginning of the names to search. For instance, if $last is "Mendel*", it would match both “Mendel” and “Mendelssohn” surnames.

The FIRSTNAME, if non-empty, will always be matched in the same manner — against beginnings of the first name & possibly the middle name of the person. It can be said that FIRSTNAME has an implicit asterisk in the end.

If the LASTNAME is an empty string, the search is assumed to match any lastname. If both LASTNAME and FIRSTNAME are empty strings, it will return “too many” string.

Each record in the return list is represented by a hash of the following structure:

Method: find_by_shortid( SHORTID )

Usage example:

my $persons = $pidaid -> find_by_shortid( "pre32" );

The SHORTID — somebody’s short-id — ACIS-assigned short alpha-numeric identifier. The method will check ACIS records database for this identifier and will return the matching personal record, if any. Returns a list reference:

Structure of the resulting list items is exactly as in find_by_name().

Method: find_by_email( EMAIL )

Usage example:

my $persons = $pidaid -> find_by_email( $email );

The EMAIL is somebody’s personal email address, as entered by the user. The method will check it is a known email address for ACIS and will return the matching personal record, if any. Returns a list reference:

The data structure is exactly as in find_by_name().

ACIS::PIDAID::XML module

This module provides one simple helper function: make_xml(). It takes a list reference as parameter and returns an XML string.

my $xml = ACIS::PIDAID::XML::make_xml( $persons );

The parameter is assumed to be a result of find_by_something function from ACIS::PIDAID. The resulting XML will consist of a <list> document element. The <list> element will contain zero, one or many <person>…</person> elements, one per item of the given $persons list. Something like this:

 <list>
   <person>
      ...
   </person>
   <person>
     ...
   </person>
   ...
 </list>

Each <person> element will contain all the fields of a given record, like this:

 <person>
   <shortid>pfo12</shortid>
   <id>repec:per:2005-07-21:samuel_fortnight</id>
   <namefull>Samuel L. Fortnight</namefull>
   <givenname>Samuel</givenname>
   <familyname>Fortnight</familyname>
   <profile_url>http://acis.zet/profile/pfo12/</profile_url>
   <namelast>Fortnight, Samuel</namelast>
   <homepage>http://www.super.edu/staff/fortnight/index.html</homepage>
 </person>

The order of the elements inside person element is random and not to be relied upon.

Basic XML-specific escaping will be applied to values to produce well-formed XML. The produced XML string will be in UTF-8 encoding.

If the $persons parameter is not a list reference, but, instead, is a string "too many", the function will return "<toomany/>" XML.

Level 4. Metadata update request

Level 4 is primarily specified in section 3.2 Requests for metadata processing of the Saskatoon document.

The metadata update request system consists of three main parts:

  1. the /meta/update screen in ACIS, which accepts metadata update requests;

  2. ACIS service-specific mirroring function, invoked by the /meta/update screen for every authorized request;

  3. the helper module for clients ACIS::MetaUpdate::Request, written in Perl, which makes sending requests to ACIS (see point 1) easy.

Usage of the helper module (point 3) is optional. It simplifies access to the metadata update request feature for applications and services, but nothing stops them from using it directly, i.e. by sending HTTP requests to the /meta/update screen of an ACIS service. Of course, such requests must follow this specification to be successful (this part, at least).

Logic of interaction

Here is the basic logic of what happens between a document submission service and an ACIS service, if all 4 levels are implemented and configured properly:

The /meta/update screen

A screen is a basic unit of the ACIS web interface. It is an URL-referenced part of the system, providing certain functionality. Relative URLs are used to distinguish and address screens, relative to the system’s base-url. A screen handles HTTP requests, may accept and process input and usually generates HTTP responses (like HTML page). Most screens are for human users.

The /meta/update screen is primarily for document submission services. It accepts two parameters, either via POST or GET method:

Configuration

These parameters in [ACIS] section of the ACIS configuration file main.conf configure the screen:

meta-update-clients

Specifies IP addresses from which to accept metadata update requests and archive identifiers for which to accept requests from these IP addresses.

Parameter value is a whitespace-separated list of IP/archive specifications, where each specification is of the form ID + "@" + IP. Here: ID is identifier of an archive. IP is a numeric IP address of the machine the archive will send requests from. "@" is a literal “commercial at” character.

Example:

meta-update-clients=zetta@81.25.34.190 albina:pao@118.2.42.9

ACIS will use this list to authorize the incoming update requests. For a request to be authorized, a pair of the request’s originating IP address and the request’s archive id must be present in the list.

meta-update-object-fetch-func

Name of the mirroring function for ACIS to call, when a metadata update request has arrived and has been succesfully authorized. The function must be written in Perl programming language and must be accessible to ACIS at its runtime (it must be loaded already). Details of the purpose and interface are below.

Example:

meta-update-object-fetch-func=RePEc::ACIS::meta_update_fetch

Processing requests (Implementation technical details)

Requests for the screen are processed by the ACIS::Web::MetaUpdate module. Two main functions of this module power it:

These functions in this order are the designated processors of the screen in the screens.xml file.

The authorize_request() function does update request authorization: checks originating IP address of the request and the archive id against the configuration meta-update-clients list. In case of authorization failure it clears the processors queue, builds and produces appropriate response (403 Access Forbidden).

The handle_request()’s work is to:

Response

The /meta/update screen’s response always consists of a certain HTTP response Status header and the response body. Response body is a very simple HTML page, built after the following template:

<html><head><title>[STATUS]</title></head>
<body><h1>[STATUS]</h1>
<address>ACIS /meta/update</address></body></html>

where [STATUS] is the resulting HTTP response status string, consisting of a 3-digit number and a word or several words.

In case the request was not authorized by its IP address & archive id, the screen generates response with 403 Access Forbidden status.

If request was authorized, status depends on the value returned by the mirroring function. It is expected that the function returns a valid hash structure (as documented below). If returned value is not a valid hash structure, the response status is 500 Internal Server Error.

If returned value is valid, the screen’s response status depends on the value of status key of the returned hash. Here are the corresponding screen statuses:

mirroring result, status key screen response status
ok 200 OK
archive unknown 404 Not Found
can not fetch 204 No Content

The generated response also contains HTTP headers to prohibit response caching by (possible) proxies and the user agent.

Data mirroring function (in perl)

The function name is specified via configuration. ACIS will call it as: Func( $acis, $id, $obj );, where $acis is a reference to the ACIS::Web object, $id is the archive identifier (id parameter of the update request), $obj is the object (filename) to update (obj parameter of the update request).

Administrator can use require-modules configuration parameter to make sure the function is available to Perl when ACIS needs it.

The data mirroring function:

  1. fetches the object $obj from archive $id;
  2. saves it to a file in one of the local metadata collections, that are configured in ACIS;
  3. returns success/failure status and details.

The return value must be a reference to a hash with the following keys:

hash key value
status
  • "archive unknown" — if function does not know how to access the archive.
  • "can not fetch" — archive is known, but it failed to connect and fetch the object from it.
  • "ok" — archive is known, object is fetched and saved.
collection id of the collection, to which the object file was saved;
pathname pathname of the object file, relative to the collection’s root point.

Note: collection and pathname are only needed if the status is "ok".

How will the function access the archive and fetch the requested object is completely a matter of implementation. It may be a simple HTTP GET request to the archive’s website or an FTP session or some other transport operation.

Note, that ACIS service administrator has to supply the function himself, it is not part of ACIS. This is because the function is largely dependent on site-specific configuration of metadata exchange with other parties.

Update request helper module ACIS::MetaUpdate::Request

This is a Perl module for applications and services, wishing to send metadata update requests to an ACIS service.

The module exports one function into the caller’s namespace: acis_metaupdate_request().

Usage example:

use ACIS::MetaUpdate::Request;

$ACIS::MetaUpdate::Request::CONF = {
   'request-target-url' => 'http://acis.super.org/meta/update',
   'archive-id'         => 'michigan',
   'log-filename'       => '/home/michigan/super-org-metaupdate.log',
};

...
acis_metaupdate_request( $filename );

Configuration

Parameters:

There are two ways to specify configuration:

The acis_metaupdate_request( OBJECT, [PARAMETERS] ) function

Example:

acis_metaupdate_request( $file, 'archive-id' => 'furry' );

The function will send update requests for a particular metadata record that your document service has produced.

OBJECT is the pathname to file or another object that ACIS shall update from originating “archive”.

PARAMETERS is an optional list of parameters and their values (PAR1, VAL1, PAR2, VAL2, …). It must always contain an even number of items. The parameters are configuration parameters and override the corresponding values specified via $ACIS::MetaUpdate::Request::CONF, if any.

The request-target-url and archive-id configuration parameters are required. Both must have a defined and non-empty value.

The function will die on insufficient configuration, will return 1 on success, undef otherwise.

If the log-filename parameter was specified, the operation will be logged to that file. The log will include full information about the request that has been sent (ie. the target url, the archive id, the OBJECT value) and status of the response it got, if any. Every log record will be accompanied by date and time of the request.