Introduction and cooperation levels
Level 1. Export document metadata
Generating AMF
Configuring a collection in ACIS
Level 2. Personal identification fields in
document metadata
Level 3. Personal identification aid
The tools
ACIS’ names and records database table
ACIS::PIDAID module
ACIS::PIDAID::XML module
Level 4. Metadata update request
Logic of interaction
The /meta/update
screen
Configuration
Processing requests (Implementation technical details)
Response
Data mirroring function (in perl)
Update request helper module
ACIS::MetaUpdate::Request
Configuration
The acis_metaupdate_request( OBJECT, [PARAMETERS] ) function
ACIS stage 3 plan (“Saskatoon”) sets the development goal: to make ACIS cooperate with external document submission services. The aim is to quickly and automatically include new submitted documents into the corresponding personal profiles, so that users do not have to do it manually. At the same time it will support creation of richer metadata, where the document’s and the author’s records are connected with each other.
Several levels of cooperation between an ACIS service and a submission service are possible. The higher the level, the easier will be user experience and/or more precise and complete will be the metadata collected.
Level 0: A submission service and an ACIS service exist and function completely independently. There’s no data exchange between them. Nothing interesting happens.
Level 1: A submission service provides all of its document metadata to an ACIS system. The data is in a stable simple directory and file structure and is in a compatible format. ACIS processes it on a regular basis, notices changed/new records and adds them to its database just as any other data (that is coming from any other source). Compatible metadata formats for ACIS are AMF and ReDIF.
This requires some transport method for the metadata files to get them to the ACIS machine, but that’s out of scope of ACIS.
Level 2: A submission service includes personal identification data into its author and editor sections of the document metadata. Personal identification data is one of:
a personal record identifier, known to the ACIS service;
personal record short-id, assigned by the ACIS service;
an email address, known to ACIS as email address of a registered person.
If document metadata contains such data, ACIS can uniquely identify the document’s authors and editors. It will include the document into their personal research profiles. A feature called ARPU — automatic research profile update — makes this possible.
In a submission service such data can only come from users. So on this level the service must allow its users to enter such information and it must include it into the generated metadata.
Level 3: A submission service helps its users to find personal identification data and include it into the document metadata.
When a user fills in author/editor info for a document being submitted, the document service may search a database of the ACIS-registered persons. If it finds any personal records which match user input, it offers them for the user to choose from. If user accepts one of the personal records, the chosen record’s identifier is included into the document metadata. Document metadata is then exported to ACIS in the usual way.
This level requires two things. First, ACIS needs to expose its personal records database to the submission service. Second, submission service must search the database, must display the options in an unobtrusive way, must give a user an easy method to choose an option.
The Level 3 section of the EPrints-related stuff in ACIS describes in detail how can the user interface of a document submission service be enhanced.
Level 4: Submission service immediately notifies an ACIS service about each new document submission. ACIS service in real-time requests this data via web (http or ftp), stores it locally and processes it. If known personal identifiers are found in this data, it is noticed and works are added to the corresponding personal profiles in a matter of minutes.
Usage of the term level in this document is arguable. Only level 1 is really required for other levels. You can build level 4 (metadata update request) into a service without having levels 2-3 (person identification aid) there.
Not having a better term I stick with this one. The way the levels are listed in here is the most logical order to implement them. We first do the most simple but useful things, then less simple and less useful.
Let’s say you run a document submission service. Let’s say you want an ACIS service to consume document metadata that your service is collecting. Then you need to build a metadata collection.
Metadata collections is what ACIS update daemon keeps track of. Update daemon is what ACIS uses to process document and other metadata, that is coming from outside. It monitors data, processes it and puts it into ACIS database.
In terms of the ACIS update daemon, a metadata collection is a set of files with stable names with data in it. Files may be grouped into a voluntarily built directory structure. (Exception: no circular symbolic links are allowed.) Each file contains zero, one or more metadata records. Each record has a unique identifier. Identifiers are treated in case-insensitive manner. Having been lowercased they must still be unique.
The data records which have colliding (non-unique) record identifiers are excluded from ACIS database.
Update daemon must have file-system access to the files of the collection. It is your responsibility to organize transport of files from your service machine to the ACIS service machine.
For a document submission service it will probably be easiest to put one record into one file.
There are two basic ways for a system to export its data into metadata files. One is to generate it on the fly as things are happenning. Another way is to keep things in a database and then upon necessity run an export script, which will query the database and create necessary files.
In our case — in case of a document service exporting metadata to ACIS — it will be much better to generate metadata files on the fly, immediately after there’s a document record submission or change. It is especially true if you plan to implement level 4 later, because recent user action will immediately get reflected in the data. Without it level 4 will not work reliably.
Also it will help ACIS because the files of the records that didn’t change would have the same last-modified timestamp on it. So ACIS won’t have to read and process them every time, which is especially good if there are a lot of files.
AMF is an XML-based format for academic metadata. It is documented on the amf.openlib.org website.
There is an open-source Perl software for parsing and generating AMF called AMF-perl. It is available from the ACIS website’s code/ section.
Here is a little example. It generates AMF for a document
titled “New AMF text noun”, authored by Joe Spark, and
having full-text in a PDF file. The id of the record is
GFIO:ZXCVBN
.
use AMF::Parser;
use AMF::Record;
my $new = AMF::Record -> new( ID => 'GFIO:ZXCVBN', TYPE => 'text' ); # create a record
$new -> adjective( 'title', { 'xml:lang' => 'en' }, "New AMF text noun" ); # assign english-language title
my $person = AMF::Record -> new( TYPE => 'person' ); # create a person record
$person -> adjective( 'givenname', {}, "Joe" );
$person -> adjective( 'familyname', {}, "Spark" );
$new -> verb( "hasauthor", {}, $person ); # assign author to the record
my $file = AMF::AdjContainer -> new; # create an adjective container
$file -> adjective( 'url', {}, "http://file.service.site/great/file.pdf" );
$file -> adjective( 'format', {}, "application/pdf" );
$new -> adjcontainer( 'file', {}, $file ); # assign the file container to the record
my $amf = $new -> stringify; # get resulting AMF/XML string
Now. To build an actual AMF file you need one more step.
AMF records always come wrapped in amf
element.
It looks like this:
<amf xmlns='http://amf.openlib.org'
xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
xsi:schemaLocation='http://amf.openlib.org http://amf.openlib.org/2001/amf.xsd'>
... <!-- amf records go here -->
</amf>
AMF files need to have .amf.xml
name extension
(case-insensitive) to be processed by ACIS.
For an ACIS service to use a metadata collection it needs to be added to the collections configuration.
Presence of the metadata fields which allow personal identification in author/editor data is important for further levels. But not much else can be said here, because it completely depends on particular service you are building these features into (and the software it runs on).
It may be that you’ll have to add a personal identification field to author/editor metadata of documents. It may be that you’ll use an existing field for personal identifiers. It may involve changing submission forms and database structures of the application or just changing the visible label of a field and end-user instructions.
It is important that the actual personal identification data (e.g. the personal identifiers) via these fields gets exported to ACIS (see level 1).
Saskatoon document says:
Imagine a scenario. A user submits a document to a document archive. She enters the names of the document authors among other document information. The document archive searches for personal records, whose names or name variations match to what the submitting user entered. If it finds any personal records which look appropriate, the service might suggest it to the user, so that she can identify the author herself.
Building these features into a real document submission service would involve a lot of work. This work is specific to this particular service or to the software, it runs on. We implemented it for EPrints. In this section we describe some general tools, which can be used by anyone who steps on this path.
The names and records tables of an ACIS service database is the most basic tool you’ll need. An ACIS administrator controls who and how can access the tables. See How to expose ACIS database for Person Identification Aid for more details.
The module searches the ACIS’ personal database and returns results as a list of personal data records.
Parameters:
MySQL access parameters:
host
— hostnameport
— port numberdb
— database nameuser
— usernamepass
— passwordmax_results
— maximum number of results to return.
This specifies maximum number of matching personal records to return from a search operation. Default: 15.
There are two ways of specifying the configuration.
First: directly define variable $ACIS::PIDAID::CONF:
$ACIS::PIDAID::CONF = {
host => 'acis.super.edu',
port => '9099',
db => 'ACIS',
user => 'peter',
pass => 'jolly',
max_results => '25', # optional, default: 15
};
Second: pass a hash of these parameters to the
PIDAID object constructor new()
as a reference.
my $conf = {
host => 'acis.super.edu',
port => '9099',
db => 'ACIS',
user => 'peter',
pass => 'jolly',
max_results => '25', # optional, default: 15
};
my $pidaid = ACIS::PIDAID -> new( $conf );
Usage example:
my $pidaid = ACIS::PIDAID -> new( $conf );
The CONF parameter is optional. If it is not defined,
subroutine will try to use global variable
$ACIS::PIDAID::CONF
instead. It will die if
neither that nor CONF is defined. It will also die if it
fails to connect to the specified database.
Subroutine returns the created ACIS::PIDAID object reference on success.
Usage example:
my $persons = $pidaid -> find_by_name( $last, $first );
Method searches the names
and records tables for
matching personal records by last name and, optionally, by
first name. If any records are found, it packs the data
into a list of records. find_by_name()
returns
reference to the list on success. It returns reference to
an empty list if no matching records found. It returns
string “too many” if more than max_results
records
found.
If the LASTNAME value ends with an asterisk character
“*
”, then it is assumed to be just the beginning of
the names to search. For instance, if $last
is
"Mendel*"
, it would match both “Mendel” and
“Mendelssohn” surnames.
The FIRSTNAME, if non-empty, will always be matched in the same manner — against beginnings of the first name & possibly the middle name of the person. It can be said that FIRSTNAME has an implicit asterisk in the end.
If the LASTNAME is an empty string, the search is assumed to match any lastname. If both LASTNAME and FIRSTNAME are empty strings, it will return “too many” string.
Each record in the return list is represented by a hash of the following structure:
shortid
— the person’s record short-id;
id
— the person’s record id (handle);
profile_url
— URL of a detailed page about the
person;
namelast
— full name of the person in format
“Lastname, Firstname M.”;
familyname
— the “Lastname” part of namelast
value
givenname
— the “Firstname M.” part of
namelast
value, will include middle names or initials
if known;
namefull
— full name of the person as she
identified herself to ACIS, usually in format “Firstname
M. Lastname”. Not necessarily equals familyname
and givenname
combined.
homepage
— URL of the person’s homepage, if
known.
Usage example:
my $persons = $pidaid -> find_by_shortid( "pre32" );
The SHORTID — somebody’s short-id — ACIS-assigned short alpha-numeric identifier. The method will check ACIS records database for this identifier and will return the matching personal record, if any. Returns a list reference:
an empty list reference when no record was found;
a one-item list reference when a matching record was found.
Structure of the resulting list items is exactly as in
find_by_name()
.
Usage example:
my $persons = $pidaid -> find_by_email( $email );
The EMAIL is somebody’s personal email address, as entered by the user. The method will check it is a known email address for ACIS and will return the matching personal record, if any. Returns a list reference:
an empty list reference when no record was found;
a one-item list reference when the matching record was found.
The data structure is exactly as in
find_by_name()
.
This module provides one simple helper function:
make_xml()
. It takes a list reference as parameter
and returns an XML string.
my $xml = ACIS::PIDAID::XML::make_xml( $persons );
The parameter is assumed to be a result of
find_by_something
function from
ACIS::PIDAID. The resulting XML will consist of a
<list> document element. The <list> element
will contain zero, one or many
<person>…</person> elements, one per item of
the given $persons
list. Something like this:
<list>
<person>
...
</person>
<person>
...
</person>
...
</list>
Each <person> element will contain all the fields of a given record, like this:
<person>
<shortid>pfo12</shortid>
<id>repec:per:2005-07-21:samuel_fortnight</id>
<namefull>Samuel L. Fortnight</namefull>
<givenname>Samuel</givenname>
<familyname>Fortnight</familyname>
<profile_url>http://acis.zet/profile/pfo12/</profile_url>
<namelast>Fortnight, Samuel</namelast>
<homepage>http://www.super.edu/staff/fortnight/index.html</homepage>
</person>
The order of the elements inside person
element is
random and not to be relied upon.
Basic XML-specific escaping will be applied to values to produce well-formed XML. The produced XML string will be in UTF-8 encoding.
If the $persons
parameter is not a list
reference, but, instead, is a string "too
many"
, the function will return
"<toomany/>"
XML.
Level 4 is primarily specified in section 3.2 Requests for metadata processing of the Saskatoon document.
The metadata update request system consists of three main parts:
the /meta/update
screen in ACIS, which accepts
metadata update requests;
ACIS service-specific mirroring function, invoked by the
/meta/update
screen for every authorized
request;
the helper module for clients ACIS::MetaUpdate::Request, written in Perl, which makes sending requests to ACIS (see point 1) easy.
Usage of the helper module (point 3) is optional. It
simplifies access to the metadata update request feature for
applications and services, but nothing stops them from using
it directly, i.e. by sending HTTP requests to the
/meta/update
screen of an ACIS service. Of course,
such requests must follow this specification to be
successful (this part, at
least).
Here is the basic logic of what happens between a document submission service and an ACIS service, if all 4 levels are implemented and configured properly:
User submits a document to a document submission service. This possibly involves Personal identification aid, levels 2 & 3.
The document submission service exports metadata to a local web-accessible AMF file. That’s level 1.
The document submission service sends a metadata update request to an ACIS service about the new/updated datafile. It either uses the ACIS::MetaUpdate::Request helper module or sends correct HTTP request by itself; that’s level 4 already.
The ACIS service fetches the datafile and processes it.
It involves the /meta/update
screen, a
site-specific mirroring function and the usual ACIS
processing with update daemon
and APU. That’s level 4 as well.
/meta/update
screenA screen is a basic unit of the ACIS web interface. It is an URL-referenced part of the system, providing certain functionality. Relative URLs are used to distinguish and address screens, relative to the system’s base-url. A screen handles HTTP requests, may accept and process input and usually generates HTTP responses (like HTML page). Most screens are for human users.
The /meta/update
screen is primarily for
document submission services. It accepts two parameters,
either via POST or GET method:
id
— identifier of the archive, that is
sending the request;obj
— relative pathname to the file, which
contains the data to update.These parameters in [ACIS]
section of the ACIS
configuration file main.conf configure the screen:
meta-update-clients
Specifies IP addresses from which to accept metadata update requests and archive identifiers for which to accept requests from these IP addresses.
Parameter value is a whitespace-separated list of
IP/archive specifications, where each specification is of
the form ID + "@"
+ IP. Here:
ID is identifier of an archive. IP is a
numeric IP address of the machine the archive will send
requests from. "@"
is a literal “commercial
at” character.
Example:
meta-update-clients=zetta@81.25.34.190 albina:pao@118.2.42.9
ACIS will use this list to authorize the incoming update requests. For a request to be authorized, a pair of the request’s originating IP address and the request’s archive id must be present in the list.
meta-update-object-fetch-func
Name of the mirroring function for ACIS to call, when a metadata update request has arrived and has been succesfully authorized. The function must be written in Perl programming language and must be accessible to ACIS at its runtime (it must be loaded already). Details of the purpose and interface are below.
Example:
meta-update-object-fetch-func=RePEc::ACIS::meta_update_fetch
Requests for the screen are processed by the ACIS::Web::MetaUpdate module. Two main functions of this module power it:
authorize_request()
handle_request()
These functions in this order are the designated processors
of the screen in the screens.xml
file.
The authorize_request()
function does update
request authorization: checks originating IP address of the
request and the archive id against the configuration
meta-update-clients list. In case of authorization
failure it clears the processors queue, builds and produces
appropriate response (403 Access Forbidden
).
The handle_request()
’s work is to:
call data mirroring function and interpret its return value;
if mirroring was successful: request update daemon processing for the acquired/updated datafile via the RePEc::Index::UpdateClient module;
To request update daemon processing for a file, one does:
require RePEc::Index::UpdateClient;
RePEc::Index::UpdateClient::send_update_request( $collection, $file );
$file
here is a relative pathname to file or
directory to update in collection $collection
.
return appropriate response to the requesting party; see below.
The /meta/update
screen’s response always consists
of a certain HTTP response Status header and the response
body. Response body is a very simple HTML page, built after
the following template:
<html><head><title>[STATUS]</title></head>
<body><h1>[STATUS]</h1>
<address>ACIS /meta/update</address></body></html>
where [STATUS]
is the resulting HTTP response
status string, consisting of a 3-digit number and a word or
several words.
In case the request was not authorized by its IP address &
archive id, the screen generates response with 403
Access Forbidden
status.
If request was authorized, status depends on the value
returned by the mirroring function. It is expected that the
function returns a valid hash structure (as documented below). If returned value is not a
valid hash structure, the response status is 500
Internal Server Error
.
If returned value is valid, the screen’s response status
depends on the value of status
key of the returned
hash. Here are the corresponding screen statuses:
mirroring result, status key |
screen response status |
---|---|
ok | 200 OK |
archive unknown |
404 Not Found |
can not fetch |
204 No Content |
The generated response also contains HTTP headers to prohibit response caching by (possible) proxies and the user agent.
The function name is specified via configuration. ACIS will
call it as: Func( $acis, $id, $obj );
, where
$acis
is a reference to the ACIS::Web object,
$id
is the archive identifier (id
parameter of the update request), $obj
is the
object (filename) to update (obj
parameter of the
update request).
Administrator can use require-modules configuration parameter to make sure the function is available to Perl when ACIS needs it.
The data mirroring function:
$obj
from archive
$id
;The return value must be a reference to a hash with the following keys:
hash key | value |
---|---|
status |
|
collection | id of the collection, to which the object file was saved; |
pathname |
pathname of the object file, relative to the collection’s root point. |
Note: collection
and pathname
are only needed
if the status
is "ok"
.
How will the function access the archive and fetch the requested object is completely a matter of implementation. It may be a simple HTTP GET request to the archive’s website or an FTP session or some other transport operation.
Note, that ACIS service administrator has to supply the function himself, it is not part of ACIS. This is because the function is largely dependent on site-specific configuration of metadata exchange with other parties.
This is a Perl module for applications and services, wishing to send metadata update requests to an ACIS service.
The module exports one function into the caller’s namespace: acis_metaupdate_request().
Usage example:
use ACIS::MetaUpdate::Request;
$ACIS::MetaUpdate::Request::CONF = {
'request-target-url' => 'http://acis.super.org/meta/update',
'archive-id' => 'michigan',
'log-filename' => '/home/michigan/super-org-metaupdate.log',
};
...
acis_metaupdate_request( $filename );
Parameters:
request-target-url
— the URL address of the
/meta/update
screen of an ACIS installation to
send requests to; required parameter;
archive-id
— identifier of the archive to send
requests on behalf of; must be negotiated with the ACIS
service administrator; required parameter;
log-filename
— name of the log file to write;
optional.
There are two ways to specify configuration:
First: set global variable
$ACIS::MetaUpdate::Request::CONF
:
$ACIS::MetaUpdate::Request::CONF = {
'request-target-url' => 'http://acis.super.org/meta/update',
'archive-id' => 'michigan',
'log-filename' => '/home/michigan/super-org-metaupdate.log',
};
Second: pass configuration parameters along in the acis_metaupdate_request() function call; see below.
Example:
acis_metaupdate_request( $file, 'archive-id' => 'furry' );
The function will send update requests for a particular metadata record that your document service has produced.
OBJECT is the pathname to file or another object that ACIS shall update from originating “archive”.
PARAMETERS is an optional list of parameters and their
values (PAR1, VAL1, PAR2, VAL2, …). It must always
contain an even number of items. The parameters are
configuration parameters and override the corresponding
values specified via
$ACIS::MetaUpdate::Request::CONF
, if any.
The request-target-url
and archive-id
configuration parameters are required. Both must have a
defined and non-empty value.
The function will die on insufficient configuration, will return 1 on success, undef otherwise.
If the log-filename
parameter was specified, the
operation will be logged to that file. The log will include
full information about the request that has been sent
(ie. the target url, the archive id, the OBJECT value) and
status of the response it got, if any. Every log record
will be accompanied by date and time of the request.
$Id$
Generated: Wed Aug 29 22:59:09 2007
ACIS project, acis@openlib.org