ACIS Stage Three Plan

“Saskatoon” paper

By Thomas Krichel and Ivan Kurmanov.

This is the plan for stage 3 of ACIS [1].  It is also available in MS Word format, 71 kb.

ACIS Project is funded by the Open Society Institute.

To understand much of this document requires some knowledge of the existing ACIS software and its first running implementation in the RePEc Author Service (RAS). We start with this background, then set out the goals and outline the implementation plan.

1. Background

1.1. Submitting documents

Our experience with RAS shows that academics are eager to share their personal and research data. Many users write to us asking about how to add new documents to their research profiles. Some are unhappy because they expect to find this feature and have searched for it without success [2]. Some users manage to reach their aim by submitting their documents to EconWPA. EconWPA supplies its document data to RAS.  Users then wait until the data reaches ACIS, until ACIS processes it and then they claim it in their research profile. For many others this doesn’t work, because EconWPA is a working papers archive, not suitable for articles and books.

In the meantime, ACIS has been fitted with a new feature called “automatic research profile update”, ARPU for short. ARPU works offline.  It executes research searches for the users. If it finds a match between a person’s name variation and an author name of a document, it offers the documents to the user. If a document’s metadata points directly to a particular person via an identifier, ARPU adds the document to the user’s research profile. In either case ARPU sends an email to the person, explaining what happened and what was found. The user can then accept or refuse the offer, correct a mistake, if it happened. No email is sent if ARPU has not found anything interesting.

One implication of ARPU is much relevant for further development. A new document record that quotes the identifier of a particular person will be automatically claimed within minutes after ACIS finds it [3]. This feature lays out a foundation we are going to build upon.

1.2.  Some other technical details

Each personal record in ACIS gets two identifiers.  The primary one is known as the record identifier and includes a date and a form of the person’s name. The second one is known as short-id, it’s short and doesn’t tell you much about the person identified. ACIS ensures a one-to-one mapping between these two identifiers, which means they usually can be used interchangeably [4].

ACIS uses a MySQL database engine to store some of its data. MySQL database can be transparently accessed through network. MySQL includes a comprehensive access authorization system, based on users and privileges.  There are free, open-source tools, such as phpMyAdmin, that help managing users and privileges, to set who can access what and in what way.

ACIS works on external metadata collected in local data files. ACIS does not include any transport mechanism for these data files to get from their original place to the ACIS installation machine. In case of RAS, this data is collected from many independent providers.

2. Goals

Users want to add their documents to their research profiles. One approach is to build a document submission facility into ACIS on our own. It is a costly task, if one wants to do it well. We don’t have resources for it. Another approach is to integrate an existing open-source software such as EPrints, for such a facility.  To integrate two different pieces of software into a single application requires more time and effort than we can spend.

Within reasonable time and with reasonable effort we can make ACIS cooperate with external submission services. A well-organized cooperation can simplify user experience and provides for better quality of collected metadata.

Imagine such a scenario. A user submits a document to a document archive.  She enters the names of the document authors among other document information. The document archive searches for personal records, whose names or name variations match to what the submitting user entered.  If it finds any personal records which look appropriate, the service might suggest it to the user, so that she can identify the author herself. When the submission is ready, the archive notifies an ACIS installation about it. ACIS processes new document data and the document is automatically added to the personal profiles of the identified authors.  ACIS sends them email, notifying about this addition.

In ACIS stage three we aim at such a user experience.

3. Implementation strategy

The goal requires some development on ACIS side and some development on side of other software. We see the EPrints software as the best candidate for a reference implementation of cooperation-with-ACIS features. To simplify possible implementation of these features in other document archive software they need to be implemented as re-usable modules with a simple, clear interface.

Cooperation between a document archive and an ACIS installation consists of two loosely-related interactions.  First, a document archive accesses the ACIS database to search for personal record data. This is the author identification aid. Second, a document archive requests ACIS processing for a new or updated document record. This is the request for metadata processing.

3.1. Author identification aid module

Identifying a person may happen in several ways.  First, a user may enter the full name or last name of the person.  This may lead to one or more possible personal records.  Second, a user may enter the person’s email address.  ACIS may recognize this as email address of a registered person.  Finally, a user may enter a personal identifier or a short-id. Thus the module for author identification must support searches on these fields and return back a list of personal records, if any are found.

Each personal record returned will contain: full personal name, short-id, homepage URL, URL of a detailed profile page about the person.

The module will query a local or remote ACIS database.  It will be configurable with MySQL access parameters. One will have to get permission and access parameters from an ACIS installation administrator to use it.

ACIS will be adjusted to create and maintain appropriate database tables [5] for easy and efficient work of this module.  This only requires a number of slight changes to the existing tables.

3.2. Requests for metadata processing

Whenever a document archive accepts a new submission or a change to an existing submission happens, it may want an ACIS installation to process it as soon as possible. It will then send a HTTP request to ACIS, identifying itself and specifying what needs to be updated. This assumes that the archive is already a supplier of metadata to the installation. New or updated document data will probably get processed anyway, but such requests can speed up the process.

The request may be either HTTP GET or POST request.  Either way, the request must include two parameters: “id” and “obj”. Parameter “id” contains an identifier of the data supplier, i.e. the document archive.  Parameter “obj” points to a particular object which archive wants ACIS to process.

In RePEc terms, “id” will contain an archive identifier (“RePEc:bil”) and “obj” will contain a pathname to the new/updated file that needs to be mirrored & processed, relative to the root directory of the archive’s access point (“bilpap/yr2005.rdf”).

Upon receiving such request ACIS will:

  1. Authorize the request
  2. Identify the requesting party (the archive)
  3. Fetch the object (file), which was requested to update
  4. Put it on the data processing queue

ACIS will authorize request by checking its originating IP address against a list of allowed IP addresses. These IP addresses are maintained by the administrator of the archive. If IP authentication fails, ACIS sends a HTTP response with “403 Access Forbidden” status code.

If authentication is successful, ACIS will pass the update request to a data- mirroring function. This function checks if it knows the archive.  If it does, it then fetches the object, probably a file, and stores it to a local copy of the archives' data. If it does not, it returns undefined value and ACIS returns a HTTP response with “204 No Content” status.

If archive identification and object (file) fetching was successful, ACIS saves the object (file) on local file system and processes it [6].  The new or updated file and its data go through the usual data processing.

If gathering the data from the document archive was successful, ACIS sends the HTTP response with “200 OK” status.

The name of the data-mirroring function will be configurable by the administrator. Since ACIS can be used on a set of data in files of completely different structures and using different underlying technology, an appropriate data-mirroring function will have to be written for it externally [7].

Data processing request interface will be implemented in two parts: client part and server part.  The client part will be a re-usable module with simple interface, for use by document archives.  The server part will be an integral part of ACIS.

This update request interface will allow a document service like EconWPA to `push' its recent additions to the RePEc Author Service nearly real-time.

3.3. Reference implementation in EPrints

The EPrints software will be enhanced with three features

  1. AMF metadata output;
  2. Author identification;
  3. Metadata processing request.

Configuration parameters will be introduced for each of these items.  The EPrints source code changes will be coordinated with its developers with hope that we will not have to maintain a separate version.

3.3.1. AMF export in EPrints

EPrints, by default, generates metadata in Dublin Core for all the documents contained in an EPrints archive. Provided such metadata has a DC:creator field and a DC:title field, as, well as an identifier, such documents can be readily given a primitive AMF translation. Here DC:title becomes AMF:title, DC:creator becomes AMF:/hasauthor/person/name.  In addition, it will be possible to capture the handle of the document in EPrints as AMF:id, and the URL of the HTML page in EPrints as AMF:displaypage.  This is a minimum conversion at this stage.  It will imply that there is only one collection for the whole of the archive, and all documents in the archive belong to that collection. Work funded under this project will implement this basic AMF conversion. Software funded under this project will provide such a basic conversion and store it, one file per record, on the EPrints machine.

4. Additional work

4.1. Implementation for EconWPA

For EconWPA, we offer to rewrite their submission script and we will furnish a script that will implement instant claim processing. This script will come with a full AMF conversion of the internal metadata format, as traditionally being converted to ReDIF first. This will then be pushed to the RePEc author service via the metadata update interface to be built.


Footnotes
1:

Its first draft was written when Ivan visited Thomas in Novosibirsk 2004-07-22 to 2004-07-28. His travel was funded by the Ford Foundation via its grant to the Socionet project. This is a much revised version of 2004-12-01.

2:

While providing such facility seems to be an obvious way to meet the expectations of users, we have no plans to do so.  In fact, doing this has been ruled out since the Montreal document. There are two reasons. First if every user is able to create new documents in ACIS documents database, this opens a door for a flood of low-quality and duplicate material.  Second, there are tools and services for archiving documents, and there’s no need to duplicate their functionality. Otherwise ACIS would become another EPrints or DSpace software.

3:

The time it takes depends on the configuration.

4:

The identifier of a person combines an expression of the name of the person, in ASCII transliteration, with a date in the lifetime of the person. If the person does not give us a date, we use the date of registration.  In each ACIS installation a prefix is added. For RePEc, the prefix is “RePEc:per:”. Thus RePEc:per:1965-06-05:thomas_krichel is a typical identifier. This identifier is called the person handle. The person handle is an intelligent identifier. When looking at the identifier a human should be able to find the person in question.  The use of the date means that the identifier is long-lived. The scheme can be used without any problem for several centuries, which is what we aim at in building personal identification. Thus intelligence and longevity are the attraction of this scheme.  However, they also imply drawbacks to the scheme. First, some people object to their birthdays being made public. Second, if the name of the person changes, she may be tempted to change it. Third, the handle is long, and we know, from our experience with users, that they prefer short handles. As an aside, HoPEc, the precursor system did not show handles at all, since the dates buried in them were used as password.

These problems with the use of personal handles lead Ivan to construct an alternative scheme to refer to a person known as the “short-id”. The short-id is formed by taking the first letters of the person’s first and last name and appending a counter to ensure uniqueness. This is a typically dumb identifier. It is short, but relies on a feature of ACIS and will not be of much use in future centuries when, ACIS is likely to be gone. While ACIS is around, it is safe to deploy short-ids. Strictly speaking, though, a short-id does not identify the person. Instead, it identifies the record in ACIS that describes the person.

5:

Users of document archives must be given access to the latest version of the ACIS database. However, they do not need to see the whole database, but only a part of it. They should not be given access to the whole database anyway, some elements, such as email addresses, for example, must be hidden from public access. At document archive services, users will interact with interfaces that will allow them:

  • to verify a known short-id
  • to query the ACIS database for short-ids and/or handles matching a name variation

This will be achieved using the MySQL database used by ACIS. Such databases can be accessed remotely over the network. Currently, ACIS creates some tables for its own work.  We will document those existing tables and we will define new tables, especially intended for the external use.

By default, these tables will cover

  • name data (first name, last name etc)
  • name variations
  • handles
  • short ids

By default, these tables will be open for public access by anyone. But ACIS installation administrators can easily restrict access to those databases, if they so wish. There are free, open-source tools, such as phpMyAdmin which can help with the task.

6:

In fact, it issues a request to ACIS update-daemon to process it.

7:

But note that data-mirroring mechanism will have to exist in the first place anyway, to get the data into ACIS in the first place. All that has to be done is allow such software to process a small set of records.  Ideally there should be only one. Under OAI, this is trivial.