ACIS Stage D Plan
This paper discusses three additional pieces of work to be accomplished in Stage D of ACIS. These are
- full-text file recognition,
- document to document associations,
- fuzzy author name search.
Full-text file recognition
In an ideal world the document metadata in a digital library would be complete and correct, always. Every document would have a full-text available and the full-text links would always point to the actual documents. But that is in an ideal world.
One way to populate a digital library with full-text contents is to look at metadata on papers, and find the corresponding full-text files on the Internet using automated means. However, the automated association between metadata and full-text file is not reliable.
Even in the more formal collections, there often is a confusion about what the full-text is and how it should be treated. As a result, a lot of metadata collections point to intermediate pages, which usually hold a link to the full text, rather than the full text itself.
Therefore, an ACIS installation can invite authors to review full-text links for their documents and let them report wrong links or intermediate pages.
So we add an archival profile to ACIS for this. The archival profile contains associations between document records and their full-text files (the URLs). The author can not directly change it, it remains read-only. However, the author may express opinions about items in the profile, i.e. the document-URL associations.
When the author enters the archival profile, we present her a set of papers from her research profile. This is the set of papers for which full-text file data is available. For each paper, there is one or more full-text file URLs. For each file, the user can make a choice in one or two menues. The first menu is the recognition menu, the second is the permission menu. The permission menu is only shown when certain values are chosen in the recognition menu.
In the recognition menu, there are four mutually exclusive options.
- This is not a file related to this paper.
- This is a web page describing the paper.
- This is a full-text file for another version of the paper.
- This is a full-text file for this version of the paper.
The default is option 4.
The options will be expressed in shorter words, but there will be a help button.
The premissions menu is only shown if one of the last two options has been chosen. In the permission menu, there are two options
- This full-text file may be archived as is.
- This full-text file may be archived, but check for updates.
- This full-text file may not be archived.
The default is option 1.
Co-authors may make contradictory choices. An ACIS installation records such choices, but does not resolve them.
The input format is AMF and it is structured either as
<text id=".."> <file> <url>url</url> </file> </text>
<text id=".."> <hasversion> <text> <file> <url>url</url> </file> </text> </hasversion> </text>
The latter form would be used if the full-text has been automatically found.
Document to document associations
Often, there are different versions of the same document.
Sometimes there are versions published in different channels. Such version may be linked. The author may wish to make this link explicit. This is an administrative link.
Sometimes, an author may write a paper on a subject, and later write a second paper that’s a development of the first paper. The author may wish to make this explicit. This is a thematic link.
We would like to allow some generic links between papers.
Each author can only create links between papers that she has authored, i.e. that are in her research profile. Thus, if author A1 and author A2 have collaborated on paper P1 and author A2 and author A3 have collaborated on paper P2, and P2 is a development of paper P1, then only author A2 can make the link. As a result, some authors will have the opportunity to make a lot of links, others will have little.
The link is always of a certain type. Link types are configured via ACIS configuration. Each link type has a name, and an explanatory string. Link types can be symmetric or directed. Directed link types may have a reverse type equivalent. In that case, each link of such type has a reverse equivalent.
If links are enabled in ACIS configuration and if there are two or more documents in the research profile, a document links profile is available to the user.
When the user enters the document links profile, the existing links screen is shown, if there are any. If there are no links, the create links screen is shown.
The existing links screen shows each document, which has at least one link attached. Below each such document, the linked documents are listed. For each linked document, the type of link is shown, with a “delete” button next to it.
At the bottom of the existing links screen, there is a hyperlink to the create links screen.
If a document link is deleted, it disappears from both source and destination documents. The existing links page is refreshed.
When a link of a directed type is deleted, and if the link type admits a reverse, then the reverse link is deleted too.
In the links creation screen, there is a table with three columns. The source documents list is on the left, the destination documents list is on the right, and the link type is in the middle. The user can choose exactly one item from each list. There is a “create link” button at the top and at the bottom.
Up until now, searches in the author name expressions data only look for exact occurrences of name variations. No fuzzy searching is conducted. The main reason is the computational expense of such searches.
We should do a limited amount of fuzzy searching. With a good strategy, we can limit the amount of computations while improving the completeness of the dataset.
Fuzzy searching makes sense because there are misspellings of author names in the name expressions data.
Misspellings are not variants of a person’s name. They are errors in the author name data. Errors are not systematic. Say if a person is called Müller, then Muller and Mueller are variants since they reflect two ways that the name would be commonly written out in situations where umlaut input is difficult. But Müler would be a mistake. It would be appearing rarely in the dataset because of its erroneous nature.
With the database structures in ACIS it is not hard to check familiarity of a name among all the name expressions and name variations. That’s an easy way to limit the amount of computations for fuzzy searches: only rare names would be checked.
By default, we would only operate fuzzy searching for author name expressions which only appear once, and are currently not covered by any name variations of any author.
For each author, and for each name variation of the author, ACIS can search all name expressions in the database, calculate the Levenshtein distance, and if it is below a critical level, suggest the document where the expression occurs to the author.
Since this is lot of work, the amount of searching will be configurable within an ACIS installation. Some parameters are
- The number of characters n of at the start of a name variation that has to match in the name expressions exactly. When matching a name variation against the set of name expressions, only those expressions are being selected for the fuzzy search, the first n characters of which match exactly. The default is 3.
- The minimum number of characters m that a name variation would have to have in order to qualify for being fuzzy matched. The default is 7.
- The maximum number of occurrences of a name expression in the name expression table before it is considered for fuzzy matching. The default is one. If this parameter is set to 0, no maximum is checked.
- The maximum number of occurrences of a name expression in the name variations table before it is considered for fuzzy matching. By default, maximum is 0, ie. a name expression should not be present among the name variations. Set is to -1 to disable this limit.
We define the distance level of an name expression to a name variation as the ratio between the Levenshtein distance between a name expression and a name variation, divided by the string length of the name variation. The default critical distance level of distance is 1/m. If the level of distance remains below the critical level, the document is suggested for the author’s research profile.
First draft was written on 2006-06-31 by Thomas Krichel. It was much reworked at a meeting of Thomas Krichel and Ivan V. Kurmanov in Moscow on 2006-09-03 and 2006-09-04. Ivan and Thomas are grateful for the hospitality of Evgenya G. Stupina, and to William L. Goffe for comments on a previous version.