Citations in ACIS

document source in Markdown format

Table of contents

   Introduction
      Why? What for?
      Workflow overview
   Citation profile: user interface
      The most interesting document
      Per-document screens: identified and potential tabs
      “Auto suggestions” screen
      Document list screen
      Refused citations screen
      Citation profile overview screen
   Automatic citation profile update
      Automatic Update Preferences screen
      Notification emails
   Citations data input
   Citations data output
   Administration and configuration
      Input data
      Citations screens parameters
   Technical implementation details / developer’s guide
      Processing the citations data
         Citations table
         Citation string normalization:
         Cutting out the editors part
      Search for citations and ACPU
         Automatic citation profile update
         Search workflow
      Citation to document similarity
         Similarity assessment function interface
         The default citation-document similarity assessment algorithm
      Citation suggestions
         Citation to document similarity table: cit_doc_similarity
         other citation suggestions table: cit_sug
         Old suggestions table: cit_old_sug
      Similarity matrix
         Data structure
         main matrix operations
      Userdata structure
      Storing and passing citations around
      Modules
         ACIS::Citations
         ACIS::Citations::Utils
         ACIS::Citations::Input
         ACIS::Citations::Suggestions
         ACIS::Citations::SimMatrix
         ACIS::Citations::Search
         ACIS::Web::Citations::Profile
         ACIS::Web::Citations
         ACIS::Citations::AutoUpdate
      Other notes
         SysProf table profile parameters
         Consistency maintentance
         Database maintenance

Introduction

Why? What for?

Authors have a strong desire to find out

We want to give authors such data. And that is our reason no. 1.

This is not only to caress their vanity, but also to support further research dialogue, especially when this data is published in an accessible and reusable way.

Automated citation parsing and indexing systems exist (CiteSeer, also see Open Citation) and are able to provide such data, when applied to an appropriate collection. But automated citation processing is technically a very complicated task and mistakes are not rare.

But we believe that the authors themselves can verify and filter automatically produced citation data. Thus, with the authors’ help, we will get improved citation data. Now that is our reason no. 2.

By publishing the resulting data in a simple, documented form, an ACIS service will support creation of additional services based on that data.

Therefore, we follow a two-level approach. First level: autonomous citation indexing. Citations get parsed, extracted and indexed with an automated reference and citation parsing system. That happens outside ACIS, by systems like CiteSeer.

Second level happens in ACIS: human users review, validate and filter the data.

Workflow overview

Here is what happens and what we do with citations.

A citation indexing service processes the same set of data on which an ACIS service is running. It parses the documents’ full texts and finds citations and tries to identify, which citations point to which documents. It then encodes its resuls, i.e. the citations data, in the specified input format.

The data is then supplied to the ACIS service. (How the data is transfered is out of the scope of our concern.) Service administrator will have to get things configured for this.

Then ACIS starts offering citations to its users and they create their own citation profiles. This means they identify citations as pointing to their own particular documents. They reject some citations which are not relevant to them. At this stage ACIS software tries hard to be helpful and make the process easy.

As a result of this interaction, ACIS collects improved citations data, and includes it into the personal data that ACIS exports. Third-party projects then may use it for building additional user services based on the data.

In this document we first describe the user interface, then the data formats, then the administrator’s configuration parameters. Technical implementation details are probably interesting to only a few; therefore, they come last.

Citation profile: user interface

Citation profile is a part of the ACIS web interface, where the users deal with citations. The users’ are given power to:

ACIS will search for citations in the database (among the data provided on input), attempt to match the citations to the documents of the user’s research profile, and offer them to the user for review and identification.

Design requirements. The screens of the citation profile should look and work similar to those of the research profile. Consistency has to be maintained.

The most interesting document

Before we offer any citations to the user for identification, we will assess all the citations found for her. By matching (comparing) the citations to the descriptions of the documents of her research profile, we will find:

  1. which document is the most interesting in a sense that it has most potential new citations pointing to it (weighted by their respective similarity measure);

  2. which citations more closely resemble which document’s description.

This will give us the order, in which to present the documents and citations to the user: we will always show the most interesting document first (see auto suggestions) and the most relevant citations at the top. Depending on the similarity level, some of the citations will be pre-selected for identification by default.

Per-document screens: identified and potential tabs

Main work on citations happens on per-document screens, which means the user operates on citations for one particular item of her research profile at a time. The title and other details of the document are shown in large type at the top of the screen. Then there are two tabs on the page, labeled identified and potential.

In the identified tab, the user sees the list of all the citations that she has identified to the document. She can remove the citations that have been mistakenly identified. Such citations become avalailable for all documents of her (to which they match). They are then displayed as new citations, even in the potential tab for the current document.

identified citations screen snapshot

In the potential tab we offer citations for user to identify. We only offer citations that we found similar to the description of the document.

Two kinds of citations may be offered: “new” and “old”. Old citations are the ones that we have already suggested to the user for this particular document in the past. (Thus a citation can only be old for a certain document.) A citation becomes old only after we suggested it to a user and the user submitted the form, but did not select this citation for inclusion. If she didn’t submit the form, which included this citation as new, we will still consider it new.

potential citations screen snapshot

We present the potential citations as shown in the prototype page linked above. The old citations will not be shown by default, only if the user requests it. Both old and new citations are ordered by decreasing similarity of the citation to the document. Some of the new citations, i.e. for which the similarity is higher than a certain threshold, will have the checkbox checked by default. The threshold level is configureable by system administrator.

The “Not my work” button removes a citation from the pool of potential citations for all the documents, and puts it into the refused citations list.

If the user presses the SUBMIT FORM button, all checked citations will be identified as pointing to that document. They will no longer be suggested for identification to other documents.

When there is at least one other document with new citations for it, we show a checkbox near the submit button, labeled “Show next document with new citations”. This checkbox will be checked by default if the current page offers any new citations. If the checkbox is checked when the form is submitted, we process the user’s choices and then redirect her to the next most interesting document.

If we have no potential citations for the document in question, we will show a message saying:

We have no citations to suggest for this document at this time.

“Auto suggestions” screen

When the system has some new unidentified citations for a user, and the user enters citation profile, we take the user to the AUTO SUGGESTIONS screen.

On it we always show the user the potential tab for a document with new citations. The document we show there is always the most interesting document. When user deals with these citations, another document would (very likely) become the most interesting. If she doesn’t uncheck the “move on” checkbox (“Show next document with new citations”), she would be shown the next most interesting document immediately after dealing with the current one. If the user follows this simple procedure, she will be taken through all the new potential citations for her documents, going through the most cited documents first.

The screen is always accessible via “auto suggestions” link-item in the right-hand top corner menu (under “Citations:”). So if user deviates from the procedure, it would be easy for her to continue. When displaying this screen, we highlight this link as a link to the current screen (and it won’t be a link).

For navigation between the documents the users have the “Next” and “Prev” links to the right and left of the document title (respectively). The “Next” link leads to the second most interesting document (if there is such a document).

What if there are no new citations?

If there are no documents for which we have any new citations, we show a brief screen with this message:

We are sorry to tell you that we have no potential
citations data for the documents in your research
profile.

If the user has no documents in the research profile, we show this message instead:

We are sorry to tell you that we have no potential
citations data for the documents in your research
profile, because your research profile is empty.

Document list screen

A table of all documents (research profile items) is shown on the “Document list” screen. It also provides numbers of citations identified to each document and number of new potential citations for each. Each line in the table provides both a link to the identified tab for that document and a link to the potential tab for it. The table may be sorted by the number of citations identified, or by the number of new potential citations for it.

Each document in the table will be presented by a title and a link to its further details.

The screen will also display the total number of identified citations.

Refused citations screen

On the Refused Citations screen users will deal with citations, for which user has pressed the “Not my work” button in the past. Those are the citations to works that the user has told us she is not an author of. She can remove citations, which then may be suggested to the same or other documents, depending on similarity found.

Citation profile overview screen

The main page of the citation profile gives a brief introduction and overview of the profile, with links to appropriate screens. This is similar to research profile overview screen. It offers links to the document list screen, to the auto suggestions screen (i.e. the most interesting document, if there is such a document at the time), and a link to the refused citations screen.

If there are new potential citations for some documents of a person, on the first enterance of the person into his citation profile, we redirect him to the auto suggestions page. But we do this only once per user session to avoid confusion.

Automatic citation profile update

Similar to the automatic research profile update, ACIS may sometimes update citation profiles for its users automatically. There are three such cases, when a citation can be added to the profile:

Then if the citation profile is updated, an email message is sent to the user (the owner of the profile) to notify her about the changes.

Automatic Update Preferences screen

Some users most certainly won’t trust automatic changes or experience problems with it. Therefore, even though these features are enabled by default for all users, they can disable them. To disable these features, users use the Automatic Update Preferences screen. There is a similar screen in the research profile.

Difference between the first two cases of automatic citaiton addition is difficult to explain to users, and, therefore, it will be regulated by one user option.

The screen offers several yes/no questions-options.

AUTOMATIC CITATION PROFILE UPDATE

We can do automatic additions to your citation profile
in certain cases (explained below).  An email
notification will be sent to you every time we change
your profile automatically.  Here you may disable or
enable automatic changes to your profile.

When we know with high certainty that a new citation is
pointing to one of your works, should we automatically
add it to your identified citations? (You can always fix
it later, if a mistake happens.)

   (*) yes
   ( ) no

When one of your co-authors has identified a citation as
pointing to your co-authored document, should we add
it to your profile also?

   (*) yes
   ( ) no

The notification message will arrive from address:

    acis@somewhere.org

Please add this address to your spam filter's whitelist
to make sure it does not stop these messages.

A link to this screen is labeled “AUTO UPDATE” in the citation profile menu (right-hand top corner navigation menu).

Notification emails

Automatic citation profile update and research profile update are separate functions of ACIS. Notification messages generated by these functions are separate.

To avoid bombarding users with notification emails, we don’t do automatic updates for a single personal profile more often than once in two weeks.

An automatic citation profile update notification message may look like this:

From: acis@somewhere.org
To: A.U. Thor <author@email.edu>
Subject: [GAS] citations automatically added

Dear A.U. Thor,

This is an automatic message from Great Author
Service. You don't need to reply.

We ran a search for citations to your documents in our
service and found some, which we think point to these
your documents, see below.  We added these citations to
your profile, but if there's an error, you can fix it.

Document: 

    Imperfect Capital Markets Need Not be Inefficient
    paper by David Webb & David De Meza
    http://ideas.repec.org/p/fmg/fmgdps/dp0007.html

Citations:

  * in: Exact joint forecasts for vector aggressive 
    models by Lu-Sam Xiu
    http://ideas.repec.org/p/xiu/xiudps/2001.html

    cited as: Webb D., De Meza D. 1999. Imperfect 
    Capital Markets Need Not be Inefficient.  Financial 
    Markets Group, Discussion papers.  

    ...


To review and correct these items, log into your profile
at this address:
http://gas.somewhere.org/citations/doclist

If necessary, review and change your preferences with
regard to future automatic citation profile updates at:
http://gas.somewhere.org/citations/autoupdate

Your password in our service is:

    supersecret

---

The message was automatically generated by Great Author Service.

http://gas.somewhere.org/

Citations data input

Data comes from automatic reference and citation parsing systems, e.g. CitEc, CiteSeer.

For ACIS to process and use citation data, the data must be encoded in the Academic Metadata Format and the system must be configured appropriately.

Two kinds of citations are expected in the input data:

Identified are the citations, which have an autonomously identified citation targets. That is, the software on the citations data provider side has found that such a citation points to a particular document in the same set of documents and the identifier of that cited document is included with the citation itself.

Therefore, the unidentified citations are those, which point to an unknown document, which may be present in the document database of ACIS or may not be present.

Identified citations will be encoded in this way:

<text ref="text">
   <references>
     <acis:referencestring> text of citation </acis:referencestring>
     <text ref="text2"/>
   </references>
</text>

where acis is a prefix for the namespace “http://acis.openlib.org/” and text2 is the identifier of the document that the citation points to. The citation string comes in the text content of the foreign-namespace element referencestring.

While it is technically valid AMF to have several text nouns as children of the references verb, ACIS will only accept the first one. Thus

<text ref="text">
   <references>
     <acis:referencestring> text of citation 1 </acis:referencestring>
     <text ref="text1"/>
     <acis:referencestring> text of citation 2</acis:referencestring>
     <text ref="text2"/>
   </references>
</text>

will ignore citation 2. To get both citation 1 and citation 2, you have to say

<text ref="text">
   <references>
     <acis:referencestring> text of citation 1 </acis:referencestring>
     <text ref="text1"/>
   </references>
   <references>
     <acis:referencestring> text of citation 2</acis:referencestring>
     <text ref="text2"/>
   </references>
</text>

Unidentified citations are written as:

<text ref="text">
   <reference>
     <literal> unidentified citation </literal>
   </reference>
</text>

If data comes in such a form:

<text ref="text">
   <references>
     <text ref="text2"/>
   </references>
</text>

then nothing is being done with it.

Citations data output

We will output citations data as part of the personal profile AMF datafiles. The citations data will be added to the document data, i.e. the research profile. Personal AMF data always includes a person noun. Here is an incomplete example showing how would identified citations added to the documents data:

<person id="donald_duck">
  <name>Donald Duck</name>
  <isauthorof>
    <text ref="text0">
       <isreferencedby>
          <text ref='text4'/>
       </isreferencedby>
    </text>
  </isauthorof>
  <isauthorof>
     <text ref="text20">
       <isreferencedby>
          <text ref='text5'/>
       </isreferencedby>
       <isreferencedby>
          <text ref='text6'/>
       </isreferencedby>
     </text>
  </isauthorof>
</person>

The identified citations would not be displayed on the personal profile pages, generated for each profile in ACIS.

Administration and configuration

Input data

Citation data can be “fed” to an ACIS service only in the form of a collection of metadata files.

That means you should configure a new metadata collection via the metadata-collections parameter. To make ACIS aware, that it should expect citation data in it, set the collection’s type (metadata-X-type parameter) to “CitationsAMF”.

It may look like this:

metadata-collections='RePEc CitEc'
metadata-RePEc-home=/data/RePEc/amf
metadata-RePEc-type=AMF
metadata-CitEc-home=/data/CitEc/amf
metadata-CitEc-type=CitationsAMF

The data files must have .amf.xml name extension to be processed by ACIS.

Citations screens parameters

The citation profile is enabled and configured via the the main.conf configuration file. (Just as every other part of the ACIS web interface.)

See Citations screens parameters in another document.

Technical implementation details / developer’s guide

Processing the citations data

The data processing flow. RePEc::Index and RePEc::Index::Collections::CitationsAMF to ACIS::Citations::Input via the collection’s proc parameter.

RePEc::Index::Collections::CitationsAMF uses AMF::Parser to parse the files. Citations originating from a document will comprise one record.

ACIS::Citations::Input saves the data into the citations table and removes disappeared citations.

Citations table

The fields:

FULLTEXT INDEX (nstring), INDEX (trgdocid)

Citation string normalization:

Cutting out the editors part

The original citation string may contain editor names, which we don’t want, because they will cause the citation to be found mistakenly for authors.

The regular expression:

s/\b((I|i)n\W.+?\Wed\..*)$//

should do the trick.

Search for citations and ACPU

Via APU. ACIS::APU -> ACIS::Citations::AutoUpdate -> ACIS::Citations::Search

Automatic citation profile update

Former ARPU has been extended to cover citations and they now jointly known as APU: automatic profile update.

APU has a number of configuration parameters:

minimum-apu-period-days minimum APU period (default: 3 weeks or 21 day(?)).

echo-apu-mails, boolean — whether to send a copy of the APU mails to the admin or not.

disable-citation-mails, boolean — whether to send citation mails at all.

ACPU itself includes 3 things:

  1. profile maintenance (clean-up)
  2. search

Every time ACPU is done, a timestamp is added to the sysprof table for that record (‘last-auto-citations-time’).

When APU is about to run, it chooses those records, which have their ACPU or ARPU timestamps oldest, just as it is now done for ARPU only.

Search workflow

For each research profile item of the user, we do something like:

select * from citations where trgdocid = 'repec:fdd:fodooo:555';

and it finds us citations which where pre-identified as pointing to that document. Pre-identified means identified by the citations data provider.

For each name variation of the user we do:

select * from citations where MATCH (nstring) AGAINST
('JOHN MALKOVICH' IN BOOLEAN MODE)

(this is a simplified variation of the actual query.)

Then we exclude (from what’s found) the already known: identified and refused citations. Then we exclude citations already suggested.

Then those that are left we compare to the RP documents and look at what came out of it. If anything interesting came out — we add those. But we do not add a citation if it has already been suggested in the past, i.e. if it is old. Ok, let’s look closer at it.

THE WORKFLOW

Citation to document similarity

Similarity assessment function interface

The citation-document similarity function (see citation-document-similarity-func) will be called by ACIS like this:

$sim = func( CIT, DOC );

where CIT is a reference to a citation record, DOC is a reference to a document record and $sim is a similarity value returned by the function.

CIT

is a reference to a hash:

{ ostring => CITSTR, 
  nstring => NORCITSTR }

Here CITSTR is the original citation string, and NORCITSTR is the normalized citation string.

DOC

is a reference to a hash:

{ title   => TITLE,
  authors => [ AUTHOR1NAME, AUTHOR2NAME, ... ],
  type    => TYPE,
  location => LOCATION
}

Here TITLE is the work’s title (a string), AUTHOR1NAME, AUTHOR2NAME and so on are the names of the authors (strings), and TYPE is one of “article”, “paper”, “book”, “software”, “chapter”, “text” (when no particular type is known).

LOCATION is a string built by joining the following items: the series or journal name, the paper’s number in the series (if it is present), the issue/volume/pages. For AMF we can take all the adjectives of the serial adjective container.

The default citation-document similarity assessment algorithm

For a given document and a given citation:

  1. Look at the author names first. Split the names of the authors of the document into a list of words, strip all non-letter non-whitespace characters and then strip single whitespace-separated letters from it.

    We then search for approximate matches for each of these words in the normalized citation string. (String::Approx) If there is a match for any of the author-name-words, it is an “author pass”. Otherwise, the match is aborted with zero similarity measure returned.

    foreach ( @names ) {
      if ( amatch( $_, $citation ) ) {
        $pass = 1;
        last;
      }
    }

    If the document is an author pass, it will be ranked according to the string similarity of the title only, see next point.

  2. Compare titles. Take the normalized citation string and take the normalized title of the research item. Find where in the citation string the first word of the title is present.

    Take as many characters of the citation string as there are in the title and compare them to the work’s title (with String::Similarity).

    (Alternatively, find where in the citation string the last word of the title is present and take a substring from start to end. Compare in the same way.)

    (Alternatively, do a String::Approx amatch on the citation string, with the document’s title used as the pattern. — No, this will result in just yes/no value, but we need a numeric measure.)

Get the comparison result as a number between 0 and 1.

Citation suggestions

Citation to document similarity table: cit_doc_similarity

Table cit_doc_similarity fields:

PRIMARY KEY (cnid,dsid)

other citation suggestions table: cit_sug

Table cit_sug fields:

PRIMARY KEY (cnid,dsid,reason)

Old suggestions table: cit_old_sug

Table cit_old_sug fields:

PRIMARY KEY (psid,dsid,cnid)

Similarity matrix

Data structure

matrix:

“dxx44” — research profile item’s sid

$matrix -> {new} {dxx44}[0] — the first of the new citations suggested for research profile item dxx44

$matrix -> {new} {dxx44}[1]->{similar} — similarity of the second citation’s to dxx44 document

$matrix -> {citations} {‘dcen4-n6Ci5dmzFdBAtQx4K+mEjg’} {dxx44} — array of where this citation is suggested.

$matrix -> {citations} {‘dcen4-n6Ci5dmzFdBAtQx4K+mEjg’} {dxx44}[0][0] — ‘new’ or ‘old’

$matrix -> {citations} {‘dcen4-n6Ci5dmzFdBAtQx4K+mEjg’} {dxx44}[0][1] — reference to the particular suggestion hash

Similarity values in the matrix and in the cit_doc_similarity table are all in the percent scale. They are produced by multiplying the similarity values (calculated by the cit-to-document comparison function) by 100.

main matrix operations

Userdata structure

The citations branch of the userdata structure:

Each citation item in the userdata is a hash with these items:

Storing and passing citations around

Citations will be stored as:

  1. citation table records

  2. in-memory suggestions

  3. userdata items (see above)

Basically, the citations will go from 1 to 2 and from 2 to 3. At each step, some hash keys may be added to a citation or may be removed.

But sometimes a citation is removed from userdata and then it may be re-considered as a suggestion (3->2).

Modules

ACIS::Citations

ACIS::Citations::Utils

and other useful stuff

ACIS::Citations::Input

use ACIS::Citations::Utils;

ACIS::Citations::Suggestions

ACIS::Citations::SimMatrix

use ACIS::Citations::Suggestions;

exported:

object methods:

internal:

ACIS::Citations::Search

use ACIS::Citations::Utils; use ACIS::Citations::Suggestions;

exportable:

ACIS::Web::Citations::Profile

use ACIS::Citations::Input; use ACIS::Web::SysProfile;

ACIS::Web::Citations

ACIS::Citations::AutoUpdate

Other notes

SysProf table profile parameters

Consistency maintentance

Every once in a while we check that every citation that is mentioned in the personal profile is still present in the citations table, and clear it, if it isn’t there.

We will remember the date of last cleaning run and at each user login (real or in APU) we will see if the profile needs cleaning already. (last-cit-prof-check-time in sysprof).

In a similar way, we ensure that every document for which we have the identified citations is still present in the database.

Database maintenance

It would be highly desirable to backup the citations, citations_deleted and other related tables regularly. It is important for consistency of the data and stability of the user profiles.