Citations in ACIS =================

document source in Markdown format

## Introduction ### Why? What for? Authors have a strong desire to find out * how many times they have been cited, * which papers are citing their work, * which of their own papers have made the most impact as measured by citations. We want to give authors such data. And that is our reason no. 1. This is not only to caress their vanity, but also to support further research dialogue, especially when this data is published in an accessible and reusable way. Automated citation parsing and indexing systems exist (CiteSeer, also see [Open Citation](http://opcit.eprints.org/)) and are able to provide such data, when applied to an appropriate collection. But automated citation processing is technically a very complicated task and mistakes are not rare. But we believe that the authors themselves can verify and filter automatically produced citation data. Thus, with the authors' help, we will get improved citation data. Now that is our reason no. 2. By publishing the resulting data in a simple, documented form, an ACIS service will support creation of additional services based on that data. Therefore, we follow a two-level approach. First level: autonomous citation indexing. Citations get parsed, extracted and indexed with an automated reference and citation parsing system. That happens outside ACIS, by systems like CiteSeer. Second level happens in ACIS: human users review, validate and filter the data. ### Workflow overview Here is what happens and what we do with citations. A citation indexing service processes the same set of data on which an ACIS service is running. It parses the documents' full texts and finds citations and tries to identify, which citations point to which documents. It then encodes its resuls, i.e. the citations data, in the specified input format. The data is then supplied to the ACIS service. (How the data is transfered is out of the scope of our concern.) Service administrator will have to get things configured for this. Then ACIS starts offering citations to its users and they create their own citations profiles. This means they identify citations as pointing to their own particular documents. They reject some citations which are not relevant to them. At this stage ACIS software tries hard to be helpful and make the process easy. As a result of this interaction, ACIS collects improved citations data, and includes it into the personal data that ACIS exports. Third-party projects then may use it for building additional user services based on the data. In this document we first describe the user interface, then the data formats, then the administrator's configuration parameters. Technical implementation details are probably interesting to only a few; therefore, they come last.

Citations profile: user interface

Citations profile is a part of the ACIS web interface, where the users deal with citations. The users' are given power to: - review citations that have been identified as referring to their works; - certify that particular citations point to their particular documents, with ability to review their choices and correct errors in their earlier-made choices; - refuse errorneously-suggested citations and review previously refused citations to undo errornously-refused items. ACIS will search for citations in the database (among the data provided on input), attempt to match the citations to the documents of the user's research profile, and offer them to the user for review and identification. *Design requirements.* The screens of the citations profile should look and work similar to those of the research profile. Consistency has to be maintained.

The most interesting document

Before we offer any citations to the user for identification, we will assess all the citations found for her. By matching (comparing) the citations to the descriptions of the documents of her research profile, we will find: 1. which document is the most *interesting* in a sense that it has most potential new citations pointing to it (weighted by their respective similarity measure); 2. which citations more closely resemble which document's description. This will give us the order, in which to present the documents and citations to the user: we will always show the most interesting document first (see auto suggestions) and the most relevant citations at the top. Depending on the similarity level, some of the citations will be pre-selected for identification by default. ### Per-document screens: identified and potential tabs Main work on citations happens on per-document screens, which means the user operates on citations for one particular item of her research profile at a time. The title and other details of the document are shown in large type at the top of the screen. Then there are two tabs on the page, labeled *identified* and *potential*. In the *identified* tab, the user sees the list of all the citations that she has identified to the document. She can remove the citations that have been mistakenly identified. Such citations become avalailable for all documents of her (to which they match). They are then displayed as new citations, even in the potential tab for the current document.

identified citations screen snapshot

In the potential tab we offer citations for user to identify. We only offer citations that we found similar to the description of the document. Two kinds of citations may be offered: "new" and "old". Old citations are the ones that we have already suggested to the user for this particular document in the past. (Thus a citation can only be old for a certain document.) A citation becomes old only after we suggested it to a user and the user submitted the form, but did not select this citation for inclusion. If she didn't submit the form, which included this citation as new, we will still consider it new.

potential citations screen snapshot

We present the potential citations as shown in the prototype page linked above. The old citations will not be shown by default, only if the user requests it. Both old and new citations are ordered by decreasing similarity of the citation to the document. Some of the new citations, i.e. for which the similarity is higher than a certain threshold, will have the checkbox checked by default. The threshold level is configureable by system administrator. The "Not my work" button removes a citation from the pool of potential citations for all the documents, and puts it into the refused citations list. If the user presses the SUBMIT FORM button, all checked citations will be identified as pointing to that document. They will no longer be suggested for identification to other documents. When there is at least one other document with new citations for it, we show a checkbox near the submit button, labeled "Show next document with new citations". This checkbox will be checked by default if the current page offers any new citations. If the checkbox is checked when the form is submitted, we process the user's choices and then redirect her to the next most interesting document. If we have no potential citations for the document in question, we will show a message saying: We have no citations to suggest for this document at this time.

"Auto suggestions" screen

When the system has some new unidentified citations for a user, and the user enters citations profile, we take the user to the AUTO SUGGESTIONS screen. On it we always show the user the *potential* tab for a document with new citations. The document we show there is always the most interesting document. When user deals with these citations, another document would (very likely) become the most interesting. If she doesn't uncheck the "move on" checkbox ("Show next document with new citations"), she would be shown the next most interesting document immediately after dealing with the current one. If the user follows this simple procedure, she will be taken through all the new potential citations for her documents, going through the most cited documents first. The screen is always accessible via "auto suggestions" link-item in the right-hand top corner menu (under "Citations:"). So if user deviates from the procedure, it would be easy for her to continue. When displaying this screen, we highlight this link as a link to the current screen (and it won't be a link). For navigation between the documents the users have the "Next" and "Prev" links to the right and left of the document title (respectively). The "Next" link leads to the second most interesting document (if there is such a document). **What if there are no new citations?** If there are no documents for which we have any new citations, we show a brief screen with this message: We are sorry to tell you that we have no potential citations data for the documents in your research profile. If the user has no documents in the research profile, we show this message instead: We are sorry to tell you that we have no potential citations data for the documents in your research profile, because your research profile is empty. ### Document list screen A table of all documents (research profile items) is shown on the "Document list" screen. It also provides numbers of citations identified to each document and number of new potential citations for each. Each line in the table provides both a link to the identified tab for that document and a link to the potential tab for it. The table may be sorted by the number of citations identified, or by the number of new potential citations for it. Each document in the table will be presented by a title and a link to its further details. The screen will also display the total number of identified citations. ### Refused citations screen On the Refused Citations screen users will deal with citations, for which user has pressed the "Not my work" button in the past. Those are the citations to works that the user has told us she is not an author of. She can remove citations, which then may be suggested to the same or other documents, depending on similarity found. ### Citations profile overview screen The main page of the citations profile gives a brief introduction and overview of the profile, with links to appropriate screens. This is similar to research profile overview screen. It offers links to the document list screen, to the auto suggestions screen (i.e. the most interesting document, if there is such a document at the time), and a link to the refused citations screen. If there are new potential citations for some documents of a person, on the first enterance of the person into his citations profile, we redirect him to the auto suggestions page. But we do this only once per user session to avoid confusion.

Automatic citation profile update

Similar to the automatic research profile update, ACIS may sometimes update citations profiles for its users automatically. There are three such cases, when a citation can be added to the profile: * when a citation has been identified by the autonomous citation indexing service as pointing to a document of a user; * when we find that a new citation has a very high similarity to one of the documents of a user (as measured by the citation-to-document similarity assessment algorithm); * when two or more users have collectively authored a document, and that document is included into their research profiles, and one of the users has identified a citation as pointing to that document. Then if the citations profile is updated, an email message is sent to the user (the owner of the profile) to notify her about the changes. ### Automatic Update Preferences screen Some users most certainly won't trust automatic changes or experience problems with it. Therefore, even though these features are enabled by default for all users, they can disable them. To disable these features, users use the Automatic Update Preferences screen. There is a similar screen in the research profile. Difference between the first two cases of automatic citaiton addition is difficult to explain to users, and, therefore, it will be regulated by one user option. The screen offers several yes/no questions-options. AUTOMATIC CITATIONS PROFILE UPDATE We can do automatic additions to your citation profile in certain cases (explained below). An email notification will be sent to you every time we change your profile automatically. Here you may disable or enable automatic changes to your profile. When we know with high certainty that a new citation is pointing to one of your works, should we automatically add it to your identified citations? (You can always fix it later, if a mistake happens.) (*) yes ( ) no When one of your co-authors has identified a citation as pointing to your co-authored document, should we add it to your profile also? (*) yes ( ) no The notification message will arrive from address: acis@somewhere.org Please add this address to your spam filter's whitelist to make sure it does not stop these messages. A link to this screen is labeled "AUTO UPDATE" in the citations profile menu (right-hand top corner navigation menu). ### Notification emails Automatic citations profile update and research profile update are separate functions of ACIS. Notification messages generated by these functions are separate. To avoid bombarding users with notification emails, we don't do automatic updates for a single personal profile more often than once in two weeks. An automatic citation profile update notification message may look like this: From: acis@somewhere.org To: A.U. Thor Subject: [GAS] citations automatically added Dear A.U. Thor, This is an automatic message from Great Author Service. You don't need to reply. We ran a search for citations to your documents in our service and found some, which we think point to these your documents, see below. We added these citations to your profile, but if there's an error, you can fix it. Document: Imperfect Capital Markets Need Not be Inefficient paper by David Webb & David De Meza http://ideas.repec.org/p/fmg/fmgdps/dp0007.html Citations: * in: Exact joint forecasts for vector aggressive models by Lu-Sam Xiu http://ideas.repec.org/p/xiu/xiudps/2001.html cited as: Webb D., De Meza D. 1999. Imperfect Capital Markets Need Not be Inefficient. Financial Markets Group, Discussion papers. ... To review and correct these items, log into your profile at this address: http://gas.somewhere.org/citations/doclist If necessary, review and change your preferences with regard to future automatic citations profile updates at: http://gas.somewhere.org/citations/autoupdate Your password in our service is: supersecret --- The message was automatically generated by Great Author Service. http://gas.somewhere.org/

Citations data input

Data comes from automatic reference and citation parsing systems, e.g. CitEc, CiteSeer. For ACIS to process and use citation data, the data must be encoded in the Academic Metadata Format and the system must be configured appropriately. Two kinds of citations are expected in the input data: - identified; - unidentified. *Identified* are the citations, which have an autonomously identified citation targets. That is, the software on the citations data provider side has found that such a citation points to a particular document in the same set of documents and the identifier of that cited document is included with the citation itself. Therefore, the *unidentified* citations are those, which point to an unknown document, which may be present in the document database of ACIS or may not be present. Identified citations will be encoded in this way: text of citation where acis is a prefix for the namespace "http://acis.openlib.org/" and text2 is the identifier of the document that the citation points to. The citation string comes in the text content of the foreign-namespace element referencestring. While it is technically valid AMF to have several text nouns as children of the references verb, ACIS will only accept the first one. Thus text of citation 1 text of citation 2 will ignore citation 2. To get both citation 1 and citation 2, you have to say text of citation 1 text of citation 2 Unidentified citations are written as: unidentified citation If data comes in such a form: then nothing is being done with it.

Citations data output

We will output citations data as part of the personal profile AMF datafiles. The citations data will be added to the document data, i.e. the research profile. Personal AMF data always includes a person noun. Here is an incomplete example showing how would identified citations added to the documents data: Donald Duck The identified citations would not be displayed on the personal profile pages, generated for each profile in ACIS. ## Administration and configuration

Input data

Citation data can be "fed" to an ACIS service only in the form of a collection of metadata files. That means you should configure a new metadata collection via the metadata-collections parameter. To make ACIS aware, that it should expect **citation** data in it, set the collection's type (metadata-X-type parameter) to "CitationsAMF". It may look like this: metadata-collections='RePEc CitEc' metadata-RePEc-home=/data/RePEc/amf metadata-RePEc-type=AMF metadata-CitEc-home=/data/CitEc/amf metadata-CitEc-type=CitationsAMF The data files **must** have **.amf.xml** name extension to be processed by ACIS.

Citations screens parameters

The citations profile is enabled and configured via the the main.conf configuration file. (Just as every other part of the ACIS web interface.) See Citations screens parameters in another document. ## Technical implementation details / developer's guide ### Processing the citations data *The data processing flow.* RePEc::Index and RePEc::Index::Collections::CitationsAMF to ACIS::Citations::Input via the collection's proc parameter. RePEc::Index::Collections::CitationsAMF uses AMF::Parser to parse the files. Citations originating from a document will comprise one record. ACIS::Citations::Input saves the data into the citations table and removes disappeared citations. - find the document's sid and details URL - load from the citations table all the known citations originating from the document - find new / disappeared - normalize new - remove editor names from new - save new - remove disappeared #### Citations table The fields: - unique numeric citation id: cnid SERIAL - long citation id: clid CHAR(38) NOT NULL PRIMARY KEY: srcdocsid + '-' + checksum (MD5 checksum of the original citation string in base64 encoding - 22 chars) - full original citation string: ostring TEXT NOT NULL - normalized searchable citation string: nstring VARCHAR(255) NOT NULL - refers to document, id: trgdocid VARCHAR(255)

FULLTEXT INDEX (nstring), INDEX (trgdocid)

#### Citation string normalization: - Unicode decomposition, - convert letters to upper case, - remove all non-letter, non-digit and non-whitespace characters, - compact all whitespace, - remove leading and trailing whitespace. #### Cutting out the editors part The original citation string may contain editor names, which we don't want, because they will cause the citation to be found mistakenly for authors. The regular expression: s/\b((I|i)n\W.+?\Wed\..*)$// should do the trick. ### Search for citations and ACPU Via APU. ACIS::APU -> ACIS::Citations::AutoUpdate -> ACIS::Citations::Search #### Automatic citations profile update Former ARPU has been extended to cover citations and they now jointly known as APU: automatic profile update. APU has a number of configuration parameters: minimum-apu-period-days minimum APU period (default: 3 weeks or 21 day(?)). echo-apu-mails, boolean -- whether to send a copy of the APU mails to the admin or not. disable-citation-mails, boolean -- whether to send citation mails at all. ACPU itself includes 3 things: 1. profile maintenance (clean-up) 2. search Every time ACPU is done, a timestamp is added to the sysprof table for that record ('last-auto-citations-time'). When APU is about to run, it chooses those records, which have their ACPU or ARPU timestamps oldest, just as it is now done for ARPU only. #### Search workflow For each research profile item of the user, we do something like: select * from citations where trgdocid = 'repec:fdd:fodooo:555'; and it finds us citations which where pre-identified as pointing to that document. Pre-identified means identified by the citations data provider. For each name variation of the user we do: select * from citations where MATCH (nstring) AGAINST ('JOHN MALKOVICH' IN BOOLEAN MODE) (this is a simplified variation of the actual query.) Then we exclude (from what's found) the already known: identified and refused citations. Then we exclude citations already suggested. Then those that are left we compare to the RP documents and look at what came out of it. If anything interesting came out -- we add those. But we do not add a citation if it has already been suggested in the past, i.e. if it is old. Ok, let's look closer at it. THE WORKFLOW - by the person's research profile documents: - find citations, that were pre-identified for a document - filter out identified citations - filter out refused citations - filter out old, suggested in the past items (cit\_old_sug table) - then, for the rest of citations: - auto add them, if preferences allow - the person's name: - find citations, that match the person's name variations - filter out identified citations - filter out refused citations - filter out old, suggested in the past items (cit\_old_sug table) - then, for the rest of citations: - compare them to the person's documents - comparison results would be written to the cit\_doc_similarity table - then, if auto-add is enabled for this profile - find which of the new citations have high similarity measure to a document (and which have the highest, if there's several of them) - auto add them, if preferences allow ### Citation to document similarity

Similarity assessment function interface

The citation-document similarity function (see citation-document-similarity-func) will be called by ACIS like this: $sim = func( CIT, DOC ); where CIT is a reference to a citation record, DOC is a reference to a document record and $sim is a similarity value returned by the function.

CIT

is a reference to a hash:

{ ostring => CITSTR, 
  nstring => NORCITSTR }

Here CITSTR is the original citation string, and NORCITSTR is the normalized citation string.

DOC

is a reference to a hash:

{ title   => TITLE,
  authors => [ AUTHOR1NAME, AUTHOR2NAME, ... ],
  type    => TYPE,
  location => LOCATION
}

Here TITLE is the work's title (a string), AUTHOR1NAME, AUTHOR2NAME and so on are the names of the authors (strings), and TYPE is one of "article", "paper", "book", "software", "chapter", "text" (when no particular type is known).

LOCATION is a string built by joining the following items: the series or journal name, the paper's number in the series (if it is present), the issue/volume/pages. For AMF we can take all the adjectives of the serial adjective container.

The default citation-document similarity assessment algorithm

For a given document and a given citation: 1. Look at the author names first. Split the names of the authors of the document into a list of words, strip all non-letter non-whitespace characters and then strip single whitespace-separated letters from it. We then search for approximate matches for each of these words in the normalized citation string. (String::Approx) If there is a match for any of the author-name-words, it is an "author pass". Otherwise, the match is aborted with zero similarity measure returned.

foreach ( @names ) {
      if ( amatch( $_, $citation ) ) {
        $pass = 1;
        last;
      }
   }

If the document is an author pass, it will be ranked according to the string similarity of the title only, see next point. 2. Compare titles. Take the normalized citation string and take the normalized title of the research item. Find where in the citation string the first word of the title is present. Take as many characters of the citation string as there are in the title and compare them to the work's title (with [String::Similarity](http://search.cpan.org/perdoc?String::Similarity)). (Alternatively, find where in the citation string the last word of the title is present and take a substring from start to end. Compare in the same way.) (Alternatively, do a String::Approx amatch on the citation string, with the document's title used as the pattern. -- No, this will result in just yes/no value, but we need a numeric measure.) Get the comparison result as a number between 0 and 1. - Also see [Digital Libraries and Autonomous Citation Indexing](http://citeseer.ist.psu.edu/aci-computer/aci-computer99.html) paper and its section "Methods" for a brief discussion of what CiteSeer team tried. ### Citation suggestions #### Citation to document similarity table: cit\_doc_similarity Table cit\_doc_similarity fields: - citation id: cnid BIGINT UNSIGNED NOT NULL - document short id: dsid CHAR(15) NOT NULL - similarity measure 0...100: similar TINYINT NOT NULL - when compared: time DATETIME NOT NULL

PRIMARY KEY (cnid,dsid)

#### other citation suggestions table: cit_sug Table cit_sug fields: - citation id: cnid BIGINT UNSIGNED NOT NULL - document short id: dsid CHAR(15) NOT NULL - reason: reason CHAR(20) NOT NULL - when suggested: time DATETIME NOT NULL

PRIMARY KEY (cnid,dsid,reason)

#### Old suggestions table: cit\_old_sug Table cit\_old_sug fields: - short id of the person to whom we suggested it: psid CHAR(15) NOT NULL - document short id: dsid CHAR(15) NOT NULL - citation id: cnid BIGINT UNSIGNED NOT NULL

PRIMARY KEY (psid,dsid,cnid)

### Similarity matrix #### Data structure matrix: - new (hash): - document sid1 | citations list (each with reason & similarity) - old (hash): - document sid1 | citations list (each with reason & similarity) - totals_new (hash): - document sid1 | total similarity of the new citations, a real positive value - doclist (list): list of document sids in the order of decreasing similarity (based on totals_new) - citations (hash): - citation cnid | hash: - docsid | list of [ 'new', $suggestion ] or ['old', $suggestion ] - psid: personal record short id "dxx44" -- research profile item's sid $matrix -> {new} {dxx44}[0] -- the first of the new citations suggested for research profile item dxx44 $matrix -> {new} {dxx44}[1]->{similar} -- similarity of the second citation's to dxx44 document $matrix -> {citations} {'dcen4-n6Ci5dmzFdBAtQx4K+mEjg'} {dxx44} -- array of where this citation is suggested. $matrix -> {citations} {'dcen4-n6Ci5dmzFdBAtQx4K+mEjg'} {dxx44}[0][0] -- 'new' or 'old' $matrix -> {citations} {'dcen4-n6Ci5dmzFdBAtQx4K+mEjg'} {dxx44}[0][1] -- reference to the particular suggestion hash Similarity values in the matrix and in the cit\_doc_similarity table are all in the percent scale. They are produced by multiplying the similarity values (calculated by the cit-to-document comparison function) by 100. #### main matrix operations - load the matrix - calculate totals - get most interesting document (dsid) - consider new citations - remove citation - citation status change new to old ### Userdata structure The **citations** branch of the userdata structure: - identified: *hash* - document short id: a list of citations - next doc short id: another list of citations - ... - refused: a list of citations - meta: hash - co-auth-auto-add: auto co-authorship update option - auto-identified-auto-add: auto pre-identified update option - auto-add-similarity-threshold: a number - last-searched-nameset-date: date of the last searched name variations set - potential new & old citations we will store in the cit_doc_similarity, cit_sug and cit_old_sug tables

Each citation item in the userdata is a hash with these items:

- cnid - ostring - srcdocid - srcdoctitle - srcdocauthors - srcdocurlabout - reason (if manually added) - autoaddreason (if automatically added) - autoadded (if automatically added) ### Storing and passing citations around Citations will be stored as: 1. citation table records 2. in-memory suggestions 3. userdata items (see above) Basically, the citations will go from 1 to 2 and from 2 to 3. At each step, some hash keys may be added to a citation or may be removed. But sometimes a citation is removed from userdata and then it may be re-considered as a suggestion (3->2). ### Modules #### ACIS::Citations #### ACIS::Citations::Utils - normalize_string( string ) - build_citations\_index( citlist, [index] ); - build a hash of [ cnid: citation ] pairs ??? - get_document\_authors(); - for co-authors' claims - cit_document\_similarity( cit, doc ) -- default cit-doc similarity func and other useful stuff #### ACIS::Citations::Input use ACIS::Citations::Utils; - process_citation( cit ) - normalize, cut the editors, calculate the checksum - save_citation( cit ) - check_citation( cit ) - reload citation from the citations table or return undef otherwise #### ACIS::Citations::Suggestions #### ACIS::Citations::SimMatrix use ACIS::Citations::Suggestions; exported: - load_similarity\_matrix( record ); - returns matrix structure as specified above object methods: - most\_interesting\_doc( ); - remove_citation( $matrix, $cit ); ... internal: - _calculate_totals( $matrix ); #### ACIS::Citations::Search use ACIS::Citations::Utils; use ACIS::Citations::Suggestions; exportable: - search\_for\_document( id ) - find pre-identified citations - search\_for\_personal\_names( names ) ... #### ACIS::Web::Citations::Profile use ACIS::Citations::Input; use ACIS::Web::SysProfile; #### ACIS::Web::Citations - acis\_citations\_enabled() - check that citations-profile parameter has a true value in main.conf; returns true or false [15 min] - prepare() [30 min] - prepare_identified() [30 min] - process_identified() [60 min] - prepare_potential() [30 min] - process_potential() [60 min] - process_not\_my\_work() [30 min] - prepare_refused() [30 min] - process_refused() [60 min] - auto_suggestions() [30 min] - prepare_doclist() [30 min] #### ACIS::Citations::AutoUpdate ### Other notes #### SysProf table profile parameters - last profile check time (seconds since the epoch): last-cit-prof-check-time - last citation search date/time (seconds): last-auto-citations-time #### Consistency maintentance Every once in a while we check that every citation that is mentioned in the personal profile is still present in the citations table, and clear it, if it isn't there. We will remember the date of last cleaning run and at each user login (real or in APU) we will see if the profile needs cleaning already. (last-cit-prof-check-time in sysprof). In a similar way, we ensure that every document for which we have the identified citations is still present in the database. #### Database maintenance It would be highly desirable to backup the citations, citations_deleted and other related tables regularly. It is important for consistency of the data and stability of the user profiles. $Id: citations.text,v 2.57 2007/04/24 15:37:35 ivan Exp $