Stage D: full-text links, document-to-document links and fuzzy name searches

Table of contents

   Fuzzy name searching
      implementation components
   Full-text file recognition
      user choices
      data storage
      data tables
      implementation components
   Document to document links
      screen design

Fuzzy name searching



distance level = Levenshtein distance / string length of the name variation

The default critical distance level of distance is 1/m, i.e. 1/7 e.g. if the Levenshtein distance is 2 edits and the name variation’s length is 20, the distance level is 2/20 = 0.1 ; O.1 is less then 1/7, so, this would be a successful and positive match.



Run bin/fuzzy_search_table script regularly to update the rare table.

implementation components

  1. given a set of name expressions (i.e. document author names) compare it to a set of name variations and find close matches.

  2. given a name variation and an ACIS configuration, find all relevant name expressions, which should be fuzzy-compared to that variation.

mysql> select name,count(name) from testrdb.res_creators_separate group by name;

mysql> select name,count(name) from names group by name;



Full-text file recognition


  1. input data — support LoTEc data format

  2. improve / fix the screen



<text id=".."> 

automatically found:

<text id=".."> 

user choices

In the recognition menu, there are four mutually exclusive options.

  1. This is not a file related to this paper.
  2. This is a web page describing the paper.
  3. This is a full-text file for another version of the paper.
  4. This is a full-text file for this version of the paper.

The default is option 4.

The options will be expressed in shorter words, but there will be a help button.

The premissions menu is only shown if one of the last two options has been chosen. In the permission menu, there are two options

  1. This full-text file may be archived as is.
  2. This full-text file may be archived, but check for updates.
  3. This full-text file may not be archived.

The default is option 1.

Co-authors may make contradictory choices. An ACIS installation records such choices, but does not resolve them.

Short versions of the menu items, first menu (“recognition”):

  1. wrong (n)
  2. abstract page (d)
  3. full-text file of another version (r)
  4. correct full-text file (y)

second menu (“permission”):

  1. may archive (y)
  2. archive, but check for updates (c)
  3. do not archive (n)


conf parameter:

data storage

The URLs are stored in the ft_urls table. The user decisions (choices) are stored in the ft_urls_choices table. (See below.) They are not stored in userdata! (Contrary to where every other piece of data given by users is stored.)

data tables

table ft_urls:

PRIMARY KEY( dsid, checksum ), index url_i(url(30)), index source_i(source(50))

table ft_urls_choices:

primary key prim(dsid, checksum, psid), index t_i(time), index psid_i(psid)

implementation components

  1. input processing, ft_urls table making

  2. recording user choices, inserting into ft_urls_choices

  3. reading previous user choices

  4. user screen

One idea is that we can create a special vacuum process for the ft_urls_choices table. It would scan the table and if for the same person/document/url there are several decisions, it would delete the old ones, and only leave the most recent one.

(it could also move the old ones to a separate archive table, optionally.)

Document to document links



link type — has a name, description and, possibly, a reverse link type name; some link types have no reverse.


$HOME/ is an example configuration file. Rename to $HOME/doclinks.conf.xml and edit to your liking.


  1. get configuration

  2. get all current links for a record - an optional, but potentially useful abstraction level

    my $l = getdoclinks( $record ); savedoclinks( $record, $l );

  3. get current links for a document

    $l->fordocument( $dsid ); $l->todocument( $dsid ); #? $l->fromdocument( $dsid ); #? $l->label($type) $l->allexpanded; $l->all_compact; $l->count;

  4. create a link, drop a link

    $l->add($dsid1,$type,$dsid2); $l->drop($dsid1,$type,$dsid2);

  5. screen

screen design