Stage D: full-text links, document-to-document links and fuzzy name searches

Table of contents

   Fuzzy name searching
      TODO
      notions
      configuration
      administration
      implementation components
      tools
   Full-text file recognition
      TODO
      input
      user choices
      configuration
      data storage
      data tables
      implementation components
   Document to document links
      TODO
      notions
      configuration
      components
      screen design

Fuzzy name searching

TODO

notions

distance level = Levenshtein distance / string length of the name variation

The default critical distance level of distance is 1/m, i.e. 1/7 e.g. if the Levenshtein distance is 2 edits and the name variation’s length is 20, the distance level is 2/20 = 0.1 ; O.1 is less then 1/7, so, this would be a successful and positive match.

configuration

administration

Run bin/fuzzy_search_table script regularly to update the rare table.

implementation components

  1. given a set of name expressions (i.e. document author names) compare it to a set of name variations and find close matches.

  2. given a name variation and an ACIS configuration, find all relevant name expressions, which should be fuzzy-compared to that variation.

mysql> select name,count(name) from testrdb.res_creators_separate group by name;

mysql> select name,count(name) from names group by name;

tools

String::Approx

Full-text file recognition

TODO

  1. input data — support LoTEc data format ftp://all.repec.org/LotEc

  2. improve / fix the screen

input

authoritative:

<text id=".."> 
 <file>
  <url>url</url>
 </file>
</text>

automatically found:

<text id=".."> 
 <hasversion>
  <text>
   <file>
    <url>url</url>
   </file>
  </text>
 </hasversion>
</text>

user choices

In the recognition menu, there are four mutually exclusive options.

  1. This is not a file related to this paper.
  2. This is a web page describing the paper.
  3. This is a full-text file for another version of the paper.
  4. This is a full-text file for this version of the paper.

The default is option 4.

The options will be expressed in shorter words, but there will be a help button.

The premissions menu is only shown if one of the last two options has been chosen. In the permission menu, there are two options

  1. This full-text file may be archived as is.
  2. This full-text file may be archived, but check for updates.
  3. This full-text file may not be archived.

The default is option 1.

Co-authors may make contradictory choices. An ACIS installation records such choices, but does not resolve them.

Short versions of the menu items, first menu (“recognition”):

  1. wrong (n)
  2. abstract page (d)
  3. full-text file of another version (r)
  4. correct full-text file (y)

second menu (“permission”):

  1. may archive (y)
  2. archive, but check for updates (c)
  3. do not archive (n)

configuration

conf parameter:

data storage

The URLs are stored in the ft_urls table. The user decisions (choices) are stored in the ft_urls_choices table. (See below.) They are not stored in userdata! (Contrary to where every other piece of data given by users is stored.)

data tables

table ft_urls:

PRIMARY KEY( dsid, checksum ), index url_i(url(30)), index source_i(source(50))

table ft_urls_choices:

primary key prim(dsid, checksum, psid), index t_i(time), index psid_i(psid)

implementation components

  1. input processing, ft_urls table making

  2. recording user choices, inserting into ft_urls_choices

  3. reading previous user choices

  4. user screen

One idea is that we can create a special vacuum process for the ft_urls_choices table. It would scan the table and if for the same person/document/url there are several decisions, it would delete the old ones, and only leave the most recent one.

(it could also move the old ones to a separate archive table, optionally.)

Document to document links

TODO

notions

link type — has a name, description and, possibly, a reverse link type name; some link types have no reverse.

configuration

$HOME/doclinks.conf.xml.eg is an example configuration file. Rename to $HOME/doclinks.conf.xml and edit to your liking.

components

  1. get configuration

  2. get all current links for a record - an optional, but potentially useful abstraction level

    my $l = getdoclinks( $record ); savedoclinks( $record, $l );

  3. get current links for a document

    $l->fordocument( $dsid ); $l->todocument( $dsid ); #? $l->fromdocument( $dsid ); #? $l->label($type) $l->allexpanded; $l->all_compact; $l->count;

  4. create a link, drop a link

    $l->add($dsid1,$type,$dsid2); $l->drop($dsid1,$type,$dsid2);

  5. screen

screen design