Stage D: full-text links, document-to-document links and fuzzy name searches

   Fuzzy name searching
      TODO
      notions
      configuration
      administration
      implementation components
      tools
   Full-text file recognition
      TODO
      input
      user choices
      configuration
      data storage
      data tables
      implementation components
   Document to document links
      TODO
      notions
      configuration
      components
      screen design

Fuzzy name searching

TODO

documentation
fuzzy search enabled for default background search (not only under APU)?
XXX marks?

notions

distance level = Levenshtein distance / string length of the name variation

The default critical distance level of distance is 1/m, i.e. 1/7 e.g. if the Levenshtein distance is 2 edits and the name variation’s length is 20, the distance level is 2/20 = 0.1 ; O.1 is less then 1/7, so, this would be a successful and positive match.

configuration

fuzzy-name-search — on or off
fuzzy-name-search-min-common-prefix — The number of characters n of at the start of a name variation that has to match in the name expressions exactly. Default: 3.
fuzzy-name-search-min-variation-length — The minimum number of characters m that a name variation would have to have in order to qualify for being fuzzy matched. The default is 7.
fuzzy-name-search-max-name-occurr-in-doc-names — The maximum number of occurrences of a name expression in the document author names table before it is considered for fuzzy matching. The default is 1. If this parameter is set to 0 or is not set, no maximum is checked.
fuzzy-name-search-max-name-occurr-in-name-variations — The maximum number of occurrences of a name expression in the name variations table before it is considered for fuzzy matching. By default, maximum is 0, ie. a name expression should not be present among name variations. Set is to -1 to disable this limit.
fuzzy-name-search-via-web — should fuzzy name searches be run when research search is initiated by the online user? (When a search is APU-initiated, this is a question of fuzzy-name-search.)

administration

Run bin/fuzzy_search_table script regularly to update the rare table.

implementation components

given a set of name expressions (i.e. document author names) compare it to a set of name variations and find close matches.
given a name variation and an ACIS configuration, find all relevant name expressions, which should be fuzzy-compared to that variation.

mysql> select name,count(name) from testrdb.res_creators_separate group by name;

mysql> select name,count(name) from names group by name;

tools

String::Approx

Full-text file recognition

TODO

input data — support LoTEc data format ftp://all.repec.org/LotEc
- configure test.acis to use that
improve / fix the screen
- add listed works count? - not critical, maybe
- scroll the page vertically so that the menu is in the screen middle? - maybe; not critical
- add explanations or links to explanations — Thomas hasn’t yet sent me the wording corrections and/or explanations

input

authoritative:

<text id=".."> 
 <file>
  <url>url</url>
 </file>
</text>

automatically found:

<text id=".."> 
 <hasversion>
  <text>
   <file>
    <url>url</url>
   </file>
  </text>
 </hasversion>
</text>

user choices

In the recognition menu, there are four mutually exclusive options.

This is not a file related to this paper.
This is a web page describing the paper.
This is a full-text file for another version of the paper.
This is a full-text file for this version of the paper.

The default is option 4.

The options will be expressed in shorter words, but there will be a help button.

The premissions menu is only shown if one of the last two options has been chosen. In the permission menu, there are two options

This full-text file may be archived as is.
This full-text file may be archived, but check for updates.
This full-text file may not be archived.

The default is option 1.

Co-authors may make contradictory choices. An ACIS installation records such choices, but does not resolve them.

Short versions of the menu items, first menu (“recognition”):

wrong (n)
abstract page (d)
full-text file of another version (r)
correct full-text file (y)

second menu (“permission”):

may archive (y)
archive, but check for updates (c)
do not archive (n)

configuration

conf parameter:

full-text-urls-recognition — on or off

data storage

The URLs are stored in the ft_urls table. The user decisions (choices) are stored in the ft_urls_choices table. (See below.) They are not stored in userdata! (Contrary to where every other piece of data given by users is stored.)

data tables

table ft_urls:

dsid char(15) not null
url blob not null
checksum char(16) binary not null
nature ENUM('authoritative','automatic') not null
source varchar(255) not null

PRIMARY KEY( dsid, checksum ), index url_i(url(30)), index source_i(source(50))

table ft_urls_choices:

dsid char(15) not null
checksum char(16) binary not null
psid char(15) not null
choice char(2) not null — first char: d|y|r|n (corresponding menu items: 2|4|3|1); second char: y|c|n (1|2|3)
time datetime not null

primary key prim(dsid, checksum, psid), index t_i(time), index psid_i(psid)

implementation components

input processing, ft_urls table making
recording user choices, inserting into ft_urls_choices
reading previous user choices
user screen

One idea is that we can create a special vacuum process for the ft_urls_choices table. It would scan the table and if for the same person/document/url there are several decisions, it would delete the old ones, and only leave the most recent one.

(it could also move the old ones to a separate archive table, optionally.)

Document to document links

TODO

screen: improve destination document listing (menu); for instance, follow Sune’s recommendation. Or Thomas’ idea of alphabetical order.
use Ajax to send the new link data and update the page?
use Ajax to delete a link?
document
fix XXX marks?

notions

link type — has a name, description and, possibly, a reverse link type name; some link types have no reverse.

configuration

document-document-links-profile — yes / no

$HOME/doclinks.conf.xml.eg is an example configuration file. Rename to $HOME/doclinks.conf.xml and edit to your liking.

components

get configuration
- parse it (ACIS::DumpXML::Parse)
- check it (validate)
- query configuration:
  
  $lab = $conf->type(‘follow-up’)->label; $rev = $conf->type(‘follow-up’)->rev; $rnm = $conf->type(‘follow-up’)->rev->name;
  
  or
  
  $lab = $conf->label(‘follow-up’); $rna = $conf->rev(‘follow-up’); $rla = $conf->revlabel(‘follow-up’);
get all current links for a record - an optional, but potentially useful abstraction level

my $l = getdoclinks( $record ); savedoclinks( $record, $l );
get current links for a document

$l->fordocument( $dsid ); $l->todocument( $dsid ); #? $l->fromdocument( $dsid ); #? $l->label($type) $l->allexpanded; $l->all_compact; $l->count;
create a link, drop a link

$l->add($dsid1,$type,$dsid2); $l->drop($dsid1,$type,$dsid2);
screen
- test on IE/Windows and in other browsers

screen design

use Yahoo! UI javascript library?
use http://script.aculo.us/?

Stage D: full-text links, document-to-document links and fuzzy name searches

Table of contents

Fuzzy name searching

TODO

notions

configuration

administration

implementation components

tools

Full-text file recognition

TODO

input

user choices

configuration

data storage

data tables

implementation components

Document to document links

TODO

notions

configuration

components

screen design