Fuzzy name searching
TODO
notions
configuration
administration
implementation components
tools
Full-text file recognition
TODO
input
user choices
configuration
data storage
data tables
implementation components
Document to document links
TODO
notions
configuration
components
screen design
documentation
fuzzy search enabled for default background search (not only under APU)?
XXX marks?
distance level = Levenshtein distance / string length of the name variation
The default critical distance level of distance is 1/m, i.e. 1/7 e.g. if the Levenshtein distance is 2 edits and the name variation’s length is 20, the distance level is 2/20 = 0.1 ; O.1 is less then 1/7, so, this would be a successful and positive match.
fuzzy-name-search — on or off
fuzzy-name-search-min-common-prefix — The number of characters n of at the start of a name variation that has to match in the name expressions exactly. Default: 3.
fuzzy-name-search-min-variation-length — The minimum number of characters m that a name variation would have to have in order to qualify for being fuzzy matched. The default is 7.
fuzzy-name-search-max-name-occurr-in-doc-names — The maximum number of occurrences of a name expression in the document author names table before it is considered for fuzzy matching. The default is 1. If this parameter is set to 0 or is not set, no maximum is checked.
fuzzy-name-search-max-name-occurr-in-name-variations — The maximum number of occurrences of a name expression in the name variations table before it is considered for fuzzy matching. By default, maximum is 0, ie. a name expression should not be present among name variations. Set is to -1 to disable this limit.
fuzzy-name-search-via-web — should fuzzy name searches be run when research search is initiated by the online user? (When a search is APU-initiated, this is a question of fuzzy-name-search.)
Run bin/fuzzy_search_table script regularly to update the rare table.
given a set of name expressions (i.e. document author names) compare it to a set of name variations and find close matches.
given a name variation and an ACIS configuration, find all relevant name expressions, which should be fuzzy-compared to that variation.
mysql> select name,count(name) from testrdb.res_creators_separate group by name;
mysql> select name,count(name) from names group by name;
String::Approx
input data — support LoTEc data format ftp://all.repec.org/LotEc
improve / fix the screen
add listed works count? - not critical, maybe
scroll the page vertically so that the menu is in the screen middle? - maybe; not critical
add explanations or links to explanations — Thomas hasn’t yet sent me the wording corrections and/or explanations
authoritative:
<text id="..">
<file>
<url>url</url>
</file>
</text>
automatically found:
<text id="..">
<hasversion>
<text>
<file>
<url>url</url>
</file>
</text>
</hasversion>
</text>
In the recognition menu, there are four mutually exclusive options.
The default is option 4.
The options will be expressed in shorter words, but there will be a help button.
The premissions menu is only shown if one of the last two options has been chosen. In the permission menu, there are two options
The default is option 1.
Co-authors may make contradictory choices. An ACIS installation records such choices, but does not resolve them.
Short versions of the menu items, first menu (“recognition”):
second menu (“permission”):
conf parameter:
The URLs are stored in the ft_urls table. The user decisions (choices) are stored in the ft_urls_choices table. (See below.) They are not stored in userdata! (Contrary to where every other piece of data given by users is stored.)
table ft_urls:
PRIMARY KEY( dsid, checksum ), index url_i(url(30)), index
source_i(source(50))
table ft_urls_choices:
primary key prim(dsid, checksum, psid), index t_i(time), index psid_i(psid)
input processing, ft_urls table making
recording user choices, inserting into ft_urls_choices
reading previous user choices
user screen
One idea is that we can create a special vacuum process for the ft_urls_choices table. It would scan the table and if for the same person/document/url there are several decisions, it would delete the old ones, and only leave the most recent one.
(it could also move the old ones to a separate archive table, optionally.)
screen: improve destination document listing (menu); for instance, follow Sune’s recommendation. Or Thomas’ idea of alphabetical order.
use Ajax to send the new link data and update the page?
use Ajax to delete a link?
document
fix XXX marks?
link type — has a name, description and, possibly, a reverse link type name; some link types have no reverse.
$HOME/doclinks.conf.xml.eg is an example configuration file. Rename to $HOME/doclinks.conf.xml and edit to your liking.
get configuration
parse it (ACIS::DumpXML::Parse)
check it (validate)
query configuration:
$lab = $conf->type(‘follow-up’)->label; $rev = $conf->type(‘follow-up’)->rev; $rnm = $conf->type(‘follow-up’)->rev->name;
or
$lab = $conf->label(‘follow-up’); $rna = $conf->rev(‘follow-up’); $rla = $conf->revlabel(‘follow-up’);
get all current links for a record - an optional, but potentially useful abstraction level
my $l = getdoclinks( $record ); savedoclinks( $record, $l );
get current links for a document
$l->fordocument( $dsid ); $l->todocument( $dsid ); #? $l->fromdocument( $dsid ); #? $l->label($type) $l->allexpanded; $l->all_compact; $l->count;
create a link, drop a link
$l->add($dsid1,$type,$dsid2); $l->drop($dsid1,$type,$dsid2);
screen
use Yahoo! UI javascript library?
use http://script.aculo.us/?
Generated: Wed Aug 29 22:59:09 2007
ACIS project, acis@openlib.org