Current version: ver.1.03
(c) Author: Yevgeniy Gorokh, UCR
goroh@ucr.kiev.ua
Ukrainian National Cancer Registry, Kyiv, Ukraine
http://www.i.com.ua/~ucr
1. What is it.
LINK_IT is an utility for record linkage. It was developed in the Ukrainian
National Cancer Registry, Kyiv, as a result of a long-time experience. LINK_IT
attempts to make linkage process more easy and helpful in practical usage. The
main task of this tool is searching and processing records which describe the
same person in different databases. "Processing" means, that we don't stop after
saying: "Here and there are two records related to the same person... Great!".
Usually we have to provide exchange of any kind of information between two data
sets - for example, new cancer cases from hospital-based to population-based
cancer registry, and fresh follow-up data for persons existing in both registries
- in the other way.
Of course, LINK_IT could not transfer something by itself. It only carries out the searching of linked (as well as unlinked!) records in two data sets, and, may be, prepares the "package" of useful data for export, which would be processed in the external system: cancer registry software, for example.
LINK_IT is provided on "AS IS" basis, without warranty of any kind. It helps us to solve our problems, and (may be) it could be useful also for you. Software development and its distribution for other people is not my main task; so, any kind of trouble or disappointment caused by using LINK_IT, is only your problem. You can try and use LINK_IT freely. There is no kind of copyright protection or limitation for "unregistered users" included into this software - I have no time and wish for doing it. However, if you decided that LINK_IT really helps in your work and makes profit to you - it would be great to REGISTER it and support our registry and me for further development of this software. I will be glad to spend my time for improving LINK_IT and making it more useful for you. Especially if your organization or scientific project is able to support such an improvement ;)
2. Main features and advantages (?) of LINK_IT.
3. Known problems and disadvantages.
4. Contacts.
You can freely contact me for any question related to record linkage and LINK_IT software via e-mail: goroh@ucr.kiev.ua Any comment or idea will be kindly appreciated!
Some information about application of record linkage in the practice of Ukrainian
Cancer Registry is available on our web site: http://www.i.com.ua/~ucr
If you need record linkage software but LINK_IT is not so good for you - contact
with prof. Rainer Schnell,
Germany, or with Australian developers of FEBRL software - Tim
Churches and Peter Christen.
FEBRL homepage is: http://datamining.anu.edu.au/projects/linkage.html
5. Special remarks.
Short details about working with lINK_IT (manual?...) see below.
Good luck! ;)
-= Working with LINK_IT =-
Of course, it is not complete manual. This is minimal information you need to know about installation and usage of LINK_IT. If it is not enough - ask me. I prefer to answer your questions, rather than writing large amounts of text and noone will use it :)
0. Installation.
Link-it is provided as self-extracted archive. Copy LINK_IT package into empty
directory and run it. After extracting, following files and directories will
be created:
link_it.exe - main executable module. Run it!
SPEC\ - directory with service files
SAMPLE\ - SAMPLE project
SPEC\ directory have to be acccessible as a subdirectory from that place where
you run Link_it.exe. Now it includes following files:
TRANSLAT.CHR - list of chars to be replaced until data extraction
UKR_RUS.CHR - another char replacements (Ukrainian to Russian, DOS)
PARTREPL.DBF - table for parts of worlds to be changed
NAMEREPL.DBF - "look-up" table of names to be replaced
PATRREPL.DBF - "look-up" table of patronymic names to be replaced
ADDRREPL.DBF - table of non-identifying words in address strings
FIELDS.STR - list of default comparator functions for some fields
CFG.STR - pattern for creating configuration file
LINK.STR - pattern for creating list of comparators
PACK.STR - pattern for creating list of blocking expressions
STR_EXP.STR - pattern for creating library of export formats
STR_EXPF.STR - pattern for creating files with structure for DBF export
STR_QUE.STR - pattern for creating library of queries
S_DESCR.STR - pattern for creating source structure files
BL1. PLL - palette file (non-editable)
*.DBF and *.STR files are really Dbase files and you can
copy and edit them if you want. Beware of deleting something from patterns!
You can edit *.CHR files by simple text editor like Notepad (but NOT
by MS World or any other who adds own special information into files!).
Every new or existing project is placed into separate subdirectory. It means
that all automatically created by Link_It files will be created there. To access
the project, select its directory using 'Open project' and F10 key. Or
run Link_It.exe with the path in command-line parameter:
>link_it.exe C:\LINK_IT\MY_PROJECTS\SAMPLE
or, for example,
>link_it MY_PROJECTS\SAMPLE
1. "Project options"
You determine the name of project (it will be written on the top frame), mode
(searching duplicate records, or linking two different data sets), date format
for export, end template file for clerical review of matches. Only MODE is obligatory.
You are free to copy certain files from one project to another, etc. - if you
are sure what that files are used for. We made "template directories" with ajusted
structures for data conversion, blocking expressions, comparators, queries for
processing and extracting results of linkage, etc. for our most often linkage
activities, and it saves a lot of time and allows us to implement typical regular
jobs quickly.
2. "Define data sets"
First, I suppose that you have findings obtained from two different places (particular
case is to extract one of them from your own data). It is your "source
data". For providing linkage, you have to convert both of them into
similar form and charset ("synchronize"). It would be made by
"extracting" procedure, but after determining the structure of
databases, and functions for extracting information from source to target for
every field. "Synchronized" data have to be stored in your project directory,
in files SOURCE1.DBF, SOURCE2.DBF. If you copy that files from earlier
created projects, or prepared them by alternative tools outside Link_it - you
can skip "Define data sets" stage.
2.1. "Data set No :" - allows svitching between 1st | 2nd data sets. For "Searching duplicates" mode, you could not select 2nd data set, and it will be automatically copied from the 1st.
2.2. "Select source file" - selecting the file which will
be processed for data extraction. By the way, the structure of Link_it source
file is created into file SOURCE1.STR or SOURCE2.STR. If structure
file already exists in project directory (for example, copied from "template
directory" or from previous project), you can use that (ajusted previously?)
file. Answer "No" on question about reloading, in that case.
The extension of source file determines the type of data extraction: *.TXT
files are processed as formatted text files by the foolowing rules:
1) Every string consists of fields for one person, and every field in every
stryng starts at the same position.
2) 1st string of the file have to include field names, and starting positions
of fields are fixed by "|" character (#124).
3) Every string ends by the DOS string separator (Chr#13+Chr#10)
As the example, see file SAMPLES\deputees.txt
The *.DBF files are processed as... as DBASE files, simply :)
2.3. "Define field attributes" - editing the structure
of future data set, and determining functions for extracting information from
the source. Bacically, you hafe automatically recognized structure of source
data. But you are free to add fields into structure, change their names, types
etc. May be useful, for example, to add field "BIRTHYEAR" in addition
to existing "BIRTHDATE", fill it by function YEAR and use for
pocket expression; add SOUNDEX field for the same task; or, include both
source and translated Names\Surnames\Address etc., for further review or export
processing. See bottom screen lines for hot keys.
Fields available in file SOURCEx.STR (source field attributes):
#N - used for sorting
FIELD_NAME - 10 UPPERCASE chars, Letter first, no spaces and special
chars available exept "_"
TYPE - "N","C","D" or "L" (dBASE field types).
LENGTH - field length (<=255)
DECIMALS - used for "N" type only
STATUS - "KEY" means that field (combination of fields) is unique
identifier in the data set. No need to insert it manually. If no KEY
fields were defined - it would be created automatically after finishing. "LINK"
means that field will be used for matching. Link_It clears "LINK" fields
in multiple records related to the same person in one data set, so only one
set of LINKing fields is processed for one person irrespective of number of
records described that person. "INFO" - any other field, and you are
free to create and process INFO fields for any need (queries, filters, data
exchange). "SERVICE" is status for auto-made field named "PRIMARY",
which marks "secondary" records related to one person as .F.
SOURCE FIELD - name of the field in the source database; value of that
field is used as an argument for an extraction function. Obligatory for DBF
type of source.
START AT - position of the TXT source file, which marks the beginning
of....
# OF CHARs the appropriate peace of data, and length of that peace. Used
for TXT type of source only, instead of "SOURCE FIELD"
FUNCTION - determines the method of extracting the value of certain field
from source into data set. First parameter is the value of source data AUTOMATICALLY,
so in PARAMETER 1, ..4 you include only ADDITIONAL parameters, if need.
The list of functions and parameters available see below. But any CLIPPER 5.1
function call or even expression is available (but DON'T FOGET about skipping
the 1st parameter!) Use F7 functional key for assistance in filling that
things. Empty FUNCTION means that field will be filled by the appropriate
source value with no any modification (unless type conversion, if need).
2.4. "Define data settings" - determining the "look-up"
tables, which will be used while data extraction. If you are pointed for any
look-up table, which not exist in project directory but found in SPEC\
directory - it will be copied into project directory while loading project.
Every project uses its own copy of look-up tables, so you can edit them independently.
- 'Character replacements Table :' - that table is single one in the
text file, not DBASE. Characters from 2nd string will substitute appropriate
chars from 1st string in any field processed by functions TRAN_STR, TRAN_NAME,
TRAN_ADDR.
- 'NAME replacements Table :' - see file NAMEREPL.DBF for example.
- 'PATRONYMIC replacements Table :' - see PATRREPL.DBF Items from
"TRUE_NAME" field will substitute appropriate words from "NAME"
in any field processed by functions TRAN_STR, TRAN_NAME, TRAN_ADDR.
- 'PARTS OF WORD replacements Table :' - see PARTREPL.DBF
Works as previous tables, but in the last turn - mostly for surnames and rare
names which were not found in look-up tables. Tranclates typical suffixes, endings,
parts of words if you can determine correspondig ones in two languages. But
be careful with it! Sometimes I find fanny "translations" caused by this table.
Put "exceptions" higher than shorter items to process them with priority.
- 'PARTS OF ADDRESS replacements Table :' table to REMOVE typical
non- informative words from address string by TRAN_ADDR function.
- 'TYPE of external data source (TXT or DBF):' - you can change default
selection, but be careful! Wherefore you do it!?
2.5. "Extract data from source file" - when all of previous strange things are filled in, run it. Of course, troubles and program breaks are expected, but.. Try! What if you are lucky!? :) In any case, you can restart Link_it, change something and run it again. Look carefully into extracted data and statistics to make sure that all is ok. As I said, further improvement of interface and error tracking is expected if someone will use it. And I am ready to help with advice now.
3. "Find suspicious links"
3.1. "Comparator functions settings" - here you define pairs of fields
with the same meaning, and functions to compare these values during linkage.
The summa of all the results of comparator functions is the "total weight",
which determines the similarity of two records. Comparator functions have several
parameters, setting the reaction on different types of differences.
Usually:
First parameter - what returned if one of argument fields is EMPTY. (Sometimes
it means that any value from second database is proper!)
Second parameter - % reaction if two comparing values are completely
equal (usually 100).
Third and Fourth parameters, if exist, could define some special situations:
for example, processing of Initial letters instead of Names or Patronimics;
Partial coincidence of Dates, etc. See the description of functions.
In addition, each comparator has two "Weight coefficients":
Weight - the proportional investment (x100) of certain comparator into
summary weight. Default is 100, but you can change it to avoid changing of separate
parameters in comparator call.
Decreasing Weight - value for decreasing summary weight if comparator
retuns zero. Default decreasing weight is 0, but sometimes it might be useful
to exclude definitely different records.
For easier setting comparators and parameters, use F7 functional key
and see the assistance line in the bottom of frame.
3.2. "Define blocking packages" - to decrease number of
compared records. You can work with no any blocking - but... For M records in
the 1st and N records in the 2nd data set you will wait while M*N records to
be compared... You can use separate field (might be the same in both data
sets), for example NYSIIS, or any expression (also might be executed
on both). For example,
SOUNDEX(surname)
or
LEFT(SURNAME,3)+LEFT(NAME,1)+LEFT(ICD,3)
Optimal is to define TWO completely different expressions or fields as two blocking
packages. Even if link is missed from one of blocks, it should be keeped by
another.
3.3. "Define treshold values" - if summary weight of all comparators exceeds "Lower threshold" value, two records are decided as suspicious to be linked, and these identifiers are added into LINKS database for further review or processing. Higher threshold is not so essential now, and it is available only for filter under clerical review.
3.4. "Begin/continue searching suspicious" - running of the process. Link_it moves in the 1st data set, and for every record looks for appropriate records in the second data set (using blocking expressions). By the way, that is why it would be better to use shorter data set as 1st. You can stop that by F8 functional key, and continue later. I recomend you to stop after several links found and review them. Maybe, you'll have ideas about thresholds, comparator weights or your data sets at all. You can change any settings and restart. If it works too slow - look into your blocking expressions and into linkage statistics.
4. "Work with list of pairs"
That mode is also known as "Clerical review". But, it is also
useful for ajusting linkage settings. If you see "strange" pairs in the list
- use F3 key to look what values your comparator functions return. Maybe,
it is a chance to change parameters and run linkage again (by the way, in that
case you can say to add new links (and change existing weights), instead of
overwriting previous found and reviewed matches!).
Ideology of review is based on special markings. You can mark certain pair (or group of pairs under filter) as: "L" - recognized as link; "U" - unlinked; " " (<space>) - not reviewed yet; or any other character if you'll remember what it means :) You can use that markers for filtering and in queries - for data processing and extracting. And vice versa, you can set filter on last query result and review or mark it. The combination of these tools allows us to find a lot of nice things easy! Enjoy it :)
See functional keys on the bottom. To see pair of records, use <Enter>.
You can customize screen form for comfortable review (default
is the simplest list of concurrent fields from both data sets, which is generated
automatically). The name of screen form file is available in linkage settings.
See the file SAMPLE\review.frm for example ("~~~~" means places
where queued field or expression value would be inserted).
5. "Processing linkage results"
This item includes two essential parts. First is query system,
which could calculate the amount of records (or pairs of records) satisfying
any query expression. And the second part is export processor,
which generates text or dBASE file filled by information. That information is
available from both of data sets related by the identifier trougth the set of
"links" into matched pairs.
5.1. "MATCHES preparing" - that option makes two things:
1) removing all the pairs recognized as "Unlinked" while clerical review
- to prevent their participation in linkage results processing. But remember
that you could not revise your decision after that!
2) Matched pairs related to the same person from the 1st data set could be marked
as "@" character. It means that more than 1 link was found in the 2nd
data set for the same person, ant it could cause problems while data transfer.
Please, don't forget to review all that pairs, and, maybe, leave one more actual
match.
5.2. Query system.
Query system allows sequential computation of query expression. Query expression
must return LOGICAL value, and .T. means that current record or match
is satisfying the query, and its identifier will be added to the List of query
results.
"Current set" option changes the host database. Default is the "Set
of matches" (file LINKS.DBF). In this case, Link_It traverses all
the records from LINKS and positiones on related records from 1st and
2nd data sets. So, any field from LINKS and both of data sets could be
obtained by the query. To point on any field from 1st (2nd) data set, use alias
DATA1-> or DATA2-> before field name. Fields from LINKS
data set could be available without alias. Also, any Clipper (FoxPro?) function
call or operation is available. For example, the query like:
left(DATA1->ICD10,3)#left(DATA2->ICD10,3)
means that we'll find all the matched pairs where matched people have different
ICD10 code even on 3-digit level.
Or, query
(MARK='#').and.(DATA1->AGE<=21)
finds all the young people marked by "#" character while clerical review (whatever
it meant).
And, of course, the simplest query:
.T.
makes ALL the items as a query result, because it satisfies anywhere :)
Most often function calls and operations are below:
.AND. .OR. .NOT. - logical
operators;
= # < > <= >= - relation
operators;
+ - / * - ariphmetical
operators;
LEFT(String,NumberOfPos)
,
RIGHT(String,NumberOfPos)
,
SUBSTR(String,Start,NumberOfPos)
- return substrings of string
YEAR(Date) ,
MONTH(Date) ,
DAY(Date) - return Numeric values from Date
EMPTY(VariableOfAnyType)
- .T. if value is empty (or 0), of course
STR(VariableOfAnyType[,StringLength])
- converts to String
VAL(String) - converts
to Numeric; 0 if impossible
AT(SubString,String) -
returns position; 0 if not found
I am ready to provide you more detailed information about sintax and functions of Clipper computer language (very similar to FoxPro) - if you really want to use Link_it! Or, see the query library from the SAMPLE project. You can experiment with queries easily, because wrong queries could not crash anything or even break Link_It (it is a rare place in Link_It where I provided error-tracking today ;)
"Current set" option pointed on 1st or 2nd data set could be useful
for finding that records, which have NO matched record found in another data
set. While Link_It is processing the data set different of "Set of Matches",
the LOGICAL wariable LINKED is available. So, the query:
.NOT.LINKED
with 1st data set as "current" finds all the people who was not found in the
2nd data set. Maybe, it means that you have to prepare the list of that persons
and send to the collegues who afforded you the 2nd data set...
The query result is presented as a list of identifiers (file ILIST.IDL for the pairs, and *.ID1 or *.ID2 for 1st or 2nd data set as a current). "ILIST.*" is a default name. But you can save the list into another file and use it in future - including further search traversing only records from that file. The last result of a query is available as a filter for clerical review, and for data export. And, of course, you'll see on screen the amount of records satisfying your finished query.
5.3. "Data export"
Main idea of LInk_It is don't stop after finding matches, but prepare something
useful for export to the source system. Query system helps us to find the group
of records (pairs of records) satisfying the certain condition. The same approach
has been used for the file of results. Link_It traverses on the last query result
and allows access to the fields from both of data sets at the same time (don't
foget about aliases DATA1-> and DATA2->). Thus, you can produse
export files combining, for example, identifiers from 1st data set and actual
information from the 2nd (that is why you include identifiers from the source
systems - now you can update information automatically!). On that principle
works, for example, exchange of follow-up data between two systems. We find
those people who has later follow-up in the 2nd data set, by the query:
DATA1->FOLLOWUP_D < DATA2->FOLLOWUP_D
and extract identifier fields from 1st data set and date and status of follow-up
from the 2nd. And vise versa, of course. It is really easy to make tool for
updating your database by such an information, which includes original identifier
of your source system!
For setting up the export format, add new record into the list of export. Link_It
allows two types of export: dBASE files, and text files. DBASE needs the file
with structure description (see file SAMPLE\for_priz.str, for example).
Text file also needs the form description file - similar to your form for clerical
review. You can prepare many text and dBASE export forms for different needs,
and copy them from one project to another. If you not assigned the file name,
Link_It will ask you. See the SAMPLE\ project for details.
Description forms can include any fields (with ALIAS) or function calls - as
well as in query system. It is enough for us to provide any data exchange that
we need. But, of course, the same job could be done by SQL queries. If you prefer
SQL - no problem. SOURCE1.DBF, LINKS.DBF and SOURCE2.DBF files
make easy relational structure, and such an expirienced researcher could open
it in MS ACCESS or in ORACLE and shake out all he wants :)
6. Files.
As I said, you can copy files from one project to another, if use similar data.
It really saves a lot of time. But what files?
LINKAGE.CFG - dBASE file, includes project and source data sets description.
You can copy it from earlier project if it is similar, but don't foget to change
the name of project, source data files, etc.
SOURCE1.STR, SOURCE2.STR - these files could be automatically created.
They contain the structure of the source data sets. But, you'll probably change
and justify something (field types, names and length, especially for TXT sources;
add new computable fields; fill translating or extracting function calls; etc.)
So, it would be useful to save these *.STR files to the "template" directory,
and copy that files into project every time when you'll use data set with the
similar structure (probably, you'll prepare typical query in your system for
providing data for linkage, and its structure will be the same in a lot of linkage
projects).
!!! Don't answer "YES" on a question after selecting data set - about
reloading the structure, if you have copied ajusted files *.STR.
SOURCE1.DBF, SOURCE2.DBF - "cleaned" (translated, "synchronized", etc.)
files after "extracting source data" step. That files are used by Link_It
for linkage, queries and data export. If you have THE SAME source file available
to be linked with different sources - you can copy that file (for example, SOURCE1.DBF)
and don't spend time for extracting and justifying it in each project. Really
you can skip the step "Define data sets", if you have already existing
files both SOURCE1.DBF and SOURCE1.DBF.
L_FIELDS.CFG - the list of comparing items. If two projects have the
same linkage fields and the same functions to compare them - you can copy that
file from one project to anouther. Be careful to prevent using missed fields!
L_FIELDS.CFG is dBASE file.
P_FIELDS.CFG - the list of pocket (blocking) expressions. You can copy
it if blocking principles and variables are the same. DBASE file.
QUERY.STR - query library. DBASE file.
Other *.STR files probably contain structures of dBASE files for export.
*.FRM files contain forms of TEXT files for export, or screen forms for
clerical review.
Most of files mentioned above could be useful in different projects if data sets of those projects have THE SAME STRUCTURE. If field names or types of source data have been changed, it cause breaking of Link_It with error message. It is not mortally, of course. You can change something and run Link_It again.
Look-up tables and character set converion files have been described above. Of corce, you can copy them. Good idea is to place most useful into SPEC directory - Link_It can copy tables from here if they were described in LINKAGE.CFG but not exist in the project directory.
*.ID? files contain lists of query result identifiers. Mostly, no need
to save or copy them, bacause every new data set will have new unique record
identifiers.
*.LOG files contain information about process of data extraction or finding
matches. Look here sometimes.
*.NTX files are index files, they are created automatically. No need
to save or copy it.
That is all for today. Ask me for more information!
With regards,
Yevgeniy Gorokh, Ukrainian Cancer Registry, Kyiv
goroh@ucr.kiev.ua