-= LINK_IT =-
record matching software


Download Link_It

Current version: ver.1.03


(c) Author: Yevgeniy Gorokh, UCR
goroh@ucr.kiev.ua

Ukrainian National Cancer Registry, Kyiv, Ukraine
http://www.i.com.ua/~ucr

 

1. What is it.
LINK_IT is an utility for record linkage. It was developed in the Ukrainian National Cancer Registry, Kyiv, as a result of a long-time experience. LINK_IT attempts to make linkage process more easy and helpful in practical usage. The main task of this tool is searching and processing records which describe the same person in different databases. "Processing" means, that we don't stop after saying: "Here and there are two records related to the same person... Great!". Usually we have to provide exchange of any kind of information between two data sets - for example, new cancer cases from hospital-based to population-based cancer registry, and fresh follow-up data for persons existing in both registries - in the other way.

Of course, LINK_IT could not transfer something by itself. It only carries out the searching of linked (as well as unlinked!) records in two data sets, and, may be, prepares the "package" of useful data for export, which would be processed in the external system: cancer registry software, for example.

LINK_IT is provided on "AS IS" basis, without warranty of any kind. It helps us to solve our problems, and (may be) it could be useful also for you. Software development and its distribution for other people is not my main task; so, any kind of trouble or disappointment caused by using LINK_IT, is only your problem. You can try and use LINK_IT freely. There is no kind of copyright protection or limitation for "unregistered users" included into this software - I have no time and wish for doing it. However, if you decided that LINK_IT really helps in your work and makes profit to you - it would be great to REGISTER it and support our registry and me for further development of this software. I will be glad to spend my time for improving LINK_IT and making it more useful for you. Especially if your organization or scientific project is able to support such an improvement ;)

2. Main features and advantages (?) of LINK_IT.

3. Known problems and disadvantages.

4. Contacts.

You can freely contact me for any question related to record linkage and LINK_IT software via e-mail: goroh@ucr.kiev.ua Any comment or idea will be kindly appreciated!

Some information about application of record linkage in the practice of Ukrainian Cancer Registry is available on our web site: http://www.i.com.ua/~ucr

If you need record linkage software but LINK_IT is not so good for you - contact with prof. Rainer Schnell, Germany, or with Australian developers of FEBRL software - Tim Churches and Peter Christen. FEBRL homepage is: http://datamining.anu.edu.au/projects/linkage.html

5. Special remarks.

Short details about working with lINK_IT (manual?...) see below.


Good luck! ;)


-= Working with LINK_IT =-


Of course, it is not complete manual. This is minimal information you need to know about installation and usage of LINK_IT. If it is not enough - ask me. I prefer to answer your questions, rather than writing large amounts of text and noone will use it :)


0. Installation.

Link-it is provided as self-extracted archive. Copy LINK_IT package into empty directory and run it. After extracting, following files and directories will be created:
link_it.exe - main executable module. Run it!
SPEC\ - directory with service files
SAMPLE\ - SAMPLE project
SPEC\ directory have to be acccessible as a subdirectory from that place where you run Link_it.exe. Now it includes following files:
TRANSLAT.CHR - list of chars to be replaced until data extraction
UKR_RUS.CHR - another char replacements (Ukrainian to Russian, DOS)
PARTREPL.DBF - table for parts of worlds to be changed
NAMEREPL.DBF - "look-up" table of names to be replaced
PATRREPL.DBF - "look-up" table of patronymic names to be replaced
ADDRREPL.DBF - table of non-identifying words in address strings
FIELDS.STR - list of default comparator functions for some fields
CFG.STR - pattern for creating configuration file
LINK.STR - pattern for creating list of comparators
PACK.STR - pattern for creating list of blocking expressions
STR_EXP.STR - pattern for creating library of export formats
STR_EXPF.STR - pattern for creating files with structure for DBF export
STR_QUE.STR - pattern for creating library of queries
S_DESCR.STR - pattern for creating source structure files
BL1. PLL - palette file (non-editable)
*.DBF and *.STR files are really Dbase files and you can copy and edit them if you want. Beware of deleting something from patterns!
You can edit *.CHR files by simple text editor like Notepad (but NOT by MS World or any other who adds own special information into files!).
Every new or existing project is placed into separate subdirectory. It means that all automatically created by Link_It files will be created there. To access the project, select its directory using 'Open project' and F10 key. Or run Link_It.exe with the path in command-line parameter:
>link_it.exe C:\LINK_IT\MY_PROJECTS\SAMPLE
or, for example,
>link_it MY_PROJECTS\SAMPLE

1. "Project options"

You determine the name of project (it will be written on the top frame), mode (searching duplicate records, or linking two different data sets), date format for export, end template file for clerical review of matches. Only MODE is obligatory.
You are free to copy certain files from one project to another, etc. - if you are sure what that files are used for. We made "template directories" with ajusted structures for data conversion, blocking expressions, comparators, queries for processing and extracting results of linkage, etc. for our most often linkage activities, and it saves a lot of time and allows us to implement typical regular jobs quickly.

2. "Define data sets"

First, I suppose that you have findings obtained from two different places (particular case is to extract one of them from your own data). It is your "source data". For providing linkage, you have to convert both of them into similar form and charset ("synchronize"). It would be made by "extracting" procedure, but after determining the structure of databases, and functions for extracting information from source to target for every field. "Synchronized" data have to be stored in your project directory, in files SOURCE1.DBF, SOURCE2.DBF. If you copy that files from earlier created projects, or prepared them by alternative tools outside Link_it - you can skip "Define data sets" stage.

2.1. "Data set No :" - allows svitching between 1st | 2nd data sets. For "Searching duplicates" mode, you could not select 2nd data set, and it will be automatically copied from the 1st.

2.2. "Select source file" - selecting the file which will be processed for data extraction. By the way, the structure of Link_it source file is created into file SOURCE1.STR or SOURCE2.STR. If structure file already exists in project directory (for example, copied from "template directory" or from previous project), you can use that (ajusted previously?) file. Answer "No" on question about reloading, in that case.
The extension of source file determines the type of data extraction: *.TXT files are processed as formatted text files by the foolowing rules:
1) Every string consists of fields for one person, and every field in every stryng starts at the same position.
2) 1st string of the file have to include field names, and starting positions of fields are fixed by "|" character (#124).
3) Every string ends by the DOS string separator (Chr#13+Chr#10)
As the example, see file SAMPLES\deputees.txt
The *.DBF files are processed as... as DBASE files, simply :)

2.3. "Define field attributes" - editing the structure of future data set, and determining functions for extracting information from the source. Bacically, you hafe automatically recognized structure of source data. But you are free to add fields into structure, change their names, types etc. May be useful, for example, to add field "BIRTHYEAR" in addition to existing "BIRTHDATE", fill it by function YEAR and use for pocket expression; add SOUNDEX field for the same task; or, include both source and translated Names\Surnames\Address etc., for further review or export processing. See bottom screen lines for hot keys.
Fields available in file SOURCEx.STR (source field attributes):
#N - used for sorting
FIELD_NAME - 10 UPPERCASE chars, Letter first, no spaces and special chars available exept "_"
TYPE
- "N","C","D" or "L" (dBASE field types).
LENGTH - field length (<=255)
DECIMALS - used for "N" type only
STATUS - "KEY" means that field (combination of fields) is unique identifier in the data set. No need to insert it manually. If no KEY fields were defined - it would be created automatically after finishing. "LINK" means that field will be used for matching. Link_It clears "LINK" fields in multiple records related to the same person in one data set, so only one set of LINKing fields is processed for one person irrespective of number of records described that person. "INFO" - any other field, and you are free to create and process INFO fields for any need (queries, filters, data exchange). "SERVICE" is status for auto-made field named "PRIMARY", which marks "secondary" records related to one person as .F.
SOURCE FIELD - name of the field in the source database; value of that field is used as an argument for an extraction function. Obligatory for DBF type of source.
START AT - position of the TXT source file, which marks the beginning of....
# OF CHARs the appropriate peace of data, and length of that peace. Used for TXT type of source only, instead of "SOURCE FIELD"
FUNCTION - determines the method of extracting the value of certain field from source into data set. First parameter is the value of source data AUTOMATICALLY, so in PARAMETER 1, ..4 you include only ADDITIONAL parameters, if need. The list of functions and parameters available see below. But any CLIPPER 5.1 function call or even expression is available (but DON'T FOGET about skipping the 1st parameter!) Use F7 functional key for assistance in filling that things. Empty FUNCTION means that field will be filled by the appropriate source value with no any modification (unless type conversion, if need).

2.4. "Define data settings" - determining the "look-up" tables, which will be used while data extraction. If you are pointed for any look-up table, which not exist in project directory but found in SPEC\ directory - it will be copied into project directory while loading project. Every project uses its own copy of look-up tables, so you can edit them independently.
- 'Character replacements Table :' - that table is single one in the text file, not DBASE. Characters from 2nd string will substitute appropriate chars from 1st string in any field processed by functions TRAN_STR, TRAN_NAME, TRAN_ADDR.
- 'NAME replacements Table :' - see file NAMEREPL.DBF for example.
- 'PATRONYMIC replacements Table :' - see PATRREPL.DBF Items from "TRUE_NAME" field will substitute appropriate words from "NAME" in any field processed by functions TRAN_STR, TRAN_NAME, TRAN_ADDR.
- 'PARTS OF WORD replacements Table :' - see PARTREPL.DBF
Works as previous tables, but in the last turn - mostly for surnames and rare names which were not found in look-up tables. Tranclates typical suffixes, endings, parts of words if you can determine correspondig ones in two languages. But be careful with it! Sometimes I find fanny "translations" caused by this table. Put "exceptions" higher than shorter items to process them with priority.
- 'PARTS OF ADDRESS replacements Table :' table to REMOVE typical non- informative words from address string by TRAN_ADDR function.
- 'TYPE of external data source (TXT or DBF):' - you can change default selection, but be careful! Wherefore you do it!?

2.5. "Extract data from source file" - when all of previous strange things are filled in, run it. Of course, troubles and program breaks are expected, but.. Try! What if you are lucky!? :) In any case, you can restart Link_it, change something and run it again. Look carefully into extracted data and statistics to make sure that all is ok. As I said, further improvement of interface and error tracking is expected if someone will use it. And I am ready to help with advice now.

 

3. "Find suspicious links"

3.1. "Comparator functions settings" - here you define pairs of fields with the same meaning, and functions to compare these values during linkage. The summa of all the results of comparator functions is the "total weight", which determines the similarity of two records. Comparator functions have several parameters, setting the reaction on different types of differences.
Usually:
First parameter - what returned if one of argument fields is EMPTY. (Sometimes it means that any value from second database is proper!)
Second parameter - % reaction if two comparing values are completely equal (usually 100).
Third and Fourth parameters, if exist, could define some special situations: for example, processing of Initial letters instead of Names or Patronimics; Partial coincidence of Dates, etc. See the description of functions.
In addition, each comparator has two "Weight coefficients":
Weight - the proportional investment (x100) of certain comparator into summary weight. Default is 100, but you can change it to avoid changing of separate parameters in comparator call.
Decreasing Weight - value for decreasing summary weight if comparator retuns zero. Default decreasing weight is 0, but sometimes it might be useful to exclude definitely different records.
For easier setting comparators and parameters, use F7 functional key and see the assistance line in the bottom of frame.

3.2. "Define blocking packages" - to decrease number of compared records. You can work with no any blocking - but... For M records in the 1st and N records in the 2nd data set you will wait while M*N records to be compared... You can use separate field (might be the same in both data sets), for example NYSIIS, or any expression (also might be executed on both). For example,
SOUNDEX(surname)
or
LEFT(SURNAME,3)+LEFT(NAME,1)+LEFT(ICD,3)
Optimal is to define TWO completely different expressions or fields as two blocking packages. Even if link is missed from one of blocks, it should be keeped by another.

3.3. "Define treshold values" - if summary weight of all comparators exceeds "Lower threshold" value, two records are decided as suspicious to be linked, and these identifiers are added into LINKS database for further review or processing. Higher threshold is not so essential now, and it is available only for filter under clerical review.

3.4. "Begin/continue searching suspicious" - running of the process. Link_it moves in the 1st data set, and for every record looks for appropriate records in the second data set (using blocking expressions). By the way, that is why it would be better to use shorter data set as 1st. You can stop that by F8 functional key, and continue later. I recomend you to stop after several links found and review them. Maybe, you'll have ideas about thresholds, comparator weights or your data sets at all. You can change any settings and restart. If it works too slow - look into your blocking expressions and into linkage statistics.

4. "Work with list of pairs"

That mode is also known as "Clerical review". But, it is also useful for ajusting linkage settings. If you see "strange" pairs in the list - use F3 key to look what values your comparator functions return. Maybe, it is a chance to change parameters and run linkage again (by the way, in that case you can say to add new links (and change existing weights), instead of overwriting previous found and reviewed matches!).

Ideology of review is based on special markings. You can mark certain pair (or group of pairs under filter) as: "L" - recognized as link; "U" - unlinked; " " (<space>) - not reviewed yet; or any other character if you'll remember what it means :) You can use that markers for filtering and in queries - for data processing and extracting. And vice versa, you can set filter on last query result and review or mark it. The combination of these tools allows us to find a lot of nice things easy! Enjoy it :)


See functional keys on the bottom. To see pair of records, use <Enter>. You can customize screen form for comfortable review (default is the simplest list of concurrent fields from both data sets, which is generated automatically). The name of screen form file is available in linkage settings. See the file SAMPLE\review.frm for example ("~~~~" means places where queued field or expression value would be inserted).

5. "Processing linkage results"

This item includes two essential parts. First is query system, which could calculate the amount of records (or pairs of records) satisfying any query expression. And the second part is export processor, which generates text or dBASE file filled by information. That information is available from both of data sets related by the identifier trougth the set of "links" into matched pairs.

5.1. "MATCHES preparing" - that option makes two things:
1) removing all the pairs recognized as "Unlinked" while clerical review - to prevent their participation in linkage results processing. But remember that you could not revise your decision after that!
2) Matched pairs related to the same person from the 1st data set could be marked as "@" character. It means that more than 1 link was found in the 2nd data set for the same person, ant it could cause problems while data transfer. Please, don't forget to review all that pairs, and, maybe, leave one more actual match.

5.2. Query system.
Query system allows sequential computation of query expression. Query expression must return LOGICAL value, and .T. means that current record or match is satisfying the query, and its identifier will be added to the List of query results.
"Current set" option changes the host database. Default is the "Set of matches" (file LINKS.DBF). In this case, Link_It traverses all the records from LINKS and positiones on related records from 1st and 2nd data sets. So, any field from LINKS and both of data sets could be obtained by the query. To point on any field from 1st (2nd) data set, use alias DATA1-> or DATA2-> before field name. Fields from LINKS data set could be available without alias. Also, any Clipper (FoxPro?) function call or operation is available. For example, the query like:
left(DATA1->ICD10,3)#left(DATA2->ICD10,3)
means that we'll find all the matched pairs where matched people have different ICD10 code even on 3-digit level.
Or, query
(MARK='#').and.(DATA1->AGE<=21)
finds all the young people marked by "#" character while clerical review (whatever it meant).
And, of course, the simplest query:
.T.
makes ALL the items as a query result, because it satisfies anywhere :)
Most often function calls and operations are below:
.AND. .OR. .NOT. - logical operators;
= # < > <= >= - relation operators;
+ - / * - ariphmetical operators;
LEFT(String,NumberOfPos) ,
RIGHT(String,NumberOfPos) ,
SUBSTR(String,Start,NumberOfPos) - return substrings of string
YEAR(Date) ,
MONTH(Date) ,
DAY(Date)
- return Numeric values from Date
EMPTY(VariableOfAnyType) - .T. if value is empty (or 0), of course
STR(VariableOfAnyType[,StringLength]) - converts to String
VAL(String) - converts to Numeric; 0 if impossible
AT(SubString,String) - returns position; 0 if not found

I am ready to provide you more detailed information about sintax and functions of Clipper computer language (very similar to FoxPro) - if you really want to use Link_it! Or, see the query library from the SAMPLE project. You can experiment with queries easily, because wrong queries could not crash anything or even break Link_It (it is a rare place in Link_It where I provided error-tracking today ;)

"Current set" option pointed on 1st or 2nd data set could be useful for finding that records, which have NO matched record found in another data set. While Link_It is processing the data set different of "Set of Matches", the LOGICAL wariable LINKED is available. So, the query:
.NOT.LINKED
with 1st data set as "current" finds all the people who was not found in the 2nd data set. Maybe, it means that you have to prepare the list of that persons and send to the collegues who afforded you the 2nd data set...

The query result is presented as a list of identifiers (file ILIST.IDL for the pairs, and *.ID1 or *.ID2 for 1st or 2nd data set as a current). "ILIST.*" is a default name. But you can save the list into another file and use it in future - including further search traversing only records from that file. The last result of a query is available as a filter for clerical review, and for data export. And, of course, you'll see on screen the amount of records satisfying your finished query.

5.3. "Data export"
Main idea of LInk_It is don't stop after finding matches, but prepare something useful for export to the source system. Query system helps us to find the group of records (pairs of records) satisfying the certain condition. The same approach has been used for the file of results. Link_It traverses on the last query result and allows access to the fields from both of data sets at the same time (don't foget about aliases DATA1-> and DATA2->). Thus, you can produse export files combining, for example, identifiers from 1st data set and actual information from the 2nd (that is why you include identifiers from the source systems - now you can update information automatically!). On that principle works, for example, exchange of follow-up data between two systems. We find those people who has later follow-up in the 2nd data set, by the query:
DATA1->FOLLOWUP_D < DATA2->FOLLOWUP_D
and extract identifier fields from 1st data set and date and status of follow-up from the 2nd. And vise versa, of course. It is really easy to make tool for updating your database by such an information, which includes original identifier of your source system!
For setting up the export format, add new record into the list of export. Link_It allows two types of export: dBASE files, and text files. DBASE needs the file with structure description (see file SAMPLE\for_priz.str, for example). Text file also needs the form description file - similar to your form for clerical review. You can prepare many text and dBASE export forms for different needs, and copy them from one project to another. If you not assigned the file name, Link_It will ask you. See the SAMPLE\ project for details.
Description forms can include any fields (with ALIAS) or function calls - as well as in query system. It is enough for us to provide any data exchange that we need. But, of course, the same job could be done by SQL queries. If you prefer SQL - no problem. SOURCE1.DBF, LINKS.DBF and SOURCE2.DBF files make easy relational structure, and such an expirienced researcher could open it in MS ACCESS or in ORACLE and shake out all he wants :)

6. Files.

As I said, you can copy files from one project to another, if use similar data. It really saves a lot of time. But what files?

LINKAGE.CFG - dBASE file, includes project and source data sets description. You can copy it from earlier project if it is similar, but don't foget to change the name of project, source data files, etc.
SOURCE1.STR, SOURCE2.STR - these files could be automatically created. They contain the structure of the source data sets. But, you'll probably change and justify something (field types, names and length, especially for TXT sources; add new computable fields; fill translating or extracting function calls; etc.) So, it would be useful to save these *.STR files to the "template" directory, and copy that files into project every time when you'll use data set with the similar structure (probably, you'll prepare typical query in your system for providing data for linkage, and its structure will be the same in a lot of linkage projects).
!!! Don't answer "YES" on a question after selecting data set - about reloading the structure, if you have copied ajusted files *.STR.
SOURCE1.DBF, SOURCE2.DBF - "cleaned" (translated, "synchronized", etc.) files after "extracting source data" step. That files are used by Link_It for linkage, queries and data export. If you have THE SAME source file available to be linked with different sources - you can copy that file (for example, SOURCE1.DBF) and don't spend time for extracting and justifying it in each project. Really you can skip the step "Define data sets", if you have already existing files both SOURCE1.DBF and SOURCE1.DBF.
L_FIELDS.CFG - the list of comparing items. If two projects have the same linkage fields and the same functions to compare them - you can copy that file from one project to anouther. Be careful to prevent using missed fields! L_FIELDS.CFG is dBASE file.
P_FIELDS.CFG - the list of pocket (blocking) expressions. You can copy it if blocking principles and variables are the same. DBASE file.
QUERY.STR - query library. DBASE file.
Other *.STR files probably contain structures of dBASE files for export. *.FRM files contain forms of TEXT files for export, or screen forms for clerical review.

Most of files mentioned above could be useful in different projects if data sets of those projects have THE SAME STRUCTURE. If field names or types of source data have been changed, it cause breaking of Link_It with error message. It is not mortally, of course. You can change something and run Link_It again.

Look-up tables and character set converion files have been described above. Of corce, you can copy them. Good idea is to place most useful into SPEC directory - Link_It can copy tables from here if they were described in LINKAGE.CFG but not exist in the project directory.

*.ID? files contain lists of query result identifiers. Mostly, no need to save or copy them, bacause every new data set will have new unique record identifiers.
*.LOG files contain information about process of data extraction or finding matches. Look here sometimes.
*.NTX files are index files, they are created automatically. No need to save or copy it.


That is all for today. Ask me for more information!


With regards,
Yevgeniy Gorokh, Ukrainian Cancer Registry, Kyiv
goroh@ucr.kiev.ua