Mark Sharp
Independent Study in LIS 16:194:698, Spring 2002
Weekly Report #2 --- 2/15/02
PC-KIMMO TERM CONFLATION EXPT. #1
I installed PC-KIMMO version 2.1.8 (11 May 2000, supplied on a floppy by SIL) and ENGLEX (28 Nov 1995, downloaded as ENGLEX20B5.ZIP from ftp://ftp.sil.org/software/unix/) on my C: drive. The ENGLEX rules and lexicon files are easily loaded with these two commands:
PC-KIMMO> load rules c:\work\englex\english.rul
PC-KIMMO> load lexicon c:\work\englex\english.lex
The lex file has INCLUDE statements for all the separate files of nouns, verbs, affixes, etc. Then PC-KIMMO is ready to use in batch or interactive mode, as clearly documented in Antworth's book.
Next I copied the following abstract from a Medline (
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed) search on "PPAR gamma" (a receptor of pharmaceutical interest).
==================================================================================
==================================================================================
This abstract was pasted into a blank Microsoft Word document, where punctuation was removed, case converted to lower, and hard returns inserted between words, resulting in a file of words, one per line, in the same order as in the text. This file was saved in txt format to a file named ppar1.txt and run as a batch through the "recognize" function of PC-KIMMO with the output logged to a file named pck1.txt, as follows.
PC-KIMMO> log c:\work\output1.txt
PC-KIMMO> file recognize c:\work\pck1.txt
The section of pck1.txt representing the article's title is shown here:
================================================================================
dehydrotrametenolic
*** NONE ***
acid
`acid `acid
induces
in`duce+s in`duce+PL
in`duce+s in`duce+3SG
preadipocyte
*** NONE ***
differentiation
`differ+ent+y+ation `differ+AJR27a+AJR14+NR23a
differ`entiate+ion `different+NR23e
and
and and
sensitizes
`sense+ite+ize+s `sense+NR32+VR6a+PL
`sense+ite+ize+s `sense+NR32+VR6a+3SG
`sense+ite+ize+s `sense+NR32+VR6a+PL
`sense+ite+ize+s `sense+NR32+VR6a+3SG
animal
`animal `animal
models
`model+s `model+PL
`model+s `model+3SG
of
of of
noninsulin
non+`insulin NEG3+`insulin
dependent
de+`pend+ent REV2+`pend+AJR27a
de`pendent de`pend
de`pend+ent de`pend+AJR27a
diabetes
dia`betes dia`betes
mellitus
*** NONE ***
to
to to
to INF
insulin
`insulin `insulin
===============================================================================
A couple of observations can be made already. First, PC-KIMMO does not do stemming on roots it does not recognize (i.e., that are not the lexicon files) even if the affix is very common; e.g. dehydrotrametenolic and preadipocyte. This means we would either have to enrich the lexicon files with a lot of scientific terminology or change the affix recognition algorithm to be independent of the root. In my autoencoder, stemming is independent of the root, but must be controlled for unwanted anomalies by excepting certain cases (e.g. basement). The idea of exceptions might not fit easily into the logic of PC-KIMMO.
Second, recognition of a compound surface form produces two lexical form outputs, one with the affixes themselves and one with the implied part-of-speech "gloss", e.g.,
surface form
differentiation
lexical form (affixes) lexical form (gloss)
`differ+ent+y+ation `differ+AJR27a+AJR14+NR23a
differ`entiate+ion `different+NR23e
surface form
sensitizes
lexical form (affixes) lexical form (gloss)
`sense+ite+ize+s `sense+NR32+VR6a+PL
`sense+ite+ize+s `sense+NR32+VR6a+3SG
`sense+ite+ize+s `sense+NR32+VR6a+PL
`sense+ite+ize+s `sense+NR32+VR6a+3SG
Since we are interested mainly in term conflation, the most direct approach would be to use this output to find the root of each word, then normalize all words to their roots in situ in the text before proceeding with further text analysis (such as term statistics or proximity mapping). The gloss form enables this most easily because the root is identifiable as the only morpheme that begins with a lower case letter.
However, in some cases, multiple gloss forms yielded multiple roots, such as differ and different for the surface form differentiation. Having to make manual choices might be prohibitively labor intensive. In this experiment I took the first root, whatever it was.
Using a series of Microsoft Word and Excel tricks*, PC-KIMMO's roots were identified and placed in the text in place of their corresponding surface forms, resulting in the following transformation. Roots which differ from the surface form are indicated by bold underlined text. If any conflation resulted (i.e., fewer forms of the word in this passage of text), the root is marked with
[Y].*(Clearly, there are more direct ways of doing this. This would be a good C programming exercise for me someday!)
===============================================================================
1 biol pharm bull 2002 jan 25 1 81 6
dehydrotrametenolic acid induce [Y] preadipocyte differ and sense [Y] animal model of insulin [Y] pend diabetes mellitus to insulin
sato m tai t nunoura y yajima y kawashima s tanaka k
part of medicine pharmacology search and develop tokyo metropolitan institute for medic science japan
sato rinshoken or jp we
cent cove that the triterpene acid compound dehydrotrametenolic acid mote adipocyte differ in vitro and act [Y] as an insulin sense in vivo this nature duct have be [Y] isolate from dry sclerotia of poria coco wolf polyporaceae a well know tradition chinese medicine plant we mine the effect of dehydrotrametenolic acid on plasma glucose concentrate in obese hyperglycemic db db mouse dehydrotrametenolic acid can reduce hyperglycemia in mouse [Y] model of insulin pend diabetes mellitus niddm and act as an insulin sense as indicate by the result of the glucose tolerate test these terpenoids and thiazolidine type of diabetic agent such as ciglitazone although structure late share many biology act [Y] both induce adipose convert active peroxisome proliferate active [Y] receive gamma ppar gamma in vitro and reduce hyperglycemia in animal model of niddm dehydrotrametenolic acid be [Y] a promise candidate for a new type of insulin sense [Y] drug this find be very port for the develop of insulin sense [Y] that be not of the thiazolidine type
pmid 11824563 pubmed in process
===============================================================================
In this Medline record there were 224 words, of which 162 (72%) were recognized by PC-KIMMO/ENGLEX, of which 59 (26% of 224) resulted in stemming to a shorter root, of which only 11 (5% of 224) were conflated with another occurrence of the same root. The first two percentages would likely remain stable no matter how many Medline records were pooled, but the last one would be expected to rise with increasing size of the text corpus.
In the following table I have juxtaposed the original text (in blue straight Roman font) with the normalized text (in black italic Arial font) and tried to highlight (in red italic Arial) the problems with the results. Problems are only flagged once but may reoccur (except a big problem with "non-" is flagged twice). See the text following the table for a discussion.
|
1 biol pharm bull 2002 jan 25 1 81 6 |
|
dehydrotrametenolic acid induce [Y] preadipocyte differ and sense [Y] animal model [dehydro-] . [-ic acid] .. [pre-] ..sense=sensitiz? of noninsulin-dependent diabetes mellitus to insulin. NO! non- Ή pend=depend? |
|
3 |
Sato M, Tai T, Nunoura Y, Yajima Y, Kawashima S, Tanaka K. |
|
part of medicine pharmacology search and develop tokyo metropolitanpart=department? [-ology]...search=research? Institute for Medical Science, Japan. sato@rinshoken.or.jp ..medical Ή medicinal? |
|
we cent cove that the triterpene acid compound dehydrotrametenolic acid cent=recent? cove=discover? |
|
mote adipocyte differ in vitro and act [Y] as an insulin sense in vivo thismote=promote? |
|
nature duct have be [Y] isolate from dry sclerotia of poria coco wolf polyporaceaeNO! duct Ή product [-aceae] |
|
a well know tradition chinese medicine plant we mine the effect of .mine=examine? |
|
dehydrotrametenolic acid on plasma glucose concentrate in obese hyperglycemic db db [hyper-] |
|
mouse dehydrotrametenolic acid can reduce hyperglycemia in mouse [Y] model of insulin.NO! |
|
pend diabetes mellitus niddm and act as an insulin sense as indicate by the |
|
result of the glucose tolerate test these terpenoids and thiazolidine type of diabetic. . [-oids] NO! anti- Ή |
|
agent such as ciglitazone although structure late share many biology act [Y].NO! late Ή relate |
|
both induce adipose convert active peroxisome proliferate active [Y] receive receive=receptor? |
|
gamma ppar gamma in vitro and reduce hyperglycemia in animal model of niddm PPAR=? ..NIDDM=? |
|
dehydrotrametenolic acid be [Y] a promise candidate for a new type of insulin sense [Y] drug |
|
17 |
This finding is very important for the development of insulin sensitizers that are not of the thiazolidine type. ..NO! port Ή important |
|
pmid 11824563 pubmed in process |
Row 2: One might want to stem chemical affixes such as "dehydro-" and "-ic acid", especially the latter, since the acid and anionic ("-ate") forms are virtually interchangeable in biochemistry. Prefixes are usually dangerous stemming candidates (see below "non-" and "anti-") but "pre-" seems pretty safe; preadipocytes are definitely conflatable with adipocytes in this record. It might not be a good idea in this domain to conflate "sensitiz-" to "sense" which refers to the plus strand of DNA (the minus strand is "antisense"). The same goes for "depend-" to "pend". "non-" should definitely NOT be stemmed under any circumstances for obvious reasons. In this case doing so conflates two entirely different diseases, NIDDM and IDDM.
Row 4: We might question the conflation of "department" to "part" or even to "depart", and also "research" to "search". On the other hand, the seemingly obvious conflation of "medical" and "medicinal" to the same root is not accomplished by PC-KIMMO/ENGLEX. "-ology" would be a desirable suffix to stem in thgis domain.
Row 5,6,8: More questionable prefix-based conflations: "recent" to "cent", "discover" to "cove", "promote" to "mote", and "examine" to "mine".
Row 7: Another total prefix crash, "product" to "duct" these have distinct meanings in this domain (e.g. tear ducts). Organism scientific names at the family level of taxonomy are often based on a "type genus" name with the same root e.g. Polyporaceae and Polyporus and conflating the two levels can be useful.
Row 9: "hyper-" (but not "hypo-") seems like another save prefix to stem; certainly hyperglycemia = glycemia.
Row 10: The noninsulin
Ή insulin problem again.Row 12,13: Same for antidiabetic
Ή diabetic and relate Ή late. Also "-oids" is a useful chemical suffix to stem, since it means "derived from" or "similar to" (terpene, in this case).Row 14: Pretty uncomfortable conflating "receptor" to "receive". Receptors are super-important in this domain, almost a sacred word.
Row 15: Failure to conflate scientific acronyms. Perhaps can be fixed by enriching ENGLEX abbreviation file, but note my comment in report #1: where is the link in that file, since the spelled out reference is a comment?
Row 17: One last big prefix crash, "important" to "port" (portal means liver in this domain).
To sum up, using PC-KIMMO and ENGLEX out-of-the-box for free-text term conflation in the biomedical/pharmaceutical research domain does not seem very promising. At a minimum the negative and reversive prefix sections of the affix.lex control file need to be commented out. Other possible next steps include enriching the lexicon files with scientific terminology, such as the suffixes from my autoencoder stemmer or terms from the Merck dictionaries. Many of those terms are phrases, however, and it might be very difficult to separate the nouns, verbs, adjectives, etc.