Mark Sharp

Independent Study in LIS 16:194:698, Spring 2002

Weekly Report #2 --- 2/15/02

PC-KIMMO TERM CONFLATION EXPT. #1

I installed PC-KIMMO version 2.1.8 (11 May 2000, supplied on a floppy by SIL) and ENGLEX (28 Nov 1995, downloaded as ENGLEX20B5.ZIP from ftp://ftp.sil.org/software/unix/) on my C: drive. The ENGLEX rules and lexicon files are easily loaded with these two commands:

PC-KIMMO> load rules c:\work\englex\english.rul

PC-KIMMO> load lexicon c:\work\englex\english.lex

The lex file has INCLUDE statements for all the separate files of nouns, verbs, affixes, etc. Then PC-KIMMO is ready to use in batch or interactive mode, as clearly documented in Antworth's book.

Next I copied the following abstract from a Medline (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed) search on "PPAR gamma" (a receptor of pharmaceutical interest).

 

==================================================================================

1: Biol Pharm Bull 2002 Jan;25(1):81-6
Dehydrotrametenolic acid induces preadipocyte differentiation and sensitizes animal models of noninsulin-dependent diabetes mellitus to insulin.

Sato M, Tai T, Nunoura Y, Yajima Y, Kawashima S, Tanaka K.

Department of Medicinal Pharmacology Research and Development, Tokyo Metropolitan Institute for Medical Science, Japan. sato@rinshoken.or.jp

We recently discovered that the triterpene acid compound dehydrotrametenolic acid promotes adipocyte differentiation in vitro and acts as an insulin sensitizer in vivo. This natural product has been isolated from dried sclerotia of Poria cocos WOLF (Polyporaceae), a well-known traditional Chinese medicinal plant. We examined the effects of dehydrotrametenolic acid on plasma glucose concentration in obese hyperglycemic db/db mice. Dehydrotrametenolic acid can reduce hyperglycemia in mouse models of noninsulin-dependent diabetes mellitus (NIDDM) and act as an insulin sensitizer as indicated by the results of the glucose tolerance test. These terpenoids and thiazolidine type of antidiabetic agents such as Ciglitazone, although structurally unrelated, share many biological activities: both induce adipose conversion, activate peroxisome proliferator-activated receptor gamma (PPAR gamma) in vitro, and reduce hyperglycemia in animal models of NIDDM. Dehydrotrametenolic acid is a promising candidate for a new type of insulin-sensitizing drug. This finding is very important for the development of insulin sensitizers that are not of the thiazolidine type.

PMID: 11824563 [PubMed - in process]

==================================================================================

This abstract was pasted into a blank Microsoft Word document, where punctuation was removed, case converted to lower, and hard returns inserted between words, resulting in a file of words, one per line, in the same order as in the text. This file was saved in txt format to a file named ppar1.txt and run as a batch through the "recognize" function of PC-KIMMO with the output logged to a file named pck1.txt, as follows.

PC-KIMMO> log c:\work\output1.txt

PC-KIMMO> file recognize c:\work\pck1.txt

The section of pck1.txt representing the article's title is shown here:

================================================================================

dehydrotrametenolic

*** NONE ***

acid

`acid `acid

induces

in`duce+s in`duce+PL

in`duce+s in`duce+3SG

preadipocyte

*** NONE ***

differentiation

`differ+ent+y+ation `differ+AJR27a+AJR14+NR23a

differ`entiate+ion `different+NR23e

and

and and

sensitizes

`sense+ite+ize+s `sense+NR32+VR6a+PL

`sense+ite+ize+s `sense+NR32+VR6a+3SG

`sense+ite+ize+s `sense+NR32+VR6a+PL

`sense+ite+ize+s `sense+NR32+VR6a+3SG

animal

`animal `animal

models

`model+s `model+PL

`model+s `model+3SG

of

of of

noninsulin

non+`insulin NEG3+`insulin

dependent

de+`pend+ent REV2+`pend+AJR27a

de`pendent de`pend

de`pend+ent de`pend+AJR27a

diabetes

dia`betes dia`betes

mellitus

*** NONE ***

to

to to

to INF

insulin

`insulin `insulin

===============================================================================

A couple of observations can be made already. First, PC-KIMMO does not do stemming on roots it does not recognize (i.e., that are not the lexicon files) even if the affix is very common; e.g. dehydrotrametenolic and preadipocyte. This means we would either have to enrich the lexicon files with a lot of scientific terminology or change the affix recognition algorithm to be independent of the root. In my autoencoder, stemming is independent of the root, but must be controlled for unwanted anomalies by excepting certain cases (e.g. basement). The idea of exceptions might not fit easily into the logic of PC-KIMMO.

Second, recognition of a compound surface form produces two lexical form outputs, one with the affixes themselves and one with the implied part-of-speech "gloss", e.g.,

surface form

differentiation

lexical form (affixes) lexical form (gloss)

`differ+ent+y+ation `differ+AJR27a+AJR14+NR23a

differ`entiate+ion `different+NR23e

surface form

sensitizes

lexical form (affixes) lexical form (gloss)

`sense+ite+ize+s `sense+NR32+VR6a+PL

`sense+ite+ize+s `sense+NR32+VR6a+3SG

`sense+ite+ize+s `sense+NR32+VR6a+PL

`sense+ite+ize+s `sense+NR32+VR6a+3SG

 

Since we are interested mainly in term conflation, the most direct approach would be to use this output to find the root of each word, then normalize all words to their roots in situ in the text before proceeding with further text analysis (such as term statistics or proximity mapping). The gloss form enables this most easily because the root is identifiable as the only morpheme that begins with a lower case letter.

However, in some cases, multiple gloss forms yielded multiple roots, such as differ and different for the surface form differentiation. Having to make manual choices might be prohibitively labor intensive. In this experiment I took the first root, whatever it was.

Using a series of Microsoft Word and Excel tricks*, PC-KIMMO's roots were identified and placed in the text in place of their corresponding surface forms, resulting in the following transformation. Roots which differ from the surface form are indicated by bold underlined text. If any conflation resulted (i.e., fewer forms of the word in this passage of text), the root is marked with [Y].

*(Clearly, there are more direct ways of doing this. This would be a good C programming exercise for me someday!)

 

===============================================================================

1 biol pharm bull 2002 jan 25 1 81 6

dehydrotrametenolic acid induce [Y] preadipocyte differ and sense [Y] animal model of insulin [Y] pend diabetes mellitus to insulin

sato m tai t nunoura y yajima y kawashima s tanaka k

part of medicine pharmacology search and develop tokyo metropolitan institute for medic science japan

sato rinshoken or jp we

cent cove that the triterpene acid compound dehydrotrametenolic acid mote adipocyte differ in vitro and act [Y] as an insulin sense in vivo this nature duct have be [Y] isolate from dry sclerotia of poria coco wolf polyporaceae a well know tradition chinese medicine plant we mine the effect of dehydrotrametenolic acid on plasma glucose concentrate in obese hyperglycemic db db mouse dehydrotrametenolic acid can reduce hyperglycemia in mouse [Y] model of insulin pend diabetes mellitus niddm and act as an insulin sense as indicate by the result of the glucose tolerate test these terpenoids and thiazolidine type of diabetic agent such as ciglitazone although structure late share many biology act [Y] both induce adipose convert active peroxisome proliferate active [Y] receive gamma ppar gamma in vitro and reduce hyperglycemia in animal model of niddm dehydrotrametenolic acid be [Y] a promise candidate for a new type of insulin sense [Y] drug this find be very port for the develop of insulin sense [Y] that be not of the thiazolidine type

pmid 11824563 pubmed in process

===============================================================================

 

In this Medline record there were 224 words, of which 162 (72%) were recognized by PC-KIMMO/ENGLEX, of which 59 (26% of 224) resulted in stemming to a shorter root, of which only 11 (5% of 224) were conflated with another occurrence of the same root. The first two percentages would likely remain stable no matter how many Medline records were pooled, but the last one would be expected to rise with increasing size of the text corpus.

 

In the following table I have juxtaposed the original text (in blue straight Roman font) with the normalized text (in black italic Arial font) and tried to highlight (in red italic Arial) the problems with the results. Problems are only flagged once but may reoccur (except a big problem with "non-" is flagged twice). See the text following the table for a discussion.

1
1: Biol Pharm Bull 2002 Jan;25(1):81-6

1 biol pharm bull 2002 jan 25 1 81 6

2
Dehydrotrametenolic acid induces preadipocyte differentiation and sensitizes animal models

dehydrotrametenolic acid induce [Y] preadipocyte differ and sense [Y] animal model

[dehydro-]….……[-ic acid]……..…[pre-]…………………..sense=sensitiz?

of noninsulin-dependent diabetes mellitus to insulin.
of insulin [Y] pend diabetes mellitus to insulin

NO! non- Ή …pend=depend?

3

Sato M, Tai T, Nunoura Y, Yajima Y, Kawashima S, Tanaka K.
sato m tai t nunoura y yajima y kawashima s tanaka k

4
Department of Medicinal Pharmacology Research and Development, Tokyo Metropolitan

part of medicine pharmacology search and develop tokyo metropolitan

part=department?……[-ology]...search=research?

Institute for Medical Science, Japan. sato@rinshoken.or.jp
institute for medic science japan sato rinshoken or jp

……..medical Ή medicinal?

5
We recently discovered that the triterpene acid compound dehydrotrametenolic acid

we cent cove that the triterpene acid compound dehydrotrametenolic acid

cent=recent?…cove=discover?

6
promotes adipocyte differentiation in vitro and acts as an insulin sensitizer in vivo. This

mote adipocyte differ in vitro and act [Y] as an insulin sense in vivo this

mote=promote?

7
natural product has been isolated from dried sclerotia of Poria cocos WOLF (Polyporaceae),

nature duct have be [Y] isolate from dry sclerotia of poria coco wolf polyporaceae

…NO! duct Ή product………………………………………………………[-aceae]

8
a well-known traditional Chinese medicinal plant. We examined the effects of

a well know tradition chinese medicine plant we mine the effect of

………………………………………………….mine=examine?

9
dehydrotrametenolic acid on plasma glucose concentration in obese hyperglycemic db/db

dehydrotrametenolic acid on plasma glucose concentrate in obese hyperglycemic db db

………………………………………………………………………[hyper-]

10
mice. Dehydrotrametenolic acid can reduce hyperglycemia in mouse models of noninsulin-

mouse dehydrotrametenolic acid can reduce hyperglycemia in mouse [Y] model of insulin

……………………………………………………………………………………….NO!

11
dependent diabetes mellitus (NIDDM) and act as an insulin sensitizer as indicated by the

pend diabetes mellitus niddm and act as an insulin sense as indicate by the

12
results of the glucose tolerance test. These terpenoids and thiazolidine type of antidiabetic

result of the glucose tolerate test these terpenoids and thiazolidine type of diabetic

.……………………………………….……[-oids]………………………NO! anti- Ή

13
agents such as Ciglitazone, although structurally unrelated, share many biological activities:

agent such as ciglitazone although structure late share many biology act [Y]

………………………………………….NO! late Ή relate

14
both induce adipose conversion, activate peroxisome proliferator-activated receptor gamma

both induce adipose convert active peroxisome proliferate active [Y] receive

………………………………………………………………………receive=receptor?

15
(PPAR gamma) in vitro, and reduce hyperglycemia in animal models of NIDDM.

gamma ppar gamma in vitro and reduce hyperglycemia in animal model of niddm

PPAR=?………………………………………………………………..NIDDM=?

16
Dehydrotrametenolic acid is a promising candidate for a new type of insulin-sensitizing drug.

dehydrotrametenolic acid be [Y] a promise candidate for a new type of insulin sense [Y] drug

17

This finding is very important for the development of insulin sensitizers that are not of the thiazolidine type.
this find be very port for the develop of insulin sense [Y] that be not of the thiazolidine type

……………..NO! port Ή important

18
PMID: 11824563 [PubMed - in process]

pmid 11824563 pubmed in process

 

 

Row 2: One might want to stem chemical affixes such as "dehydro-" and "-ic acid", especially the latter, since the acid and anionic ("-ate") forms are virtually interchangeable in biochemistry. Prefixes are usually dangerous stemming candidates (see below "non-" and "anti-") but "pre-" seems pretty safe; preadipocytes are definitely conflatable with adipocytes in this record. It might not be a good idea in this domain to conflate "sensitiz-" to "sense" which refers to the plus strand of DNA (the minus strand is "antisense"). The same goes for "depend-" to "pend". "non-" should definitely NOT be stemmed under any circumstances for obvious reasons. In this case doing so conflates two entirely different diseases, NIDDM and IDDM.

Row 4: We might question the conflation of "department" to "part" or even to "depart", and also "research" to "search". On the other hand, the seemingly obvious conflation of "medical" and "medicinal" to the same root is not accomplished by PC-KIMMO/ENGLEX. "-ology" would be a desirable suffix to stem in thgis domain.

Row 5,6,8: More questionable prefix-based conflations: "recent" to "cent", "discover" to "cove", "promote" to "mote", and "examine" to "mine".

Row 7: Another total prefix crash, "product" to "duct" – these have distinct meanings in this domain (e.g. tear ducts). Organism scientific names at the family level of taxonomy are often based on a "type genus" name with the same root – e.g. Polyporaceae and Polyporus – and conflating the two levels can be useful.

Row 9: "hyper-" (but not "hypo-") seems like another save prefix to stem; certainly hyperglycemia = glycemia.

Row 10: The noninsulin Ή insulin problem again.

Row 12,13: Same for antidiabetic Ή diabetic and relate Ή late. Also "-oids" is a useful chemical suffix to stem, since it means "derived from" or "similar to" (terpene, in this case).

Row 14: Pretty uncomfortable conflating "receptor" to "receive". Receptors are super-important in this domain, almost a sacred word.

Row 15: Failure to conflate scientific acronyms. Perhaps can be fixed by enriching ENGLEX abbreviation file, but note my comment in report #1: where is the link in that file, since the spelled out reference is a comment?

Row 17: One last big prefix crash, "important" to "port" (portal means liver in this domain).

 

To sum up, using PC-KIMMO and ENGLEX out-of-the-box for free-text term conflation in the biomedical/pharmaceutical research domain does not seem very promising. At a minimum the negative and reversive prefix sections of the affix.lex control file need to be commented out. Other possible next steps include enriching the lexicon files with scientific terminology, such as the suffixes from my autoencoder stemmer or terms from the Merck dictionaries. Many of those terms are phrases, however, and it might be very difficult to separate the nouns, verbs, adjectives, etc.