Mark Sharp --- msharp@scils.rutgers.edu

Independent Study in LIS 16:194:698, Spring 2002

Weekly Report #3 --- 2/19/02

 

1. CONCLUSIONS FROM PC-KIMMO TERM CONFLATION EXPT. #1

Last week's experiment with PC-KIMMO/ENGLEX suggested that its term recognition and conflation functionality might be insufficiently robust for analysis of unrestricted text in the biomedical/pharmaceutical domain. This might mean that using it for our purposes would require a great deal of manual encoding of linguistic knowledge, which is not acceptable. However, the limitations of the experiment must be kept in mind:

  1. PC-KIMMO and ENGLEX were used without modification. Both could be modified to enhance the desired functionality. For example, ENGLEX could be enriched with biomedical terminology using the knowledge sources of the Unified Medical Language System (UMLS) (http://www.nlm.nih.gov/research/umls/).
  2. Using the root of the first lexical gloss form for term conflation was expedient but might not be the best strategy. Certainly the fine granularity of the organization of the ENGLEX files should permit differentiating desirable from undesirable conflation. Part-of-speech or other information might be combined with the root to improve performance. For example, undesirable conflation of opposites by prefix stemming (non-, anti-, re-, pro-) might be avoided in this way. However, it might take some time to dissect the ENGLEX files to do this.

Therefore, PC-KIMMO should not be ruled out for our purpose. But using it will require further development.

 

2. AUTOENCODER TERM CONFLATION EXPT. #1

Background

The autoencoder is a computer program designed to translate free-text ("verbatim") phrases into controlled vocabulary ("dictionary") terms. It does this by progressively "normalizing" the verbatim (i.e., a little at a time) until it matches a normalized dictionary term. Normalizing algorithms include converting case to lower, removing punctuation and noisewords, tokenizing words, converting British to American spelling, stemming suffixes, compressing phrases, and interchanging synonyms and close values. It starts out with the most conservative changes (case) and gradually gets more radical (up to close value switching). Dictionary match attempts are made at various points along the way. If it finds an "exact match" (defined as all tokens matched in both verbatim and dictionary term) at any time, it stops. Therefore, the accuracy of the matches depends on where in the pipeline they were made, the earlier the more accurate. If it gets all the way through without finding any exact matches, it lists all the "possible matches" it found. A possible match has one or more matching tokens, but some unmatched tokens. Possible matches and multiple exact matches can serve as suggestions for manual encoding.

In linguistic terms, the autoencoder uses syntactic analysis (including morphological analysis and lexical knowledge) to do conflation of content words and phrases. Its case, spelling, phrase compression, suffix stemming, and synonym switching functionality are directly applicable to the free text term conflation problem we are concerned with. Unlike PC-KIMMO, the autoencoder is rule-based rather than lexicon-based, so the normalization algorithms are immediately applicable to unrestricted text, assuming the text can be parceled into small, verbatim-sized chunks (sentences?). Also unlike PC-KIMMO, the autoencoder has already been "tuned" to the biomedical/pharmaceutical domain.

Method

I took the same MEDLINE abstract that was used for the PC-KIMMO experiment, divided it into sentences, and ran the sentences as individual verbatim through both the drug and medical domain versions of the autoencoder. Of course, no exact dictionary term matches were found, so the normalization pipeline ran to completion for all inputs, and the normalized verbatim was pasted back into the abstract format for comparison to the original. Close value switches were then manually reversed, simulating what the normalization output would look like without close value switching (see below).

Results & Discussion

The three versions of the abstract are shown below (original=black, drug=green, medical=blue) followed by a line-by-line comparison with my comments (red). Words which were altered by normalization are shown in bold underlined. Noiseword removal is indicated by <>.

Overall the results were more encouraging than the PC-KIMMO results, but also indicated that significant effort would have to be applied to adapting the autoencoder for text mining purposes. Many of the autoencoder's functions are heuristics that have been developed specifically for matching verbatim to dictionary terms in a clinical research context. This has both a semantic aspect (the verbatim tends to be short noun phrases) and a domain aspect (the verbatim is often formulated as responses to specific questions, e.g. "Reason for physician visit"). These features have a focusing effect that eliminates the vast majority of possible free text expressions. Therefore, while quite robust for the often challenging world of clinical verbatim, the autoencoder may not be very robust for MEDLINE text. For example, the unfortunate stemming of "important" to "import" observed in this test has never been corrected by tuning because it has not been a problem. These words rarely if ever appear in medical verbatim or dictionary terms: "important" is expressed more precisely as "severe", "chronic", etc., and the concept of "importing" is not relevant to clinical reporting. But in the biomolecular world of MEDLINE, importing is an important (sorry!) concept relating to the transport of drugs and other substances from one biological compartment to another, say, from the cytoplasm into the cell nucleus.

Other systematic problems include:

Technical problems include:

[END OF NARRATIVE]

Results Detail

original input=black, drug output=green, medical output=blue, my comments=red

=================================================================================

ORIGINAL ABSTRACT

1: Biol Pharm Bull 2002 Jan;25(1):81-6
Dehydrotrametenolic acid induces preadipocyte differentiation and sensitizes animal models of noninsulin-dependent diabetes mellitus to insulin.

Sato M, Tai T, Nunoura Y, Yajima Y, Kawashima S, Tanaka K.

Department of Medicinal Pharmacology Research and Development, Tokyo Metropolitan Institute for Medical Science, Japan. sato@rinshoken.or.jp

We recently discovered that the triterpene acid compound dehydrotrametenolic acid promotes adipocyte differentiation in vitro and acts as an insulin sensitizer in vivo. This natural product has been isolated from dried sclerotia of Poria cocos WOLF (Polyporaceae), a well-known traditional Chinese medicinal plant. We examined the effects of dehydrotrametenolic acid on plasma glucose concentration in obese hyperglycemic db/db mice. Dehydrotrametenolic acid can reduce hyperglycemia in mouse models of noninsulin-dependent diabetes mellitus (NIDDM) and act as an insulin sensitizer as indicated by the results of the glucose tolerance test. These terpenoids and thiazolidine type of antidiabetic agents such as Ciglitazone, although structurally unrelated, share many biological activities: both induce adipose conversion, activate peroxisome proliferator-activated receptor gamma (PPAR gamma) in vitro, and reduce hyperglycemia in animal models of NIDDM. Dehydrotrametenolic acid is a promising candidate for a new type of insulin-sensitizing drug. This finding is very important for the development of insulin sensitizers that are not of the thiazolidine type.

PMID: 11824563 [PubMed - in process]
=================================================================================
DRUG DOMAIN AUTOENCODER NORMALIZATION
1 biol pharm bull 2002 jan 25 1 81 6
dehydrotrametenol acid induc preadipocyt differentiation <> sensit animal model of noninsulin dependent diabet mellitu to insulin
sat m tai t nunour y yajim y kawashim s tanak k
department of medicinal pharmacolog research <> development toky metropolitan institut for medical scienc japan sat rinshokenorjp
we recentl discover that <> triterp acid compound dehydrotrametenol acid promot adipocyt differentiation <> vitr <> act acet salicyl acid an insulin sensit <> viv thi natural <> has been isolat from dri sclerot of por coc wolf polyporac a well known traditional chines medicinal plant we examin <> effect of dehydrotrametenol acid <> plasm glucos concentration <> obes hyperglycem db db mic dehydrotrametenol acid can reduc hyperglycem <> mous model of noninsulin dependent diabet mellitu niddm <> act as an insulin sensit acet salicyl acid indicat by <> result of <> glucos toleranc test thes terpenoid <> thiazolidin typ of antidiabet agent such acet salicyl acid ciglitazon although structurall unrelat shar man biological activit both induc adipos conversion activat peroxisom proliferat activat recept gamm ppar gamm <> vitr <> reduc hyperglycem <> animal model of niddm dehydrotrametenol acid is a promising candidat for alph new typ of insulin sensitizing <> thi finding is ver important for <> development of insulin sensit that ar not of <> thiazolidin typ
pmid 11824563 pubm <> process
=================================================================================
MEDICAL DOMAIN AUTOENCODER NORMALIZATION
1 biol pharm bull <> jan 25 1 81 <>
dehydrotrametenol acid induc preadipocyt diff <> allerg anim model <> noninsulin depend diabet mellit <> insulin
sato m tai t nunor y yajim y kawashim <> tanak k
depart <> medicin pharmacolog research <> develop tokyo metropolitan institut <> drug sci japan sato rinshokenorjp
we rec discov <> <> triterp acid compound dehydrotrametenol acid promot adipocyt diff <> vitro <> react <> <> insulin allerg <> vivo <> natur product <> <> isol <> dry sclerot <> por coco wolf polyporac a decreas known tradit chines medicin plant we exam <> <> <> dehydrotrametenol acid <> plasm glucos concentr <> obes hyperglycem db db mous dehydrotrametenol acid <> reduc hyperglycem <> mous model <> noninsulin depend diabet mellit niddm <> act <> <> insulin sensit <> indic <> <> <> <> <> glucos tol test thes terpenoid <> thiazolidin <> <> antidiabes agent <> <> ciglitazon although structur unrel shar <> biolog act <> induc adipos convert act peroxisom proliferator act receptor gamm ppar gamm <> vitro <> defic hyperglycem <> anim model <> niddm dehydrotrametenol acid <> a promis candid <> a new <> <> insulin allerg drug <> find <> very import <> <> develop <> insulin allerg <> <> not <> <> thiazolidin <>
pmid 11824563 pubm <> process
=================================================================================
[Note: double blanks have been underlined to keep the HTML interpreter from collapsing them.]
=================================================================================
1
1: Biol Pharm Bull 2002 Jan;25(1):81-6
1__biol pharm bull 2002 jan 25 1__81 6
1__biol pharm bull <>__ jan 25 1__81 <>
Note that MEDLINE's abbreviated journal title could be simulated by stemming ("bulletin"/"bull" might be unfortunate, however).__Note the numbers 6 and 2002 deleted as noisewords by medical.
=================================================================================
2
Dehydrotrametenolic acid induces preadipocyte differentiation and
dehydrotrametenol__ acid induc__ preadipocyt__differentiation <>
dehydrotrametenol__ acid induc__ preadipocyt__diff____________<>
"-ic" and other adjectival and adverbial morpho-syntactic identifying suffixes are stemmed by both domains; thus tagging would have to be done before term conflation. Actually we should conflate "-ic acid" with "-ate" in this domain. "induc"=induce, inducer, inducers, induces, induced, inducing, induction; this is good. preadipocyt~adipocyt needs to be addressed somehow. differentiation ==>"diff" shows the power of Paice modular stemming; does it go too far?
=================================================================================
3
sensitizes animal models of noninsulin-dependent diabetes mellitus
sensit____ animal model__of noninsulin dependent diabet__ mellitu
sensit____ anim__ model__<> noninsulin depend____diabet__ mellit
"sensit"=sensitive, sensitivity, sensitize, sensitizes, sensitized, sensitizing, sensitizer, sensitizers BUT NOT "sense" as per PC-KIMMO (this is good). "anim"=animal, animus, animated, etc., maybe too much. "depend" is good, "diabet"=diabetes, diabetic, diabetically (good). "-s" in mellitus is seen as plural by drug stemmer, maybe can fix. "-us" is stemmed by medical stemmer to conflate Latin male singular (-us) with female (-a) and plural (-i) forms of same root, usually works well.
=================================================================================
4
to insulin.
to insulin
<> insulin
Note domain difference in noiseword recognition.
=================================================================================
5
Sato M, Tai T, Nunoura Y, Yajima Y, Kawashima S, Tanaka K.
sat__m__tai t__nunour__y__yajim__y__kawashim__s__tanak__k
sato m__tai t__nunor__ y__yajim__y__kawashim__<> tanak__k
uh-oh, stemming proper names, definitely not good, preprocess tagging or something must protect proper names. (there are many proper noun recognizing algorithms in the literature)
=================================================================================
6
Department of Medicinal Pharmacology Research and Development, Tokyo
department of medicinal pharmacolog__research <>__development__toky
depart____ <> medicin__ pharmacolog__research <>__develop______tokyo
department=depart? maybe not, but better than PC-KIMMO "part". Do we want pharmacolog~pharmaceut (~pharm?)? research¹ search is good, "develop" is good. Tokyo needs proper noun protection.
=================================================================================
7
Metropolitan Institute for Medical Science, Japan. sato@rinshoken.or.jp
metropolitan institut__for medical scienc__ japan__sat__rinshoken or jp
metropolitan institut__<>__medic__ sci______japan__sato rinshoken or jp
medicin¹ medic same problem as PC-KIMMO, do we want to conflate them?
=================================================================================
8
We recently discovered that the triterpene acid compound
we recentl__discover__ that <>__triterp____acid compound
we rec______discov____ <>__ <>__triterp____acid compound
ugh, both mess up "recently", shows fallibility of this approach. Not sure "-ene" is expendable either.
=================================================================================
9
dehydrotrametenolic acid promotes adipocyte differentiation in vitro
dehydrotrametenol__ acid promot__ adipocyt__differentiation <> vitr
dehydrotrametenol__ acid promot__ adipocyt__diff____________<> vitro
"promot"=promoter, promotes, etc. (good), vs PC-KIMMO "mote" (bad). Drug stemmer seems to whack -o too indiscriminantly for free narrative text.
=================================================================================
10
and acts as an insulin sensitizer in vivo. This natural product has
<>__act__as an insulin sensit____ <> viv__ thi__natural <>______has
<>__act__<> <> insulin sensit____ <> vivo__ <>__natur__ product <>
more bad stemming of narrative by drug stemmer, it really depends on clean verbatim to be effective, medical stemmer is more robust from bitter experience with dirty verbatim. No product=duct problem like PC-KIMMO.
=================================================================================
11
been isolated from dried sclerotia of Poria cocos WOLF (Polyporaceae),
been isolat__ from dri__ sclerot__ of por__ coc__ wolf__polyporac
<>__ isol____ <>__ dry__ sclerot__ <> por__ coco__wolf__polyporac
"isol" shows results of "-ate" stemming. Note dri=dry and Latin suffix stemming.
=================================================================================
12
a well-known traditional Chinese medicinal plant. We examined the
a well known traditional chines__medicinal plant__we examin__ <>
a well known tradit______chines__medicin__ plant__we exam____ <>
Chinese = proper noun. exam=examine, examination, etc. (good)
=================================================================================
13
effects of dehydrotrametenolic acid on plasma glucose concentration in
effect__of dehydrotrametenol__ acid <> plasm__glucos__concentration <>
<>______<> dehydrotrametenol__ acid <> plasm__glucos__concentr______<>
"concentr" another consequence of "-ate" stemming.
=================================================================================
14
obese hyperglycemic db/db mice. Dehydrotrametenolic acid can reduce
obes__hyperglycem__ db db mic__ dehydrotrametenol__ acid can reduc
obes__hyperglycem__ db db mous__dehydrotrametenol__ acid <>__reduc
"obes"=obese, obesity; "hyperglycem"=hyperglycemic, hyperglycemia; mice=mous; "reduc"… all good.
=================================================================================
15
hyperglycemia in mouse models of noninsulin-dependent diabetes mellitus
hyperglycem__ <> mous__model__of noninsulin dependent diabet__ mellitu
hyperglycem__ <> mous__model__<> noninsulin depend____diabet__ mellit
=================================================================================
16
(NIDDM) and act as an insulin sensitizer as indicated by the results of
niddm__ <>__act as an insulin sensit____ as indicat__ by <>__result__of
niddm__ <>__act <> <> insulin sensit____ <> indic____ <> <>__<>______<>
=================================================================================
17
the glucose tolerance test. These terpenoids and thiazolidine type of
<>__glucos__toleranc__test__thes__terpenoid__<>__thiazolidin__typ__of
<>__glucos__tol______ test__thes__terpenoid__<>__thiazolidin__<>__ <>
tolerance==>"tol" too much? "-ene" but not "-oid" probably bad. "thiazolidin" is IUPAC combining form (good). Note medical considers "type" a noiseword.
=================================================================================
18
antidiabetic agents such as Ciglitazone, although structurally
antidiabet__ agent__such as ciglitazon__ although structurall
antidiabes__ agent__<>__ <> ciglitazon__ although structur
no "anti-" stemming problem like PC-KIMMO, but "-etic" = "-es"? have to fix that
=================================================================================
19
unrelated, share many biological activities: both induce adipose
unrelat____shar__man__biological activit____ both induc adipos
unrel______shar__<>__ biolog____ act________ <>__ induc adipos
"unrel-" = "-ate" stemming again. "many" = medical noiseword. activities, activation, action, acting = act? "adipos"=adipose, adiposity (good)
=================================================================================
20
conversion, activate peroxisome proliferator-activated receptor gamma
conversion__activat__peroxisom__proliferat__ activat__ recept__ gamm
convert____ act______peroxisom__proliferator act______ receptor gamm
medical knows conversion=convert, but drug knows "-or" can be stemmed. Greek letters should be protected. Also we could compress long spelled-out forms to acronyms for PPAR, NIDDM, IDDM, etc.
=================================================================================
21
(PPAR gamma) in vitro, and reduce hyperglycemia in animal models of
ppar__gamm__ <> vitr__ <>__reduc__hyperglycem__ <> animal model__of
ppar__gamm__ <> vitro__<>__reduc__hyperglycem__ <> anim__ model__<>
=================================================================================
22
NIDDM. Dehydrotrametenolic acid is a promising candidate for a new type
niddm__dehydrotrametenol__ acid is a promising candidat__for a new typ
niddm__dehydrotrametenol__ acid <> a promis____candid____<>__a new <>
"candid-" = "-ate" stemming result.
=================================================================================
23
of insulin-sensitizing drug. This finding is very important for the
of insulin sensitizing <>____thi__finding is ver__important for <>
<> insulin sensit______drug__<>__ find____<> very import____<>__<>
uh-oh, important ¹ import, same problem as PC-KIMMO.
=================================================================================
24
development of insulin sensitizers that are not of the thiazolidine typ
development of insulin sensit______that ar__not of <>__thiazolidin__typ
develop____ <> insulin sensit______<>__ <>__not <> <>__thiazolidin__<>
"are" should not be stemmed but rather conflated with "be"
=================================================================================
25
PMID: 11824563 [PubMed - in process]
pmid__11824563__pubm____ <> process
pmid__11824563__pubm____ <> process__
=================================================================================