# User:Mgawlik

## Contents

## Personal Information

My name is Michal Gawlik (pronounced me-how). I am 100% Polish, bilingual, and very proud of my heritage. I am currently a sophomore at Marquette University, working on an Electrical Engineering major. For the 2010 REU program I will be working with Dr. Craig Struble in the Bistro Lab on the Intelligent Discovery of Acronyms and Abbreviations (IDA2) project, started by Adam Mallen last year. The general topic of this research is Natural Language Processing (NLP).

## Final result

Paper: Media:Comparison of abbreviation recognition algorithms.pdf

Poster: Media:Poster- Comparison of algorithms.pdf

Presentation: Media:Presentation- Comparison of algorithms.pdf

## Week 1: May 31- June 4

**May 31**

- Memorial Day- no work
- Took train to Milwaukee, settled in at the Men's Catholic House

**June 1**

- Attended introductory meeting for REU program.
- Browsed wiki from last year's IDA2 project, including Adam Mallen's work log.
- Attended REU talk about research practices.
- Practiced using LaTeX (for typesetting), Subversion (for version control), and Make (for building automation). These tools were discussed at last week's Bistro Lab meeting.
- Prepared for weekly lab meeting by reading Ashelford, et. al., "At Least 1 in 20 16S rRNA Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies." While not exactly related to my research project, it provided me with a chance to practice reading research papers.
- Time= 4 hours

**June 2**

- Worked through some LingPipe tutorials (Spelling Correction, Text Classification). LingPipe is a set of Java classes that are used for linguistic analysis.
- Met with Dr. Struble. Discussed my current understanding of NLP, worked some examples, and planned next few weeks.
- Attended weekly Bistro Lab meeting. Praful Aggarwal (a graduate student working with Dr. Struble) led presentation/discussion on software called Pintail (used to detect chimeric sequences in a public genomic database).
- Went to the library and checked out two books (recommended by Dr. Struble).
- "Foundations of Statistical Natural Language Processing" by Manning, Schutze
- "Programming for Corpus Linguistics: How to Do Text Analysis with Java" by Oliver Mason

- Time= 5.5 hours

**June 3**

- Set up my wiki, updated my work log.
- Will be leaving Milwaukee for long weekend (housesitting while family goes on vacation). Nevertheless, will continue to work diligently.
- Create list of basic NLP terms/definitions
- Time= 2.5 hours

## Week 2: June 7- 11

**June 7**

- More basic research/learning (added to list of terms)
- Read Schwartz, Ariel S., and Marti A. Hearst. A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text. Pacific Symposium on Biocomputing 2003.
- Paper outlines algorithm that is currently used by IDA2
- Came up with ideas/questions to consider for improving elements of the IDA2 acronym finder

- Time= 6 hours

**June 8**

- Read introduction to Manning, Schutze's "Foundations of Statistical Natural Language Processing"
- Sections 1.1-1.4= Introduction
- 2.1.1-2.1.10= Mathematics Essentials
- 4.1-4.3= Corpus-Based Work
- 6.1-6.3= "Statistical Inference" n-gram Models over Sparse Data"

- Began writing basic n-gram model of my own. Corpus used is Mark Twain's "Tom Sawyer" (from Project Gutenburg)
- Time= 7 hours

**June 9**

- Continue writing n-gram code
- Resolved all problems with I/O
- Some improvements in tokenization

- Attend weekly lab meeting (read, discuss article on Genotype-Imputation Accuracy)
- Reread Schwartz, Hearst paper (jot down more notes, ideas)
- Time= 6 hours

**June 10**

- Reread Schutze 6.1-6.3 (dealing with n-gram models and estimators)
- Continue working on n-gram model code
- Had epiphany on how to calculate probabilities
- Stored trigrams in HashMap (key= word, value= frequency)

- Meet with Dr. Struble to discuss progress
- Looked at my code together, discussed it

- After meeting, improved/cleaned up code
- Got rid of unnecessary I/O
- Overall, shorten code from 140 to 90 lines

- Time= 10 hours

**June 11**

- Final edits to n-gram program
- Fix how probability is calculated
- Play with different corpora/texts (Shakespeare, etc.)

- Meet with Dr. Struble
- Get IDA2 code from lab repository
- Discuss how it works and possible shortfalls

- Time= 3 hours

## Week 3: June 14- 18

**June 14**

- Become familiar with IDA2 code
- Run it with some small input (varied success)
- Find Schwartz, Hearst original code here

- Short meeting with Dr. Struble (discuss progress, future work)
- Test IDA2 algorithm with data used by Schwartz, Hearst (1000 MEDLINE abstracts, found here)
- Find that our version of algorithm finds 190 less short, long form pairs than original
- Clean up/fix labels in data (will use to determine precision/recall, determine differences in code)

- Time= 9.5 hours

**June 15**

- Continue struggling with poorly labeled data
- Create list of all abbreviations found by algorithm
- Organized into categories: Matches, Partials, Wrong, and Missing
- Calculate precision and recall based on this
- Precision: 90.16% (577 pairs correct/ 640 pairs found)
- Recall: 60.48% (577 pairs found/ 954 total pairs)

- Time= 6.5 hours

**June 16**

- Double check list generated of algorithms found
- Get Schwartz/Hearst's original algorithm and run it on same data
- Precision: 94.44% (Paper claims 95%)
- Recall: 80.18% (Paper claims 82%)
- Start comparing pairs found from both algorithms (looking for differences that suggest error in code)

- Meeting with Dr. Struble
- Resolve some small problems with program
- Learn basics of database management in 30 min.

- Attend weekly lab meeting (another of Dr. Struble's students gave his thesis defense for practice)
- Time= 5.5 hours

## Week 4: June 21-25

**June 21**

- Clean up process of categorizing pairs (match, miss, etc.)
- Write new method to do it automatically (using generated list of pairs from algorithm and actual list)
- Fix some more errors in data (like labeling, but did not correct grammatical/spelling errors in abstracts)

- Make revelation: part of reason why IDA2 code is has much lower precision/recall is due to how text is passed in/parsed
- Goes line by line, but about 160 pairs are on two lines

- Time= 10 hours

**June 22**

- Fix error in I/O causing IDA2 program to miss pairs (now finding 802 instead of 637 pairs)
- Read in all lines and separate into sentences (save temporarily to ArrayList)
- Generate new list of pairs from IDA2
- Start categorizing and comparing (again)

- IDA2 algorithm is a lot better than initially predicted:
- Precision= 92.52% (742 pairs correct/ 802 pairs found)
- Recall= 76.18% (742 pairs found/ 974 total pairs)

- Time= 6.5 hours

**June 23**

- Attend REU talk- "Expressing yourself in verbally and in writing"
- Weekly lab meeting- just talk about everyone's recent progress
- Find differences between pairs found by each algorithm
- Original algorithm finds pair with no capital letters
- Also finds pairs with parentheses, punctuation better

- Collect information about context of each pair that is not found by both algorithms
- Time= 6 hours

**June 24**

- Continue working with pairs found by algorithms
- Get ideas on how to fix IDA2
- One recurring problem is nested parentheses
- Another problem is whether or not the short form has a capital letter

- Time= 2 hours

## Week 5: June 28-July 2

**June 28**

- Create table/graph to compare performance of algorithms
- Search for new research papers to read (used Google Scholar, looked for papers that cited Schwartz/Hearst)
- Think of some more possible improvements to algorithm (still cannot explain some discrepancies)
- Time= 7 hours

**June 29**

- Continue working with pairs found by algorithms
- Think of other possible improvements to IDA2 (new algorithm, etc.)
- Learn some basics about databases and SQL (used by IDA2 to store all acronym/abbreviations found)
- Read from Raghu Ramakrishnan's "Database Management Systems" (old edition [1997])
- Section 1= Introduction to Database Systems
- 2= The Relational Model
- 4= File Organizations and Indexes
- 9= SQL: The Query Language

- Read from Raghu Ramakrishnan's "Database Management Systems" (old edition [1997])
- Time= 8 hours

**June 30**

- Continue learning/practicing DBMS/SQL
- Take break from algorithm (to clear head)
- Time= 3.5 hours

**July 1**

- Make improvements to both algorithms
- Reconcile differences in matches
- Fix problem with certain partials (nested parentheses) and wrong pairs (with no space before parentheses)

- Recalculate precision and recall of both algorithms
- SH*- Precision= 94.36% (786 pairs correct/ 833 pairs found), Recall= 80.70% (786 pairs found/ 974 total pairs)
- IDA2*- Precision= 94.42% (779 pairs correct/ 825 pairs found), Recall= 79.98% (779 pairs found/ 974 total pairs)

- Time= 7 hours

**July 2**

- Document changes in algorithms (and results)
- Find, print several papers to read (6-8)
- Looked for papers cited by Schwartz & Hearst, or that cited their paper

- Read:
- Larkey et al Acrophile: An Automated Acronym Extractor and Server
- Paper discusses a project very similar to IDA2. I was specifically interested in the 4 algorithms they used to find abbreviation/acronym pairs
- Some of their algorithms could detect and define short forms that were not inside parenthesis (using stop words).
- However, the precision and recall of these algorithms is very poor compared to S&H (at most, about 20%).

- Park, Byrd Hybrid text mining for finding abbreviations and their definitions.
- Cited by S&H (just like the previous one). It introduces an algorithm that uses a simple alignment scheme like S&H, but also creates a "RuleBase" of different patterns.
- S&H can only define abbreviations/acronyms if they are right next to each other, while this one can define them even if the long/short form pairs are offset from each other.

- Larkey et al Acrophile: An Automated Acronym Extractor and Server
- Time= 8.5 hours

## Week 6: July 5-9

**July 5**

- Read another paper that S&H had cited: Using Compression to identify acronyms in text.
- Not as helpful as the other papers- published over 20 years ago, algorithm only dealt with acronyms
- Clever use of a threshold based on ratio of acronym to definition length.

- Read two other papers that cited S&H in their references (written in 2005/2006, 2-3 years after S&H).
- Torii et al A comparison study on algorithms of detecting long forms for short forms in biomedical text.
- Compares three different algorithms- S&H, Chang et al (CSA), and ALICE (see below). Each algorithm uses a different technique (alignment-based, machine learning, and template/rule-based).
- Found that a majority of acronyn/abbreviation's found by all algorithms (but reported higher than expected precision/recall for all, due to curation of corpus).
- ALICE had slightly better precision/recall (and was newer than S&H and CSA by several years), so I decided to take a closer look at it.

- Ao, Takagi "ALICE: An Algorithm to Extract Abbreviations from MEDLINE" (published in the Journal of the American Medical Informatics Association, Sept/Oct 2005; not available for free to public, printed copy through Marquette library).
- Uses very similar approach to S&H (assuming that acronym/abbreviation is in parentheses), but includes conditions/templates to increase precision/recall (conditions created after examination of biomedical texts).
- Reports 97% precision (similar to S&H) and 95% recall (over 13% better than S&H) on a similar corpus to S&H's.

- Torii et al A comparison study on algorithms of detecting long forms for short forms in biomedical text.
- Time= 5 hours

**July 6**

- Read two more papers that cite S&H:
- Sohn et al Abbreviation definition identification based on automatic precision estimates
- Their approach to finding abbreviations/extensions has similar success as S&H, but has the unique ability to also estimate its precision.
- The paper also has several links to free, annotated PubMed records (I intend to use them as corpora in further testing).

- Gaudan et al Resolving abbreviations to their senses in Medline
- This paper introduces a method that relies on several NLP process (word-sense disambiguation and part-of-speech tagger, mainly) to find local as well as global abbreviations (ones where the extension is not present in the paper/abstract).
- I found the paper slightly more confusing than the others, so I will have to reread it at a later time. The topic is interesting because this is one of the topics Adam Mallen (the previous REU student who worked on REU) suggested for further work.

- Sohn et al Abbreviation definition identification based on automatic precision estimates
- Search online for some corpora to use, but run into some problems (Medstract Gold Standard is not available)
- Time: 5 hours

**July 7**

- Skimmed another article: Dannells Automatic Acronym Extraction
- This caught my eye because it used an algorithm similar to S&H, but able to find offset pairs, too
- It also tests that algorithm an a couple machine learning ones on a Swedish corpus, a supposed first.

- Begin preparing presentation for Friday's REU Meeting.
- Also update my list of sources/references (after the reading spree)

- Time= 2 hours

**July 8**

- Continue preparing for presentation for REU Meeting (put together PowerPoint slides, practice).
- Time= 3 hours

**July 9**

- Before meeting, practice presentation some more (minor edits)
- Attend REU Meeting
- Listen to midterm presentations by all students
- Give feedback on aspects of presentation

- Time= 5.5 hours

**July 10**

- Continue searching for annotated corpora (with limited success)
- Download a couple algorithms, have not had chance to take a look at them yet
- Update wiki page
- Time= 4.5 hours

## Week 7: July 12-16

**July 12**

- Revisit IDA2 code and simplify it (remove some unnecessary code)
- Download C/C++ Development Kit (CDT) for Eclipse IDE, try to get other algorithms to work
- Time= 8 hours

**July 13**

- Work at making corpus used by ALICE (Ao, Takagi) compatible with IDA2, SH.
- Plan to use it to see if changes in IDA2 result in similar performance on other corpus

- Struggle with getting other algorithms to work (might require different OS)
- Time= 8 hours

**July 15**

- Get corpus of 1000 PubMed abstracts to work (list of abbreviations provided on Ao, Takagi's page)
- Run S&H, new IDA2 algorithm on it
- Precision= 92.2% (970 correct pairs found/ 1052 total pairs found)
- Recall= 88.6% (970 correct pairs found/ 1095 total correct pairs)

- Run S&H, new IDA2 algorithm on it
- Find some discrepancy with results (need to double check I/O and parsing of abstracts)
- Also come up with some ideas to improve IDA2 algorithm (implemented 07/26)
- Treat "/" as a delimiter
- Change all "{}" and "[]" to parentheses "()"

- Time= 8 hours

**July 16**

- Meet with Dr. Struble
- Discuss progress during his vacation
- Throw around plans for rest of program
- Plan to work together closely for remainder

- Document results of algorithms on new corpora more carefully
- Properly save/update all files (make copies, too) before departure
- Time= 6 hours

## Week 8: July 19-23

- Family vacation to Grand Teton, Yellowstone, and Glacier National Parks (in WY and MN).

## Week 9: July 26-30

**July 26**

- Implement changes to IDA2 mentioned in 07/15 log
- Meet with Dr. Struble
- Get help with VirtualBox, Fedora Core (needed to retrieve another algorithm technique)
- Review LaTeX and how to use various tools for preparation of: Poster, Paper, and Presentation

- Finally get VirtualBox to work properly on my computer, get Fedora Core 13
- Still some problems with starting machine, no luck retrieving new algorithm

- Create a third (and final) corpus- based on Medline articles gathered by Sohn et al
- Still need to fix some bugs in I/O to have algorithms run over it

- Time= 8 hours

**July 27**

- Finally get second algorithm, ALICE (from Ao & Takagi), to work
- Run ALICE on two corpora (BioText from Schwartz & Hearst, and ALICE)
- My test (ALICE algorithm on ALICE corpus) result matches result by Ao & Takagi (97% precision, 95% recall)
- Online implementation of ALICE available here

- Time= 4 hours

**July 28**

- Fix I/O problems with third corpus (Ab3P, from Sohn et al)
- REU Meeting: clear up expectations for end of program, attempt to discuss progress
- Find third algorithm that I want to compare: BIOADI (Biomedical Abbreviation Definition Identifier)
- Uses machine learning instead of alignment/rule-based approach
- Online implementation of BIOADI available here

- Read article describing the algorithm: Kuo et al BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature
- Time= 8 hours

**July 29**

- Get third algorithm, BIOADI, to work
- Required reformatting of corpora (no page breaks, PMID above abstract)
- When testing algorithm on Ab3P corpus (the one originally used by Kuo et al) get similar results (94% precision, 80% recall)

- Rerun all three algorithms (S&H, ALICE, BIOADI) on all three corpora (BioText, A&T, Ab3P)
- Save results to Excel file (as I had before), recalculate precision, recall of algorithms
- Also include new performance measure: F1-score (harmonic mean of precision, recall)

- Time= 8 hours

**July 30**

- Read sources on poster presentation, provided by Dr. Brylow and Factor
- Begin creating draft of poster
- Find suitable PowerPoint template (intended to do LaTeX but had problems)
- Create outline of poster, reorganize template

- Complete very rought draft of poster
- Time= 8 hours

**July 31**

- Add tables, graphs, and equations to poster
- Edit the content of poster
- Time= 4 hours

## Week 10: August 2-6

**Aug 2**

- Update wiki after abandoning it for a week
- Email poster to lab mates for constructive criticism
- Edit poster based on Dr. Struble's suggestions (less text, more examples)

- Start thinking about paper outline, content
- Explore LaTeX template provided by Dr. Struble

- Time= 5 hours

**Aug 3**

- Fix some errors in tables, graphs
- Continue editing poster
- Create presentation outline
- Learn to use beamer package in LaTeX
- Time= 8 hours

**Aug 4**

- Finish presentation draft
- Bistro Lab Meeting
- Get further comments on poster
- Informal presentation of work
- Get comments on presentation

- Final edit of poster (send to Dr. Brylow)
- Reformat some parts of presentation
- Simpler slides
- Better tables, graphs

- Time= 9 hours

**Aug 5**

- More edits of presentation
- More examples in slides
- Rewrite introduction, conclusion

- Time= 3 hours

**Aug 6**

- Work on draft of paper
- Finish Algorithms, Methods section
- Redo tables in LaTeX

- Time= 4 hours

## Week 11: August 9-12

**Aug 9**

- Continue working on paper
- Finish Methods, Results/Discussion
- Write Introduction

- Time= 3 hours

**Aug 10**

- Continue working on paper
- Write Conclusion, Future Work
- Write Abstract, References (some problems)
- Edit all sections

- REU Poster Session
- Answer questions about research
- View other posters

- Send paper to Dr. Struble for help (also peer edit with Victor Blas)
- Time= 5.5 hours

**Aug 11**

- Finish working on research paper
- Edit slides for presentation
- Bistro Lab Meeting
- Discuss P=NP problem

- Time= 4.5 hours

**Aug 12**

- Practice presentation
- REU Meeting
- Give presentation, "Comparison of abbreviation recognition algorithms"
- Listen to other presentations
- Post-REU survey

- Time= 4 hours