Schedule
- Mar. 14, 2011
- Registration opens
- May 1, 2011
- Initial registration extended
- May 15, 2011
- Extendended registration closed
- May 27, 2011
- Returned fully executed Data Use Agreement and login credentials
- June 1, 2011
- Training data released
- August 1, 2011
- Test data released
- August 3, 2011 11:59pm Eastern Time
- Deadline for uploading system outputs
- September. 1, 2011
- Short papers due
- September 21, 2011
- Invitations to present at workshop
- October, 2011
- Workshop
2007 International Challenge: Classifying Clinical Free Text Using Natural Language Processing
2007 Challenge
Download Data Set
Data sets from the 2007 challenge may be downloaded at the CMC Resource Catalog.
Purpose
To challenge the international Natural Language Processing (NLP) research community to create and train computational intelligence algorithms that automate the assignment of ICD-9-CM codes to clinical free text.
Introduction
It is surprisingly hard for computers to handle free text as smoothly and effectively as humans do. So far, the results of the numerous efforts to achieve this have been mixed. Indeed, at times it has appeared that the complexities of free text are such as to render the effort futile. Not so; in fact, successive attempts to address the problem of converting free text into actionable knowledge have advanced the science of natural language processing and led to demand for software that simulates and complements what people are able to do.
We are sponsoring an international challenge task on the automated processing of clinical free text. Even with advances in structured vocabularies, many hospitals continue to electronically store some patient data as free text. This practice produces terabytes of information that, beyond the clinical visit, has limited utility because of its volume and accessibility. Natural language processing can potentially uncover implicit structure in this data, rendering it accessible to targeted search engines as well as special purpose systems dedicated to billing, quality assurance and discovery. This challenge offers participants an opportunity to test their untested algorithms or apply existing ones. Additionally, the Challenge provides full access to a carefully anonymised body of clinical data suitable for training and testing.
Competition Process
All participants will be required to register. On 1 Feb 2007, participants will be given access to a training data set, which they will use to develop their algorithms.
The test data set will be made available on 1 Mar 2007. Participants will use their algorithms to process the test data, and will submit their results in XML format, along with a brief description of their methods. For complete details about the data formats and evaluation process, download the Challenge Details document.
Competition results will be announced on 1 Apr 2007 and posted to the results page.
Travel Subsidy Awards
- First Place: US$1,000.00
- Second Place: US$500.00
- Third Place: US$250.00
Other Benefits of Participation
The competition provides an international opportunity for research groups to share the applicability of their natural language processing and artificial intelligence research in the medical domain. Also, results will be published in some way, although the publication stream is not yet finalized. It will most likely include conference proceedings, journal publications and a potential book. All publications related to the challenge will include the appropriate participant(s) as co-authors.
2007 Challenge Organizers
The Medical NLP Challenge is sponsored by the Computational Medicine Center, a collaborative medical research center between Cincinnati Children's Hospital Medical Center and the University of Cincinnati Medical Center that uses data and computational systems to make disease more preventable, illness more predictive and treatment more personalized. Cincom Systems, an Ohio-based software and services company involved with the center, is providing the travel subsidy awards. Researchers from the Department of Linguistics at The Ohio State University and from the Center for Computational Pharmacology at the University of Colorado Health Services Center also participated in organizing the challenge.
Computational Medicine Center
The Computational Medicine Center combines the best in the fields of genetics, medicine, computer science and biological science, taking medicine to a new level. With funding and support from Ohio's Third Frontier Project and the National Institutes of Health (NIH), the center continues to build its team of talented research physicians and experts in bioinformatics, genomics, genetics, epidemiology, computer science, math and statistics. To learn more about the center, visit its web site, or contact John Pestian, PhD, director.
Cincinnati Children's Hospital Medical Center
With 475 registered beds and more than 8,500 employees, Cincinnati Children's Hospital Medical Center is a leading medical research and teaching hospital consistently ranking among the top 10 pediatric hospitals in the nation. Cincinnati Children's is also the second-highest ranking recipient of research grants from the National Institutes of Health among pediatric institutions.
University of Cincinnati Medical Center
The University of Cincinnati Medical Center houses some of the university's most innovative and captivating science and research laboratories, not to mention four academic colleges and a conglomerate of patient care facilities and resources. With nearly $263 million in research dollars, more than 290,000 patients and more than 3,000 aspiring health professionals, the UC Medical Center is by far one of the most innovative academic health research complexes in the nation.
The Ohio State University Department of Linguistics
The Ohio State University linguistics program has a strongly theoretical orientation, with a research focus on the development of a general theory of human language as well as detailed accounts of the structure, development and variation of individual languages. The character of the Ohio State linguistics program is also that of a "pure science" rather than an "applied science": linguistic phenenomena are studied primarily for their own sake and as an aspect of human cognition, rather than being examined for their practical application in fields such as second language education. Experimental laboratory research in speech production, speech perception, the mental processes of sentence understanding, etc., are an important feature of many areas of study in the program. The department also has a rapidly growing focus in computational linguistics.
Center for Computational Pharmacology
The mission of the Center for Computational Pharmacology at the University of Colorado Health Sciences Center is the creation of novel algorithms and knowledge-based tools for the analysis and interpretation of high-throughput molecular biology data. The center's ultimate goal is to transform the process of drug design through the use of advanced computational techniques, particularly machine learning and knowledge-based approaches applied to high-throughput molecular biology data. Researchers at the center create novel algorithms for the analysis and interpretation of gene expression arrays, proteomics, metabolomics and combinatorial chemistry assays. They also create tools for building, maintaining and applying ontologies and knowledge-bases of molecular biology, and for knowledge-driven inference from multiple biological data types. A major focus of the center is the development and application of natural language processing techniques for information extraction from and management of the biomedical literature.
2007 Final Results
Here are final results of the 2007 Computational Medicine Center International Challenge: Classifying Clinical Free Text Using Natural Language Processing.
MICRO-AVERAGED F-1 mean = 0.7670
MICRO-AVERAGED F-1 std.dev. = 0.1325
MICRO-AVERAGED F-1 median = 0.7985
GREEN MICRO-AVERAGED F-1 is 1 standard deviation from the mean
ORANGE MICRO-AVERAGED F-1 is 2 standard deviations from the mean
RED MICRO-AVERAGED F-1 is more than 2 standard deviations from the mean
| PLACE | TEAM SHORT NAME | MICRO-AVERAGED F-1 | MACRO-AVERAGED F-1 | COST SENSITIVE |
|---|---|---|---|---|
| 1 | Szeged | 0.8908 | 0.7691 | 0.9180 |
| 2 | University at Albany | 0.8855 | 0.7291 | 0.9091 |
| 3 | University of Turku | 0.8769 | 0.7034 | 0.9126 |
| 4 | PENN | 0.8760 | 0.7210 | 0.9088 |
| 5 | LMCO-IS & S | 0.8719 | 0.7760 | 0.9009 |
| 6 | GMJ_JL | 0.8711 | 0.7334 | 0.8975 |
| 7 | SULTRG | 0.8676 | 0.7322 | 0.8998 |
| 8 | MANCS | 0.8594 | 0.6676 | 0.9049 |
| 9 | otters | 0.8509 | 0.6816 | 0.9010 |
| 10 | Aseervatham | 0.8498 | 0.6756 | 0.8775 |
| 11 | LHC/NLM | 0.8469 | 0.6916 | 0.8880 |
| 12 | ohsu_dmice | 0.8457 | 0.6542 | 0.8938 |
| 13 | Stockholm | 0.8392 | 0.6684 | 0.8699 |
| 14 | Fabrizio Sebastiani | 0.8375 | 0.6870 | 0.8754 |
| 15 | UOE | 0.8360 | 0.6802 | 0.8598 |
| 16 | Davide | 0.8284 | 0.6784 | 0.8804 |
| 17 | SIM | 0.8277 | 0.6959 | 0.8755 |
| 18 | Yasui Biostatistics Team | 0.8274 | 0.6825 | 0.8703 |
| 19 | SUNY-Buffalo | 0.8258 | 0.6276 | 0.8805 |
| 20 | I2R | 0.8180 | 0.6763 | 0.8449 |
| 21 | LLX | 0.8147 | 0.7343 | 0.8354 |
| 22 | BME-TMIT | 0.7992 | 0.6406 | 0.8567 |
| 23 | Watson | 0.7977 | 0.5488 | 0.8725 |
| 24 | ErasmusMC | 0.7969 | 0.6369 | 0.8424 |
| 25 | NJU-NLP Group | 0.7950 | 0.5846 | 0.8558 |
| 26 | bozyurt | 0.7916 | 0.5643 | 0.8686 |
| 27 | Dharmendra | 0.7900 | 0.5758 | 0.8618 |
| 28 | strappa | 0.7822 | 0.5662 | 0.8810 |
| 29 | UMN | 0.7741 | 0.5540 | 0.8525 |
| 30 | cl.naist.jp | 0.7657 | 0.5571 | 0.8710 |
| 31 | MITRE | 0.7455 | 0.5097 | 0.8574 |
| 32 | BAL | 0.7341 | 0.5936 | 0.8106 |
| 33 | Delbecque | 0.7246 | 0.5682 | 0.7960 |
| 34 | ravim | 0.7145 | 0.5522 | 0.7874 |
| 35 | MIRACLE | 0.7067 | 0.5109 | 0.8000 |
| 36 | Magne Rekdal | 0.6869 | 0.5238 | 0.7749 |
| 37 | cu_nlp | 0.6865 | 0.5052 | 0.7894 |
| 38 | CNRC | 0.6823 | 0.5089 | 0.8140 |
| 39 | ARAMAKI | 0.6786 | 0.4459 | 0.7881 |
| 40 | SINAI | 0.6719 | 0.5248 | 0.7590 |
| 41 | Hirukote | 0.6556 | 0.4770 | 0.7807 |
| 42 | MERLIN | 0.5768 | 0.3345 | 0.7373 |
| 43 | ILMA | 0.3905 | 0.3351 | 0.5099 |
| 44 | CLaC | 0.1541 | 0.1918 | 0.4545 |