Cocoa is a dense annotator for biological text. It annotates:
- Macromolecules: cytochrome P-450
- Chemicals: N(omega)-nitro-L-arginine methyl ester (L-NAME)
- Protein/DNA parts, mutations: Lys-23, E30K
- Complexes: Axin signaling complex, spliceosome
- Molecules: chromophore, photon
- Molecular part: group, moiety
- Geometrical part: loop, cleft
- Categories: donor, acceptor
- *States: thermal stability, transition state, closed state
- Organisms: rat, sauropod, Ectothiorhodospira halophila
- Processes: protonation, cleavage
- Anatomical parts: hindlimb
- Locations: intracellular, dorsal
- Physiological terms: stress, disease, gestation
- Parameters: systemic arterial pressure, open probability
- Values: 1 mM
- Techniques: spectroscopy, patch clamping
- Procedures: resection, hysterectomy
- Food: e.g., bread
- Habitats: e.g., ocean, meadow
- Institutions: e.g., University of Rochester
- Profession(al)s: e.g., pulmonologists, acolytes
Cocoa provides annotations for nested entities (liver
in the disease term liver cancer
) as well as fine grained subcategorization of anatomical entities, diseases and organisms through an extended annotation interface (details here
The bad parts:
- Proteins and DNA are not distinguished
- Certain entities are tagged in the same document as both molecules and proteins (this is a bug)
- Many adjectival processes ("photobleached") are not yet tagged as states
- Generally, the tagging of states is quite incomplete
- The boundary between molecular parts and geometrical parts is evolving, and possibly is inaccurate for some entities
- Determiners and generalized quantifiers are not tagged.
- Author's names in inline references snd company/institution names are often tagged as proteins/chemicals
- Quotes are not properly handled.
- Entities separated by a slash ("/") are sometimes handled improperly.
- Physiological processes, conditions and disease symptoms are all lumped together. This is partly handled in the extended annotation interface.
- Cellular parts, organs, tissues and anatomical structures are not differentiated. Again partly handled in the extended annotation interface.
- Foods are not disambiguated from organisms.
- and many many more ...
The ugly parts:
- Coverage of general biological phenomena such as evolution is poor.
- Undefined acronyms are often tagged arbitrarily (but see the WebApi for a hack).
Cocoa has been validated against parts
of the following corpora:
the Cellfinder corpus
Protein Residue Relations Silver Corpus
Protein Residue Full Text Corpus
Organism Tagger annotated corpora
GENIA term corpus
, the Adverse effects corpus
and 100 abstracts for Kir channels annotated for quantitative values (publicly available soon).
Evaluation of protein/protein_part/cellular component/cell/disease annotations from Cocoa against the Colorado Richly Annotated Full Text Corpus (CRAFT)
, the Anatomical Entity Mention (AnEM) corpus
and the Arizona Disease corpus