Home | WebAPI | About Cocoa | Help | FeedBack

Cocoa Compact cover annotator for biological noun phrases

Cocoa is a dense annotator for biological text. It annotates:
  • Macromolecules: cytochrome P-450
  • Chemicals: N(omega)-nitro-L-arginine methyl ester (L-NAME)
  • Protein/DNA parts, mutations: Lys-23, E30K
  • Complexes: Axin signaling complex, spliceosome
  • Molecules: chromophore, photon
  • Molecular part: group, moiety
  • Geometrical part: loop, cleft
  • Categories: donor, acceptor
  • *States: thermal stability, transition state, closed state
  • Organisms: rat, sauropod, Ectothiorhodospira halophila
  • Processes: protonation, cleavage
  • Anatomical parts: hindlimb
  • Locations: intracellular, dorsal
  • Physiological terms: stress, disease, gestation
  • Parameters: systemic arterial pressure, open probability
  • Values: 1 mM
  • Techniques: spectroscopy, patch clamping
  • Procedures: resection, hysterectomy
  • Food: e.g., bread
  • Habitats: e.g., ocean, meadow
  • Institutions: e.g., University of Rochester
  • Profession(al)s: e.g., pulmonologists, acolytes
Cocoa provides annotations for nested entities (liver in the disease term liver cancer) as well as fine grained subcategorization of anatomical entities, diseases and organisms through an extended annotation interface (details here).

The bad parts:
  • Proteins and DNA are not distinguished
  • Certain entities are tagged in the same document as both molecules and proteins (this is a bug)
  • Many adjectival processes ("photobleached") are not yet tagged as states
  • Generally, the tagging of states is quite incomplete
  • The boundary between molecular parts and geometrical parts is evolving, and possibly is inaccurate for some entities
  • Determiners and generalized quantifiers are not tagged.
  • Author's names in inline references snd company/institution names are often tagged as proteins/chemicals
  • Quotes are not properly handled.
  • Entities separated by a slash ("/") are sometimes handled improperly.
  • Physiological processes, conditions and disease symptoms are all lumped together. This is partly handled in the extended annotation interface.
  • Cellular parts, organs, tissues and anatomical structures are not differentiated. Again partly handled in the extended annotation interface.
  • Foods are not disambiguated from organisms.
  • and many many more ...
The ugly parts:
  • Coverage of general biological phenomena such as evolution is poor.
  • Undefined acronyms are often tagged arbitrarily (but see the WebApi for a hack).


Cocoa has been validated against parts of the following corpora: the Cellfinder corpus, the SCAI corpus, the Nagel corpus, the Protein Residue Relations Silver Corpus, the Protein Residue Full Text Corpus, the Organism Tagger annotated corpora, the GENIA term corpus, the iProLink corpora, the Bioinfer corpus, the AnEM corpus, the Adverse effects corpus, and 100 abstracts for Kir channels annotated for quantitative values (publicly available soon).
Evaluation of protein/protein_part/cellular component/cell/disease annotations from Cocoa against the Colorado Richly Annotated Full Text Corpus (CRAFT), the Anatomical Entity Mention (AnEM) corpus and the Arizona Disease corpus is here.

Home | Terms of Service | Contact
©2012 - NPjoint