Automatic Curation of the Catalog of GWAS

Project PI: Chun-Nan Hsu

Members: Ai He (USC), Shitij Bhargava, Gordon Lin, Priyanka Ganapathi, Suvir Jain

Sponsor: NHGRI/NIH 5U01HG006894

Project Period: 09/24/2012 – 06/30/2015

The Catalog of GWAS is an important resource containing published association between SNPs and phenotypes identified by Genome-Wide Association Studies (GWAS), a well-defined study approach.

However, curation of the catalog is current performed by expert curators. Though this will ensure the quality but new publications in GWAS really outpace any human curation team can possibly handle. This project is to solve this problem by applying information extraction techniques in Text Mining.

Useful links

Data sets

  • Original PDFs of the articles(1,382 PDFs and totally 600+ MB even if compressed)
  • Sampled PDFs of the articles (81 PDFs for gene and disease mention detection)

Detection on sampled date sets

  • Gene mention detection (Not yet validated "gold standards")
    • BioCreative 2 gold standard: bc2GNandGMgold_Subs.tar.gz http://sourceforge.net/projects/biocreative/
    • Perl script for evaluation https://github.com/bioinformatics-ua/gimli/blob/master/resources/evaluation/bc2gm/alt_eval.perl
  • Disease mention detection (Same as the above)
  • Supporting Resources
    • A collective list of Diseases and Traits (Totally 306,478 entries)

Meeting dates 2014

  • 11:00am Every Wednesday