Monday, May 4, 2015

Looking for CpG site enrichment

I had a seemingly easy bioinformatics task - find out if a given set of genomics region are enriched for some genomic feature - like CTCF binding (in comparison to a background set).

The below is from fumbling through a bunch of help pages online.. the first page was this one that lists a bunch of tools (I ended up using bedtools fisher)
  1. dump out genomic locations for target and background set (I was using CpG sites in gene bodies on the 450k array).
    1. sort the bed files using ./sortbed -i from bedtools. 
  2. create a genome file and sort it: "
    1. mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg19.chromInfo"  > hg19.genome 
      • thanks to Samad's post
    2. then sort that file with plain sort
  3. download the tract of interest from usc genome table browser in bed format (from biostars)
    1. http://genome-test.cse.ucsc.edu/cgi-bin/hgTables?command=start
    2. select bed format
    3. sort the bed files using ./sortbed -i from bedtools again
  4. run 
    1. bedtools fisher -a target -b download_tract -g hg19.sorted.genome
    2. bedtools fisher -a background -b download_tract -g hg19.sorted.genome
    3. use the output to compute a p-value that takes into account the background results (using hypergeometric test)
It would be nice if there was an app that would do this for me, and iterate a bunch of tracts. There is tools like GREAT but those seem to be gene centric.

This was in part inspired by this work on distal SNP-CpG relations - Lemire et al. describe there methods in the "SNP and CpG site annotations" section.

This Galaxy based tool - genomic hyperbrowser might be an easier way of doing things.

Update: the Forge tool seems to do what I want but for DNAse I sites only