Thursday, June 21, 2012

Plotting differentially methylated bases on an ideogram

Ideogram is a popular way to show genomic features in relation to chromosome sizes and chromosomal location in bulk. Recently, we have shown differentially methylated CpGs on a chromosomal ideogram in our PLoS Genetics paper. Although it doesn't always make sense to show differentially methylated bases on an ideogram, I'm describing a function below that does a similar job.
The function can be used to plot ideograms for differentially methylated CpGs. The function requires methylKit, GenomicRanges and ggbio packages. GenomicRanges and ggbio are available from bioconductor

First, we need to get chromosome lengths that are needed to define chromosome boundaries. I used BSgenome.Hsapiens.UCSC.hg18 package from Bioconductor to get the chromosome lengths of hg18 assembly.
library(BSgenome)
library("BSgenome.Hsapiens.UCSC.hg18")
chr.len = seqlengths(Hsapiens)  # get chromosome lengths
# remove X,Y,M and random chromosomes
chr.len = chr.len[grep("_|M|X|Y", names(chr.len), invert = T)] 
Now, we need to download a methlyDiff object to be plotted. This can be any methylDiff object from methylKit.

download.file("http://methylkit.googlecode.com/files/myDiff.rda", 
    destfile = "myDiff.rda")
load("myDiff.rda")
Next, we plot the ideogram using the chromosome length and methylDiff object. “difference” is the percent methylation difference threshold and “qvalue” is the qvalue threshold, both used to define differentially methylated bases in the function. The differentially methylated bases satisfying “difference” and “qvalue” thresholds are plotted in a color coded manner in the ideogram.
ideoDMC(myDiff, chrom.length = chr.len, difference = 25, qvalue = 0.01, 
    circos = FALSE, title = "test", hyper.col = "magenta", hypo.col = "green")

We can also plot a circular ideogram similar to circos plots using the same function.
ideoDMC(myDiff, chrom.length = chr.len, difference = 25, qvalue = 0.01, 
    circos = TRUE, title = "test", hyper.col = "magenta", hypo.col = "green")

30 comments:

  1. This is very nice. Cannot get the second plot to work though"
    Error in .local(data, ...) : label must be one of column names


    > sessionInfo()
    R version 2.15.0 (2012-03-30)
    Platform: i386-pc-mingw32/i386 (32-bit)

    locale:
    [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252
    [4] LC_NUMERIC=C LC_TIME=English_Australia.1252

    attached base packages:
    [1] stats graphics grDevices utils datasets methods base

    other attached packages:
    [1] BSgenome.Hsapiens.UCSC.hg18_1.3.17 BSgenome_1.24.0 Biostrings_2.24.1
    [4] ggbio_1.4.6 ggplot2_0.9.1 GenomicRanges_1.8.7
    [7] IRanges_1.14.4 BiocGenerics_0.2.0 methylKit_0.5.2

    loaded via a namespace (and not attached):
    [1] AnnotationDbi_1.18.1 Biobase_2.16.0 biomaRt_2.12.0 biovizBase_1.4.2
    [5] bitops_1.0-4.1 cluster_1.14.2 colorspace_1.1-1 data.table_1.8.0
    [9] DBI_0.2-5 dichromat_1.2-4 digest_0.5.2 GenomicFeatures_1.8.2
    [13] grid_2.15.0 gridExtra_0.9 Hmisc_3.9-3 labeling_0.1
    [17] lattice_0.20-6 MASS_7.3-19 Matrix_1.0-7 memoise_0.1
    [21] munsell_0.3 parallel_2.15.0 plyr_1.7.1 proto_0.3-9.2
    [25] RColorBrewer_1.0-5 RCurl_1.91-1.1 reshape2_1.2.1 Rsamtools_1.8.5
    [29] RSQLite_0.11.1 rtracklayer_1.16.2 scales_0.2.1 snpStats_1.6.0
    [33] splines_2.15.0 stats4_2.15.0 stringr_0.6 survival_2.36-14
    [37] tools_2.15.0 VariantAnnotation_1.2.9 XML_3.9-4.1 zlibbioc_1.2.0

    ReplyDelete
  2. Replies
    1. Works a treat! Thanks for taking the time to make this example available, great worked examples like this are extremely helpful and benefit many people, thanks.

      Delete
  3. Thanks for this
    My data is not RRBS so I haven't been able to make use of methlykit but this is a great graphic

    ReplyDelete
    Replies
    1. You should be able to produce a similar plot for arrays, you need to change the functions slightly.

      Delete
    2. It is actually WGSBS data I am working with. The package & your PLOS Genetics paper are good stuff

      Delete
    3. I see, I think you should be able to use the methylKit with WGSBS it will be slower since there are more bases covered. For the ideogram, I don't know how the plotting packages will handle large number of points, that might be slow as well.

      Delete
  4. This is a very nice tool, congratulations!
    How can I use it for 450k bead array data?
    Thanks!

    ReplyDelete
  5. Thanks! methylKit won't work with beadarray data out of the box, but there are other R packages designed to work with array methylation data. But for the specific purpose of making ideograms, you can change the function above, to use GRanges objects instead of methylKit objects and plot methylation percentages or difference for the probes. You need to organize probe locations and scores you want to plot as a GRanges object (from GenomicRanges package) and change the code above to plot methylation information from arrays

    ReplyDelete
    Replies
    1. Thank you l2n, I will try this.
      I've been working with other packages for bead array methylation, but they don't offer many possibilities of visualization. However, I can get a GRanges object after differential methylation. How can I convert this to a methylDiff object to continue my analyses on Methylkit?
      Thanks for your time!

      Delete
    2. Hi,
      You have to create a data frame with looking like the following structure and column names(ignore the row numbers):
      id chr start end strand pvalue qvalue
      1 chr21.10011833 chr21 10011833 10011833 + 2.453779e-04 8.543092e-04
      2 chr21.10011841 chr21 10011841 10011841 + 1.909584e-14 6.049801e-13
      3 chr21.10011855 chr21 10011855 10011855 + 3.292371e-10 4.579307e-09
      4 chr21.10011858 chr21 10011858 10011858 + 7.585884e-01 5.921730e-01
      5 chr21.10011861 chr21 10011861 10011861 + 8.445194e-09 8.162676e-08
      6 chr21.10011872 chr21 10011872 10011872 + 3.821215e-04 1.238123e-03
      meth.diff
      1 10.590278
      2 46.670505
      3 22.641509
      4 2.040816
      5 41.197462
      6 16.476642


      then you can create a methylDiff object as follows (assuming my.df contains the data frame structured as shown above):

      methylDiff.obj=new("methylDiff",my.df,sample.ids=c("test1","control1"),assembly="hg18",context="CpG", treatment=c(1,0),destranded=FALSE,resolution="base")

      Delete
  6. Thanks for the codes to visualize such great plots.
    My question is about the data structure above (for 450k).
    Is the meth.diff column the average of all case-control differences?
    Or can I make each paired differences separately and plot it? (For example, 20 matched pairs makes 20 difference data points in the plot)

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. you can try the change the "size" argument inside the function, right now, size=1, you can try to make it size=0.5. You can also try to increase the cutoffs to see if you get better separation of points with more stringent cutoffs.

      Are you using the same dataset as in the example, or different?

      Delete
    2. altuna, thanks a lot for your prompt reply. I was using the same dataset, and the same cutoffs as your example. I will try changing size to 0.5 and see if I can get similar result.
      In addition, the figure in your Plos Genetics paper shows nice banding patterns on the chromosomes. Is it easy to add this feature to the plots?

      Delete
    3. There might be something changed with the ggbio package, I will check and if there is a change I will put an update here when I have time. ggbio should also be able to draw chromosome bands but I haven't tried that feature with scores yet.

      Delete
  8. Hi l2n
    Great function. Thank you for sharing.
    I'm trying to generate a methyl.diff object from a dataframe (as suggested above):

    id <- c('chr21.10011833','chr21.10011841','chr21.10011855','chr21.10011858')
    chr <- c('chr21','chr21','chr21','chr21')
    start <- c('10011833','10011841','10011855','10011858')
    end <- c('10011834','10011842','10011856','10011859')
    strand <- c('+','+','+','-')
    pvalue <- c('2.453779e-04','1.909584e-14','3.292371e-10','7.585884e-01')
    qvalue <- c('8.543092e-04','6.049801e-13','4.579307e-09','5.921730e-01')
    meth.diff <- c('10.590278','46.670505','22.641509','2.040816')

    my.df<-data.frame(id, chr, start, end, strand, pvalue, qvalue, meth.diff)
    my.df

    methylDiff.obj=new("methylDiff",my.df,sample.ids=c("test1","control1"),
    assembly="hg18",context="CpG", treatment=c(1,0),
    destranded=FALSE,resolution="base")

    methylDiff.obj

    ideoDMC(methylDiff.obj, chrom.length = chr.len, difference = 25, qvalue = 0.01,
    circos = FALSE, title = "test", hyper.col = "magenta", hypo.col = "green")

    but I get some errors when trying to plot:

    Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") :
    solving row 1: range cannot be determined from the supplied arguments (too many NAs)
    In addition: Warning messages:
    1: In Ops.factor(.Object$qvalue, qvalue) : < not meaningful for factors
    2: In Ops.factor((.Object$meth.diff), -1 * difference) :
    < not meaningful for factors
    3: In Ops.factor(.Object$qvalue, qvalue) : < not meaningful for factors
    4: In Ops.factor((.Object$meth.diff), difference) :
    > not meaningful for factors

    Have you seen this before? Any suggestions are greatly appreciated.
    Thanks

    ReplyDelete
    Replies
    1. remove the single quotes around the numbers in the start and end vectors, that should work

      Delete
    2. Hi altuna

      Thanks for the tip; works great.
      However, while the linear version works well (circos=FALSE), the circular version (circos=TRUE) misplaces the location of values in chromosomes. For instance, the example code above places the
      values in chr22 (they should be displayed in chr21).
      Any thoughts?

      Thank you

      Delete
    3. Hi Enrique,
      I couldn't have a look at this yet. If you figured out what is wrong please share. I will post an update when I have time to re-run the code.

      Delete
    4. This comment has been removed by the author.

      Delete
  9. Maybe, this can help you Enrique,

    Try adding this lines under g.per object definition:

    > seqlengths(g.per) = (chrom.length)

    (Do the same for g.po object)

    Good luck,

    ReplyDelete
    Replies
    1. Thank you, I updated the code based on your and Shannon's suggestions

      Delete
  10. I found that in updating GenomicRanges (v1.14.3), I ran into a problem that I believe is due to the depreciation of keepSeqlevels. When I had chromosomes that did not have any points to be plotted, this was not being recognized. I made the following changes to the code above to make the seqlengths/values consistent between all objects and figured I'd share:

    #original code
    g.per = as(hyper, "GRanges")
    g.per = keepSeqlevels(g.per, names(chrom.length))

    #my changes
    g.per = as(hyper, "GRanges")
    seqlevels(g.per, force=TRUE) <- seqlevels(myIdeo)
    seqlengths(g.per)=(chrom.length)

    [did the same for g.po]

    I'm sure there's a better way (feel free to share!), but this is working for me currently.

    ReplyDelete
    Replies
    1. Thank you, I updated the code based on your suggestions

      Delete
  11. Great post! Thank you so much for sharing..

    For those who want to learn R Programming, here is a great new course on youtube for beginners and Data Science aspirants. The content is great and the videos are short and crisp. New ones are getting added, so I suggest to subscribe.

    https://www.youtube.com/watch?v=BGWVASxyow8&list=PLFAYD0dt5xCzTQHDhMPZwBoaAXWeVhZzg&index=19

    ReplyDelete
  12. I cannot understand the code... Is ideoDMC a function created by ourself? I just cannot find this function...

    ReplyDelete