Why bother with ggplot2 extension for GWAS summary statistic data visualisations?

This vignette lays out the motivation why having a ggplot2 extension is needed. But it also discusses existing alternatives and ways to create a ggplot2-like Q-Q plot and Manhattanplot.

GWAS data visualisations

There are two main data visualisations1 that are done after the GWAS is run2:

  • Q-Q plot: checks the deviation of the P-value distribution from the uniform distribution.

  • Manhattan plot: shows the P-value or summary statistic distribution along the chomosomal position.

QQplot

Manhattan Plot.png
By M. Kamran Ikram et al - Ikram MK et al (2010) Four Novel Loci (19q13, 6q24, 12q24, and 5q14) Influence the Microcirculation In Vivo. PLoS Genet. 2010 Oct 28;6(10):e1001184. doi:10.1371/journal.pgen.1001184.g001, CC BY 2.5, Link

Alternatives

Both plotting types can be applied in R with various functions. Just to name a few, here are some links which have also served as inspiration:

An often used package is qqman, that has a function for Q-Q plots qqman::qq and Manhattan plots qqman::manhattan.

And there are various ggplot2 wrappers (ggbio, ggman), but not actual extensions.

Reasons

The problem is two fold:

  1. GWAS summary statistics usually contains around 1 mio of rows, so plotting becomes a lengthy process.
  2. All existing packages use the base plotting functions, making it difficult to apply any functionalities that ggplot2 can (e.g. facetting, colouring).

The idea is therefore to create a ggplot2 extension that would facilitate all this: 1) fast plotting (through hexagons and filtering) and 2) building ggplot2 geoms and stats.

By writing a ggplot2 extension, we can inherit lots of the default ggplot2 functionalities and shorten the input.

Ways of making a ggplot2-like Q-Q plot plot

Let’s say we have GWAS summary statistics for a number of SNPs. We call this data gwas.summarystats: for a number of SNPs listed by row, we know the SNP identifier (SNP) and the P-value (P).

SNP     P
rs3342  1e-2
rs83    1e-2
...     ...

For illustration purposes, we will use the summary statistics dataset GWAS.utils::giant.

1. qqman::qq

qqman::qq is straightforward to use: just pass on the P-value column.

qqman::qq(giant$P)

2. ggplot2::geom_qq

There is a geom_qq in ggplot2 that implements quantile-quantile plots. However, this is only useful, if we can transform the axes with -log10(). How to create custom made ggplot2 scales is explained here.


library(ggplot2)
theme_set(theme_bw())

## source: 
(gg <- ggplot(data = giant) + 
  geom_qq(aes(sample = P), distribution = stats::qunif)) +
    theme(aspect.ratio = 1)

3. ggplot2::geom_point


N <- nrow(giant)
expected <- (1:N) / N - 1 / (2 * N)
giant <- giant %>% dplyr::arrange(P) %>% dplyr::mutate(P_expected = expected)

ggplot(data = giant) + 
  geom_point(aes(-log10(P_expected), -log10(P))) + 
  geom_abline(intercept = 0, slope = 1) + 
  theme(aspect.ratio = 1) + 
  xlab(expression(Expected ~ ~-log[10](italic(p)))) + ## from qqman::qq
  ylab(expression(Observed ~ ~-log[10](italic(p)))) ## from qqman::qq




## or with the scale_y_continous feature
ggplot(data = giant) +
geom_point(aes(P_expected, P)) + 
  scale_y_continuous(trans = mlog10_trans) + 
  scale_x_continuous(trans = mlog10_trans) + 
  geom_abline(intercept = 0, slope = 1) + 
  theme(aspect.ratio = 1) #+ ## square shaped

4. ggGWAS::geom_gwas_qq

ggplot(data = giant) + geom_gwas_qq(aes(y = P))

Here is a whishlist of what a geom_gwas_qq function should be able to do:

  • include correct labels (expected as x-label and observed as y-label)
  • make sure color, group, facetting all works
  • allow for the raster version (for faster plotting), hexbins (also for faster plotting) and Pvalue thresholding (removing the high Pvalue SNPs from the plot)
  • if time: implement genomic inflation factor representation
  • while we are at it: plotting box should be squared (theme(aspect.ratio = 1))

Ways of making a ggplot2-like Manhattan plot

1. qqman::manhattan

qqman::manhattan(giant, chr = "CHR" , bp = "POS", p = "P")

2. ggplot2::geom_point


## computing new x axis
giant <- giant %>% dplyr::arrange(CHR, POS) %>% dplyr::mutate(tmp = 1, cumsum.tmp = cumsum(tmp))

## calculating x axis location for chromosome label
med.dat <- giant %>% dplyr::group_by(CHR) %>% dplyr::summarise(median.x = median(cumsum.tmp))

ggplot(data = giant) + 
  geom_point(aes(x = cumsum.tmp, y = -log10(P), color = CHR %%2 == 0)) + ## create alternate coloring
  geom_hline(yintercept = -log10(5e-8)) + ## add horizontal line
  scale_x_continuous(breaks = med.dat$median.x, labels = med.dat$CHR) + ## add new x labels 
  guides(colour=FALSE) +  ## remove legend
  xlab("Chromosome") + 
  ylab(expression(-log[10](italic(p)))) + ## y label from qqman::qq
  scale_color_manual(values = c(gray(0.5), gray(0))) ## instead of colors, go for gray

3. ggGWAS::geom_gwas_manhattan

ggplot(data = giant) + geom_gwas_manhattan(aes(pos = POS, chr = CHR, y = -log10(P)))

Here is a whishlist of what a geom_gwas_manhattan function should be able to do:

  • x axis spacing with space between chromosome and spaced as with position
  • include correct y axis labels
  • make sure color, group, facetting all works
  • allow for the raster version (for faster plotting) and Pvalue thresholding (removing the high Pvalue SNPs from the plot)
  • implement coloring (two alternating colors)

Misc

Testing

How to test plots?

One option is, to compare ggplot2 object data. In the example below, we are comparing two ggplot2 outputs, one created with qplot and one with ggplot.

gg1 <- qplot(Sepal.Length, Petal.Length, data = iris)
gg2 <- ggplot(data = iris) + geom_point(aes(Sepal.Length, Petal.Length))
identical(gg1$data, gg2$data)

We can apply this to our package by creating the qqplot and manhattanplots manually by hand, and then comparing the to the function outputs.

Another option is to use https://github.com/lionel-/vdiffr


  1. There are also other tools used, such as locuszoom plots, but these are harder to implement because they need access to external information, and we will skip these for now.

  2. We will not discuss their usefulness here.