Locuszoom plot of GWAS summary statistics

using the biomaRt R-package

In the world of genome-wide association studies (GWAS), we often get a list of genetic markers (SNPs) that seem for some reason relevant for a particular outcome. At the same time, we have little knowledge about these genetic variants that come in cryptic combinations of characters and numbers.

For example:

  • We might get asked how frequent the SNP rs1421085 is in a range of populations.
  • We need to extract all known SNPs that are within 1 Mb of rs1421085.
  • A genomic region turns out to be highly relevant for some disease, and we want to know all genes contained in that genomic region.

Unless you sit on lots of genetic data, the list of SNPs come from a summarised form. Minimally, these summary statistic datasets contain the SNP identifier (SNPNAME), the effect size (beta) and the standard error (se). Sometimes the position (POS) and chromosome (CHR) is provided instead of the SNP identifier, and sometimes both are available. Then there is usually other information coming from the study data; for example, the allele frequency in the study this is shown on the LHS in the table below. However, hardly ever do the datasets contain annotation, such as the gene where the SNP resides, the phenotypes it is associated with or the minor allele frequency (MAF) in a specific population this is illustrated in the green coloured part in the table below.

GWAS summary statistics dataset
Annotation
CHR POS SNPNAME REF ALT beta se Gene Linked phenotype Global MAF
7 75163169 rs1167827 A G 0.028 0.0032 HIP1 BMI 0.4720450
12 122781897 rs11057405 G A -0.021 0.0037 CLIP1 BMI 0.0327476
10 114758349 rs7903146 C T -0.031 0.0023 TCF7L2 BMI 0.2278350
1 49589847 rs657452 A G -0.015 0.0025 AGBL4 BMI 0.4532750
1 49589847 rs657452 A G -0.015 0.0025 AC099788.1 BMI 0.4532750

This makes sense. Genetic data from cohorts will be used for years and won’t change, but the annotation will change over the years. Take the chromosomal position of SNPs that changes with every new human genome reference assembly.

To get our hands on annotation, we need to consult external databases that are - lucky us! - public. But first, let us define what we want to see at the end of the blog post.

Goal

For this blog post, we keep it simple and only focus on the genes contained in a genomic region, but this can be easily extended to any other annotation available in databases.

In the illustration below1, we want to know the gene starting and end positions around one particular gene region to create an informative locuszoom plot.

Drawing of a table.
In this case, we want to visualise the GWAS P-value (-log10(P-value)) of a genomic region (with each point representing a SNP), with the corresponding genes at the bottom. This is similar to a plot done with LocusZoom, a tool that takes summary statistics as input and outputs a pretty graph for a desired genomic region, including gene annotation, LD information and more.

Getting it done

There are currently more than 10 Mio SNPs known, and knowing their functions and genes by heart would equal to some superpower. Which is why we 2 public databases, such as dbSNP and ensembl3.

What we like even more, is, to make our analyses reproducible and automate annotation and lookups.

In this blog post, I will show how to zoom into GWAS results and annotate the plot based on the information about that genomic region using R.

Here is the plan:

  1. Get summary statistics
  2. Extract annotation
  3. Combine summary statistics and annotation

Before tackling the first item, we want to have all R-packages installed & ready to use:

## Packages needed to run code
## ----------------------------
# install.packages("dplyr")     ## data manipulation
# install.packages("data.table") ## read data, fast!
# install.packages("forcats")   ## dealing with factors
# install.packages("ggplot2")   ## dataviz
# install.packages("magrittr")  ## piping
# install.packages("metafolio")  ## colorpalette
# install.packages("skimr")     ## summarising data
# install.packages("qqman")     ## Manhattan plot
# install.packages("patchwork") ## assembling plots

# source("https://bioconductor.org/biocLite.R")
# biocLite("biomaRt")           ## annotation

## Optional packages for Rmd
## --------------------------
# install.packages("kableExtra") ## making pretty tables
# install.packages("devtools")
# devtools::install_github("hadley/emo")        ## emojis
# devtools::install_github("ropenscilabs/icon") ## icons

1. Get summary statistics

First, we need some GWAS summary statistics.

There are lots of resources for publicly available GWAS summary statistics.

We will look at BMI, because it can be accessed easily4 and because it is relatively small for a genomic dataset. The data is from the Genetic Investigation of ANthropometric Traits (GIANT) consortium. You can download the dataset5 here or load it directly into R.

## Data Source URL
url <- "https://portals.broadinstitute.org/collaboration/giant/images/2/21/BMI_All_ancestry.fmt.gzip"

#url <- "jenger.riken.jp/1analysisresult_qtl_download/All_2017_BMI_BBJ_autosome.txt.gz"

## Import BMI summary statistics dat.bmi <- read_tsv(file = url) ##
## taking too long, let's use fread instead.

dat.bmi <- data.table::fread(url, verbose = FALSE)

I added verbose = FALSE because it will complain that there is an unexpected character in column 1, which appears to be numerical. This is because chromosome X will only appear towards the end of the dataset.

Next, we rename some columns to something more conventional.

## Rename some columns
dat.bmi <- dat.bmi %>% rename(SNP = SNPNAME, P = Pvalue)

Now, let’s look at the data with the skimr package.

skimr::skim(dat.bmi)
## Skim summary statistics
##  n obs: 246328 
##  n variables: 10 
## 
## ── Variable type:character ──────────────────────────────────────────────────────────────────────────────────
##  variable missing complete      n min max empty n_unique
##       ALT       0   246328 246328   1   1     0        4
##       CHR       0   246328 246328   1   2     0       23
##  ExAC_MAF       0   246328 246328   1 325     0    51423
##      GMAF       0   246328 246328   1  17     0     9964
##       REF       0   246328 246328   1   1     0        4
##       SNP       0   246328 246328   1  11     0   243206
## 
## ── Variable type:integer ────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete      n    mean      sd    p0     p25     p50
##       POS       0   246328 246328 7.6e+07 5.7e+07 11885 3.2e+07 6.1e+07
##      p75    p100     hist
##  1.1e+08 2.5e+08 ▇▇▅▅▃▂▁▁
## 
## ── Variable type:numeric ────────────────────────────────────────────────────────────────────────────────────
##  variable missing complete      n      mean     sd        p0    p25
##      beta     977   245351 246328   0.00051   0.13 -3.2      -0.032
##         P     977   245351 246328   0.48      0.3   8.6e-269  0.21 
##        se       0   246328 246328 Inf       NaN     0.002     0.023
##       p50   p75  p100     hist
##  -0.00017 0.031   3.2 ▁▁▁▇▇▁▁▁
##   0.47    0.73    1   ▇▇▆▆▆▆▆▆
##   0.059   0.11  Inf   ▇▁▁▁▁▁▁▁

The reference allele (REF), alternative allele (ALT), SNP identifier (SNP) and chromosome (CHR) are characters. There are four unique values for ALT and REF: A, C, G, T, and 23 unique values for CHR - seems about right. The two columns with the minor allele frequencies measured in GIANT and ExAC datasets (GMAF, ExAC_MAF) are characters too because the allele is concatenated. The chromosomal position (POS) is an integer. Then there is the actual association of each SNP with BMI (beta, se, P).

This dataset has 246328 rows and 10 columns. What if we want to visualise this all at once? In particular, we are interested if there are ANY associations between the genetic markers and BMI. The mini P distribution in the skimr output does not reveal much.

Visualising associations

Visualising all associations at once can be done with a Manhattan plot, where the x-axis represents chromosomeCHR and the chromosomal position POS, and the y-axis the -log10(P)-value. Let’s use the R-package qqman6 for that.

qqman::manhattan(dat.bmi %>% mutate(CHR = as.numeric(as.character(fct_recode(CHR, "23" = "X")))) %>%   filter(-log10(P)>1), chr="CHR", bp="POS", snp="SNP", p="P", suggestiveline =FALSE, genomewideline = FALSE, chrlabs = c(1:22, "X"), cex = 0.4)

We can spot immediately, that there are loads of SNPs with P-values smaller than \(10^{-100}\) (sample size was around 700K).

Of course, there are many other solutions to spot real associations, but this is not the point of this blog post ;-).

Identify genomic region with lowest P-value

Now that we know that there are lots of genetic markers associated with BMI, we want to look at a specific genomic region and figure out what genes it contains. For illustrative purposes, we pick the genomic region with the lowest P-value.

dat.bmi.sel <- dat.bmi %>% slice(which.min(P))
dat.bmi.sel
##   CHR      POS REF ALT       SNP     GMAF ExAC_MAF  beta     se        P
## 1  16 53800954   T   C rs1421085 C:0.2286        - 0.078 0.0022 8.6e-269

SNP identifier rs1421085 that has the lowest P-value (\(P = 8.6\times 10^{-269}\)).

Now we can visualise the summary statistics of that genomic region (\(\pm 500 \cdot 10^{3}\)).

range <- 5e+05
sel.chr <- dat.bmi.sel$CHR
sel.pos <- dat.bmi.sel$POS

dat.bmi.sel.region <- dat.bmi %>% filter(CHR == sel.chr, between(POS, sel.pos - 
      range, sel.pos + range))
p1 <- ggplot(data = dat.bmi.sel.region) + 
  geom_point(aes(POS, -log10(P)), shape = 1) + 
  labs(title = "Locuszoomplot for BMI GWAS", subtitle = paste("Summary   statistics for chromosome", sel.chr, "from", format((sel.pos - range), big.mark = "'"), "to", format((sel.pos + range), big.mark = "'"), "bp"), caption = paste("Data source:", url))
print(p1)

Next, we want to know if rs1421085 is part of a gene, and if yes, which one.

2. Extracting annotation

biomaRt

Thankfully, there is an R-package called biomaRt that can do this for us.

I had only come across this package a few weeks ago, so I apologise in advance that I won’t make full use of all the features that the package offers.

biomaRt should not be confused with biomartr. biomartr is an rOpenSci package by Hajk-Georg Drost. I needed this tweet to realise that these were two separate yet related R-packages.

So far, I did not fully grasp the benefits of biomartr compared to biomaRt for annotation of human data. But I will look more into it.

I also got a tip from Marianna Foos to check out rsnps, an R-package dealing with SNP annotation.

Learning resources

I used mainly two sources to learn what I wanted to do:

Using biomaRt

First, we need to load the biomaRt package from Bioconductor.

library(biomaRt) 

Next, we specify which database to use. You can use the functions listMart(), listEnsembl() and listDatasets() to select from the right biomart and dataset. We will need to extract SNPs, hence biomart = "snp".

snp.ensembl <- useEnsembl(biomart = "snp", dataset = "hsapiens_snp")

class(snp.ensembl)
## [1] "Mart"
## attr(,"package")
## [1] "biomaRt"
# other ways of selecting the mart snp.mart <- useMart(biomart =
# 'ENSEMBL_MART_SNP', dataset='hsapiens_snp') 
# gene.mart <- useMart('ensembl', dataset='hsapiens_gene_ensembl')

Last, we extract the gene id to which our SNP belongs using the function getBM(). Along with that, we also extract other information, like the minor allele frequency minor_allele_freq. To check which attributes and filters are available, run listAttributes(snp.ensembl) and listFilters(snp.ensembl).

out.bm <- getBM(
  attributes = c("ensembl_gene_stable_id", 
                 "refsnp_id", 
                 "chr_name", 
                 "chrom_start", 
                 "chrom_end", 
                 "minor_allele", 
                 "minor_allele_freq"),
                # "ensembl_transcript_stable_id",
                # "consequence_type_tv"),
  filters = "snp_filter", 
  values = "rs1421085",#dat.bmi.sel$SNP, 
  mart = snp.ensembl)

This gives us - as chosen in attributes - the gene identifier, the SNP identifier, the chromosome name, chromosomal position (start + end), minor allele and minor allele frequency. The output in out.bm corresponds to this webpage entry on the ensembl webpage.

out.bm
##   ensembl_gene_stable_id refsnp_id chr_name chrom_start chrom_end
## 1        ENSG00000140718 rs1421085       16    53767042  53767042
##   minor_allele minor_allele_freq
## 1            C          0.228634

Quick d-tour: assembly GRCh37 or GRCh38?

Before getting the gene names, we want to check if the positions in the dataset are from assembly GRCh37 or GRCh38. This is a handy thing because often only SNP identifiers are reported. Or SNP identifiers are reported, but with positions on a different assembly.

ifelse(sel.pos == out.bm$chrom_start, 
       "\u2713: same assembley (GRCh38)", 
       "\u2717: not the same assembley")
## [1] "✗: not the same assembley"

The position is not matching, because the databases that we are looking at is based on the most recent human assembly GRCh38, but the BMI summary statistics dataset is based on human assembly GRCh37. The command listEnsemblArchives() will list you the URLs needed to get access to an archived assembly. So let’s pull out the archived GRCh37 version with the argument host = 'http://grch37.ensembl.org'.

snp.ensembl.grch37 <- useMart(host='http://grch37.ensembl.org', 
                     biomart='ENSEMBL_MART_SNP', 
                     dataset='hsapiens_snp')

out.bm.grch37 <- getBM(
  attributes = c('ensembl_gene_stable_id', 'refsnp_id', 'chr_name', 'chrom_start', 'chrom_end', 'minor_allele', 'minor_allele_freq'),
  filters = 'snp_filter',
  values = dat.bmi.sel$SNP,
  mart = snp.ensembl.grch37
)
out.bm.grch37
##   ensembl_gene_stable_id refsnp_id chr_name chrom_start chrom_end
## 1        ENSG00000140718 rs1421085       16    53800954  53800954
##   minor_allele minor_allele_freq
## 1            C          0.228634

Let’s check again.

ifelse(sel.pos == out.bm.grch37$chrom_start,
       "\u2713: same assembley (grch37)", 
       "\u2717: not the same assembley")
## [1] "✓: same assembley (grch37)"

Flavia Hodel pointed out that adding the argument GRCh = 37 works too! That’s a handy argument (but limited to GRCh versions 37 and 38).

snp.ensembl.grch37.alt <- useEnsembl(biomart = "snp", dataset = "hsapiens_snp", GRCh = 37)


out.bm.grch37 <- getBM(
  attributes = c('ensembl_gene_stable_id', 'refsnp_id', 'chr_name', 'chrom_start', 'chrom_end', 'minor_allele', 'minor_allele_freq'),
  filters = 'snp_filter',
  values = dat.bmi.sel$SNP,
  mart = snp.ensembl.grch37.alt
)
out.bm.grch37
##   ensembl_gene_stable_id refsnp_id chr_name chrom_start chrom_end
## 1        ENSG00000140718 rs1421085       16    53800954  53800954
##   minor_allele minor_allele_freq
## 1            C          0.228634
ifelse(sel.pos == out.bm.grch37$chrom_start,
       "\u2713: same assembley (grch37)", 
       "\u2717: not the same assembley")
## [1] "✓: same assembley (grch37)"

Extracting gene name for one SNP

Next, we want to get the gene name where rs1421085 falls into.

Let’s check which attributes contain the string gene.

listAttributes(snp.ensembl) %>% slice(str_which(name, "gene")) 
##                     name                    description page
## 1        associated_gene Associated gene with phenotype  snp
## 2 ensembl_gene_stable_id                 Gene stable ID  snp

Now use ensembl_gene_stable_id and associated_gene as additional attributes.

## extract gene
## ----------
out.bm.snp2gene <- getBM(
  attributes = c('refsnp_id', 'allele', 'chrom_start', 'chr_name', 'ensembl_gene_stable_id'), 
      filters = c('snp_filter'), 
      values = dat.bmi.sel$SNP, 
      mart = snp.ensembl)
out.bm.snp2gene
##   refsnp_id allele chrom_start chr_name ensembl_gene_stable_id
## 1 rs1421085    T/C    53767042       16        ENSG00000140718
## Attribute `associated_gene` is `Associated gene with phenotype`.

## Extract string
## ----------
gene.ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", GRCh = 37) # we will need an additional mart for genes
## because we are using positions from GRCh = 37 in a next query, we need to pass that information on.
out.bm.gene <- getBM(attributes = c('external_gene_name'), 
      filters = c('ensembl_gene_id'),
      values = unique(out.bm.snp2gene$ensembl_gene_stable_id), 
      mart = gene.ensembl)
out.bm.gene
##   external_gene_name
## 1                FTO

The gene that contains SNP rs1421085 is called FTO.

Extracting gene names for genomic region

So now we know that rs1421085 is part of FTO. But where does FTO start and end? And what are the genes nearby? For this purpose, we want to visualise the summary statistics of the full genomic region (\(\pm 250 \cdot 10^3\) Mb). We just recycle the previous code, but instead of providing a SNP-id, we provide the chromosome, the start and the end position.

out.bm.genes.region <- getBM(
  attributes = c('start_position','end_position','ensembl_gene_id','external_gene_name', 'gene_biotype'), 
      filters = c('chromosome_name','start','end'), 
      values = list(sel.chr, sel.pos - range, sel.pos + range), 
      mart = gene.ensembl)
head(out.bm.genes.region)
##   start_position end_position ensembl_gene_id external_gene_name
## 1       53088945     53363062 ENSG00000177200               CHD9
## 2       53332136     53333704 ENSG00000261056       RP11-454F8.2
## 3       53368474     53368576 ENSG00000238645             snoU13
## 4       53371365     53371483 ENSG00000202193          RNA5SP427
## 5       53395931     53397590 ENSG00000259962       RP11-44F14.4
## 6       53398894     53406995 ENSG00000260078       RP11-44F14.1
##     gene_biotype
## 1 protein_coding
## 2     pseudogene
## 3         snoRNA
## 4           rRNA
## 5     pseudogene
## 6     pseudogene

We can plot out.bm.genes.region with a line range plot, where each horizontal line represents one gene.

## rank gene names according to start position
out.bm.genes.region <- out.bm.genes.region %>% mutate(external_gene_name = fct_reorder(external_gene_name, 
      start_position, .desc = TRUE))

## plot
ggplot(data = out.bm.genes.region) + 
      geom_linerange(aes(x = external_gene_name, ymin = start_position, ymax = end_position)) + 
  coord_flip() + 
  ylab("")

Let’s try to make that pretty. We can group the genes by gene_biotype and colour them accordingly. And we move the protein-coding genes to the top row and colour it black.

## define plot range for x-axis
plot.range <- c(min(sel.pos - range, out.bm.genes.region$start_position), 
      max(sel.pos + range, out.bm.genes.region$end_position))

## rank gene_biotype label
out.bm.genes.region <- out.bm.genes.region %>% mutate(gene_biotype_fac = fct_relevel(as.factor(gene_biotype), 
      "protein_coding"), external_gene_name = fct_reorder2(external_gene_name, 
      start_position, gene_biotype_fac, .desc = TRUE))


## plot
p2 <- ggplot(data = out.bm.genes.region) + 
  geom_linerange(aes(x = external_gene_name, ymin = start_position, ymax = end_position, colour = gene_biotype_fac, group = gene_biotype_fac)) +
  coord_flip() + ylab("") +
  ylim(plot.range) + 
  geom_text(aes(x = external_gene_name, y = start_position, label = external_gene_name, colour = gene_biotype_fac), fontface = 2, alpha = I(0.7), hjust = "right", size= 2.5) + 
  labs(title = "", subtitle = paste0("Genes"), caption = paste0("Data source: ", gene.ensembl@host, " + Data set: ", gene.ensembl@dataset), color = "Gene Biotype") +
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(), 
        strip.text.y = element_text(angle = 0),
        legend.position="bottom", 
        panel.grid.major.y = element_blank()) + 
  expand_limits(y=c(-1, 1)) +
  scale_color_manual(values = c("black", metafolio::gg_color_hue(nlevels(out.bm.genes.region$gene_biotype_fac)-1)))
  ## hack to have 11 colors, but probably merging some gene biotype groups could make sense too. 
  ## consider colorblindr::palette_OkabeIto_black
print(p2)

Some short genes are starting with AC and LINCO. We can check what they are and/or consult a biologist.

3. Combining summary statistics and annotation

Now we are ready to combine plot p1 and p2.

library(patchwork)

p1b <- p1 + xlab("") + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank()) + xlim(plot.range)

p1b + p2 + plot_layout(ncol = 1, heights = c(6, 6))

Having both plots combined, we can spot that there are six protein-coding genes in the region. But the P-value peak is located in gene FTO.

If this were work done for a real project, this would now be the time to get back to domain experts, discuss the results and figure out what other annotation they need8.

Wouldn’t it be nice…

  • … to polish that plot even more? For starters, the colour code is not ideal…
  • … to add information about the correlation between SNPs, like in LocusZoom plots?
  • … or have an interactive plot9: hovering over the SNP points would light up the corresponding genes or give some other information, like the allele frequency.

Totally 😉 Seems like great material for future blog posts!

Source

  • The R Markdown file is here.
  • Some more biomaRt code snippets are in a gist.

If you know another R-package to solve a similar problem or have feedback, you can comment below 👇.


  1. Thanks to Maëlle Salmon you get a real plot here instead of a blurry hand drawing 😉 She gave lots of feedback on my first version - much appreciated!

  2. Awesome icons in R? checkout these instructions by Mitchell O’Hara-Wild on the rOpenSci webpage.

  3. Thanks to my colleagues with biology background for explaining me the differences between these databases and what a biomart is!

  4. Initially, I wanted to use data from two recent studies by the Psychiatric Genomics Consortium (PGC) on schizophrenia (SCZ) and bipoloar disorder (BD), as well as Major depressive disorder (MDD). Like most consortia, PGC provides summary statistics that can be downloaded. However, before downloading anything, the user needs to acknowledge and agree to a list of conditions - which I think is an excellent approach! - therefore we cannot directly load it into R.

  5. Data source: Yengo et al. (2018).

  6. A guide to crafting Manhattan plots by Yan Holtz for the R graph gallery

  7. I recently listened to a podcast by Saskia Freytag and NJ Tierney where they talk about the differences between CRAN and Bioconductor (there are many!). The podcast is called Credibly Curious.

  8. +1 for reproducibility!

  9. Liza Darrous pointed out the interactive Manhattan R function manhanttanly.

Avatar
Sina Rüeger
(Genomic) Data Scientist

Related

comments powered by Disqus