LD Part 1

Access precomputed LD

R-packages used:

library(httr)
library(jsonlite)
library(xml2)
library(ggplot2)
theme_set(theme_bw())
library(tibble)
library(tidyr)
library(magrittr)
library(glue)
library(purrr)
library(rsnps)
library(data.table)
library(janitor)
library(stringr)
library(rsnps)

The squared correlation between genetic markers is one way to estimate linkage disequilibrium (LD). LD has to be computed all the time - either for as an input for statistical methods or to summarise results.

However, accessing LD estimations quickly, for a specific population and in an automated way (e.g. with R) is suprisingly difficult.

In this blog post I am exploring how to do this efficiently.

Goal

At the end of this blog post, we want to know the genetic correlation between two or more markers in a specific human population, so that we can populate the locuszoom plot from the previous blog post with coloured dots.

For simplicity, I will use the terms correlation, squared correlation, r, r2 and LD interchangeably.

Two approaches

In principle, there are two ways of doing accessing LD:

  1. Download (or access) the genetic data from which you want to estimate your correlations + calculate the correlations using some efficient approach.
  2. Access precomputed LD estimations.
Approach Advantages Downsides Useful when…
(1) Local computation of LD LD matrix can be quickly updated to new reference panels Requires large computation and storage space (e.g. 1000 Genomes is >100 few GB large). i) LD for a large set of SNPs is needed ii) LD from non-standard reference panel is needed.
(2) Access precomputed LD not need for large computation and storage space. limited to certain small sets of markers, limited to possibly outdated reference panels. LD for a small set of SNPs is needed

For now, I will focus on approach (2), and then explore approach (1) in a future blog post.

Spoiler: Using approach (2) does not get you far. It took me quite a while to gather all the solutions that are listed below, and yet there is not one perfect/ideal solution.

Our toy data

We will recycle the data from the previous blog post, where the focus was on extracting annotation using the package biomaRt. In this blog post, we will complete that locuszoom plot by adding the LD information.

## Data Source URL
url <- "https://portals.broadinstitute.org/collaboration/giant/images/2/21/BMI_All_ancestry.fmt.gzip"

## Import BMI summary statistics dat.bmi <- read_tsv(file = url) ##
## taking too long, let's use fread instead.

dat.bmi <- data.table::fread(url, verbose = FALSE)

## Rename some columns
dat.bmi <- dat.bmi %>% rename(SNP = SNPNAME, P = Pvalue)

## Extract region
dat.bmi.sel <- dat.bmi %>% slice(which.min(P))
dat.bmi.sel

## range region
range <- 5e+05
sel.chr <- dat.bmi.sel$CHR
sel.pos <- dat.bmi.sel$POS

data <- dat.bmi %>%
  filter(CHR == sel.chr, between(POS, sel.pos -
    range, sel.pos + range))

head(data)

(snp <- dat.bmi.sel$SNP)

What we are interested in is the LD between our top SNP rs1421085 and all other 82 SNPs nearby.

This dataset has positions on build GRCh37, while most databases are on build GRCh38 by now.

sm <- rsnps::ncbi_snp_summary(snp) %>% separate(chrpos, c("chr", "pos"))
sel.pos == as.numeric(sm$pos)
## [1] FALSE

Let’s quickly repeat what our primary goal is:

Extract the correlation between SNPs

  • without downloading any data,
  • fairly quick and
  • in R.

1B. A solution that almost works: ensembl.org

The REST API of Ensembl can do a lot (see options here). For example access precomputed LD. The webpage even provides R code to do so, which is from where I copied some snippets below.

To access the rest API at ensembl, we need the following three packages loaded.

library(httr)
library(jsonlite)
library(xml2)

What reference panels/population can we choose from?

Currently, the largest and hence most popular public reference panel is 1000 Genomes reference panel (1KG). The 26 populations of roughly 100 individuals each can be grouped into five super populations: African (AFR), American (AMR), European (EUR), South Asian (SAS), East Asian (EAS).

We can ask the ENSEMBL API from what populations reference panels are available. This will return us a data frame.

server <- "https://rest.ensembl.org"
ext <- "/info/variation/populations/homo_sapiens?filter=LD"

r <- GET(paste(server, ext, sep = ""), content_type("application/json"))

stop_for_status(r)

head(fromJSON(toJSON(content(r))))
##                                                  description size
## 1                              African Caribbean in Barbados   96
## 2                           African Ancestry in Southwest US   61
## 3                                      Bengali in Bangladesh   86
## 4                        Chinese Dai in Xishuangbanna, China   93
## 5 Utah residents with Northern and Western European ancestry   99
## 6                               Han Chinese in Bejing, China  103
##                      name
## 1 1000GENOMES:phase_3:ACB
## 2 1000GENOMES:phase_3:ASW
## 3 1000GENOMES:phase_3:BEB
## 4 1000GENOMES:phase_3:CDX
## 5 1000GENOMES:phase_3:CEU
## 6 1000GENOMES:phase_3:CHB

name stands for the population identifier. size refers to the number of individuals in the reference panel. Note that these are all populations with around 100 individuals (the correlation estimation will have an error that scales with the sample size). There are also the five super population available (although not listed here), simply replace the last three characters in name by EUR, AFR, AMR, EAS, SAS.

(From this blog post.)

We want the LD information, so that we can add this info to the locuszoom plot. But how do we know which population to pick? One way is to read up what kind of individuals were present. In our case - mostly Europeans (EUR). But we could also build some pooled LD matrix of different populations.

Now that we know which reference panel we want to use, we can use the different rest APIs.

Access LD between a SNP and its region

This API is described here.

The default window size is 500 kb. There are also thresholds for r2 (e.g. if you want to filter all SNPs with an r2 > 0.8).

The only input required is the SNP rsid, marked with {snp}.

snp
## [1] "rs1421085"
server <- "https://rest.ensembl.org"
ext <- glue::glue("/ld/human/{snp}/1000GENOMES:phase_3:EUR?")
## Window size in kb. The maximum allowed value for the window size is 500 kb. LD is computed for the given variant and all variants that are located within the specified window.

r <- GET(paste(server, ext, sep = ""), content_type("application/json"))
stop_for_status(r)
LD.SNP.region <- as_tibble(fromJSON(toJSON(content(r)))) %>%
  unnest() %>%
  mutate(r2 = as.numeric(r2))
## Warning: `cols` is now required.
## Please use `cols = c(variation1, d_prime, r2, population_name, variation2)`
head(LD.SNP.region)
## # A tibble: 6 x 5
##   variation1 d_prime      r2 population_name         variation2
##   <chr>      <chr>     <dbl> <chr>                   <chr>     
## 1 rs1421085  0.327013 0.0834 1000GENOMES:phase_3:EUR rs8043738 
## 2 rs1421085  0.992084 0.409  1000GENOMES:phase_3:EUR rs2042031 
## 3 rs1421085  0.846280 0.647  1000GENOMES:phase_3:EUR rs11642841
## 4 rs1421085  1.000000 0.957  1000GENOMES:phase_3:EUR rs9940128 
## 5 rs1421085  0.999948 0.0552 1000GENOMES:phase_3:EUR rs73612011
## 6 rs1421085  0.907868 0.648  1000GENOMES:phase_3:EUR rs8057044

As a result, LD.snp.region contains the r2 of our top SNP with all SNPs that were +/- 500 kb away.

What if we want the correlation between all SNPs?

Access LD matrix

For this, we need the rest API here.

We can calculate the LD matrix of a full region, max 1 Mb wide. For fast computation, we limit it to +/- 50 kb.

## Query region. A maximum of 1Mb is allowed.

ext <- glue::glue("/ld/human/region/{sel.chr}:{sel.pos - range/20}..{sel.pos + range/20}/1000GENOMES:phase_3:EUR?")

r <- GET(paste(server, ext, sep = ""), content_type("application/json"))
stop_for_status(r)
LD.matrix.region <- as_tibble(fromJSON(toJSON(content(r)))) %>%
  unnest() %>%
  mutate(r2 = as.numeric(r2))
## Warning: `cols` is now required.
## Please use `cols = c(population_name, r2, variation2, variation1, d_prime)`
head(LD.matrix.region)
## # A tibble: 6 x 5
##   population_name            r2 variation2  variation1 d_prime 
##   <chr>                   <dbl> <chr>       <chr>      <chr>   
## 1 1000GENOMES:phase_3:EUR 0.761 rs9933509   rs8057044  0.999998
## 2 1000GENOMES:phase_3:EUR 1     rs150763868 rs72803680 1.000000
## 3 1000GENOMES:phase_3:EUR 0.106 rs7206122   rs62033401 0.999940
## 4 1000GENOMES:phase_3:EUR 0.731 rs9935401   rs8057044  0.999997
## 5 1000GENOMES:phase_3:EUR 0.106 rs141816793 rs62033399 0.999943
## 6 1000GENOMES:phase_3:EUR 0.996 rs9941349   rs9931900  1.000000

Access LD between a SNP and many other SNPs

The third and last option is to pass on a set of SNP rs ids, and access the LD among these. Implemented in the ENSEMBL API is only the LD between two SNPs, so we will have to extend this to many SNPs.

extract_ld <- function(SNP.id2 = NULL, SNP.id1 = NULL, POP = NULL) {
  ext <- glue::glue("/ld/human/pairwise/{SNP.id1}/{SNP.id2}/") ## filter POP further down

  server <- "https://rest.ensembl.org"

  r <- GET(paste(server, ext, sep = ""), content_type("application/json"))
  stop_for_status(r)

  out <- as_tibble(fromJSON(toJSON(content(r)))) %>%
    unnest() %>%
    filter(stringr::str_detect(population_name, POP))

  return(out)
}


## see futher down why intersect here
other.snps <- intersect(LD.SNP.region$variation2, data$SNP)

## cacluate LD for all other.snps SNPs
LD.matrix.snps <- purrr::map_df(other.snps, extract_ld, snp, "EUR") %>%
  mutate(r2 = as.numeric(r2)) %>%
  bind_rows() %>%
  unnest()
## Warning: `cols` is now required.
## Please use `cols = c(variation2, d_prime, variation1, r2, population_name)`

## Warning: `cols` is now required.
## Please use `cols = c(variation2, d_prime, variation1, r2, population_name)`

## Warning: `cols` is now required.
## Please use `cols = c(variation2, d_prime, variation1, r2, population_name)`
## Warning: `cols` is now required.
## Please use `cols = c(d_prime, variation1, variation2, r2, population_name)`
## Warning: `cols` is now required.
## Please use `cols = c(variation2, variation1, d_prime, population_name, r2)`
## Warning: `cols` is now required.
## Please use `cols = c(variation1, d_prime, variation2, population_name, r2)`
## Warning: `cols` is now required.
## Please use `cols = c(r2, population_name, d_prime, variation1, variation2)`
## Warning: `cols` is now required.
## Please use `cols = c(population_name, r2, variation1, d_prime, variation2)`
## Warning: `cols` is now required.
## Please use `cols = c(variation2, d_prime, variation1, r2, population_name)`
## Warning: `cols` is now required.
## Please use `cols = c(d_prime, variation1, variation2, r2, population_name)`
## Warning: `cols` is now required.
## Please use `cols = c()`
LD.matrix.snps
## # A tibble: 0 x 5
## # … with 5 variables: variation2 <chr>, d_prime <chr>, variation1 <chr>,
## #   r2 <dbl>, population_name <chr>

Calculate the LD matrix (LD.matrix.region) or the LD between SNP pairs (LD.matrix.snps) takes a lot of time!

Coloured locuszoom plot

For the locuszoom plot we need only the correlation between the top SNP and all other SNPs. So we join the object LD.SNP.region to data.

data_ensembl <- data %>%
  full_join(LD.SNP.region, by = c("SNP" = "variation2"))
ggplot(data = data_ensembl) +
  geom_point(aes(POS, -log10(P), color = r2), shape = 16) +
  labs(
    title = "Locuszoom plot for BMI GWAS",
    subtitle = paste("Summary statistics for chromosome", sel.chr, "from", format((sel.pos - range), big.mark = "'"), "to", format((sel.pos + range), big.mark = "'"), "bp"),
    caption = paste("Data source:", url)
  ) +
  geom_point(
    data = data_ensembl %>% filter(SNP == "rs1421085"),
    aes(POS, -log10(P)), color = "black", shape = 16
  ) +
  scale_color_distiller("R2 (ensembl)", type = "div", palette = "Spectral", limits = c(0, 1))
## Warning: Removed 205 rows containing missing values (geom_point).

2. Solutions that work half-through

SNPsnap

API provided by sph.umich

3. A solution that does not work

Conclusion

The ldlink API with the LDproxy module seems the most perfect solution for now.

This will probably change with changing technology and larger reference panels.

Session Info

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.14.5
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] janitor_1.2.0      data.table_1.12.2  rsnps_0.3.2.9121  
##  [4] glue_1.3.1.9000    magrittr_1.5       xml2_1.2.0        
##  [7] jsonlite_1.6       httr_1.4.0         forcats_0.4.0.9000
## [10] stringr_1.4.0      dplyr_0.8.3.9000   purrr_0.3.2       
## [13] readr_1.3.1        tidyr_0.8.3.9000   tibble_2.1.3      
## [16] ggplot2_3.2.0.9000 tidyverse_1.2.1   
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5 xfun_0.8         haven_2.1.1      lattice_0.20-38 
##  [5] colorspace_1.4-1 vctrs_0.2.0.9000 generics_0.0.2   htmltools_0.3.6 
##  [9] yaml_2.2.0       utf8_1.1.4       XML_3.98-1.20    rlang_0.4.0     
## [13] pillar_1.4.2     httpcode_0.2.0   withr_2.1.2      modelr_0.1.4    
## [17] readxl_1.3.1     plyr_1.8.4       munsell_0.5.0    blogdown_0.13   
## [21] gtable_0.3.0     cellranger_1.1.0 rvest_0.3.4      evaluate_0.14   
## [25] knitr_1.23       curl_3.3         fansi_0.4.0      triebeard_0.3.0 
## [29] urltools_1.7.3   broom_0.5.2      Rcpp_1.0.1       scales_1.0.0    
## [33] backports_1.1.4  hms_0.4.2        digest_0.6.20    stringi_1.4.3   
## [37] bookdown_0.11    grid_3.5.1       cli_1.1.0        tools_3.5.1     
## [41] lazyeval_0.2.2   crul_0.8.0       crayon_1.3.4     pkgconfig_2.0.2 
## [45] zeallot_0.1.0    ellipsis_0.2.0.1 lubridate_1.7.4  assertthat_0.2.1
## [49] rmarkdown_1.13   rstudioapi_0.10  R6_2.4.0         icon_0.1.0      
## [53] nlme_3.1-140     compiler_3.5.1
Avatar
Sina Rüeger
(Genomic) Data Scientist

Related

comments powered by Disqus