QTL_BSA_Crop_Varieties

Author: Michael Hall
Date: 4/13/2022

Before we begin I like to reveal what my machine specifications are just in case there might be a compatibility issue:

What open source opeating system are you running? Ubuntu 18.04, Code name Bionic, it must be a Tuesday

QTLSorghum

QTLseqr is an R package for QTL mapping using NGS Bulk Segregant Analysis.

QTLseqr is still under development and is offered with out any guarantee.

For more detailed instructions please read the vignettehere

For updates read the NEWS.md

Installation

You can install QTLseqr from github with:

# install devtools first to download packages from github
install.packages("devtools")

# use devtools to install QTLseqr
devtools::install_github("PBGLMichaelHall/QTLseqr")

Package Dependencies

Note: Apart from regular package dependencies, there are some Bioconductor tools that we use as well, as such you will be prompted to install support for Bioconductor, if you haven’t already. QTLseqr makes use of C++ to make some tasks significantly faster (like counting SNPs). Because of this, in order to install QTLseqr from github you will be required to install some compiling tools (Rtools and Xcode, for Windows and Mac, respectively).

Citation

If you use QTLseqr in published research, please cite:

Mansfeld B.N. and Grumet R, QTLseqr: An R package for bulk segregant analysis with next-generation sequencing The Plant Genome doi:10.3835/plantgenome2018.01.0006

We also recommend citing the paper for the corresponding method you work with.

QTL-seq method:

Takagi, H., Abe, A., Yoshida, K., Kosugi, S., Natsume, S., Mitsuoka, C., Uemura, A., Utsushi, H., Tamiru, M., Takuno, S., Innan, H., Cano, L. M., Kamoun, S. and Terauchi, R. (2013), QTL-seq: rapid mapping of quantitative trait loci in rice by whole genome resequencing of DNA from two bulked populations. Plant J, 74: 174–183. doi:10.1111/tpj.12105

G prime method:

Magwene PM, Willis JH, Kelly JK (2011) The Statistics of Bulk Segregant Analysis Using Next Generation Sequencing. PLOS Computational Biology 7(11): e1002255. doi.org/10.1371/journal.pcbi.1002255

Abstract

Next Generation Sequencing Bulk Segregant Analysis (NGS-BSA) is efficient in detecting quantitative trait loci (QTL). Despite the popularity of NGS-BSA and the R statistical platform, no R packages are currently available for NGS-BSA. We present QTLseqr, an R package for NGS-BSA that identifies QTL using two statistical approaches: QTL-seq and G’. These approaches use a simulation method and a tricube smoothed G statistic, respectively, to identify and assess statistical significance of QTL. QTLseqr, can import and filter SNP data, calculate SNP distributions, relative allele frequencies, G’ values, and log10(p-values), enabling identification and plotting of QTL.

Examples:

Load/install libraries

devtools::install_github(“PBGLMichaelHall/QTLseqr”,force = TRUE)
install.packages(“vcfR”)
install.packages(“tidyr”)
install.packages(“ggplot2”)

library(QTLseqr)
library(vcfR)
library(tidyr)
library(ggplot2)

**Methods**

Set the Working Directory

setwd("/home/michael/Desktop/QTLseqr/extdata")

Pre-Filtering Rules

Vcf file must only contain bialleleic variants. (filter upstream, e.g., with bcftools view -v snps -m2 -M2), also the QTLseqR functions will only take    SNPS, ie, length of REF and ALT== 1

Importing Data

importFromVCF

df <- importFromVCF(file = "freebayes_D2.filtered.vcf", highBulk = "D2_F2_tt", lowBulk =    "D2_F2_TT", filname = "Hall")

importFromGATK

An offical Github GATK Genomic Analysis Toolkit repository can be found here to download https://github.com/broadinstitute/gatk

However, we want to clone the repository and make a build:

git clone https://github.com/broadinstitute/gatk

**Navigate to find gradlew and type the command:**

gradlew bundle

**To verify it is working invoke python interpreter:**

python gatk --help

python gatk --list

*Base Calling:*
*Copy Number Variant Discovery:*
*Coverage Analyis:*
*Diagnostics and Quality Control:*
*Example Tools:*
*Genotyping Arrays Manipulation:*
*Intervals Manipulation:*
*Metagenomics:*
*Methalation-Specific Tools:*
*Other:*
*Read Data Manipulation:*
*Reference:*
*Short Variant Discovery:*
*Structural Variant Discovery:*
*Variant Evaluation and Refinement:*
*Variant Filtering:*
*Variant Manipulation:*

# We are most concerned with Variant Evaluation and Refinement

**To produce the input file Hall.table, run the following command:**

python gatk VariantsToTable --variant freebayes_D2.filtered.vcf --fields CHROM --fields POS --fields REF --fields ALT --genotyp-fields AD --genotype-fields DP --genotype-fields GQ --genotype-fields PL --output Hall.table

Input Fields ImportFromVCF

**Define High bulk and Low bulk sample names as an input object and define parser generated file name. The file name is generated from ImportFromVCF function.**

HighBulk <- "D2_F2_tt"
LowBulk <- "D2_F2_TT"
file <- "Hall.csv"

**Choose and define which chromosomes/contigs will be included in the analysis. The chromosome/contg names are reverse compatible with VCF names.**

Chroms <- c("Chr01","Chr02","Chr03","Chr04","Chr05","Chr06","Chr07","Chr08","Chr09","Chr10")

importFromTable

df <-
  importFromTable(
    file = file,
    highBulk = HighBulk,
    lowBulk = LowBulk,
    chromList = Chroms
  )

Inspect Header

Input Fields ImportFromGATK

**Define Objects High bulk, Low bulk and file given there proper names.**

HighBulk <- "D2_F2_tt"
LowBulk <- "D2_F2_TT"
file <- "Hall.table"

**Choose which chromosomes/contigs will be included in the analysis.**

Chroms <- c("Chr01","Chr02","Chr03","Chr04","Chr05","Chr06","Chr07","Chr08","Chr09","Chr10")

importFromTable

df <-
  importFromGATK(
    file = file,
    highBulk = HighBulk,
    lowBulk = LowBulk,
    chromList = Chroms
  )

Histograms

**Make histograms associated with filtering arguments. Such as Minimum Depth, Maximum Depth, Reference Allele Frequency, Minimum Sample Depth, and Genotype Quality.

ggplot(data =df) + geom_histogram(aes(x = DP.LOW + DP.HIGH)) + xlim(0,400)

ggsave(filename = "Depth_Histogram.png",plot=last_plot())

ggplot(data = df) + geom_histogram(aes(x = REF_FRQ))
ggsave(filename = "Ref_Freq_Histogram.png",plot = last_plot())

filterSNPs

**Filter SNPs:**
df_filt <- filterSNPs( SNPset = df,
refAlleleFreq = 0.20, minTotalDepth = 100, maxTotalDepth = 400,
minSampleDepth = 40,
minGQ = 0 )

runGprimeAnalysis_MH

**Run G' analysis:**

df_filt<-runGprimeAnalysis_MH(
  SNPset = df_filt,
  windowSize = 5000000,
  outlierFilter = "deltaSNP",
  filterThreshold = 0.1)

plotGprimeDist_MH

**The plot reveals a skewed G Prime statistic with a really small variance. Perhaps it is due to relatively High Coverage with respect to Bulk Sample Sizes and not a lot of variants called.**

**In addition, Hampels outlier filter in the second argument can also be changed to "deltaSNP".**

plotGprimeDist(SNPset = df_filt, outlierFilter = "Hampel",filterThreshold = 0.1, binwidth = 0.5)

**We can see raw data before and after our filtering step**

plotGprimeDist_MH(SNPset = df_filt, outlierFilter = "deltaSNP",filterThreshold = 0.1,binwidth=0.5)

runQTLseqAnalysis_MH

**Run QTLseq analysis:**


df_filt2 <- runQTLseqAnalysis_MH(
  SNPset = df_filt,
  windowSize = 5000000,
  popStruc = "F2",
  bulkSize = c(45, 38),
  replications = 10000,
  intervals = c(95, 99)
)

Plot G Statistic Distribution as a Histogram

hist(df_filt2$G,breaks = 950,xlim = c(0,10),xlab = "G Distribution",main = "Histogram of G Values")

plotQTLStats

nSNPs

**Plot Snps as a function of chromosome and position values**


plotQTLStats(SNPset = df_filt2, var = "nSNPs")
ggsave(filename = "nSNPs.png",plot = last_plot())

Gprime

**Using QTLStats funciton plot Gprime Statistic with False Discovery Rate Threhshold as a third argument boolean operator as TRUE. The q value is used as FDR threshold null value is 0.05%.**


plotQTLStats(SNPset = df_filt, var = "Gprime", plotThreshold = TRUE, q = 0.01)
ggsave(filename = "GPrime.png",plot = last_plot())

deltaSNP

**Again using plotQTLStats change second argument varaible to deltaSNP and plot.**

plotQTLStats(SNPset = df_filt2, var = "deltaSNP", plotIntervals  = TRUE)
ggsave(filename = "DeltaSNPInterval.png",plot = last_plot())

negLog10Pval

**Finally with plotQTLStats plot negLog10Pval.**

plotQTLStats(SNPset = df_filt2, var = "negLog10Pval",plotThreshold = TRUE,q=0.01)
ggsave(filename = "negLog10Pval.png",plot = last_plot())

Gprime Subset

**Add subset argument to focus on particular chromosomes one, three, four, and six.**
**The reason is due to signficant QTL regions**


plotQTLStats(SNPset = df_filt2, var = "Gprime",plotThreshold = TRUE,q=0.01,subset = c("Chr01","Chr03","Chr04","Chr06"))

rMVP Package

SNP Densities

install.packages("rMVP")
library(rMVP)
sample<-"Semi_Dwarfism_in_Sorghum"
pathtosample <- "/home/michael/Desktop/QTLseqr/extdata/subset_freebayes_D2.filtered.vcf.gz"
out<- paste0("mvp.",sample,".vcf")
memo<-paste0(sample)
dffile<-paste0("mvp.",sample,".vcf.geno.map")

message("Making MVP data S1")
MVP.Data(fileVCF=pathtosample,
      #filePhe="Phenotype.txt",
      fileKin=FALSE,
      filePC=FALSE,
      out=out)

message("Reading MVP Data S1")
df <- read.table(file = dffile, header=TRUE)
message("Making SNP Density Plots")
MVP.Report.Density(df[,c(1:3)], bin.size = 5000000, col = c("blue", "yellow", "red"), memo = memo, file.type = "jpg", dpi=300)

Export summary CSV

QTLTable(SNPset = df_filt, alpha = 0.01, export = TRUE, fileName = "my_BSA_QTL.csv")

Preview the Summary QTL

Theory

Contigency Table

Obs_Allel_Freq

**Use the function to plot allele frequencies per chromosome.**
**Second argument size specifes size of scalar factor on nSNPs and if you have a relatively small SNP set .001 is a good startin point otherwise set to 1**


Obs_Allele_Freq(SNPSet = df_filt, size = .001)

Obs_Allele_Freq2

**Use the function to investigate chromosomal region of interest**

Obs_Allele_Freq2(SNPSet = df_filt, ChromosomeValue = "Chr04", threshold = .90)

Total Coverage and Expected Allelic Frequencies

E(n1) = E(n2) = E(n3) = E(n4) = C/2

**Read in the csv file from High bulk tt**

tt<-read.table(file = "D2_F2_tt.csv",header = TRUE,sep = ",")

**Calculate average Coverage per SNP site**

mean(tt$DP)

**Find REalized frequencies**

p1_STAR <- sum(tt$AD_ALT.) / sum(tt$DP)

**Read in the csv file from Low Bulk TT.**

TT<-read.table(file ="D2_F2_TT.csv",header = TRUE,sep=",")

**Calculate average Coverage per SNP sit**

mean(TT$DP)

**Find Realized frequencies**

p2_STAR <- sum(TT$AD_ALT.) / sum(TT$DP)

**Take the average of the Averages**

C <-(mean(tt$DP)+mean(TT$DP))/2

C<-round(C,0)
**Find Coverage Value**
C
110

E(n1) = E(n2) = E(n3) = E(n4) = C/2 = 55

p2 >> p1 QTL is present

Theory and Analytical Framework of Sampling from BSA

Binomial Sampling

High Bulk

par(mfrow=c(1,1)) Define Ranges of Success success <- 0:90

The Difference between realized and Expected Frequencies

ns : Sample Size taken from Low Bulk

2(ns)p1_star ~ Binomial(2(ns),p1)

p1 Expected Frequencies

Expected Frequencies:

E(n1) = E(n2) = E(n3) = E(n4) = C/2 = 110

We prefer for accuracy and a powerful G Prime Test to have ns >> C >> 1

However, it is not true in this case.

plot(success, dbinom(success, size = 90, prob = .50), type = “h”,main=”Binomial Sampling from Diploid Orgainism from High Bulk”,xlab=”2(ns)(p1_STAR)”,ylab=”Density”)

Low Bulk

**ns : Sample Size from High Bulk**
**2(ns)p2_star ~ Binomial(2(ns),p2)**
**p2 Expected Frequencies**
success <- 0:76
plot(success, dbinom(success, size = 76, prob = 0.5), type = "h",main="Binomial Sampling from Diploid Organism from Low Bulk",xlab="2(n2)(p2_STAR)",ylab="Density")

Conditional Distribution of n1 given realized average frequency

par(mfrow=c(1,1))
#Define Ranges of Success (Allele Frequencies High and Low)
success <- 0:100
#n1|p1_star ~ Poisson(lambda)
plot(success, dpois(success, lambda = C*(1-p1_STAR)), type = 'h',main="n1|p1_STAR ~ Poisson(C[1-p1_STAR])",xlab="n1|(n3/n1+n3)",ylab="Prob")

Observed n1

hist(TT$AD_REF., probability = TRUE,main="Histogram of Actually Realized n1 Values",xlab="n1")

Conditional Distribution of n2 given realized average frequency

#n2|p2_star ~ Poisson(lambda)
plot(success, dpois(success, lambda = C*(1-p2_STAR)), type='h', main="n2|p2_STAR ~ Poisson(C[[1-p2_STAR])",xlab="n2|(n4/n2+n4)",ylab="Prob")

Observed n2

hist(tt$AD_REF., probability = TRUE, main = "Histogram of Actually Realized n2 Values",xlab="n2")

Conditional Distribution of n3 given realized average frequency

#n3|p1_star ~ Poisson(lambda)
plot(success, dpois(success, lambda = C*p1_STAR),type='h',main="n3|p1_STAR ~ Poisson(C[1-p1_STAR])",xlab="n3|(n3/n1+n3)",ylab="Prob")

Observed n3

hist(TT$AD_ALT., probability = TRUE, main="Histogram of Acutally Realized n3 Values",xlab="n3")

Conditional Distribution of n4 given realized average frequency

#n4|p2_star ~ Poisson(lambda)
plot(success, dpois(success, lambda = C*p2_STAR), type = 'h',main="n4|p2_STAR ~ Poisson(C[1-p2_STAR])",xlab="n4|n4/(n2+n4)",ylab="Prob")

Observed n4

hist(tt$AD_ALT., probability = TRUE, main="Histogram of Acutally Realized n4 Values",xlab="n4")

An interdependentaly observed relationship between G and Gprime