QTL_Rice_Cold_Tolerance
- Author
Michael Hall
- Date
4/13/2022
Before we begin I like to reveal what my machine specifications are just in case there might be a compatibility issue:
What open source opeating system are you running? Ubuntu 18.04, Code name Bionic, it must be a Tuesday
QTL-Rice-Cold-Tolerance
QTLseqr is an R package for QTL mapping using NGS Bulk Segregant Analysis.
QTLseqr is still under development and is offered with out any guarantee.
For more detailed instructions please read the vignettehere
For updates read the NEWS.md
Installation
You can install QTLseqr from github with:
# install devtools first to download packages from github
install.packages("devtools")
# use devtools to install QTLseqr
devtools::install_github("PBGLMichaelHall/QTLseqr")
Package Dependencies
Note: Apart from regular package dependencies, there are some Bioconductor tools that we use as well, as such you will be prompted to install support for Bioconductor, if you haven’t already. QTLseqr makes use of C++ to make some tasks significantly faster (like counting SNPs). Because of this, in order to install QTLseqr from github you will be required to install some compiling tools (Rtools and Xcode, for Windows and Mac, respectively).
Citation
If you use QTLseqr in published research, please cite:
Mansfeld B.N. and Grumet R, QTLseqr: An R package for bulk segregant analysis with next-generation sequencing The Plant Genome doi:10.3835/plantgenome2018.01.0006
We also recommend citing the paper for the corresponding method you work with.
QTL-seq method:
Takagi, H., Abe, A., Yoshida, K., Kosugi, S., Natsume, S., Mitsuoka, C., Uemura, A., Utsushi, H., Tamiru, M., Takuno, S., Innan, H., Cano, L. M., Kamoun, S. and Terauchi, R. (2013), QTL-seq: rapid mapping of quantitative trait loci in rice by whole genome resequencing of DNA from two bulked populations. Plant J, 74: 174–183. doi:10.1111/tpj.12105
G prime method:
Magwene PM, Willis JH, Kelly JK (2011) The Statistics of Bulk Segregant Analysis Using Next Generation Sequencing. PLOS Computational Biology 7(11): e1002255. doi.org/10.1371/journal.pcbi.1002255
Abstract
Next Generation Sequencing Bulk Segregant Analysis (NGS-BSA) is efficient in detecting quantitative trait loci (QTL). Despite the popularity of NGS-BSA and the R statistical platform, no R packages are currently available for NGS-BSA. We present QTLseqr, an R package for NGS-BSA that identifies QTL using two statistical approaches: QTL-seq and G’. These approaches use a simulation method and a tricube smoothed G statistic, respectively, to identify and assess statistical significance of QTL. QTLseqr, can import and filter SNP data, calculate SNP distributions, relative allele frequencies, G’ values, and log10(p-values), enabling identification and plotting of QTL.
Examples:
Load/install libraries
install.packages(“vcfR”)
install.packages(“tidyr”)
install.packages(“ggplot2”)
install.packages("data.table")
devtools::install_github(“PBGLMichaelHall/QTLseqr”,force = TRUE)
library(data.table)
library(QTLseqr)
library(vcfR)
library(tidyr)
library(ggplot2)
library(dplyr)
**Methods**
Set the Working Directory
setwd("/home/michael/Desktop/RiceCold2")
Pre-Filtering Rules
Vcf file can contain bialleleic variants before parsing, however, out of a principal investigators preference, the user can (filter upstream, e.g., with bcftools view -m2 -M2), also the QTLseqR functions will only call SNPS, so filter out **INDELS** with the following command line.
The Lonely Parser
Calling my Parser QTLParser_1_MH This method requires 4 arguments, a vcf, highBulk, lowBulk, and filename. Proceeding this Call you must invoke importFromTable before Filtering.
df <- QTLParser_1_MH(vcf = "wGQ-Filt-freebayes~bwa~IRGSP-1.0~both-segregant_bulks~filtered-default.vcf", highBulk = "ET-pool-385", lowBulk = "ES-pool-430", filename = "Hall.csv")
Import Data
Method 1 (Biased due to parser configuration)
Calling importFromTable on Hall.csv file This method requires 5 inputs to 5 arguments, file, highBulk, lowBulk, chromList and sep.
importFromTable
Chroms <- c("NC_029256.1","NC_029257.1","NC_029258.1","NC_029259.1","NC_029260.1","NC_029261.1","NC_029262.1","NC_029263.1","NC_029264.1","NC_029265.1","NC_029266.1","NC_029267.1")
df <- importFromTable(file = "Hall.csv", highBulk = "ET-pool-385", lowBulk = "ES-pool-430", chromList = Chroms, sep = ",")
**Method 2 (Most convienent)**
Calling **importFromVCF**
This method requires 5 arguments, a vcf **file**, **highBulk**, **lowBulk**, **chromList**, **filename**, and **filter.**
**The filtering argument is a Boolean accepting only TRUE or FALSE. If TRUE then it filters out all SNPs that did not "PASS" in that INFO field.**
**If it is FALSE then there is no filter applied at all.**
importFromVCF
Chroms <- c("NC_029256.1","NC_029257.1","NC_029258.1","NC_029259.1","NC_029260.1","NC_029261.1","NC_029262.1","NC_029263.1","NC_029264.1","NC_029265.1","NC_029266.1","NC_029267.1")
df <- importFromVCF(file = "wGQ-Filt-freebayes~bwa~IRGSP-1.0~both-segregant_bulks~filtered-default.vcf",highBulk = "ET-pool-385",lowBulk = "ES-pool-430",chromList = Chroms,filename = "Hall",filter = FALSE)
GATK
Method 3 (Best in my opinion)
Calling importFromGATK This method requires 4 arguments, a vcf file, highBulk, lowBulk, and chromlist. If you do not have the software on your machine, first visit this website. https://gatk.broadinstitute.org/hc/en-us/articles/360036194592-Getting-started-with-GATK4 Go to section 4 and click the first from left to right here hyperlink
Navigate to the folder containg gatk executable python script
Call VariantsToTable sub executable program with all appropriate flags
This should produce a file called Hall.table
Chroms <- c("NC_029256.1","NC_029257.1","NC_029258.1","NC_029259.1","NC_029260.1","NC_029261.1","NC_029262.1","NC_029263.1","NC_029264.1","NC_029265.1","NC_029266.1","NC_029267.1")
df <- importFromGATK(file = "Hall.table", highBulk = "ET-pool-385", lowBulk = "ES-pool-430", chromlist = Chroms)
**Method 1 is the most biased and therefore cuts out more SNPs than Methods 2 & 3 which produce nearly identical SNP sets.**
Input Fields
#Set High bulk and Low bulk sample names and parser generated file name
#The file name is generated from the QTLParser_1_MH function in line 119
HighBulk <- "ET-pool-385"
LowBulk <- "ES-pool-430"
file <- "Hall.csv"
#Choose which chromosomes/contigs will be included in the analysis,
Chroms <- c("NC_029256.1","NC_029257.1","NC_029258.1","NC_029259.1","NC_029260.1","NC_029261.1","NC_029262.1","NC_029263.1","NC_029264.1","NC_029265.1","NC_029266.1","NC_029267.1")
importFromTable
df <-
importFromTable(
file = file,
highBulk = HighBulk,
lowBulk = LowBulk,
chromList = Chroms
)
Histograms
#plot histograms associated with filtering arguments such as mamximum and minumum Total Depths and reference Allele Frequency to determine cut off values
ggplot(data =df) + geom_histogram(aes(x = DP.LOW + DP.HIGH)) + xlim(0,400)
ggsave(filename = "Depth_Histogram.png",plot=last_plot())
ggplot(data = df) + geom_histogram(aes(x = REF_FRQ))
ggsave(filename = "Ref_Freq_Histogram.png",plot = last_plot())
filterSNPs
#Filter SNPs based on some criteria
df_filt <- filterSNPs( SNPset = df,
refAlleleFreq = 0.20, minTotalDepth = 100, maxTotalDepth = 400,
minSampleDepth = 40,
# minGQ = 0 )
runGprimeAnalysis_MH
#Run G' analysis
df_filt<-runGprimeAnalysis(
SNPset = df_filt,
windowSize = 1e6,
outlierFilter = "deltaSNP",
filterThreshold = 0.1)
plotGprimeDist_MH
#The plot reveals a skewed G Prime statistic with a really small variance. Perhaps it is due to the small number of variants called.
#In addition, Hampels outlier filter in the second argument, can also be changed to "deltaSNP"
plotGprimeDist(SNPset = df_filt, outlierFilter = "Hampel")
#We can see raw data before and after our filtering step
plotGprimeDist(SNPset = df_filt, outlierFilter = "deltaSNP",filterThreshold = 0.1)
runQTLseqAnalysis_MH
#Run QTLseq analysis
df_filt <- runQTLseqAnalysis(
SNPset = df_filt,
windowSize = 1e6,
popStruc = "F2",
bulkSize = c(430, 385),
replications = 10000,
intervals = c(95, 99)
)
Plot G Statistic Distribution as a Histogram
hist(df_filt$G,breaks = 950,xlim = c(0,10),xlab = "G Distribution",main = "Histogram of G Values")
plotQTLStats
nSNPs
#Plot Snps as a function of chromosome and position values
plotQTLStats(SNPset = df_filt, var = "nSNPs")
ggsave(filename = "nSNPs.png",plot = last_plot())
Gprime
#Using QTLStats funciton plot Gprime Statistic with False Discovery Rate Threhshold as a third argument boolean operator as TRUE. The q value is used as FDR threshold null value is 0.05%.
plotQTLStats(SNPset = df_filt, var = "Gprime", plotThreshold = TRUE, q = 0.01)
ggsave(filename = "GPrime.png",plot = last_plot())
deltaSNP
#Again using plotQTLStats change second argument varaible to deltaSNP and plot.
plotQTLStats(SNPset = df_filt, var = "deltaSNP", plotIntervals = TRUE)
ggsave(filename = "DeltaSNPInterval.png",plot = last_plot())
negLog10Pval
#Finally with plotQTLStats plot negLog10Pval
plotQTLStats(SNPset = df_filt, var = "negLog10Pval",plotThreshold = TRUE,q=0.15)
ggsave(filename = "negLog10Pval.png",plot = last_plot())
Gprime Subset
#Add subset argument to focus on particular chromosomes one, three, four, and six.
#The reason is due to signficant QTL regions
plotQTLStats(SNPset = df_filt, var = "Gprime",plotThreshold = TRUE,q=0.05,subset = c("NC_029256.1","NC_029258.1","NC_029259.1","NC_029261.1"))
rMVP Package
SNP Densities
#install.packages("rMVP")
library(rMVP)
sample<-"Semi_Dwarfism_in_Sorghum"
pathtosample <- "/home/michael/Desktop/QTLseqr/extdata/subset_freebayes_D2.filtered.vcf.gz"
out<- paste0("mvp.",sample,".vcf")
memo<-paste0(sample)
dffile<-paste0("mvp.",sample,".vcf.geno.map")
message("Making MVP data S1")
MVP.Data(fileVCF=pathtosample,
#filePhe="Phenotype.txt",
fileKin=FALSE,
filePC=FALSE,
out=out)
message("Reading MVP Data S1")
df <- read.table(file = dffile, header=TRUE)
message("Making SNP Density Plots")
MVP.Report.Density(df[,c(1:3)], bin.size = 1000000, col = c("blue", "yellow", "red"), memo = memo, file.type = "jpg", dpi=300)
Export summary CSV
QTLTable(SNPset = df_filt, alpha = 0.01, export = TRUE, fileName = "my_BSA_QTL.csv")
Preview the Summary QTL
Theory
Contigency Table
Obs_Allel_Freq
#Use the function to plot allele frequencies per chromosome
#Second argument size specifes size of scalar factor on nSNPs and if you have a relatively small SNP set .001 is a good startin point otherwise set to 1
Obs_Allele_Freq(SNPSet = df_filt, size = 1)
Obs_Allele_Freq2
#Use the function to plot allele frequencies per chromosome
#Second argument size specifes size of scalar factor on nSNPs and if you have a relatively small SNP set .001 is a good startin point otherwise set to 1
##Use the function to investigate chromosomal region of interest
Obs_Allele_Freq2(SNPSet = df_filt, ChromosomeValue = "NC_029263.1", threshold = .85)
Total Coverage and Expected Allelic Frequencies
#Assuming average sequencing coverage (C) expected values for n1,n2,n3,n4
E(n1) = E(n2) = E(n3) = E(n4) = C/2 = 35
# Read in the csv file from High bulk tt
tt<-read.table(file = "ET-pool-385.csv",header = TRUE,sep = ",")
# Calculate average Coverage per SNP site
mean(tt$DP)
# Find REalized frequencies
p1_STAR <- sum(tt$AD_ALT.) / sum(tt$DP)
# Read in the csv file from Low Bulk TT
TT<-read.table(file ="ES-pool-430.csv",header = TRUE,sep=",")
# Calculate average Coverage per SNP sit
mean(TT$DP)
# Find Realized frequencies
p2_STAR <- sum(TT$AD_ALT.) / sum(TT$DP)
# Take the average of the Averages
C <-(mean(tt$DP)+mean(TT$DP))/2
C<-round(C,0)
#Average Coverage
70
C/2 = 35
p2 >> p1 QTL is present
However, ns >> C >> 1 is TRUE
Chromosomal Sampling Theory and an Analytical Framework with respect to Bulk Segregant Analysis
Binomial Sampling
Low Bulk
setwd("/home/michael/Desktop/QTLseqr/extdata")
# Theory and Analytical Framework of Sampling from BSA
par(mfrow=c(1,1))
# Define Ranges of Success
# Sample Size from High Bulk sn = 385
success <- 0:770
# The Difference between realized and Expected Frequencies
# ns : Sample Size taken from Low Bulk
# 2(ns)p1_star ~ Binomial(2(ns),p1)
# p1 Expected Frequencies
# Expected Frequencies:
# E(n1) = E(n2) = E(n3) = E(n4) = C/2 = 110
# We prefer for accuracy to have ns >> C >> 1
plot(success, dbinom(success, size = 770, prob = .50), type = "h",main="Binomial Sampling from Diploid Orgainism from Low Bulk",xlab="2(ns)( p1_STAR)",ylab="Density")
High Bulk
# ns : Sample Size from High Bulk
# 2(ns)p2_star ~ Binomial(2(ns),p2)
# p2 Expected Frequencies
success <- 0:860
plot(success, dbinom(success, size = 860, prob = 0.5), type = "h",main="Binomial Sampling from Diploid Organism from High Bulk",xlab="2(n2)(p2_STAR)",ylab="Density")
Conditional Distribution of n1 given realized average frequency
par(mfrow=c(1,1))
#Define Ranges of Success (Allele Frequencies High and Low)
success <- 0:100
#n1|p1_star ~ Poisson(lambda)
plot(success, dpois(success, lambda = C*(1-p1_STAR)), type = 'h',main="n1|p1_STAR ~ Poisson(C[1-p1_STAR])",xlab="n1|(n3/n1+n3)",ylab="Prob")
Observed n1
# Filter outliers
TT <- TT %>% filter(AD_REF. <= 500)
hist(TT$AD_REF., probability = FALSE,main="Histogram of Actually Realized n1 Values",xlab="n1",breaks = "Sturges")
Conditional Distribution of n2 given realized average frequency
#n2|p2_star ~ Poisson(lambda)
plot(success, dpois(success, lambda = C*(1-p2_STAR)), type='h', main="n2|p2_STAR ~ Poisson(C[[1-p2_STAR])",xlab="n2|(n4/n2+n4)",ylab="Prob")
Observed n2
tt <- tt %>% filter(AD_REF. <= 500)
hist(tt$AD_REF., probability = TRUE, main = "Histogram of Actually Realized n2 Values",xlab="n2")
Conditional Distribution of n3 given realized average frequency
#n3|p1_star ~ Poisson(lambda)
plot(success, dpois(success, lambda = C*p1_STAR),type='h',main="n3|p1_STAR ~ Poisson(C[1-p1_STAR])",xlab="n3|(n3/n1+n3)",ylab="Prob")
Observed n3
TT <- TT %>% filter(AD_ALT. <= 300)
hist(TT$AD_ALT., probability = TRUE, main="Histogram of Acutally Realized n3 Values",xlab="n3")
Conditional Distribution of n4 given realized average frequency
#n4|p2_star ~ Poisson(lambda)
plot(success, dpois(success, lambda = C*p2_STAR), type = 'h',main="n4|p2_STAR ~ Poisson(C[1-p2_STAR])",xlab="n4|n4/(n2+n4)",ylab="Prob")
Observed n4
hist(tt$AD_ALT., probability = TRUE, main="Histogram of Acutally Realized n4 Values",xlab="n4")
An interdependentaly observed relationship between G and Gprime