GWAS介绍
全基因组关联分析是对多个个体在全基因组范围的遗传变异多态性进行检测,获得基因型,进而将基因型与可观测的性状,即表型,进行群体水平的统计学分析,根据统计量或P值筛选出有可能影响该性状的遗传变异。
重测序-GWAS
通过全基因组大样本重测序对动植物重要种质资源进行全基因组的基因型鉴定,并与关注的表型数据进行全基因组关联分析(GWAS),进而找到与关注表型相关的SNP位点,定位与性状相关基因,为后续动植物的育种提供科学理论依据。
分析内容
1、测序数据质控
2、与参考 基因组比对
3、全基因组变异检测
4、遗传进化分析
5、连锁不平衡分析
6、全基因组关联分析:a)SNP与性状关联分析 b)候选基因功能注释
Selecting study populations
GWAS often require very large sample sizes to identify reproducible genome-wide significant associations and the desired sample size can be determined using power calculations in software tools such as CaTS14 or GPC15. Study designs can involve the inclusion of cases and controls when the trait of interest is dichotomous, or quantitative measurements on the whole study sample when the trait is quantitative. In addition, one can choose between population-based and family-based designs.
Different ethnicities can be included in the same study, as long as the population substructure is considered to avoid false positive results. Individual cohorts with detailed clinical measures may not be able to meet the required sample size; in these cases, ‘proxy’ phenotypes that are easier to measure and for which there are more data can be used (for example, educational attainment can be used as a proxy for intelligence, or depressive symptoms can be used as a proxy for a clinical diagnosis of depression)19.
Genotyping
Genotyping of individuals is typically done using microarrays for common variants or next-generation sequencing methods such as WES or WGS that also include rare variants. Microarray-based genotyping is the most commonly used method for obtaining genotypes for GWAS owing to the current cost of next-generation sequencing. However, the choice of genotyping platform depends on many factors and tends to be guided by the purpose of the GWAS; for example, in a consortium-led GWAS, it is usually wise to have all individual cohorts genotyped on the same genotyping platform. Ideally, WGS — which determines nearly every genotype of a full genome — is preferred over WES and microarrays, and is expected to become the method of choice over the next couple of years with the increasing availability of low-cost WGS technology.
From: Genome-wide association studies
Software | Use |
---|---|
Quality control | |
PLINK/PLINK2 (ref.20) | Can be used for many key steps in quality control, including filtering of bad SNPs (based on deviation from Hardy–Weinberg equilibrium, genotyping call rate and minor allele frequency) and bad individuals (based on sex check, genotyping call rate, sample call rate, heterozygosity and relatedness checks) |
RICOPILI23 | Quality control of raw genetic data and summary statistics used for input in meta-analyses |
SMARTPCA | Principal component analysis of raw genotyping data; provides individual-level principal components that can be used to correct for population stratification |
FlashPCA255 | Similar to SMARTPCA; faster and more scalable with increasing sample sizes |
Imputation | |
IMPUTE2 (refs256,257) | Imputation of missing genotypes against an existing reference panel matched for ancestry; tends to use more memory than other imputation tools |
BEAGLE258 | Imputation of missing genotypes against an existing reference panel matched for ancestry |
MACH/Minimac259 | Imputation of missing genotypes against an existing reference panel matched for ancestry; Minimac includes pre-phasing, which speeds up imputation time |
Association | |
PLINK/PLINK2 (ref.20) | Most widely known tool for conducting genetic associations |
SNPTEST260 | Genetic association testing; works well with IMPUTE2 |
GEMMA55 | Genetic association testing based on linear mixed models |
SAIGE35 | Genetic association for binary phenotypes; analyses very large samples (N > 100,000) |
BOLT-LMM261 | Genetic association testing based on the BOLT-LMM algorithm for mixed model association testing and the BOLT-REML algorithm for variance components analysis (partitioning of SNP-based heritability and estimation of genetic correlations) |
REGENIE56 | Genetic association testing; analyses very large samples (N > 100,000); can assess multiple phenotypes at once; fast and memory efficient |
BGENIE76 | Genetic association for continuous phenotypes; analyses very large samples (N > 100,000); custom-made for the UK Biobank BGENv1.2 file format |
fastGWA37 | Mixed-model genetic association analysis |
Statistical fine-mapping | |
CAVIAR127 | Estimates the probability of each variant in a locus to be causal based on the observed pattern of P values and the level of linkage disequilibrium; allows for an arbitrary number of causal variants |
PAINTOR95 | Statistical fine-mapping using GWAS summary statistics and functional genomic data to prioritize likely causal variants |
SuSIE96 | Statistical fine-mapping using GWAS summary statistics and linkage disequilibrium information from a reference panel; based on a Bayesian modification of a forward selection model |
FINEMAP94 | Statistical fine-mapping using GWAS summary statistics as input; calculates effect sizes and heritability owing to likely causal SNPs |
Meta-analysis | |
GWAMA262 | Fixed and random effects meta-analysis; allows the specification of different genetic models |
METAL39 | Weighted meta-analysis using GWAS summary statistics as input |
Variant annotation | |
VEP115 | Functional annotation of genetic variants with their effect on genes, transcripts and protein sequence as well as regulatory regions |
ANNOVAR114 | Functional annotation of genetic variants with their effect on genes, transcripts and protein sequence as well as regulatory regions |
FUMA88 | Functional annotation of genetic variants with their effect on genes, transcripts and protein sequence as well as regulatory regions; includes chromatin interaction information and integrates and visualizes all output |
Enrichment or gene-set analysis | |
MAGMA136 | Gene-based and gene-set analysis using competitive testing with a regression framework; allows testing of custom gene sets and includes options for conditional and interaction testing between gene sets |
DEPICT137 | Systematic prioritization of genes and assessment of enriched pathways using predicted gene functions |
LDSC174 | Partitioned SNP-based heritability analyses showing enrichment in sets of functionally related SNPs |
QTL analysis | |
QTLTools263 | Molecular QTL discovery and analysis; uses raw genomic (sequence) data as input |
Genetic correlations | |
LDSC174 | Assessment of genetic correlation between phenotypes using summary statistics as input; has various other functions, including partitioned SNP-based heritability and assessment of selection bias |
GCTA173 | Assessment of genetic correlation between phenotypes using raw genotypic data as input |
SumHer264 | Assessment of genetic correlation between phenotypes using summary statistics as input; has various other functions, including partitioned SNP-based heritability and assessment of selection bias |
superGNOVA183 | Assessment of local genetic correlations using GWAS summary statistics |
ρ-HESS184 | Assessment of local SNP-based heritability and genetic correlations using GWAS summary statistics |
LAVA185 | Assessment of local multivariate genetic correlations using GWAS summary statistics |
GenomicSEM265 | Assessment of multivariate genetic correlations based on GWAS summary statistics |
Causality | |
Mendelian randomization266 | Assessment of causal relation between traits based on genetic overlap, using GWAS summary statistics as input. |
PRS analysis | |
PRScs146 | Estimation of posterior effect sizes of SNPs using a Bayesian shrinkage approach |
LDPred151/LDPred-2 (ref.150) | Estimation of posterior effect sizes of SNPs using a Bayesian shrinkage approach |
SBayesR147 | Estimation of posterior effect sizes of SNPs using a Bayesian shrinkage approach |
PRSice144 | PRS analysis using a P value thresholding and clumping approach |
TWAS | |
FUSION125 | Performing TWAS by predicting functional/molecular phenotypes based on reference data; uses GWAS summary statistics as input |
PrediXcan126 | Prioritizing likely causal genes based on transcription data; uses GWAS summary statistics as input |
SMR | Testing whether SNP-trait associations are mediated by gene expression levels using a Mendelian randomization approach |
GWAMA, genome-wide association meta-analysis; GWAS, genome-wide association studies; PRS, polygenic risk score; QTL, quantitative trait locus; SNP, single-nucleotide polymorphism; TWAS, transcriptome-wide association studies.
Accounting for false discovery
Testing millions of associations between individual genetic variants and a phenotype of interest requires a stringent multiple-testing threshold to avoid false positives. The International HapMap Project and other studies have shown that there are approximately 1 million independent common genetic variants across the human genome on average, resulting in a Bonferroni testing threshold of P < 5 × 10–8 (representing a false discovery rate of 0.05/106)38. The appropriate threshold might vary depending on the population; for example, a more stringent threshold may be needed for populations with larger effective population sizes or if the minor allele frequency thresholds for inclusion in a GWAS are lowered as sample sizes increase, as low minor allele frequency variants are typically not in linkage disequilibrium with common variants and, therefore, add a greater multiple testing burden. Complex traits such as height, schizophrenia or type 2 diabetes tend to be highly polygenic and, as a result, many genetic variants with small effects contribute to the phenotype. In these cases, winner’s curse is common, and effect size estimates close to the discovery threshold tend to be overestimated in initial GWAS.
Genome-wide association meta-analysis
To increase sample size, GWAS is typically carried out in the context of a consortium such as the Psychiatric Genomics Consortium, the Genetic Investigation of Anthropometric Traits (GIANT) consortium or the Global Lipids Genetics Consortium where data from multiple cohorts are analysed together using tools such as METAL39, N-GWAMA or MA-GWAMA40 and quality control pipelines such as those implemented in RICOPILI23 or EasyQC41. For a detailed description of the quality control procedures specific to GWAMA, we refer readers to ref.42.
参考:
https://www.nature.com/articles/s43586-021-00056-9
https://chinacohort.bjmu.edu.cn/about/