NOTE (22 May 2024): When using TOPMed Imputation Server v.2.0.0 with reference panel version r3, an imputation quality of 0.98 should be used for identifying high-quality SNPs.
NOTE (24 April 2024): The TOPMed Imputation Server v1.6.6 (Minimac v4-1.0.2 for imputation, Eagle v2.4 for phasing, r2 for reference panel) was used for testing TSIM and generating all results in the paper (see below). Since completing our analyses, the TOPMed Imputation Server has been updated to v2.0.0 (Minimac v4.1.6 for imputation, Eagle v2.4 for phasing, r3 for reference panel). We've found that the r2 filter for high-quality SNPs needs to be recalibrated for this updated reference panel and are currently working on that analysis.
- python (3.9+)
- cyvcf2
- native libraries
- sys
- logging
- datetime
- argparse
- gzip
- bcftools
- tabix
conda env create -f environment.yml
conda activate tsim
If you get an error building wheel for cyvcf2, you can install the package manually with pip after activating the conda environment.
conda activate tsim
pip install cyvcf2
tsim has 4 subcommands. You can check the options with the -h flag of tsim.py
tsim.py -h
tsim.py rsq -h #recalculates Rsq based on selected samples
tsim.py qc -h #apply Rsq, ER2, MAF, and HWE filters to imputed variants
tsim.py overlap -h #find intersection of 2 variant lists
tsim.py merge -h #merge 2 VCFs based on variant list
NOTE: tsim was developed using output from the TOPMed Imputation Server v1.6.6 (Minimac4 for imputation, Eagle v2.4 for phasing, r2 for reference panel). We are aware that there were some recent changes in output format and are working on updating these scripts accordingly.
Before running tsim.py, QC and impute your cohorts separately.
Rsq (or R2) does not generally need to be recalculated. However, it is a sample-based calculation. So, if you are working with a subset of samples included in the imputation results, recalculating rsq will provide more accurate measurements for determining high-quality SNPs.
- Input file (
-v
): imputed VCF - Output file (
-o
): TSV file containing variant ID, alternative allele frequency (AAF), recalculated rsq (RSQ), original rsq (RSQ_TOPMED), and empirical rsq (ER2)
python tsim.py rsq -v a.vcf.gz -o a.recalc_rsq.tsv -s a.samples.txt
python tsim.py rsq -v b.vcf.gz -o b.recalc_rsq.tsv -s b.samples.txt
- Input files (
-m
,-r
): TSVs (can be gzipped) containing variant ID, allele frequency and rsq- Optional input file (
--hwe
):*.hwe
from PLINK's--hardy
option - The files containing allele frequencies (
-m
) and rsqs (-r
) can be different. If they are the same, specify the same file for both-m
and-r
. Both flags are required.
- Optional input file (
- Output file (
-o
): text file containing variant IDs passing QC - Specify chromosome using
-c
or--chrom
Recommended HWE command:
plink --vcf <vcf> --allow-no-sex --hardy --mpheno 4 --out <output> --pheno <fam_file> --update-sex <fam_file> 3
Default options:
- QC filters
- Rsq filter (
-rf
): >=0.99 - MAF filter (
-mf
): >=0.01 - ER2 filter (
-ef
): >=0.90 - HWE filter (
-hf
): >=1e-6
- Rsq filter (
- column numbers (based on output of
rsq
command)- variant ID (
-mvc
,-rvc
): 1 - Rsq (
-rc
): 3 - MAF (
-mc
): 2 - ER2 (
-ec
): 5
- variant ID (
If working with a control-only cohort and you want to filter HWE, use flag --nocases
.
python tsim.py qc -r a.recalc_rsq.tsv -m a.recalc_rsq.tsv -o a.variant_qc.txt --chrom 22 --hwe a.hardy.hwe
python tsim.py qc -r b.recalc_rsq.tsv -m b.recalc_rsq.tsv -o b.variant_qc.txt -c 22 --hwe b.hardy.hwe
This command assumes that variants have consistent naming scheme across all cohorts.
- Input file (
-l
): text file containing list of file paths to high-quality SNP lists (i.e., output ofqc
command) - Output file (
-o
): text file containing list of variants that are shared between all high-quality SNP lists - Specify chromosome using
-c
or--chrom
### to create input file
ls *.variant_qc.txt > l.filelist.txt
###
python tsim.py overlap -l l.filelist.txt -o l.overlap.txt -c 22
- Input file (
-l
): CSV file containing paths to VCFs to merge, SNP lists to merge on, and samples to include for each file (column 1 = VCF files, column 2 = SNP lists, 3 = sample lists). Sample lists can include all samples to be merged, it does not have to be cohort-specific. - Output file (
-o
): merged VCFs - Specify chromosome using
-c
or--chrom
- For high-quality SNPs, use flag
--snpsonly
### to create input file
echo "a.vcf.gz,l.overlap.txt,a.samples.txt" > l.mergelist.txt
echo "b.vcf.gz,l.overlap.txt,b.samples.txt" >> l.mergelist.txt
###
python tsim.py merge -l l.mergelist.txt -o merged.vcf.gz -c 22 --snpsonly
- Impute the merged VCFs.
The rsq
and qc
functions may also be used after the second stage of imputation.
Anya Greenberg, Kaylia Reynolds, Michelle T McNulty, Matthew G. Sampson, Hyun Min Kang, Dongwon Lee. "Accurate cross-platform GWAS analysis via two-stage imputation." https://www.medrxiv.org/content/10.1101/2024.04.19.24306081v1