Northwestern University logo

Genomic Data

Feedback


Release summary
  • Strains: 913
  • WGS strains: 773
  • Isotypes: 403
  • Genome: WS276

Release Notes

The 20200815 release includes genotypes from whole-genome sequences and reduced representation (RAD) sequencing. Genotypes are compared for concordance, and strains that are 99.95% identical to each other are grouped into isotypes. One strain within each isotype is the reference strain for that isotype. All isotype reference strains are available on CeNDR.

Datasets


Dataset Description Download
Strain Data Includes strain, isotype, location information, and more. CelegansStrainData.tsv
Strain Issues This link contains all strain issues for this release
Alignment Data This link contains all alignment data as BAM or BAI files.
Soft-Filtered Variants The soft-filtered VCF includes all variants and annotations called by the GATK pipeline. The QC status of each variant (INFO field=FILTER) and genotype (Format Field=FT) is specified by a VCF Field. All Strains
WI.20200815.soft-filter.vcf.gz
WI.20200815.soft-filter.vcf.gz.tbi

Isotypes
WI.20200815.soft-filter.isotype.vcf.gz
WI.20200815.soft-filter.isotype.vcf.gz.tbi
Hard-Filtered Variants The hard-filtered VCF includes only high-quality variants after all variants and genotypes with a failed QC status are removed. To obtain vcf for a single or a subset of strains, use bcftools view --samples All Strains
WI.20200815.hard-filter.vcf.gz
WI.20200815.hard-filter.vcf.gz.tbi

Isotypes
WI.20200815.hard-filter.isotype.vcf.gz
WI.20200815.hard-filter.isotype.vcf.gz.tbi
Imputed Variants The imputed VCF includes all the variants from the hard-filtered Isotype VCF, but all missing genotypes have been imputed using Beagle v5.1. Isotypes
WI.20200815.impute.isotype.vcf.gz
WI.20200815.impute.isotype.vcf.gz.tbi
Reference Genome FASTA (WS276) The reference genome build from Wormbase used for alignment and annotation. c_elegans.PRJNA13758.WS276.genomic.fa.gz
Transposon Calls We have performed transposon calling for a subset of isotypes as part of Laricchia et al. tes_cender.bed
Tree Tree generated using neighbour-joining algorithm as implemented in QuickTree in Newick and PDF format. All Strains
WI.20200815.hard-filter.min4.tree
WI.20200815.hard-filter.min4.tree.pdf

Isotype
WI.20200815.hard-filter.isotype.min4.tree
WI.20200815.hard-filter.isotype.min4.tree.pdf
Haplotypes Haplotypes for isotypes were calculated and plotted as described in Lee et al. haplotype.png
haplotype.pdf
Sweep Haplotypes The most frequent haplotype that covers at least 30% of the chromosome and is found on chromosome centers was determined and classified as a selective sweep. For more details, see Andersen et al. and Lee et al.. The plot shows red (swept), gray (non-swept), and white (not classified) regions. sweep.pdf
sweep_summary.tsv
Hyper-Divergent Regions The hyper-divergent regions are characterized by higher-than-average density of small variants and large genomic spans where short sequence reads fail to align to the N2 reference genome. They were identified as described in Lee et al. divergent_regions_strain.bed
divergent_regions_strain.bed.gz
Download BAMs Script You can batch download individual strain BAMs using this script. download_bams.sh

Methods

This tab links to the nextflow pipelines used to process wild isolate sequence data.

FASTQ QC and Trimming

andersenlab/trim-fq-nf -- (Latest f0b63e)

Adapters and low quality sequences were trimmed off of raw reads using fastp (0.20.0) and default parameters. Reads shorter than 20 bp after trimming were discarded.

Alignment

andersenlab/alignment-nf -- (892b37)

Trimmed reads were aligned to C. elegans reference genome (project PRJNA13758 version WS276 from the Wormbase) using bwa mem BWA (0.7.17). Libraries of the same strain were merged together and indexed by sambamba (0.7.0). Duplicates were flagged with Picard (2.21.3).

Strains with less than 14x coverage were not included in the alignment report and subsequent analyses.

Variant Calling

andersenlab/wi-gatk -- (c9dd1e)

Variants for each strain were called using gatk HaplotypeCaller. After the initial variant calling, variants were combined and then recalled jointly using gatk GenomicsDBImport and gatk GenotypeGVCFs GATK (4.1.4.0).

The variants were further processed and filtered with custom-written scripts for heterozygous SNV polarization, GATK (4.1.4.0), and bcftools (1.10).

Warning
Heterozygous polarization and filtering thresholds were optimized for single nucleotide variants (SNVs).

Additionally, insertion or deletion (indel) variants less than 50 bp are more reliably called than indel variants greater than this size. In general, indel variants should be considered less reliable than SNVs.

Site-level filtering and annotation

  1. Heterozygous SNV polarization: Because C. elegans is a selfing species, heterozygous SNV sites are most likely errors. Biallelic heterozygous SNVs were converted to homozygous REF or ALT if we had sufficient evidence for conversion. Only biallelic SNVs that are not on mitochondria DNA were included in this step. Specifically, the SNV was converted if the normalized Phred-scaled likelihoods (PL) met the following criteria (a smaller PL means more confidence). Any heterozygous SNVs that did not meet these criteria were left unchanged.

    • If PL-ALT/PL-REF <= 0.5 and PL-ALT <= 200, convert to homozygous ALT
    • If PL-REF/PL-ALT <= 0.5 and PL-REF <= 200, convert to homozygous REF
  2. Soft filtering: Low quality sites were flagged but not modified or removed.

    For the site-level soft filter, variant sites that meet the following conditions were flagged as PASS. These stats were computed across all samples for each site.

    • Variant quality (QUAL) > 30 (this filter is very lenient, only three sites failed)
    • Variant quality normalized by read depth (QD) > 20
    • Strand bias of ALT calls: strand odds ratio (SOR) < 5
    • Strand bias of ALT calls: Fisherstrand (FS) < 100
    • Fraction of samples with missing genotype < 95%
    • Fraction of samples with heterozygous genotype after heterozygous site polarization < 10%

    For the sample-level soft filter, genotypes that meet the following filters were flagged as PASS for each site in each sample:

    • Read depth (DP) > 5
    • Site is not heterozygous
  3. SnpEff Annotation: The predicted impact of each variant site was annotated with SnpEff (4.3.1t).

  4. For the hard-filtered VCF, low quality sites were modified or removed using the following criteria.

    • For the site-level hard filter, variant sites not flagged as PASS were removed.
    • For the sample-level hard filter, genotypes not flagged as PASS were converted to missing (./.), with the exception that heterozygous sites on mitochondria where kept unchanged.

    After the steps above, sites that are invariant (0/0 or 1/1 across all samples, not counting missing ./.) were removed.

  5. Variant impacts were then annotated using bcftools csq, which takes into consideration nearby variants and annotates variant impacts based on haplotypes.

Determination of filter thresholds

We re-examined our filter thresholds for this release. A variant simulation pipeline was used as part of this process:

Please see the filter optimization report for further details.

Isotype Assignment

andersenlab/concordance-nf -- (ae3d80)

Isotype groups contain strains that are likely identical to each other and were sampled from the same isolation locations. For any phenotypic assay, only the isotype reference strain needs to be scored. Users interested in individual strain genotypes can use the strain-level data.

Strains were grouped into isotypes using the following steps:

  1. Using all high quality variants (the hard-filtered VCF) and bcftools gtcheck, concordance for each pair of strains was calculated as a fraction of shared variants over the total variants in each pair.

  2. Strain pairs with concordance > 0.9995 were grouped into the same isotype group. The threshold 0.9995 was determined by:

    • Examining the distribution of concordance scores.
    • Capturing similarity between strains to minimize the number of strains that get assigned to multiple isotype groups.
    • Agreement with the isotype groups in previous releases.
  3. The following issues, which were rare, were resolved on a case-by-case basis:

    • If one strain was assigned to multiple isotypes.
    • If one isotype from previous releases matches to multiple new isotype groups.
    • If one new isotype group contains strains from multiple isotypes from previous releases.

When issues arose, the pairwise concordance between all strains within an isotype were examined manually. Strains and isotypes may be re-assigned with the goal that strains within the same isotype group should have high concordance with each other, and strains from different isotype groups should have lower concordance.

Tree Generation

Trees were generated by converting the hard-filtered VCF to Phylip format using vcf2phylip (030b8d). Then, the Phylip format was converted to Stockholm format using Bioconvert (0.3.0), which was then used to construct a tree with QuickTree (2.5) using default settings. The trees were plotted with FigTree (1.4.4) rooting on the most diverse strain XZ1516.

Imputation

Imputation was done using Beagle (5.1) with the following parameters: window=5 overlap=2 (except for chromosome X, which used window=3 overlap=1 to run with available memory) impute=true ne=100000 imp-segment=0.5 imp-step=0.01 cluster=0.0005.