pVACseq logo

Filtering Commands

pVACseq currently offers four filters: a binding filter, a coverage filter, a transcript support level filter, and a top score filter.

These filters are always run automatically as part of the pVACseq pipeline using default cutoffs.

All filters can also be run manually on the filtered.tsv file to narrow the results down further, or they can be run on the all_epitopes.tsv file to apply different filtering thresholds.

The binding filter is used to remove neoantigen candidates that do not meet desired peptide:MHC binding criteria. The coverage filter is used to remove variants that do not meet desired read count and VAF criteria (in normal DNA and tumor DNA/RNA). The transcript support level filter is used to remove variant annotations based on low quality transcript annotations. The top score filter is used to select the most promising peptide candidate for each variant. Multiple candidate peptides from a single somatic variant can be caused by multiple peptide lengths, registers, HLA alleles, and transcript annotations.

Further details on each of these filters is provided below.

Note

The default values for filtering thresholds are suggestions only. While they are based on review of the literature and consultation with our clinical and immunology colleagues, your specific use case will determine the appropriate values.

Binding Filter

usage: pvacseq binding_filter [-h] [-b BINDING_THRESHOLD]
                              [-p PERCENTILE_THRESHOLD]
                              [-c MINIMUM_FOLD_CHANGE] [-m {lowest,median}]
                              [--exclude-NAs] [-a]
                              input_file output_file

Filter variants processed by IEDB by binding score.

positional arguments:
  input_file            The final report .tsv file to filter.
  output_file           Output .tsv file containing list of filtered epitopes
                        based on binding affinity.

optional arguments:
  -h, --help            show this help message and exit
  -b BINDING_THRESHOLD, --binding-threshold BINDING_THRESHOLD
                        Report only epitopes where the mutant allele has ic50
                        binding scores below this value. (default: 500)
  -p PERCENTILE_THRESHOLD, --percentile-threshold PERCENTILE_THRESHOLD
                        Report only epitopes where the mutant allele has a
                        percentile rank below this value. (default: None)
  -c MINIMUM_FOLD_CHANGE, --minimum-fold-change MINIMUM_FOLD_CHANGE
                        Minimum fold change between mutant binding score and
                        wild-type score. The default is 0, which filters no
                        results, but 1 is often a sensible option (requiring
                        that binding is better to the MT than WT). (default:
                        0)
  -m {lowest,median}, --top-score-metric {lowest,median}
                        The ic50 scoring metric to use when filtering epitopes
                        by binding-threshold or minimum fold change. lowest:
                        Use the Best MT Score and corresponding Fold Change
                        (i.e. use the lowest MT ic50 binding score and
                        corresponding fold change of all chosen prediction
                        methods). median: Use the Median MT Score and Median
                        Fold Change (i.e. use the median MT ic50 binding score
                        and fold change of all chosen prediction methods).
                        (default: median)
  --exclude-NAs         Exclude NA values from the filtered output. (default:
                        False)
  -a, --allele-specific-binding-thresholds
                        Use allele-specific binding thresholds. To print the
                        allele-specific binding thresholds run `pvacseq
                        allele_specific_cutoffs`. If an allele does not have a
                        special threshold value, the `--binding-threshold`
                        value will be used. (default: False)

The binding filter removes variants that don’t pass the chosen binding threshold. The user can chose whether to apply this filter to the lowest or the median binding affinity score by setting the --top-score-metric flag. The lowest binding affinity score is recorded in the Best MT Score column and represents the lowest ic50 score of all prediction algorithms that were picked during the previous pVACseq run. The median binding affinity score is recorded in the Median MT Score column and corresponds to the median ic50 score of all prediction algorithms used to create the report. Be default, the binding filter runs on the median binding affinity.

The binding filter also offers the option to filter on Fold Change columns, which contain the ratio of the MT score to the WT Score. This option can be activated by setting the --minimum-fold-change threshold (to require that the mutant peptide is a better binder than the corresponding wild type peptide). If the --top-score-metric option is set to lowest, the Corresponding Fold Change column will be used (Corresponding WT Score/Best MT Score). If the --top-score-metric option is set to median, the Median Fold Change column will be used (Median WT Score/Median MT Score).

By default, entries with NA values will be included in the output. This behavior can be turned off by using the --exclude-NAs flag.

Coverage Filter

usage: pvacseq coverage_filter [-h] [--normal-cov NORMAL_COV]
                               [--tdna-cov TDNA_COV] [--trna-cov TRNA_COV]
                               [--normal-vaf NORMAL_VAF] [--tdna-vaf TDNA_VAF]
                               [--trna-vaf TRNA_VAF] [--expn-val EXPN_VAL]
                               [--exclude-NAs]
                               input_file output_file

Filter variants processed by IEDB by coverage, vaf, and gene expression

positional arguments:
  input_file            The final report .tsv file to filter
  output_file           Output .tsv file containing list of filtered epitopes
                        based on coverage and expression values

optional arguments:
  -h, --help            show this help message and exit
  --normal-cov NORMAL_COV
                        Normal Coverage Cutoff. Sites above this cutoff will
                        be considered. (default: 5)
  --tdna-cov TDNA_COV   Tumor DNA Coverage Cutoff. Sites above this cutoff
                        will be considered. (default: 10)
  --trna-cov TRNA_COV   Tumor RNA Coverage Cutoff. Sites above this cutoff
                        will be considered. (default: 10)
  --normal-vaf NORMAL_VAF
                        Normal VAF Cutoff. Sites BELOW this cutoff in normal
                        will be considered. (default: 0.02)
  --tdna-vaf TDNA_VAF   Tumor DNA VAF Cutoff. Sites above this cutoff will be
                        considered. (default: 0.25)
  --trna-vaf TRNA_VAF   Tumor RNA VAF Cutoff. Sites above this cutoff will be
                        considered. (default: 0.25)
  --expn-val EXPN_VAL   Gene and Transcript Expression cutoff. Sites above
                        this cutoff will be considered. (default: 1.0)
  --exclude-NAs         Exclude NA values from the filtered output. (default:
                        False)

If the input VCF contains readcount and/or expression annotations, then the coverage filter can be run again on the filtered.tsv report file to narrow down the results even further. You can also run this filter again on the all_epitopes.tsv report file to apply different cutoffs.

The general goals of these filters are to limit variants for neoepitope prediction to those with good read support and/or remove possible sub-clonal variants. In some cases the input VCF may have already been filtered in this fashion. This filter also allows for removal of variants that do not have sufficient evidence of RNA expression.

For more details on how to prepare input VCFs that contain all of these annotations, refer to the Input File Preparation section for more information.

By default, entries with NA values will be included in the output. This behavior can be turned off by using the --exclude-NAs flag.

Transcript Support Level Filter

usage: pvacseq transcript_support_level_filter [-h]
                                               [--maximum-transcript-support-level {1,2,3,4,5}]
                                               [--exclude-NAs]
                                               input_file output_file

Filter variants processed by IEDB by transcript support level

positional arguments:
  input_file            The all_epitopes.tsv or filtered.tsv pVACseq report
                        file to filter.
  output_file           Output .tsv file containting list of of filtered
                        epitopes based on transcript support level.

optional arguments:
  -h, --help            show this help message and exit
  --maximum-transcript-support-level {1,2,3,4,5}
                        The threshold to use for filtering epitopes on the
                        transcript support level. Keep all epitopes with a
                        transcript support level <= to this cutoff. (default:
                        1)
  --exclude-NAs         Exclude NA values from the filtered output. (default:
                        False)

This filter is used to eliminate variant annotations based on poorly-supported transcripts. By default, only transcripts with a transcript support level (TSL) of <=1 are kept. This threshold can be adjusted using the --maximum-transcript-support-level parameter.

By default, entries with NA values will be included in the output. This behavior can be turned off by using the --exclude-NAs flag.

Top Score Filter

usage: pvacseq top_score_filter [-h] [-m {lowest,median}]
                                input_file output_file

Pick the best neoepitope for each variant

positional arguments:
  input_file            The final report .tsv file to filter.
  output_file           Output .tsv file containing only the list of the top
                        epitope per variant.

optional arguments:
  -h, --help            show this help message and exit
  -m {lowest,median}, --top-score-metric {lowest,median}
                        The ic50 scoring metric to use for filtering. lowest:
                        Use the best MT Score (i.e. the lowest MT ic50 binding
                        score of all chosen prediction methods). median: Use
                        the median MT Score (i.e. the median MT ic50 binding
                        score of all chosen prediction methods). (default:
                        median)

This filter picks the top epitope for a variant. Epitopes with the same Chromosome - Start - Stop - Reference - Variant are identified as coming from the same variant.

In order to account for different splice sites among the transcripts of a variant that would lead to different peptides, this filter also takes into account the different transcripts returned by VEP and will return the top epitope for all transcripts if they are non-identical. If the resulting list of top epitopes for the transcripts of a variant is identical, the epitope for the transcript with the highest expression is returned. If this information is not available, the transcript with the lowest Ensembl ID is returned.

By default the --top-score-metric option is set to median which will apply this filter to the Median MT Score column and pick the epitope with the lowest median mutant ic50 score for each variant. If the --top-score-metric option is set to lowest, the Best MT Score column is instead used to make this determination.

It is important to note that there are several reasons why a particular variant can lead to multiple peptides with different predicted binding affinities. The following can result in multiple peptides and/or binding predictions for a single variant:

1. Different epitope lengths: specifying multiple epitope lengths results in similar but non-identical epitope sequences for each variant (e.g. KLPEPCPS, KLPEPCPST, KLPEPCPSTT, KLPEPCPSTTP). 2. Different registers: pVACseq will test epitopes where the mutation is in every position (e.g. EPCPSTTP, PEPCPSTT, LPEPCPST, KLPEPCPS, …). 3. Different transcripts: in some case the peptide sequence surrounding a variant will depend on the reference transcript sequence, particularly if there are alternative splice sites near the variant position. 4. Different HLA alleles: the HLA allele that produces the best predicted binding affinity is chosen. 5. A homozygous somatic variant with heterozygous proximal variants nearby may produce multiple different peptides.

The significance of choosing a single representative peptide can depend on your experimental or clinical aims. For example, if you are planning to use short peptide sequences exactly as they were assessed for binding affinity in pVACseq (e.g. specific 9-mers for in vitro experimental validation or perhaps a dendritic cell vaccine delivery approach) then the selection of a specific peptide from the possibilities caused by different lengths, registers, etc. is very important. In some cases you may wish to consider more criteria beyond which of these candidates has the best predicted binding affinity and gets chosen by the Top Score Filter.

On the other hand, if you plan to use synthetic long peptides (SLPs) or encode your candidates in a DNA vector, you will likely include flanking amino acids. This means that you often get a lot of the different short peptides that correspond to slightly different lengths or registers within the longer containing sequence. In this scenario, pVACseq’s choice of a single candidate peptide by the Top Score Filter isn’t actually that critical in the sense of losing other good candidates, because you may get them all anyway.

One important exception to this is the rare case where the same variant leads to different peptides in different transcripts (due to different splice site usage). If multiple transcripts are expressed and lead to distinct peptides, you may want to include both in your final list of candidates. The top score filter supports this case, as described above. This assumes you did not start with only a single transcript model for each gene (e.g. using the --pick option in VEP) and also that if you are requiring transcripts with TSL=1 that there are multiple qualifying transcripts that lead to different peptide sequences at the site of the variant. This will be fairly rare. Even though most genes have alternative transcripts, they often have only subtle differences in open reading frame and overall protein sequence, and only differences within the window that would influence a neoantigen candidate are consequential here.