Filtering Commands¶
pVACseq currently offers four filters: a binding filter, a coverage filter, a transcript filter, and a top score filter.
These filters are always run automatically as part of the pVACseq pipeline using default cutoffs.
All filters can also be run manually on the filtered.tsv file to narrow the results down further, or they can be run on the all_epitopes.tsv file to apply different filtering thresholds.
The binding filter is used to remove neoantigen candidates that do not meet desired peptide:MHC binding criteria. The coverage filter is used to remove variants that do not meet desired read count and VAF criteria (in normal DNA and tumor DNA/RNA). The transcript filter is used to remove variant annotations based on low quality transcript annotations. The top score filter is used to select the most promising peptide candidate for each variant. Multiple candidate peptides from a single somatic variant can be caused by multiple peptide lengths, registers, HLA alleles, and transcript annotations.
Further details on each of these filters is provided below.
Note
The default values for filtering thresholds are suggestions only. While they are based on review of the literature and consultation with our clinical and immunology colleagues, your specific use case will determine the appropriate values.
Binding Filter¶
usage: pvacseq binding_filter [-h] [-b BINDING_THRESHOLD]
[-p PERCENTILE_THRESHOLD]
[--percentile-threshold-strategy {conservative,exploratory}]
[-c MINIMUM_FOLD_CHANGE] [-m {lowest,median}]
[--exclude-NAs] [-a]
input_file output_file
Filter variants processed by IEDB by binding score.
positional arguments:
input_file The all_epitopes.tsv or filtered.tsv pVACseq report
file to filter.
output_file Output .tsv file containing list of filtered epitopes
based on binding affinity.
optional arguments:
-h, --help show this help message and exit
-b BINDING_THRESHOLD, --binding-threshold BINDING_THRESHOLD
Report only epitopes where the mutant allele has ic50
binding scores below this value. (default: 500)
-p PERCENTILE_THRESHOLD, --percentile-threshold PERCENTILE_THRESHOLD
Report only epitopes where the mutant allele has a
percentile rank below this value. (default: None)
--percentile-threshold-strategy {conservative,exploratory}
Specify the candidate inclusion strategy. The
'conservative' option requires a candidate to pass
BOTH the binding threshold and percentile threshold
(default). The 'exploratory' option requires a
candidate to pass EITHER the binding threshold or the
percentile threshold. (default: conservative)
-c MINIMUM_FOLD_CHANGE, --minimum-fold-change MINIMUM_FOLD_CHANGE
Minimum fold change between mutant binding score and
wild-type score. The default is 0, which filters no
results, but 1 is often a sensible option (requiring
that binding is better to the MT than WT). (default:
0)
-m {lowest,median}, --top-score-metric {lowest,median}
The ic50 scoring metric to use when filtering epitopes
by binding-threshold or minimum fold change. lowest:
Use the Best MT IC50 Score, Corresponding Fold Change,
and Best MT Percentile (i.e. use the lowest MT ic50
binding score, orresponding fold change of all chosen
prediction methods, and lowest MT percentile). median:
Use the Median MT IC50 Score, Median Fold Change, and
Median MT Percentile i.e. use the median MT ic50
binding score, fold change, and MT percentile of all
chosen prediction methods). (default: median)
--exclude-NAs Exclude NA values from the filtered output. (default:
False)
-a, --allele-specific-binding-thresholds
Use allele-specific binding thresholds. To print the
allele-specific binding thresholds run `pvacseq
allele_specific_cutoffs`. If an allele does not have a
special threshold value, the `--binding-threshold`
value will be used. (default: False)
The binding filter removes variants that don’t pass the chosen binding threshold.
The user can chose whether to apply this filter to the lowest or the median binding
affinity score by setting the --top-score-metric flag. The lowest binding
affinity score is recorded in the Best MT IC50 Score column and represents the lowest
ic50 score of all prediction algorithms that were picked during the previous pVACseq run.
The median binding affinity score is recorded in the Median MT IC50 Score column and
corresponds to the median ic50 score of all prediction algorithms used to create the report.
Be default, the binding filter runs on the median binding affinity.
An additional --top-score-metric2 flag allows the user to choose whether to use IC50 or
Percentile scores. By default, IC50 is used.
When the --allele-specific-binding-thresholds flag is set, binding cutoffs specific to each
prediction’s HLA allele are used instead of the value set via the --binding-threshold parameters.
For HLA alleles where no allele-specific binding threshold is available, the
binding threshold is used as a fallback. Alleles with allele-specific
threshold as well as the value of those thresholds can be printed by executing
the pvacseq allele_specific_cutoffs command.
The binding filter also offers the option to filter on Fold Change columns, which contain
the ratio of the MT score to the WT Score. This option can be activated by setting the
--minimum-fold-change threshold (to require that the mutant peptide is a better binder
than the corresponding wild type peptide). If the --top-score-metric option is set to lowest,
the Corresponding Fold Change column will be used (Corresponding WT IC50 Score/Best MT IC50 Score).
If the --top-score-metric option is set to median, the Median Fold Change column
will be used (Median WT IC50 Score/Median MT IC50 Score).
In addition to being able to filter on the IC50 score columns, the binding
filter also offers the ability to filter on the percentile score using the
--percentile-threshold parameter. When the --top-score-metric is set
to lowest, this threshold is applied to the Best MT Percentile column. When
it is set to median, the threshold is applied to the Median MT
Percentile column.
When the --percentile-threshold flag is set, the candidate inclusion strategy can be
specified by using the --percentile-threshold-strategy parameter. The parameter has two
options conservative (default) and exploratory. The ‘conservative’ option requires a candidate
to pass BOTH the binding threshold and percentile threshold, while the ‘exploratory’ option requires
a candidate to pass EITHER the binding threshold or percentile threshold.
By default, entries with NA values will be included in the output. This
behavior can be turned off by using the --exclude-NAs flag.
Coverage Filter¶
usage: pvacseq coverage_filter [-h] [--normal-cov NORMAL_COV]
[--tdna-cov TDNA_COV] [--trna-cov TRNA_COV]
[--normal-vaf NORMAL_VAF] [--tdna-vaf TDNA_VAF]
[--trna-vaf TRNA_VAF] [--expn-val EXPN_VAL]
[--exclude-NAs]
input_file output_file
Filter variants processed by IEDB by coverage, vaf, and gene expression
positional arguments:
input_file The all_epitopes.tsv or filtered.tsv pVACseq report
file to filter.
output_file Output .tsv file containing list of filtered epitopes
based on coverage and expression values
optional arguments:
-h, --help show this help message and exit
--normal-cov NORMAL_COV
Normal Coverage Cutoff. Sites above this cutoff will
be considered. (default: 5)
--tdna-cov TDNA_COV Tumor DNA Coverage Cutoff. Sites above this cutoff
will be considered. (default: 10)
--trna-cov TRNA_COV Tumor RNA Coverage Cutoff. Sites above this cutoff
will be considered. (default: 10)
--normal-vaf NORMAL_VAF
Normal VAF Cutoff in decimal format. Sites BELOW this
cutoff in normal will be considered. (default: 0.02)
--tdna-vaf TDNA_VAF Tumor DNA VAF Cutoff in decimal format. Sites above
this cutoff will be considered. (default: 0.25)
--trna-vaf TRNA_VAF Tumor RNA VAF Cutoff in decimal format. Sites above
this cutoff will be considered. (default: 0.25)
--expn-val EXPN_VAL Gene and Transcript Expression cutoff. Sites above
this cutoff will be considered. (default: 1.0)
--exclude-NAs Exclude NA values from the filtered output. (default:
False)
If the input VCF contains readcount and/or expression annotations, then the coverage filter can be run again on the filtered.tsv report file to narrow down the results even further. You can also run this filter again on the all_epitopes.tsv report file to apply different cutoffs.
The general goals of these filters are to limit variants for neoepitope prediction to those with good read support and/or remove possible sub-clonal variants. In some cases the input VCF may have already been filtered in this fashion. This filter also allows for removal of variants that do not have sufficient evidence of RNA expression.
For more details on how to prepare input VCFs that contain all of these annotations, refer to the Input File Preparation section for more information.
By default, entries with NA values will be included in the output. This
behavior can be turned off by using the --exclude-NAs flag.
Transcript Filter¶
usage: pvacseq transcript_filter [-h]
[--transcript-prioritization-strategy TRANSCRIPT_PRIORITIZATION_STRATEGY]
[--maximum-transcript-support-level {1,2,3,4,5}]
input_file output_file
Filter variant transcripts processed by IEDB.
positional arguments:
input_file The all_epitopes.tsv or filtered.tsv report file to
filter.
output_file Output .tsv file containing list of filtered epitopes
based on the variant transcript.
optional arguments:
-h, --help show this help message and exit
--transcript-prioritization-strategy TRANSCRIPT_PRIORITIZATION_STRATEGY
Specify the criteria to consider when filtering
transcripts of the neoantigen candidates. 'canonical'
will select candidates resulting from variants on a
Ensembl canonical transcript. 'mane_select' will
select candidates resulting from variants on a MANE
select transcript. 'tsl' will select candidates where
the transcript support level (TSL) matches the
--maximum-transcript-support-level cutoff. When
selecting more than one criteria, a transcript meeting
EITHER of the selected criteria will be selected.
(default: ['canonical', 'mane_select', 'tsl'])
--maximum-transcript-support-level {1,2,3,4,5}
The threshold to use for filtering epitopes on the
Ensembl transcript support level (TSL). Keep all
epitopes with a transcript support level <= to this
cutoff. (default: 1)
This filter is used to eliminate variant annotations based on poorly-supported transcripts. This assessed
based on whether the transcript is the MANE Select transcripts, whether it is
the canonical transcript or whether the transcript support level (TSL) meets the
--maximum-transcript-support-level cutoff. The
--transcript-prioritizatio-strategy parameter controlls which ones of these three
criteria are considered. A neoantigen candidate passes this filter if its
transcript passes at least one of the specified criteria.
Transcript with a TSL of Not Supported will pass the TSL criteria. These values occur if VEP was run
without the --tsl flag or if data is aligned to GRCh37 or older.
Top Score Filter¶
usage: pvacseq top_score_filter [-h] [-m {lowest,median}]
[-m2 {ic50,percentile}]
[--transcript-prioritization-strategy TRANSCRIPT_PRIORITIZATION_STRATEGY]
[--maximum-transcript-support-level {1,2,3,4,5}]
[-b BINDING_THRESHOLD]
[--allele-specific-binding-thresholds]
[--allele-specific-anchors]
[--anchor-contribution-threshold ANCHOR_CONTRIBUTION_THRESHOLD]
input_file output_file
Pick the best neoepitope for each variant
positional arguments:
input_file The final report .tsv file to filter.
output_file Output .tsv file containing only the list of the top
epitope per variant.
optional arguments:
-h, --help show this help message and exit
-m {lowest,median}, --top-score-metric {lowest,median}
The ic50 scoring metric to use for filtering. lowest:
Use the best MT Score (i.e. the lowest MT ic50 binding
score of all chosen prediction methods). median: Use
the median MT Score (i.e. the median MT ic50 binding
score of all chosen prediction methods). (default:
median)
-m2 {ic50,percentile}, --top-score-metric2 {ic50,percentile}
Whether to use median/best IC50 or to use median/best
percentile score when determining the top scoring
peptide. This parameter is also used to influence the
primary sorting criteria for the variants in the
output report. (default: ic50)
--transcript-prioritization-strategy TRANSCRIPT_PRIORITIZATION_STRATEGY
Specify the criteria to consider when filtering
transcripts of the neoantigen candidates. 'canonical'
will select candidates resulting from variants on a
Ensembl canonical transcript. 'mane_select' will
select candidates resulting from variants on a MANE
select transcript. 'tsl' will select candidates where
the transcript support level (TSL) matches the
--maximum-transcript-support-level cutoff. When
selecting more than one criteria, a transcript meeting
EITHER of the selected criteria will be selected.
(default: ['canonical', 'mane_select', 'tsl'])
--maximum-transcript-support-level {1,2,3,4,5}
When determining the top peptide, only consider those
entries that meet this threshold for the Ensembl
transcript support level (TSL). Transcript support
level needs to be <= this cutoff to be considered.
(default: 1)
-b BINDING_THRESHOLD, --binding-threshold BINDING_THRESHOLD
When determining the top peptide, only peptides
passing the anchor criteria are considered. This
criteria is failed if all mutated amino acids of a
peptide (Pos) are at an anchor position and the WT
peptide has good binding (IC50 WT <
binding_threshold). (default: 500)
--allele-specific-binding-thresholds
When determining the top peptide and evaluating the
anchor criteria, use allele-specific binding
thresholds. If an allele does not have a special
threshold value, the `--binding-threshold` value will
be used. (default: False)
--allele-specific-anchors
When determining the top peptide and evaluating the
anchor criteria, use allele-specific anchor positions.
This option is available for 8, 9, 10, and 11mers and
only for HLA-A, B, and C alleles. If this option is
not enabled or as a fallback for unsupported lengths
and alleles, the default positions of 1, 2, epitope
length - 1, and epitope length are used. Please see
https://doi.org/10.1101/2020.12.08.416271 for more
details. (default: False)
--anchor-contribution-threshold ANCHOR_CONTRIBUTION_THRESHOLD
For determining the top peptide and evaluating the
anchor criteria using allele-specific anchors, each
position is assigned a score based on how binding is
influenced by mutations. From these scores, the
relative contribution of each position to the overall
binding is calculated. Starting with the highest
relative contribution, positions whose scores together
account for the selected contribution threshold are
assigned as anchor locations. As a result, a higher
threshold leads to the inclusion of more positions to
be considered anchors. (default: 0.8)
This filter picks the top epitope for a variant. Epitopes with the same Chromosome - Start - Stop - Reference - Variant are identified as coming from the same variant.
In order to account for different splice sites among the transcripts of a variant that would lead to different peptides, this filter also takes into account the different transcripts returned by VEP and bins the ones resulting in the same set of epitopes together into a transcript set. For each transcript set the filter will return the top epitope similar to how the Best Peptide is determined in the aggregated report:
Pick all entries with a variant transcript that have a
protein_codingBiotypeOf the remaining entries, pick the entries that pass at least one of the transcript criteria selected in the
--transcript-prioritization-strategytaking into consideration the--maximum-transcript-support-levelif tsl is one of the selected criteria.Of the remaining entries, pick the entries with no Problematic Positions
Of the remaining entries, pick the ones passing the Anchor Criteria (see details below)
Of the remaining entries, pick the one with the lowest median/best MT IC50 score, lowest Transcript Support Level, and longest transcript.
By default the --top-score-metric option is set to median which will apply this
filter to the Median MT IC50 Score column. If the --top-score-metric
option is set to lowest, the Best MT IC50 Score column is used
instead.
Anchor Criteria
This criteria is failed if all mutated amino acids of the entry (Mutation
Position) are at an anchor position and the WT peptide has good binding
(Best/Median WT IC50 Score < binding_threshold).
When the --allele-specific-binding-thresholds flag is set, binding cutoffs specific to each
prediction’s HLA allele are used instead of the value set via the --binding-threshold parameters.
For HLA alleles where no allele-specific binding threshold is available, the
binding threshold is used as a fallback. Alleles with allele-specific
threshold as well as the value of those thresholds can be printed by executing
the pvacseq allele_specific_cutoffs command.
Additional Considerations
It is important to note that there are several reasons why a particular variant can lead to multiple peptides with different predicted binding affinities. The following can result in multiple peptides and/or binding predictions for a single variant:
Different epitope lengths: specifying multiple epitope lengths results in similar but non-identical epitope sequences for each variant (e.g. KLPEPCPS, KLPEPCPST, KLPEPCPSTT, KLPEPCPSTTP).
Different registers: pVACseq will test epitopes where the mutation is in every position (e.g. EPCPSTTP, PEPCPSTT, LPEPCPST, KLPEPCPS, …).
Different transcripts: in some case the peptide sequence surrounding a variant will depend on the reference transcript sequence, particularly if there are alternative splice sites near the variant position.
Different HLA alleles: the HLA allele that produces the best predicted binding affinity is chosen.
A homozygous somatic variant with heterozygous proximal variants nearby may produce multiple different peptides.
The significance of choosing a single representative peptide can depend on your experimental or clinical aims. For example, if you are planning to use short peptide sequences exactly as they were assessed for binding affinity in pVACseq (e.g. specific 9-mers for in vitro experimental validation or perhaps a dendritic cell vaccine delivery approach) then the selection of a specific peptide from the possibilities caused by different lengths, registers, etc. is very important. In some cases you may wish to consider more criteria beyond which of these candidates has the best predicted binding affinity and gets chosen by the Top Score Filter.
On the other hand, if you plan to use synthetic long peptides (SLPs) or encode your candidates in a DNA vector, you will likely include flanking amino acids. This means that you often get a lot of the different short peptides that correspond to slightly different lengths or registers within the longer containing sequence. In this scenario, pVACseq’s choice of a single candidate peptide by the Top Score Filter isn’t actually that critical in the sense of losing other good candidates, because you may get them all anyway.
One important exception to this is the rare case where the same variant leads to different peptides in different transcripts (due to different splice site usage).
If multiple transcripts are expressed and
lead to distinct peptides, you may want to include both in your final list of candidates.
The top score filter supports this case, as described above.
This assumes you did not start with only a single transcript
model for each gene (e.g. using the --pick option in VEP) and also that if you are requiring transcripts with TSL=1 that there
are multiple qualifying transcripts that lead to different peptide sequences at the site of the variant. This will be fairly rare.
Even though most genes have alternative transcripts, they often have only subtle differences in open reading frame and overall
protein sequence, and only differences within the window that would influence a neoantigen candidate are consequential here.
Aggregate Report Filter¶
usage: pvacseq aggregate_report_filter [-h] [--include-tiers INCLUDE_TIERS]
input_file output_file
input_metrics_file output_metrics_file
Filter an aggregate report and its metrics.json file based on the variant
Tier.
positional arguments:
input_file The aggregated.tsv report file to filter.
output_file Output aggregated.tsv file containing list of filtered
aggregate report entries based on the selected variant
Tier.
input_metrics_file The metrics.json file accompanying the input
aggregated report file.
output_metrics_file Filtered metrics.json file only retaining those
entries from variants in the selected variant Tier.
optional arguments:
-h, --help show this help message and exit
--include-tiers INCLUDE_TIERS
Specify a comma-separated list of tiers for which to
retain aggregate report and metrics.json variant data.
(default: ['Pass'])
This command filters the aggregate report and its corresponding metrics.json
file to only those variants matching the specified --include-tiers (default:
Pass).
This filter may be useful for high tumor mutation burden cases when the metrics.json file size exceeds the pVACview upload file size limit or slows down the pVACview application.