Optional Downstream Analysis Tools¶
Generate Protein Fasta¶
usage: pvacsplice generate_protein_fasta [-h] [--input-tsv INPUT_TSV]
[--pass-only] [--biotypes BIOTYPES]
[-j JUNCTION_SCORE]
[-v VARIANT_DISTANCE]
[--anchor-types [{A,D,NDA,DA,N} [{A,D,NDA,DA,N} ...]]]
[--aggregate-report-evaluation AGGREGATE_REPORT_EVALUATION]
[-s SAMPLE_NAME]
input_file flanking_sequence_length
output_file annotated_vcf ref_fasta
gtf_file
Generate an annotated fasta file from a RegTools junctions output TSV file
with protein sequences of mutations
positional arguments:
input_file RegTools junctions output TSV file
flanking_sequence_length
Number of amino acids to add on each side of the
splice site when creating the FASTA.
output_file The output fasta file.
annotated_vcf A VEP-annotated single- or multi-sample VCF containing
genotype and transcript information.The VCF ma be
gzipped (requires tabix index).
ref_fasta A reference FASTA file. Note: this input should be the
same as the RegTools vcf input.
gtf_file A reference GTF file. Note: this input should be the
same as the RegTools gtf input.
optional arguments:
-h, --help show this help message and exit
--input-tsv INPUT_TSV
A pVACsplice all_epitopes, filtered, or aggregated TSV
file with epitopes to use for subsetting the inputs to
peptides of interest. Only the peptide sequences for
the epitopes in the TSV will be used when creating the
FASTA. (default: None)
--pass-only Only process VCF entries with a PASS status. (default:
False)
--biotypes BIOTYPES A list of biotypes to use for pre-filtering
transcripts for processing in the pipeline. (default:
['protein_coding'])
-j JUNCTION_SCORE, --junction-score JUNCTION_SCORE
Junction Coverage Cutoff. Only sites above this read
depth cutoff will be considered. (default: 10)
-v VARIANT_DISTANCE, --variant-distance VARIANT_DISTANCE
Regulatory variants can lie inside or outside of
splicing junction.Maximum distance window (upstream
and downstream) for a variant outside the junction.
(default: 100)
--anchor-types [{A,D,NDA,DA,N} [{A,D,NDA,DA,N} ...]]
The anchor types of junctions to use. Multiple anchors
can be specified using a comma-separated list.Choices:
A, D, NDA, DA, N (default: ['A', 'D', 'NDA'])
--aggregate-report-evaluation AGGREGATE_REPORT_EVALUATION
When running with an aggregate report input TSV, only
include variants with this evaluation. Valid values
for this field are Accept, Reject, Pending, and
Review. Specifiy multiple values as a comma-separated
list to include multiple evaluation states. (default:
Accept)
-s SAMPLE_NAME, --sample-name SAMPLE_NAME
The name of the sample being processed. Required when
processing a multi-sample VCF and must be a sample ID
in the input VCF #CHROM header line. (default: None)
This tool will extract protein sequences surrounding splice sites predicted by RegTools. One use case for this tool is to help select long peptides that contain short neoepitope candidates. For example, if pVACsplice was run to predict nonamers (9-mers) that are good binders and the user wishes to select long peptide (e.g. 24-mer) sequences that contain the nonamer for synthesis or encoding in a DNA vector. The splice site junction will be centered in the protein sequence returned (if possible).
The output may be limited to PASS variants only by setting the --pass
only
flag. Additionally, variants can be limited to specific transcript biotypes
using the --biotypes
parameters, which is set to only include protein_coding
transcripts by default.
The output can be further limited to only certain variants by providing
a pVACsplice report file to the --input-tsv
argument. Only the peptide sequences for the epitopes in the TSV
will be used when creating the FASTA. If this argument is an aggregated TSV
file, use the --aggregate-report-evaluation
parameter to only include
peptide sequences for epitopes matching the chosen Evaluation(s).
Generate Aggregated Report¶
usage: pvacsplice generate_aggregated_report [-h]
[--tumor-purity TUMOR_PURITY]
[-b BINDING_THRESHOLD]
[--allele-specific-binding-thresholds]
[--percentile-threshold PERCENTILE_THRESHOLD]
[--aggregate-inclusion-binding-threshold AGGREGATE_INCLUSION_BINDING_THRESHOLD]
[-m {lowest,median}]
[--trna-vaf TRNA_VAF]
[--trna-cov TRNA_COV]
[--expn-val EXPN_VAL]
[--maximum-transcript-support-level {1,2,3,4,5}]
input_file output_file
Generate an aggregated report from a pVACsplice .all_epitopes.tsv report file.
positional arguments:
input_file A pVACsplice .all_epitopes.tsv report file
output_file The file path to write the aggregated report tsv to
optional arguments:
-h, --help show this help message and exit
--tumor-purity TUMOR_PURITY
Value between 0 and 1 indicating the fraction of tumor
cells in the tumor sample. Information is used during
aggregate report creation for a simple estimation of
whether variants are subclonal or clonal based on VAF.
If not provided, purity is estimated directly from the
VAFs. (default: None)
-b BINDING_THRESHOLD, --binding-threshold BINDING_THRESHOLD
Tier epitopes in the "Pass" tier when the mutant
allele has ic50 binding scores below this value.
(default: 500)
--allele-specific-binding-thresholds
Use allele-specific binding thresholds. To print the
allele-specific binding thresholds run `pvacseq
allele_specific_cutoffs`. If an allele does not have a
special threshold value, the `--binding-threshold`
value will be used. (default: False)
--percentile-threshold PERCENTILE_THRESHOLD
When set, tier epitopes in the "Pass" tier when the
mutant allele has percentile scores below this value
and in the "Relaxed" tier when the mutant allele has
percentile scores below double this value. (default:
None)
--aggregate-inclusion-binding-threshold AGGREGATE_INCLUSION_BINDING_THRESHOLD
Threshold for including epitopes when creating the
aggregate report (default: 5000)
-m {lowest,median}, --top-score-metric {lowest,median}
The ic50 scoring metric to use when filtering epitopes
by binding-threshold or minimum fold change. lowest:
Use the best MT Score and Corresponding Fold Change
(i.e. the lowest MT ic50 binding score and
corresponding fold change of all chosen prediction
methods). median: Use the median MT Score and Median
Fold Change (i.e. the median MT ic50 binding score and
fold change of all chosen prediction methods).
(default: median)
--trna-vaf TRNA_VAF Tumor RNA VAF Cutoff. Used to calculate the allele
expression cutoff for tiering. (default: 0.25)
--trna-cov TRNA_COV Tumor RNA Coverage Cutoff. Used as a cutoff for
tiering. (default: 10)
--expn-val EXPN_VAL Gene and Expression cutoff. Used to calculate the
allele expression cutoff for tiering. (default: 1.0)
--maximum-transcript-support-level {1,2,3,4,5}
The threshold to use for filtering epitopes on the
Ensembl transcript support level (TSL). Transcript
support level needs to be <= this cutoff to be
included in most tiers. (default: 1)
This tool produces an aggregated version of the all_epitopes TSV. It finds the best-scoring epitope for each splice site variant, and outputs additional binding affinity, expression, and coverage information for that epitope. It also gives information about the total number of well-scoring epitopes for each variant, the number of transcripts covered by those epitopes, as well as the HLA alleles that those epitopes are well-binding to. Lastly, the report will bin variants into tiers that offer suggestions as to the suitability of variants for use in vaccines. For a full definition of these tiers, see the pVACsplice output file documentation.
Calculate Reference Proteome Similarity¶
usage: pvacsplice calculate_reference_proteome_similarity [-h]
[--match-length MATCH_LENGTH]
[--species SPECIES]
[--blastp-path BLASTP_PATH]
[--blastp-db {refseq_select_prot,refseq_protein}]
[--peptide-fasta PEPTIDE_FASTA]
[-t N_THREADS]
input_file
input_fasta
output_file
Identify which epitopes in a pVACseq|pVACfuse|pVACbind report file have
matches in the reference proteome using either BLASTp or a checking directly
against a reference proteome FASTA.
positional arguments:
input_file Input filtered, all_epitopes, or aggregated report
file with predicted epitopes.
input_fasta For pVACbind, the original input FASTA file. For
pVACseq, pVACfuse, and pVACsplice a FASTA file with
mutant peptide sequences for each variant isoform.
This file can be found in the same directory as the
input filtered.tsv/all_epitopes.tsv file. Can also be
generated by running `pvacseq|pvacfuse|pvacsplice
generate_protein_fasta`.
output_file Output TSV filename of report file with epitopes with
reference matches marked.
optional arguments:
-h, --help show this help message and exit
--match-length MATCH_LENGTH
The minimum number of consecutive amino acids that
need to match. (default: 8)
--species SPECIES The species of the data in the input file. (default:
human)
--blastp-path BLASTP_PATH
Blastp installation path. (default: None)
--blastp-db {refseq_select_prot,refseq_protein}
The blastp database to use. (default:
refseq_select_prot)
--peptide-fasta PEPTIDE_FASTA
A reference peptide FASTA file to use for finding
reference matches instead of blastp. (default: None)
-t N_THREADS, --n-threads N_THREADS
Number of threads to use for parallelizing BLAST
calls. (default: 1)
This tool will find matches of the epitope candidates in the reference proteome and return the results in an output TSV & reference_match file pair. It requires the input of a pVACplice run’s fasta file in order to look up the larger peptide sequence the epitope was derived from. Any substring of that peptide sequence that matches against the reference proteome and is at least as long as the specified match length, will be considered a hit. This tool also requires the user to provide a filtered.tsv, all_epitopes.tsv or aggregated.tsv pVACsplice report file as an input and any candidates in this input file will be searched for.
This tool may be either run with BLASTp using either the refseq_select_prot
or refseq_protein
database.
By default this option uses the BLAST API but users may independently install BLASTp. Alternatively, users
may provide a reference proteome fasta file and this tool will string match on
the entries of this fasta file directly. This approach is recommended, because
it is significantly faster than BLASTp. Reference proteome fasta files may be
downloaded from Ensembl. For example, the latest reference proteome fasta for human
can be downloaded from this
link.
For more details on the generated reference_match file, see the pVACsplice output file documentation.
NetChop Predict Cleavage Sites¶
usage: pvacsplice net_chop [-h] [--method {cterm,20s}] [--threshold THRESHOLD]
input_file input_fasta output_file
Predict cleavage sites for neoepitopes.
positional arguments:
input_file Input filtered file with predicted epitopes.
input_fasta The required fasta file.
output_file Output tsv filename for putative neoepitopes.
optional arguments:
-h, --help show this help message and exit
--method {cterm,20s} NetChop prediction method to use ("cterm" for C term
3.0, "20s" for 20S 3.0). (default: cterm)
--threshold THRESHOLD
NetChop prediction threshold. (default: 0.5)
This tool uses NetChop to predict cleavage sites for neoepitopes from a pVACsplice run’s filtered/all_epitopes TSV. In its output, it adds to the TSV 3 columns: Best Cleavage Position, Best Cleavage Score, and a Cleavage Sites list. Typically this step is done in the pVACsplice run pipeline for the filtered output TSV when specified. This tool provides a way to manually run this on pVACseq’s generated filtered/all_epitopes TSV files so that you can add this information when not present, if desired.
You can view more information about these columns for pVACsplice in the output file documentation.
NetMHCStab Predict Stability¶
usage: pvacsplice netmhc_stab [-h] [-m {lowest,median}] input_file output_file
Add stability predictions to predicted neoepitopes.
positional arguments:
input_file Input filtered file with predicted epitopes.
output_file Output TSV filename for putative neoepitopes.
optional arguments:
-h, --help show this help message and exit
-m {lowest,median}, --top-score-metric {lowest,median}
The ic50 scoring metric to use when sorting epitopes.
lowest: Use the best MT Score and Corresponding Fold
Change (i.e. the lowest MT ic50 binding score and
corresponding fold change of all chosen prediction
methods). median: Use the median MT Score and Median
Fold Change (i.e. the median MT ic50 binding score and
fold change of all chosen prediction methods).
(default: median)
This tool uses NetMHCstabpan to add stability predictions for neoepitopes from a pVACsplice run’s filtered/all_epitopes TSV. In its output, it adds to the TSV 4 columns: Predicted Stability, Half Life, Stability Rank, and NetMHCStab Allele. Typically this step is done in the pVACsplice run pipeline for the filtered output TSV when specified. This tool provides a way to manually run this on pVACseq’s generated filtered/all_epitopes TSV files so that you can add this information when not present if desired.
You can view more informatnion about these columns for pVACsplice in the output file documentation.
Identify Problematic Amino Acids¶
usage: pvacsplice identify_problematic_amino_acids [-h]
[--filter-type {soft,hard}]
input_file output_file
problematic_amino_acids
Mark problematic amino acid positions in each epitope or filter entries that have problematic amino acids.
positional arguments:
input_file Input filtered or all_epitopes file with predicted epitopes.
output_file Output .tsv file with identification of problematic amino acids or hard-filtered to remove epitopes with problematic amino acids.
problematic_amino_acids
A list of amino acids to consider as problematic. Each entry can be specified in the following format:
`amino_acid(s)`: One or more one-letter amino acid codes. Any occurrence of this amino acid string,
regardless of the position in the epitope, is problematic. When specifying more than
one amino acid, they will need to occur together in the specified order.
`amino_acid:position`: A one letter amino acid code, followed by a colon separator, followed by a positive
integer position (one-based). The occurrence of this amino acid at the position
specified is problematic., E.g. G:2 would check for a Glycine at the second position
of the epitope. The N-terminus is defined as position 1.
`amino_acid:-position`: A one letter amino acid code, followed by a colon separator, followed by a negative
integer position. The occurrence of this amino acid at the specified position from
the end of the epitope is problematic. E.g., G:-3 would check for a Glycine at the
third position from the end of the epitope. The C-terminus is defined as position -1.
optional arguments:
-h, --help show this help message and exit
--filter-type {soft,hard}, -f {soft,hard}
Set the type of filtering done. Choosing `soft` will add a new column "Problematic Positions" that lists positions in the epitope with problematic amino acids. Choosing `hard` will remove epitope entries with problematic amino acids.
This tool is used to identify positions in an epitope with an amino acid that is problematic for downstream processing, e.g. vaccine manufacturing. Since this can differ from case to case, this tool requires the user to specify which amino acid(s) to consider problematic. This can be specified in one of three formats:
|
One or more one-letter amino acid codes. Any occurrence of this amino acid string, regardless of the position in the epitope, is problematic. When specifying more than one amino acid, they will need to occur together in the specified order. |
|
A one letter amino acid code, followed by a colon separator, followed by a positive integer position (one-based). The occurrence of this amino acid at the position specified is problematic., E.g. G:2 would check for a Glycine at the second position of the epitope. The N-terminus is defined as position 1. |
|
A one letter amino acid code, followed by a colon separator, followed by a negative integer position. The occurrence of this amino acid at the specified position from the end of the epitope is problematic. E.g., G:-3 would check for a Glycine at the third position from the end of the epitope. The C-terminus is defined as position -1. |
You may specify any number of these problematic amino acid(s), in any combination, by providing them as a comma-separated list.
This tool may be used with any filtered.tsv or all_epitopes.tsv pVACsplice report file.