Optional Downstream Analysis Tools¶

Generate Protein Fasta¶

usage: pvacseq generate_protein_fasta [-h] [--input-tsv INPUT_TSV]
                                      [-p PHASED_PROXIMAL_VARIANTS_VCF]
                                      [--pass-only] [--biotypes BIOTYPES]
                                      [--allow-incomplete-transcripts]
                                      [--mutant-only]
                                      [--aggregate-report-evaluation AGGREGATE_REPORT_EVALUATION]
                                      [-d DOWNSTREAM_SEQUENCE_LENGTH]
                                      [-s SAMPLE_NAME]
                                      input_vcf flanking_sequence_length
                                      output_file

Generate an annotated fasta file from a VCF with protein sequences of
mutations and matching wildtypes

positional arguments:
  input_vcf             A VEP-annotated single- or multi-sample VCF containing
                        genotype, transcript, Wildtype protein sequence, and
                        Frameshift protein sequence information.The VCF may be
                        gzipped (requires tabix index).
  flanking_sequence_length
                        Number of amino acids to add on each side of the
                        mutation when creating the FASTA.
  output_file           The output fasta file.

optional arguments:
  -h, --help            show this help message and exit
  --input-tsv INPUT_TSV
                        A pVACseq all_epitopes, filtered, or aggregated TSV
                        file with epitopes to use for subsetting the input VCF
                        to peptides of interest. Only the peptide sequences
                        for the epitopes in the TSV will be used when creating
                        the FASTA. When running with an aggregated TSV, the
                        sequences will be further narrowed down to only
                        include variants with the selected --aggregate-report-
                        evaluation. (default: None)
  -p PHASED_PROXIMAL_VARIANTS_VCF, --phased-proximal-variants-vcf PHASED_PROXIMAL_VARIANTS_VCF
                        A VCF with phased proximal variant information to
                        incorporate into the predicted fasta sequences. Must
                        be gzipped and tabix indexed. (default: None)
  --pass-only           Only process VCF entries with a PASS status. (default:
                        False)
  --biotypes BIOTYPES   A list of biotypes to use for pre-filtering
                        transcripts for processing in the pipeline. (default:
                        ['protein_coding'])
  --allow-incomplete-transcripts
                        By default, transcripts annotated with incomplete CDS
                        (i.e., 'cds_start_NF' or 'cds_end_NF' flags in the VEP
                        CSQ field) are excluded from analysis, as they often
                        produce invalid protein sequences. Use this flag to
                        allow candidates from such transcripts. Only peptides
                        that do not contain 'X' will be included. These
                        candidates will be deprioritized relative to those
                        from transcripts without incomplete CDS flags.
                        (default: False)
  --mutant-only         Only output mutant peptide sequences (default: False)
  --aggregate-report-evaluation AGGREGATE_REPORT_EVALUATION
                        When running with an aggregate report input TSV, only
                        include variants with this evaluation. Valid values
                        for this field are Accept, Reject, Pending, and
                        Review. Specifiy multiple values as a comma-separated
                        list to include multiple evaluation states. (default:
                        Accept)
  -d DOWNSTREAM_SEQUENCE_LENGTH, --downstream-sequence-length DOWNSTREAM_SEQUENCE_LENGTH
                        Cap to limit the downstream sequence length for
                        frameshifts when creating the fasta file. Use 'full'
                        to include the full downstream sequence. (default:
                        1000)
  -s SAMPLE_NAME, --sample-name SAMPLE_NAME
                        The name of the sample being processed. Required when
                        processing a multi-sample VCF and must be a sample ID
                        in the input VCF #CHROM header line. (default: None)

This tool will extract protein sequences surrounding supported protein altering variants in an input VCF file. One use case for this tool is to help select long peptides that contain short neoepitope candidates. For example, if pvacseq was run to predict nonamers (9-mers) that are good binders and the user wishes to select long peptide (e.g. 24-mer) sequences that contain the nonamer for synthesis or encoding in a DNA vector. The protein sequence extracted will correspond to the transcript sequence used in the annotated VCF. The alteration in the VCF (e.g. a somatic missense SNV) will be centered in the protein sequence returned (if possible). If the variant is near the beginning or end of the CDS, it will be as close to center as possible while returning the desired protein sequence length. If the variant causes a frameshift, the full downstream protein sequence will be returned unless the user specifies otherwise as described above. The flanking_sequence_length positional parameter controls how many amino acids will be included on either side of the mutation.

To incorporate proximal variants in the final sequence, use the --phased-proximal-variants-vcf argument. Please see the Creating a phased VCF of proximal variants section of the documentation on how to create this VCF.

The output may be limited to PASS variants only by setting the --pass only flag and to mutant sequences by setting the --mutant-only flag. Additionally, variants can be limited to specific transcript biotypes using the --biotypes parameters, which is set to only include protein_coding transcripts by default.

The output can be further limited to only certain variants by providing a pVACseq report file to the --input-tsv argument. Only the peptide sequences for the epitopes in the TSV will be used when creating the FASTA. If this argument is an aggregated TSV file, use the --aggregate-report-evaluation parameter to only include peptide sequences for epitopes matching the chosen Evaluation(s). This is useful when creating a peptide fasta for vaccine ordering after using pVACview to select vaccine candidates and exporting the results to a new TSV.

Generate Aggregated Report¶

usage: pvacseq generate_aggregated_report [-h] [--tumor-purity TUMOR_PURITY]
                                          [-b BINDING_THRESHOLD]
                                          [--allele-specific-binding-thresholds]
                                          [--binding-percentile-threshold BINDING_PERCENTILE_THRESHOLD]
                                          [--immunogenicity-percentile-threshold IMMUNOGENICITY_PERCENTILE_THRESHOLD]
                                          [--presentation-percentile-threshold PRESENTATION_PERCENTILE_THRESHOLD]
                                          [--percentile-threshold-strategy {conservative,exploratory}]
                                          [--aggregate-inclusion-binding-threshold AGGREGATE_INCLUSION_BINDING_THRESHOLD]
                                          [--aggregate-inclusion-count-limit AGGREGATE_INCLUSION_COUNT_LIMIT]
                                          [-m {lowest,median}]
                                          [-m2 TOP_SCORE_METRIC2]
                                          [--trna-vaf TRNA_VAF]
                                          [--trna-cov TRNA_COV]
                                          [--expn-val EXPN_VAL]
                                          [--transcript-prioritization-strategy TRANSCRIPT_PRIORITIZATION_STRATEGY]
                                          [--maximum-transcript-support-level {1,2,3,4,5}]
                                          [--allele-specific-anchors]
                                          [--anchor-contribution-threshold ANCHOR_CONTRIBUTION_THRESHOLD]
                                          input_file output_file

Generate an aggregated report from a pVACseq .all_epitopes.tsv report file.

positional arguments:
  input_file            A pVACseq .all_epitopes.tsv report file
  output_file           The file path to write the aggregated report tsv to

optional arguments:
  -h, --help            show this help message and exit
  --tumor-purity TUMOR_PURITY
                        Value between 0 and 1 indicating the fraction of tumor
                        cells in the tumor sample. Information is used during
                        aggregate report creation for a simple estimation of
                        whether variants are subclonal or clonal based on VAF.
                        If not provided, purity is estimated directly from the
                        VAFs. (default: None)
  -b BINDING_THRESHOLD, --binding-threshold BINDING_THRESHOLD
                        Tier epitopes in the "Pass" tier when the mutant
                        allele has ic50 binding scores below this value.
                        (default: 500)
  --allele-specific-binding-thresholds
                        Use allele-specific binding thresholds. To print the
                        allele-specific binding thresholds run `pvacseq
                        allele_specific_cutoffs`. If an allele does not have a
                        special threshold value, the `--binding-threshold`
                        value will be used. (default: False)
  --binding-percentile-threshold BINDING_PERCENTILE_THRESHOLD
                        Tier epitopes in the "Pass" tier when the mutant
                        allele has a binding percentile below this value.
                        (default: 2.0)
  --immunogenicity-percentile-threshold IMMUNOGENICITY_PERCENTILE_THRESHOLD
                        Tier epitopes in the "Pass" tier when the mutant
                        allele has a immunogenicity percentile below this
                        value. (default: 2.0)
  --presentation-percentile-threshold PRESENTATION_PERCENTILE_THRESHOLD
                        Tier epitopes in the "Pass" tier when the mutant
                        allele has a presentation percentile below this value.
                        (default: 2.0)
  --percentile-threshold-strategy {conservative,exploratory}
                        Specify the candidate inclusion strategy. The
                        'conservative' option requires a candidate to pass the
                        binding threshold and all percentile thresholds
                        (default). The 'exploratory' option requires a
                        candidate to pass at the binding threshold or one of
                        the percentile thresholds. (default: conservative)
  --aggregate-inclusion-binding-threshold AGGREGATE_INCLUSION_BINDING_THRESHOLD
                        Binding threshold for including epitopes when creating
                        the aggregate report (default: 5000)
  --aggregate-inclusion-count-limit AGGREGATE_INCLUSION_COUNT_LIMIT
                        Limit neoantigen candidates included in the aggregate
                        report to only the best n candidates per variant. This
                        ensures performance when loading results into
                        pVACview, e.g. for frameshifts with potentially
                        hundreds of predictions. (default: 15)
  -m {lowest,median}, --top-score-metric {lowest,median}
                        The ic50 scoring metric to use when filtering epitopes
                        by binding-threshold or minimum fold change. lowest:
                        Use the best MT Score and Corresponding Fold Change
                        (i.e. the lowest MT ic50 binding score and
                        corresponding fold change of all chosen prediction
                        methods). median: Use the median MT Score and Median
                        Fold Change (i.e. the median MT ic50 binding score and
                        fold change of all chosen prediction methods).
                        (default: median)
  -m2 TOP_SCORE_METRIC2, --top-score-metric2 TOP_SCORE_METRIC2
                        Which metrics to consider when selecting the best
                        peptide and when sorting candidates within a tier.
                        Each specified metric will be ranked and the sum of
                        these ranks will be used.Available options are 'ic50',
                        'combined_percentile', 'binding_percentile',
                        'immunogenicity_percentile', and
                        'presentation_percentile'.Whether the lowest or median
                        is considered for each metric is controlled by the
                        --top-score-metric parameter. (default: ['ic50',
                        'combined_percentile'])
  --trna-vaf TRNA_VAF   Tumor RNA VAF Cutoff. Used to calculate the allele
                        expression cutoff for tiering. (default: 0.25)
  --trna-cov TRNA_COV   Tumor RNA Coverage Cutoff. Used as a cutoff for
                        tiering. (default: 10)
  --expn-val EXPN_VAL   Gene and Expression cutoff. Used to calculate the
                        allele expression cutoff for tiering. (default: 1.0)
  --transcript-prioritization-strategy TRANSCRIPT_PRIORITIZATION_STRATEGY
                        Specify the criteria to consider when prioritizing or
                        filtering transcripts of the neoantigen candidates
                        during aggregate report creation or TSL filtering.
                        'canonical' will prioritize/select candidates
                        resulting from variants on a Ensembl canonical
                        transcript. 'mane_select' will prioritize/select
                        candidates resulting from variants on a MANE select
                        transcript. 'tsl' will prioritize/select candidates
                        where the transcript support level (TSL) matches the
                        --maximum-transcript-support-level. When selecting
                        more than one criteria, a transcript meeting EITHER of
                        the selected criteria will be prioritized/selected.
                        (default: ['canonical', 'mane_select', 'tsl'])
  --maximum-transcript-support-level {1,2,3,4,5}
                        The threshold to use for filtering epitopes on the
                        Ensembl transcript support level (TSL). Transcript
                        support level needs to be <= this cutoff to be
                        included in most tiers. (default: 1)
  --allele-specific-anchors
                        Use allele-specific anchor positions when tiering
                        epitopes in the aggregate report. This option is
                        available for 8, 9, 10, and 11mers and only for HLA-A,
                        B, and C alleles. If this option is not enabled or as
                        a fallback for unsupported lengths and alleles, the
                        default positions of 1, 2, epitope length - 1, and
                        epitope length are used. Please see
                        https://doi.org/10.1101/2020.12.08.416271 for more
                        details. (default: False)
  --anchor-contribution-threshold ANCHOR_CONTRIBUTION_THRESHOLD
                        For determining allele-specific anchors, each position
                        is assigned a score based on how binding is influenced
                        by mutations. From these scores, the relative
                        contribution of each position to the overall binding
                        is calculated. Starting with the highest relative
                        contribution, positions whose scores together account
                        for the selected contribution threshold are assigned
                        as anchor locations. As a result, a higher threshold
                        leads to the inclusion of more positions to be
                        considered anchors. (default: 0.8)

This tool produces an aggregated version of the all_epitopes TSV. It finds the best-scoring (lowest binding affinity) epitope for each variant, and outputs additional binding affinity, expression, and coverage information for that epitope. It also gives information about the total number of well-scoring epitopes for each variant, the number of transcripts covered by those epitopes, as well as the HLA alleles that those epitopes are well-binding to. Lastly, the report will bin variants into tiers that offer suggestions as to the suitability of variants for use in vaccines. For a full definition of these tiers, see the pVACseq output file documentation.

Add Evaluation Predictions Using a Pre-Trained Machine Learning Model¶

usage: pvacseq add_ml_predictions [-h] [--artifacts-path ARTIFACTS_PATH]
                                  [--output-dir OUTPUT_DIR]
                                  [--ml-threshold-accept ML_THRESHOLD_ACCEPT]
                                  [--ml-threshold-reject ML_THRESHOLD_REJECT]
                                  class1_aggregated class1_all_epitopes
                                  class2_aggregated sample_name

Add ML-based neoantigen evaluation predictions to existing pVACseq output files.

positional arguments:
  class1_aggregated     Path to the MHC Class I aggregated epitopes TSV.
  class1_all_epitopes   Path to the MHC Class I all epitopes TSV.
  class2_aggregated     Path to the MHC Class II aggregated epitopes TSV.
  sample_name           Sample name prefix to use for the output files.

optional arguments:
  -h, --help            show this help message and exit
  --artifacts-path ARTIFACTS_PATH
                        Optional path to a directory containing ML model artifacts. Defaults to the package-provided artifacts.
  --output-dir OUTPUT_DIR
                        Directory where the ML prediction TSV should be written. Defaults to the directory containing the Class I aggregated file.
  --ml-threshold-accept ML_THRESHOLD_ACCEPT
                        Prediction threshold for Accept predictions (default: 0.55).
  --ml-threshold-reject ML_THRESHOLD_REJECT
                        Prediction threshold for Reject predictions (default: 0.30).

This tool adds machine learning (ML)-based neoantigen prioritization predictions to existing pVACseq output files. It uses a trained random forest model to predict whether neoantigen candidates should be evaluated as “Accept”, “Reject”, or “Pending” based on a comprehensive set of features derived from binding affinity predictions, expression data, and variant characteristics.

This tool requires that you have already generated both MHC Class I and Class II aggregated reports using the generate_aggregated_report command or by running the pVACseq pipeline (pvacseq run). It takes as input the Class I aggregated TSV, Class I all epitopes TSV, and Class II aggregated TSV files from a pVACseq run. The tool merges these files, performs data cleaning and imputation, and applies the ML model to generate evaluation predictions for each variant.

Note that the built-in ML model was trained with most of the features listed under Features. It is STRONGLY recommended to use the all option for the prediction_algorithms parameter when running the pVACseq pipeline for the best predictions.

The output file is written to the same directory as the Class I aggregated file (the directory you pass as output_dir when using the standalone command) as <sample_name>.MHC_I.all_epitopes.aggregated.ML_predict.tsv. The output file contains all columns from the original Class I aggregated file with some changes:

`Evaluation`	The ML-predicted evaluation status: “Accept”, “Reject”, or “Pending”, based on the prediction probability score.
`ML Prediction (score)`	A formatted output combining the model-predicted evaluation with the prediction probability score (e.g., “Accept (0.72)”). It shows “NA” for variants where the model could not make a prediction, which may be due to a candidate having Class I algorithm predictions but not Class II algorithm predictions, causing the Class I and Class II aggregated reports to have different numbers of rows.

The --ml-threshold-accept parameter controls the probability threshold for Accept predictions (default: 0.55). Variants with prediction probabilities >= this threshold are evaluated as “Accept”. The --ml-threshold-reject parameter controls the probability threshold for Reject predictions (default: 0.30). Variants with prediction probabilities <= this threshold are evaluated as “Reject”. Everything in between is set to “Pending” for manual review. The --artifacts-path parameter allows you to specify a custom directory containing ML model artifacts. By default the tool uses the model artifacts included with the pvactools package.

Calculate Reference Proteome Similarity¶

usage: pvacseq calculate_reference_proteome_similarity [-h]
                                                       [--match-length MATCH_LENGTH]
                                                       [--species SPECIES]
                                                       [--blastp-path BLASTP_PATH]
                                                       [--blastp-db {refseq_select_prot,refseq_protein}]
                                                       [--peptide-fasta PEPTIDE_FASTA]
                                                       [-t N_THREADS]
                                                       [-m AGGREGATE_METRICS_FILE]
                                                       input_file input_fasta
                                                       output_file

Identify which epitopes in a pVACseq|pVACfuse|pVACbind report file have
matches in the reference proteome using either BLASTp or a checking directly
against a reference proteome FASTA.

positional arguments:
  input_file            Input filtered, all_epitopes, or aggregated report
                        file with predicted epitopes.
  input_fasta           For pVACbind, the original input FASTA file. For
                        pVACseq, pVACfuse, and pVACsplice a FASTA file with
                        mutant peptide sequences for each variant isoform. For
                        pVACseq and pVACfuse, this file can be found in the
                        same directory as the input
                        filtered.tsv/all_epitopes.tsv file. For pVACsplice,
                        this file can be found in the main output directory.
                        Can also be generated by running
                        `pvacseq|pvacfuse|pvacsplice generate_protein_fasta`.
  output_file           Output TSV filename of report file with epitopes with
                        reference matches marked.

optional arguments:
  -h, --help            show this help message and exit
  --match-length MATCH_LENGTH
                        The minimum number of consecutive amino acids that
                        need to match. (default: 8)
  --species SPECIES     The species of the data in the input file. (default:
                        human)
  --blastp-path BLASTP_PATH
                        Blastp installation path. (default: None)
  --blastp-db {refseq_select_prot,refseq_protein}
                        The blastp database to use. (default:
                        refseq_select_prot)
  --peptide-fasta PEPTIDE_FASTA
                        A reference peptide FASTA file to use for finding
                        reference matches instead of blastp. (default: None)
  -t N_THREADS, --n-threads N_THREADS
                        Number of threads to use for parallelizing BLAST
                        calls. (default: 1)
  -m AGGREGATE_METRICS_FILE, --aggregate-metrics-file AGGREGATE_METRICS_FILE
                        When running with the aggregate report as an input
                        tsv, optionally provide the metrics.json file to
                        update with detailed reference match data for display
                        in pVACview. (default: None)

This tool will find matches of the epitope candidates in the reference proteome and return the results in an output TSV & reference_match file pair. It requires the input of a pVACseq run’s fasta file in order to look up the larger peptide sequence the epitope was derived from. Any substring of that peptide sequence that matches against the reference proteome and is at least as long as the specified match length, will be considered a hit. This tool also requires the user to provide a filtered.tsv, all_epitopes.tsv or aggregated.tsv pVACseq report file as an input and any candidates in this input file will be searched for.

This tool may be either run with BLASTp using either the refseq_select_prot or refseq_protein database. By default this option uses the BLAST API but users may independently install BLASTp. Alternatively, users may provide a reference proteome fasta file and this tool will string match on the entries of this fasta file directly. This approach is recommended, because it is significantly faster than BLASTp. Reference proteome fasta files may be downloaded from Ensembl. For example, the latest reference proteome fasta for human can be downloaded from this link.

For more details on the generated reference_match file, see the pVACseq output file documentation.

NetChop Predict Cleavage Sites¶

usage: pvacseq net_chop [-h] [--method {cterm,20s}] [--threshold THRESHOLD]
                        input_file input_fasta output_file

Predict cleavage sites for neoepitopes.

positional arguments:
  input_file            Input filtered file with predicted epitopes.
  input_fasta           The required fasta file.
  output_file           Output tsv filename for putative neoepitopes.

optional arguments:
  -h, --help            show this help message and exit
  --method {cterm,20s}  NetChop prediction method to use ("cterm" for C term
                        3.0, "20s" for 20S 3.0). (default: cterm)
  --threshold THRESHOLD
                        NetChop prediction threshold. (default: 0.5)

This tool uses NetChop to predict cleavage sites for neoepitopes from a pVACseq run’s filtered/all_epitopes TSV. In its output, it adds to the TSV 3 columns: Best Cleavage Position, Best Cleavage Score, and a Cleavage Sites list. Typically this step is done in the pVACseq run pipeline for the filtered output TSV when specified. This tool provides a way to manually run this on pVACseq’s generated filtered/all_epitopes TSV files so that you can add this information when not present if desired.

You can view more information about these columns for pVACseq in the output file documentation.

NetMHCStab Predict Stability¶

usage: pvacseq netmhc_stab [-h] [-m {lowest,median}] [-m2 TOP_SCORE_METRIC2]
                           input_file output_file

Add stability predictions to predicted neoepitopes.

positional arguments:
  input_file            Input filtered file with predicted epitopes.
  output_file           Output TSV filename for putative neoepitopes.

optional arguments:
  -h, --help            show this help message and exit
  -m {lowest,median}, --top-score-metric {lowest,median}
                        The ic50 scoring metric to use when sorting epitopes.
                        lowest: Use the best MT Score and Corresponding Fold
                        Change (i.e. the lowest MT ic50 binding score and
                        corresponding fold change of all chosen prediction
                        methods). median: Use the median MT Score and Median
                        Fold Change (i.e. the median MT ic50 binding score and
                        fold change of all chosen prediction methods).
                        (default: median)
  -m2 TOP_SCORE_METRIC2, --top-score-metric2 TOP_SCORE_METRIC2
                        Which metrics to consider when sorting the results.
                        All listed metrics will be rank scored and the sum of
                        those rank scores will be used. Available options are
                        'ic50', 'combined_percentile', 'binding_percentile',
                        'immunogenicity_percentile', and
                        'presentation_percentile'.Whether the lowest or median
                        is considered for each metric is controlled by the
                        --top-score-metric parameter. (default: ['ic50',
                        'combined_percentile'])

This tool uses NetMHCstabpan to add stability predictions for neoepitopes from a pVACseq run’s filtered/all_epitopes TSV. In its output, it adds to the TSV 4 columns: Predicted Stability, Half Life, Stability Rank, and NetMHCStab Allele. Typically this step is done in the pVACseq run pipeline for the filtered output TSV when specified. This tool provides a way to manually run this on pVACseq’s generated filtered/all_epitopes TSV files so that you can add this information when not present if desired.

You can view more information about these columns for pVACseq in the output file documentation.

Identify Problematic Amino Acids¶

usage: pvacseq identify_problematic_amino_acids [-h]
                                                [--filter-type {soft,hard}]
                                                input_file output_file
                                                problematic_amino_acids

Mark problematic amino acid positions in each epitope or filter entries that have problematic amino acids.

positional arguments:
  input_file            Input filtered, all_epitopes, or aggregated file with predicted epitopes.
  output_file           Output .tsv file with identification of problematic amino acids or hard-filtered to remove epitopes with problematic amino acids.
  problematic_amino_acids
                        A list of amino acids to consider as problematic. Each entry can be specified in the following format:
                        `amino_acid(s)`: One or more one-letter amino acid codes. Any occurrence of this amino acid string,
                                         regardless of the position in the epitope, is problematic. When specifying more than
                                         one amino acid, they will need to occur together in the specified order.
                        `amino_acid:position`: A one letter amino acid code, followed by a colon separator, followed by a positive
                                               integer position (one-based). The occurrence of this amino acid at the position
                                               specified is problematic., E.g. G:2 would check for a Glycine at the second position
                                               of the epitope. The N-terminus is defined as position 1.
                        `amino_acid:-position`: A one letter amino acid code, followed by a colon separator, followed by a negative
                                                integer position. The occurrence of this amino acid at the specified position from
                                                the end of the epitope is problematic. E.g., G:-3 would check for a Glycine at the
                                                third position from the end of the epitope. The C-terminus is defined as position -1.

optional arguments:
  -h, --help            show this help message and exit
  --filter-type {soft,hard}, -f {soft,hard}
                        Set the type of filtering done. Choosing `soft` will add a new column "Problematic Positions" (for filtered or all_epitopes input files) or "Prob Pos" (for aggregated input files) that lists positions in the epitope with problematic amino acids. Choosing `hard` will remove epitope entries with problematic amino acids.

This tool is used to identify positions in an epitope with an amino acid that is problematic for downstream processing, e.g. vaccine manufacturing. Since this can differ from case to case, this tool requires the user to specify which amino acid(s) to consider problematic. This can be specified in one of three formats:

`amino_acid(s)`	One or more one-letter amino acid codes. Any occurrence of this amino acid string, regardless of the position in the epitope, is problematic. When specifying more than one amino acid, they will need to occur together in the specified order.
`amino_acid:position`	A one letter amino acid code, followed by a colon separator, followed by a positive integer position (one-based). The occurrence of this amino acid at the position specified is problematic., E.g. G:2 would check for a Glycine at the second position of the epitope. The N-terminus is defined as position 1.
`amino_acid:-position`	A one letter amino acid code, followed by a colon separator, followed by a negative integer position. The occurrence of this amino acid at the specified position from the end of the epitope is problematic. E.g., G:-3 would check for a Glycine at the third position from the end of the epitope. The C-terminus is defined as position -1.

You may specify any number of these problematic amino acid(s), in any combination, by providing them as a comma-separated list.

This tool may be used with any filtered.tsv or all_epitopes.tsv pVACseq report file.

Mark Genes of Interest¶

usage: pvacseq mark_genes_of_interest [-h]
                                      [--genes-of-interest-file GENES_OF_INTEREST_FILE]
                                      input_file output_file

Mark predictions resulting from variants on a genes of interest list.

positional arguments:
  input_file            Input filtered or all_epitopes file with predicted epitopes.
  output_file           Output .tsv file with identification of predictions resulting from variants on the genes of interest list.

optional arguments:
  -h, --help            show this help message and exit
  --genes-of-interest-file GENES_OF_INTEREST_FILE
                        A genes of interest file. Predictions resulting from variants on genes in this list will be marked in the output file. The file should be formatted to have each gene on a separate line without a header line. If no file is specified, the Cancer Gene Census list of high-confidence genes is used as the default.

Update Tiers¶

usage: pvacseq update_tiers [-h] [-b BINDING_THRESHOLD]
                            [--allele-specific-binding-thresholds]
                            [--percentile-threshold PERCENTILE_THRESHOLD]
                            [--binding-percentile-threshold BINDING_PERCENTILE_THRESHOLD]
                            [--immunogenicity-percentile-threshold IMMUNOGENICITY_PERCENTILE_THRESHOLD]
                            [--presentation-percentile-threshold PRESENTATION_PERCENTILE_THRESHOLD]
                            [--percentile-threshold-strategy {conservative,exploratory}]
                            [-m2 TOP_SCORE_METRIC2] [--trna-vaf TRNA_VAF]
                            [--trna-cov TRNA_COV] [--expn-val EXPN_VAL]
                            [--transcript-prioritization-strategy TRANSCRIPT_PRIORITIZATION_STRATEGY]
                            [--maximum-transcript-support-level {1,2,3,4,5}]
                            [--allele-specific-anchors]
                            [--anchor-contribution-threshold ANCHOR_CONTRIBUTION_THRESHOLD]
                            input_file metrics_file vaf_clonal

Update tiers in an aggregated report in order to, for example, use different
thresholds or account for problematic position or reference match information
if run after initial pipeline run.

positional arguments:
  input_file            Input aggregated file with tiers to update. This file
                        will be overwritten with the output.
  metrics_file          metrics.json file corresponding to the input
                        aggregated file. This file will be overwritten to
                        update tiering parameters used by this command.
  vaf_clonal            The RNA VAF threshold to determine whether a candidate
                        is considered clonal. Any candidates with RNA VAF <
                        vaf_clonal/2 will be considered subclonal.

optional arguments:
  -h, --help            show this help message and exit
  -b BINDING_THRESHOLD, --binding-threshold BINDING_THRESHOLD
                        IC50 binding threshold to consider when evaluting the
                        binding criteria. Candidates where the mutant allele
                        has ic50 binding scores below this value will be
                        considered good binders. (default: 500)
  --allele-specific-binding-thresholds
                        Use allele-specific binding thresholds when evaluating
                        the binding criteria for tiering. To print the allele-
                        specific binding thresholds run `pvacseq
                        allele_specific_cutoffs`. If an allele does not have a
                        special threshold value, the `--binding-threshold`
                        value will be used. (default: False)
  --percentile-threshold PERCENTILE_THRESHOLD
                        Account for the IC50 percentile rank when evaluating
                        the binding criteria for tiering. A candidate's
                        percentile rank must be below this value. (default:
                        None)
  --binding-percentile-threshold BINDING_PERCENTILE_THRESHOLD
                        Tier epitopes in the "Pass" tier when the mutant
                        allele has a binding percentile below this value.
                        (default: 2.0)
  --immunogenicity-percentile-threshold IMMUNOGENICITY_PERCENTILE_THRESHOLD
                        Tier epitopes in the "Pass" tier when the mutant
                        allele has a immunogenicity percentile below this
                        value. (default: 2.0)
  --presentation-percentile-threshold PRESENTATION_PERCENTILE_THRESHOLD
                        Tier epitopes in the "Pass" tier when the mutant
                        allele has a presentation percentile below this value.
                        (default: 2.0)
  --percentile-threshold-strategy {conservative,exploratory}
                        Specify the candidate inclusion strategy. The
                        'conservative' option requires a candidate to pass the
                        binding threshold and all percentile thresholds
                        (default). The 'exploratory' option requires a
                        candidate to pass EITHER the binding threshold or one
                        of the percentile thresholds. (default: conservative)
  -m2 TOP_SCORE_METRIC2, --top-score-metric2 TOP_SCORE_METRIC2
                        Which metrics to consider when sorting candidates
                        within a tier. Each specified metric will be ranked
                        and the sum of these ranks will be used for
                        sorting.Available options are 'ic50',
                        'combined_percentile', 'binding_percentile',
                        'immunogenicity_percentile', and
                        'presentation_percentile'.Whether the lowest or median
                        is considered for each metric is controlled by the
                        --top-score-metric parameter. (default: ['ic50',
                        'combined_percentile'])
  --trna-vaf TRNA_VAF   Tumor RNA VAF Cutoff in decimal format to consider
                        when evaluating the expression criteria. Only sites
                        above this cutoff will be considered. (default: 0.25)
  --trna-cov TRNA_COV   Tumor RNA Coverage Cutoff to consider when evaluating
                        the expression criteria. Only sites above this read
                        depth cutoff will be considered. (default: 10)
  --expn-val EXPN_VAL   Gene and Transcript Expression cutoff. Sites above
                        this cutoff will be considered. (default: 1.0)
  --transcript-prioritization-strategy TRANSCRIPT_PRIORITIZATION_STRATEGY
                        Specify the criteria to consider when evaluating
                        transcripts of the neoantigen candidates. 'canonical'
                        will consider a candidate to come from a good
                        transcript if the transcript is a Ensembl canonical
                        transcript. 'mane_select' will consider a candidate to
                        come from a good transcript if the transcript is a
                        MANE select transcript. 'tsl' will consider a
                        candidate to come from a good transcript if the
                        transcript's support level (TSL) passes the --maximum-
                        transcript-support-level. When selecting more than one
                        criteria, a transcript meeting EITHER of the selected
                        criteria will be prioritized/selected. (default:
                        ['canonical', 'mane_select', 'tsl'])
  --maximum-transcript-support-level {1,2,3,4,5}
                        The threshold to use for filtering epitopes on the
                        Ensembl transcript support level (TSL). Keep all
                        epitopes with a transcript support level <= to this
                        cutoff. (default: 1)
  --allele-specific-anchors
                        Use allele-specific anchor positions when evaluating
                        the anchor criteria for tiering epitopes in the
                        aggregate report. This option is available for 8, 9,
                        10, and 11mers and only for HLA-A, B, and C alleles.
                        If this option is not enabled or as a fallback for
                        unsupported lengths and alleles, the default positions
                        of 1, 2, epitope length - 1, and epitope length are
                        used. Please see
                        https://doi.org/10.1101/2020.12.08.416271 for more
                        details. (default: False)
  --anchor-contribution-threshold ANCHOR_CONTRIBUTION_THRESHOLD
                        For determining allele-specific anchors, each position
                        is assigned a score based on how binding is influenced
                        by mutations. From these scores, the relative
                        contribution of each position to the overall binding
                        is calculated. Starting with the highest relative
                        contribution, positions whose scores together account
                        for the selected contribution threshold are assigned
                        as anchor locations. As a result, a higher threshold
                        leads to the inclusion of more positions to be
                        considered anchors. (default: 0.8)

Create Peptide Ordering Form¶

usage: pvacseq create_peptide_ordering_form [-h] [-o OUTPUT_PATH]
                                            [-p PHASED_PROXIMAL_VARIANTS_VCF]
                                            [--external-vcf EXTERNAL_VCF]
                                            [--pass-only]
                                            [--biotypes BIOTYPES]
                                            [--allow-incomplete-transcripts]
                                            [-d DOWNSTREAM_SEQUENCE_LENGTH]
                                            [--aggregate-report-evaluation AGGREGATE_REPORT_EVALUATION]
                                            [--classI-IC50 CLASSI_IC50]
                                            [--classI-percent CLASSI_PERCENT]
                                            [--classII-IC50 CLASSII_IC50]
                                            [--classII-percent CLASSII_PERCENT]
                                            [--prob-pos PROB_POS]
                                            input_vcf flanking_sequence_length
                                            classI_aggregated_tsv
                                            classII_aggregated_tsv
                                            output_file_prefix sample_name

Generate peptide ordering files (FASTA, annotated ordering Excel spreadsheet,
and review template Excel spreadsheet) to streamline preparation of peptides
for synthesis and review.

positional arguments:
  input_vcf             A VEP-annotated single- or multi-sample VCF containing
                        genotype, transcript, Wildtype protein sequence, and
                        Frameshift protein sequence information. The VCF may
                        be gzipped (requires tabix index). This VCF will be
                        used to extract peptide sequences for processable
                        variants with 25 flanking amino acids on either side
                        of the mutation. These sequences will be included in
                        the peptide ordering spreadsheet.
  flanking_sequence_length
                        Number of amino acids to add on each side of the
                        mutation when creating the FASTA.
  classI_aggregated_tsv
                        The path to the classI all_epitopes.aggregated.tsv
                        file with the Evaluation column filled in to mark
                        candidates to process as 'Accept'. Only candidates
                        marked as Accept in this file will be included in the
                        ordering spreadsheet. This file is commonly created by
                        importing the aggregated class I report from pVACseq
                        into pVACview, investigating candidates, selecting
                        appropriate evaluations, and exporting the results in
                        TSV format.
  classII_aggregated_tsv
                        The path to the classII all_epitopes.aggregated.tsv
  output_file_prefix    The prefix for the output files' names
  sample_name           The name of the sample being processed. Must be a
                        sample ID in the input VCF #CHROM header line.

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT_PATH, --output-path OUTPUT_PATH
                        The path where the output will be generated. A
                        directory will be created if not specified. (default:
                        None)
  -p PHASED_PROXIMAL_VARIANTS_VCF, --phased-proximal-variants-vcf PHASED_PROXIMAL_VARIANTS_VCF
                        A VCF with phased proximal variant information to
                        incorporate into the predicted fasta sequences
                        generated from the input_vcf. Must be gzipped and
                        tabix indexed. (default: None)
  --external-vcf EXTERNAL_VCF
                        A VCF file from an external provider to check variants
                        against. Any variant with a PASS filter or no other
                        filter applied will be marked as called in the
                        "Variant Called in External VCF" column of the updated
                        aggregated report
                        "<sample_name>.Annotated.Neoantigen_Candidates.xlsx"
                        (default: None)
  --pass-only           Only process VCF entries with a PASS status. (default:
                        False)
  --biotypes BIOTYPES   A list of biotypes to use for pre-filtering
                        transcripts when generating peptide sequences from the
                        input_vcf. (default: ['protein_coding'])
  --allow-incomplete-transcripts
                        By default, transcripts annotated with incomplete CDS
                        (i.e., 'cds_start_NF' or 'cds_end_NF' flags in the VEP
                        CSQ field) are excluded from analysis, as they often
                        produce invalid protein sequences. Use this flag to
                        allow candidates from such transcripts. Only peptides
                        that do not contain 'X' will be included. These
                        candidates will be deprioritized relative to those
                        from transcripts without incomplete CDS flags.
                        (default: False)
  -d DOWNSTREAM_SEQUENCE_LENGTH, --downstream-sequence-length DOWNSTREAM_SEQUENCE_LENGTH
                        Cap to limit the downstream sequence length for
                        frameshifts when creating the fasta file. Use 'full'
                        to include the full downstream sequence. (default:
                        1000)
  --aggregate-report-evaluation AGGREGATE_REPORT_EVALUATION
                        Only include variants where the Evaluation column in
                        the classI_aggregated_tsv matches this evaluation.
                        Valid values for this field are Accept, Reject,
                        Pending, and Review. Specify multiple values as a
                        comma-separated list to include multiple evaluation
                        states. (default: Accept)
  --classI-IC50 CLASSI_IC50
                        Bold the Best Peptide from the classI_aggregated_tsv
                        file in the 'CANDIDATE NEOANTIGEN AMINO ACID SEQUENCE
                        WITH FLANKING RESIDUES' column of the ordering
                        spreadsheet only if the IC50 score is less than this
                        cutoff or the --classI-percent cutoff is met.
                        (default: 1000)
  --classI-percent CLASSI_PERCENT
                        Color the Best Peptide from the classI_aggregated_tsv
                        file in the 'CANDIDATE NEOANTIGEN AMINO ACID SEQUENCE
                        WITH FLANKING RESIDUES' column of the ordering
                        spreadsheet only if this percentile cutoff is met or
                        the IC50 score is below the specified --classI-IC50
                        maximum. (default: 2)
  --classII-IC50 CLASSII_IC50
                        Bold the Best Peptide from the classII_aggregated_tsv
                        file in the 'CANDIDATE NEOANTIGEN AMINO ACID SEQUENCE
                        WITH FLANKING RESIDUES' column of the ordering
                        spreadsheet only if the IC50 score is less than this
                        cutoff or the --classII-percent cutoff is met.
                        (default: 500)
  --classII-percent CLASSII_PERCENT
                        Bold the Best Peptide from the classII_aggregated_tsv
                        file in the 'CANDIDATE NEOANTIGEN AMINO ACID SEQUENCE
                        WITH FLANKING RESIDUES' column of the ordering
                        spreadsheet only if this percentile cutoff is met or
                        the IC50 score is below the specified --classII-IC50
                        maximum. (default: 2)
  --prob-pos PROB_POS   Comma-separated list of problematic positions to make
                        large in the ordering spreadsheet. (default: [])

This tool generates a comprehensive peptide ordering package from pVACseq results in a single step. It streamlines the preparation of long peptides for synthesis by combining protein sequence extraction, manufacturability assessment, peptide annotation, and visualization into one workflow. The output includes peptide FASTA files, manufacturability reports, and color-coded Excel summaries that highlight binding strength, sequence properties, and variant context.

This command replaces the need to run the generate_protein_fasta, generate_reviews_files, and color_peptides51mer scripts separately. The output includes the following files:

`<output_file>_<sample_name>.fa`	Contains the generated peptides in FASTA format for peptide synthesis.
`<output_file>_<sample_name>.manufacturability.tsv`	Manufacturability assessments for the peptides, including metrics such as cysteine content, hydrophobicity, and sequence complexity.
`<output_file>_<sample_name>.Colored_Peptides.xlsx`	A color-coded Excel file summarizing peptides, annotations, manufacturability metrics, and peptide positions, ready for ordering.
`<output_file>_<sample_name>.Annotated.Neoantigen_Candidates.xlsx`	A spreadsheet intended for downstream manual review of selected variants, including visualization in tools such as IGV.

Several options are available for tailoring the output. The flanking_sequence_length determines the number of flanking amino acids around the mutation of interest when creating the peptide sequence for the ordering spreadsheet. The --biotypes option can be used to pre-filter transcripts when generating peptide sequences from the input VCF, limiting the analysis to specific transcript biotypes (default: protein_coding) and the --pass-only flag can be used to narrow down the input VCF to PASS variants only. The two latter options should match the options selected for the original pVACseq run so that variants will match correctly between the peptide sequences created by this tool and the variants in the classI_tsv aggregated report.

Additionally, the --aggregate-report-evaluation parameter can be used to restrict the output reports to candidates with specific evaluation states in the classI_tsv (e.g. Accept, Reject, Pending, or Review; multiple values may be provided as a comma-separated list).

For custom peptide prioritization thresholds, the IC50 and percentile cutoffs for class I and II can be adjusted using the appropriate flags.

Table of Contents

Previous topic

Next topic