pVACfuse logo

Optional Downstream Analysis Tools

Generate Protein Fasta

usage: pvacfuse generate_protein_fasta [-h] [--input-tsv INPUT_TSV]
                                       [-d DOWNSTREAM_SEQUENCE_LENGTH]
                                       input flanking_sequence_length
                                       output_file

Generate an annotated fasta file from AGFusion or Arriba output.

positional arguments:
  input                 An AGFusion output directory or Arriba fusion.tsv
                        output file.
  flanking_sequence_length
                        Number of amino acids to add on each side of the
                        mutation when creating the FASTA.
  output_file           The output fasta file.

optional arguments:
  -h, --help            show this help message and exit
  --input-tsv INPUT_TSV
                        A pVACfuse all_epitopes or filtered TSV file with
                        epitopes to use for subsetting the input file to
                        peptides of interest. Only the peptide sequences for
                        the epitopes in the TSV will be used when creating the
                        FASTA. (default: None)
  -d DOWNSTREAM_SEQUENCE_LENGTH, --downstream-sequence-length DOWNSTREAM_SEQUENCE_LENGTH
                        Cap to limit the downstream sequence length for
                        frameshift fusion when creating the fasta file. Use
                        'full' to include the full downstream sequence.
                        (default: 1000)

This tool will extract protein sequences surrounding fusion variant in an by parsing Arriba or AGFusion output. One use case for this tool is to help select long peptides that contain short neoepitope candidates. For example, if pVACfuse was run to predict nonamers (9-mers) that are good binders and the user wishes to select long peptide (e.g. 24-mer) sequences that contain the nonamer for synthesis or encoding in a DNA vector. The fusion position will be centered in the protein sequence returned (if possible). If the fusion causes a frameshift, the full downstream protein sequence will be returned unless the user specifies otherwise as described above. The flanking_sequence_length positional parameter controls how many amino acids will be included on either side of the mutation.

The output can be limited to only certain variants by providing a pVACfuse filtered.tsv report file to the --input-tsv argument. Only the peptide sequences for the epitopes in the TSV will be used when creating the FASTA.

Generate Aggregated Report

usage: pvacfuse generate_aggregated_report [-h] [-b BINDING_THRESHOLD]
                                           [--allele-specific-binding-thresholds]
                                           [--percentile-threshold PERCENTILE_THRESHOLD]
                                           [--aggregate-inclusion-binding-threshold AGGREGATE_INCLUSION_BINDING_THRESHOLD]
                                           [-m {lowest,median}]
                                           [--read-support READ_SUPPORT]
                                           [--expn-val EXPN_VAL]
                                           input_file output_file

Generate an aggregated report from a pVACfuse .all_epitopes.tsv report file.

positional arguments:
  input_file            A pVACfuse .all_epitopes.tsv report file
  output_file           The file path to write the aggregated report tsv to

optional arguments:
  -h, --help            show this help message and exit
  -b BINDING_THRESHOLD, --binding-threshold BINDING_THRESHOLD
                        Tier epitopes in the "Pass" tier when the mutant
                        allele has ic50 binding scores below this value and in
                        the "Relaxed" tier when the mutant allele has ic50
                        binding scores below double this value. (default: 500)
  --allele-specific-binding-thresholds
                        Use allele-specific binding thresholds. To print the
                        allele-specific binding thresholds run `pvacfuse
                        allele_specific_cutoffs`. If an allele does not have a
                        special threshold value, the `--binding-threshold`
                        value will be used. (default: False)
  --percentile-threshold PERCENTILE_THRESHOLD
                        When set, tier epitopes in the "Pass" tier when the
                        mutant allele has percentile scores below this value
                        and in the "Relaxed" tier when the mutant allele has
                        percentile scores below double this value. (default:
                        None)
  --aggregate-inclusion-binding-threshold AGGREGATE_INCLUSION_BINDING_THRESHOLD
                        Threshold for including epitopes when creating the
                        aggregate report (default: 5000)
  -m {lowest,median}, --top-score-metric {lowest,median}
                        The ic50 scoring metric to use when filtering epitopes
                        by binding-threshold or minimum fold change. lowest:
                        Use the best MT Score and Corresponding Fold Change
                        (i.e. the lowest MT ic50 binding score and
                        corresponding fold change of all chosen prediction
                        methods). median: Use the median MT Score and Median
                        Fold Change (i.e. the median MT ic50 binding score and
                        fold change of all chosen prediction methods).
                        (default: median)
  --read-support READ_SUPPORT
                        Read Support Cutoff. When failing this cutoff, sites
                        will be binned in a "LowReadSupport" tier. (default:
                        5)
  --expn-val EXPN_VAL   Expression Cutoff. Expression is meassured as FFPM
                        (fusion fragments per million total reads). When
                        failing this cutoff sites will be binned in the
                        "LowExpr" tier. (default: 0.1)

This tool produces an aggregated version of the all_epitopes TSV. It finds the best-scoring (lowest binding affinity) epitope for each variant, and outputs additional information for that epitope. It also gives information about the total number of well-scoring epitopes for each variant, as well as the HLA alleles that those epitopes are well-binding to. For a full overview of the output, see the pVACfuse output file documentation.

Calculate Reference Proteome Similarity

usage: pvacfuse calculate_reference_proteome_similarity [-h]
                                                        [--match-length MATCH_LENGTH]
                                                        [--species SPECIES]
                                                        [--blastp-path BLASTP_PATH]
                                                        [--blastp-db {refseq_select_prot,refseq_protein}]
                                                        [--peptide-fasta PEPTIDE_FASTA]
                                                        [-t N_THREADS]
                                                        input_file input_fasta
                                                        output_file

Identify which epitopes in a pVACseq|pVACfuse|pVACbind report file have
matches in the reference proteome using either BLASTp or a checking directly
against a reference proteome FASTA.

positional arguments:
  input_file            Input filtered, all_epitopes, or aggregated report
                        file with predicted epitopes.
  input_fasta           For pVACbind, the original input FASTA file. For
                        pVACseq and pVACfuse a FASTA file with mutant peptide
                        sequences for each variant isoform. This file can be
                        found in the same directory as the input
                        filtered.tsv/all_epitopes.tsv file. Can also be
                        generated by running `pvacseq|pvacfuse
                        generate_protein_fasta`.
  output_file           Output TSV filename of report file with epitopes with
                        reference matches marked.

optional arguments:
  -h, --help            show this help message and exit
  --match-length MATCH_LENGTH
                        The minimum number of consecutive amino acids that
                        need to match. (default: 8)
  --species SPECIES     The species of the data in the input file. (default:
                        human)
  --blastp-path BLASTP_PATH
                        Blastp installation path. (default: None)
  --blastp-db {refseq_select_prot,refseq_protein}
                        The blastp database to use. (default:
                        refseq_select_prot)
  --peptide-fasta PEPTIDE_FASTA
                        A reference peptide FASTA file to use for finding
                        reference matches instead of blastp. (default: None)
  -t N_THREADS, --n-threads N_THREADS
                        Number of threads to use for parallelizing BLAST
                        calls. (default: 1)

This tool will find matches of the epitope candidates in the reference proteome and return the results in an output TSV & reference_match file pair. It requires the input of a pVACfuse run’s fasta file in order to look up the larger peptide sequence the epitope was derived from. Any substring of that peptide sequence that matches against the reference proteome and is at least as long as the specified match length, will be considered a hit. This tool also requires the user to provide a filtered.tsv, all_epitopes.tsv or aggregated.tsv pVACseq report file as an input and any candidates in this input file will be searched for.

This tool may be either run with BLASTp using either the refseq_select_prot or refseq_protein database. By default this option uses the BLAST API but users may independently install BLASTp. Alternatively, users may provide a reference proteome fasta file and this tool will string match on the entries of this fasta file directly. This approach is recommended, because it is significantly faster than BLASTp. Reference proteome fasta files may be downloaded from Ensembl. For example, the latest reference proteome fasta for human can be downloaded from this link.

For more details on the generated reference_match file, see the pVACfuse output file documentation.

NetChop Predict Cleavage Sites

usage: pvacfuse net_chop [-h] [--method {cterm,20s}] [--threshold THRESHOLD]
                         input_file input_fasta output_file

Predict cleavage sites for neoepitopes.

positional arguments:
  input_file            Input filtered file with predicted epitopes.
  input_fasta           The required fasta file.
  output_file           Output tsv filename for putative neoepitopes.

optional arguments:
  -h, --help            show this help message and exit
  --method {cterm,20s}  NetChop prediction method to use ("cterm" for C term
                        3.0, "20s" for 20S 3.0). (default: cterm)
  --threshold THRESHOLD
                        NetChop prediction threshold. (default: 0.5)

This tool uses NetChop to predict cleavage sites for neoepitopes from a pVACfuse run’s filtered/all_epitopes TSV. In its output, it adds to the TSV 3 columns: Best Cleavage Position, Best Cleavage Score, and a Cleavage Sites list. Typically this step is done in the pVACfuse run pipeline for the filtered output TSV when specified. This tool provides a way to manually run this on pVACfuse’s generated filtered/all_epitopes TSV files so that you can add this information when not present if desired. You can view more about these columns for pVACfuse in the output file documentation.

NetMHCStab Predict Stability

usage: pvacfuse netmhc_stab [-h] [-m {lowest,median}] input_file output_file

Add stability predictions to predicted neoepitopes.

positional arguments:
  input_file            Input filtered file with predicted epitopes.
  output_file           Output TSV filename for putative neoepitopes.

optional arguments:
  -h, --help            show this help message and exit
  -m {lowest,median}, --top-score-metric {lowest,median}
                        The ic50 scoring metric to use when sorting epitopes.
                        lowest: Use the best MT Score and Corresponding Fold
                        Change (i.e. the lowest MT ic50 binding score and
                        corresponding fold change of all chosen prediction
                        methods). median: Use the median MT Score and Median
                        Fold Change (i.e. the median MT ic50 binding score and
                        fold change of all chosen prediction methods).
                        (default: median)

This tool uses NetMHCstabpan to add stability predictions for neoepitopes from a pVACfuse run’s filtered/all_epitopes TSV. In its output, it adds to the TSV 4 columns: Predicted Stability, Half Life, Stability Rank, and NetMHCStab Allele. Typically this step is done in the pVACfuse run pipeline for the filtered output TSV when specified. This tool provides a way to manually run this on pVACfuse’s generated filtered/all_epitopes TSV files so that you can add this information when not present if desired. You can view more about these columns for pVACfuse in the output file documentation.

Identify Problematic Amino Acids

usage: pvacfuse identify_problematic_amino_acids [-h]
                                                 [--filter-type {soft,hard}]
                                                 input_file output_file
                                                 problematic_amino_acids

Mark problematic amino acid positions in each epitope or filter entries that have problematic amino acids.

positional arguments:
  input_file            Input filtered or all_epitopes file with predicted epitopes.
  output_file           Output .tsv file with identification of problematic amino acids or hard-filtered to remove epitopes with problematic amino acids.
  problematic_amino_acids
                        A list of amino acids to consider as problematic. Each entry can be specified in the following format:
                        `amino_acid(s)`: One or more one-letter amino acid codes. Any occurrence of this amino acid string,
                                         regardless of the position in the epitope, is problematic. When specifying more than
                                         one amino acid, they will need to occur together in the specified order.
                        `amino_acid:position`: A one letter amino acid code, followed by a colon separator, followed by a positive
                                               integer position (one-based). The occurrence of this amino acid at the position
                                               specified is problematic., E.g. G:2 would check for a Glycine at the second position
                                               of the epitope. The N-terminus is defined as position 1.
                        `amino_acid:-position`: A one letter amino acid code, followed by a colon separator, followed by a negative
                                                integer position. The occurrence of this amino acid at the specified position from
                                                the end of the epitope is problematic. E.g., G:-3 would check for a Glycine at the
                                                third position from the end of the epitope. The C-terminus is defined as position -1.

optional arguments:
  -h, --help            show this help message and exit
  --filter-type {soft,hard}, -f {soft,hard}
                        Set the type of filtering done. Choosing `soft` will add a new column "Problematic Positions" that lists positions in the epitope with problematic amino acids. Choosing `hard` will remove epitope entries with problematic amino acids.

This tool is used to identify positions in an epitope with an amino acid that is problematic for downstream processing, e.g. vaccine manufacturing. Since this can differ from case to case, this tool requires the user to specify which amino acid(s) to consider problematic. This can be specified in one of three formats:

amino_acid(s)

One or more one-letter amino acid codes. Any occurrence of this amino acid string, regardless of the position in the epitope, is problematic. When specifying more than one amino acid, they will need to occur together in the specified order.

amino_acid:position

A one letter amino acid code, followed by a colon separator, followed by a positive integer position (one-based). The occurrence of this amino acid at the position specified is problematic., E.g. G:2 would check for a Glycine at the second position of the epitope. The N-terminus is defined as position 1.

amino_acid:-position

A one letter amino acid code, followed by a colon separator, followed by a negative integer position. The occurrence of this amino acid at the specified position from the end of the epitope is problematic. E.g., G:-3 would check for a Glycine at the third position from the end of the epitope. The C-terminus is defined as position -1.

You may specify any number of these problematic amino acid(s), in any combination, by providing them as a comma-separated list.

This tool may be used with any filtered.tsv or all_epitopes.tsv pVACfuse report file.