Reconstructing germline sequences

Example data

We have hosted a small example data set resulting from the UMI barcoded MiSeq workflow described in the pRESTO documentation. The files can be downloaded from here:

Change-O Example Files

The following examples use the HD13M_db-pass.tsv AIRR Rearrangement file provided in the example bundle, which has already undergone the IMGT/IgBLAST parsing and filtering operations.

Adding germline sequences to the database

The CreateGermlines.py tool is used to reconstruct the germline V(D)J sequence, from which the Ig lineage and mutations can be inferred. In addition to the alignment information parsed by MakeDb.py to generate the initial database, CreateGermlines.py also requires the set of germline sequences that were used for the alignment passed to the -r argument. In the case of V-segment germlines, the reference sequences must be IMGT-gapped. Because the D-segment call for B cell receptor alignments is often low confidence, the default germline format (-g dmask) places Ns in the N/P and D-segments of the junction region rather than using the D-segment assigned during reference alignment; this can be modified to generate a complete germline (-g full) or a V-segment only germline (-g vonly) if you wish. The command below adds the germline sequence to the germline_alignment_d_mask column of the output database:

CreateGermlines.py -d HD13M_db-pass.tsv -g dmask \
    -r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta

Alternatively, if you have run the clonal assignment task prior to invoking CreateGermlines.py, then adding the --cloned argument is recommended, as this will generate a single germline of consensus length for each clone:

CreateGermlines.py -d HD13M_db-pass_clone-pass.tsv -g dmask --cloned \
    -r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta

Important

The germline set passed to -r must contain the complete set of germlines used by the reference alignment software (IMGT/HighV-QUEST or IgBLAST). If alleles called by the aligner are missing from the reference set, they will not be successfully processed. Additionally, the V-segment reference set must contain IMGT-gapped sequences to properly reconstruct germlines, even if the reference alignment was performed on ungapped sequences.

Note

While MakeDb.py provides the ihmm subcommand to parse alignment output generated by iHMMuneAlign, there is insufficient information to successfully reconstruct germline sequences for all cases using CreateGermlines.py.

Warning

CreateGermlines.py only supports reconstruction from nucleotide sequences. Amino acid sequences are not currently supported. Ensure that the sequence field (e.g., sequence_alignment or SEQUENCE_IMGT) contains nucleotide data, not amino acid data (e.g., sequence_alignment_aa or SEQUENCE_AA).