Reconstructing germline sequences
Example data
We have hosted a small example data set resulting from the UMI barcoded MiSeq workflow described in the pRESTO documentation. The files can be downloaded from here:
The following examples use the HD13M_db-pass.tsv
AIRR Rearrangement file provided in
the example bundle, which has already undergone the IMGT/IgBLAST
parsing and filtering operations.
Adding germline sequences to the database
The CreateGermlines.py tool is used to reconstruct the germline V(D)J sequence,
from which the Ig lineage and mutations can be inferred. In addition to the alignment
information parsed by MakeDb.py to generate the initial database, CreateGermlines.py
also requires the set of germline sequences that were used for the alignment
passed to the -r
argument. In the case of V-segment germlines,
the reference sequences must be IMGT-gapped. Because the D-segment call for B cell receptor
alignments is often low confidence, the default
germline format (-g dmask
) places Ns in the N/P and D-segments
of the junction region rather than using the D-segment assigned during reference alignment;
this can be modified to generate a complete germline (-g full
)
or a V-segment only germline (-g vonly
) if you wish.
The command below adds the germline sequence to the germline_alignment_d_mask
column of
the output database:
CreateGermlines.py -d HD13M_db-pass.tsv -g dmask \
-r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta
Alternatively, if you have run the clonal assignment task prior to invoking
CreateGermlines.py, then adding the --cloned
argument is recommended, as this will generate a single germline of consensus length for each clone:
CreateGermlines.py -d HD13M_db-pass_clone-pass.tsv -g dmask --cloned \
-r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta
Important
The germline set passed to -r
must contain the
complete set of germlines used by the reference alignment software
(IMGT/HighV-QUEST or IgBLAST). If alleles called by the aligner are missing from the
reference set, they will not be successfully processed. Additionally, the V-segment
reference set must contain IMGT-gapped sequences to properly reconstruct germlines,
even if the reference alignment was performed on ungapped sequences.
Note
While MakeDb.py provides the ihmm subcommand to parse alignment output generated by iHMMuneAlign, there is insufficient information to successfully reconstruct germline sequences for all cases using CreateGermlines.py.
See also
The TIgGER R package provided tools for
identifying novel polymorphisms and building a personalized germline database. To
use the germline corrections provided by TIgGER
you would replace the V-segment germline file with the one generated by
genotypeFasta
(-r IGHV_genotype.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta
) and
specify the genotyped V-segment column (--vf v_call_genotyped
):
CreateGermlines.py -d genotyped.tsv -g dmask --vf v_call_genotyped \
-r IGHV_genotype.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta