Clonal clustering methods¶
The DefineClones tool provides multiple different approaches to assigning Ig sequences into clonal groups.
Clustering by V-gene, J-gene and junction length¶
All methods provided by the bygroup subcommand of DefineClones
first partition sequences based on common IGHV gene, IGHJ gene, and
junction region length. These groups are then further subdivided into
clonally related groups based on the following distance metrics on the
junction region. The specified distance metric
--model) is then
used to perform hierarchical clustering under the specified linkage
--link) clustering. Clonal groups are
defined by trimming the resulting dendrogram at the specified threshold
Amino acid model¶
aa distance model is the Hamming distance
between junction amino acid sequences.
Hamming distance model¶
ham distance model is the Hamming
distance between junction nucleotide sequences.
Human and mouse 1-mer models¶
mk_rs5nf distance models are single
nucleotide distance matrices derived from averaging and symmetrizing the human 5-mer
targeting model in [YVanderHeidenU+13] and the mouse 5-mer targeting model in
[CDiNiroVanderHeiden+16]. The are broadly similar to a transition/transversion model.
Human 1-mer substitution matrix:
Mouse 1-mer substitution matrix:
Human and mouse 5-mer models¶
mk_rs5nf distance models are based on
the human 5-mer targeting model in [YVanderHeidenU+13] and mouse 5-mer
argeting models in [CDiNiroVanderHeiden+16], respectively. The targeting
matrix has 5-mers across the columns and the nucleotide to
which the center base of the 5-mer mutates as the rows. The value for a
given nucleotide, 5-mer pair is the product of the
likelihood of that 5-mer to be mutated and the
likelihood of the center base mutating to the given nucleotide
. This matrix of probabilities is converted
into a distance matrix via the following steps:
- is then divided by the mean of values in
- All distances in that are infinite (probability of zero), distances on the diagonal (no change), and NA distances are set to 0.
Since the distance matrix is not symmetric, the
can be specified to calculate either the average (avg) or minimum (min)
of and .
The distances defined by for each nucleotide difference are
summed for all 5-mers in the junction to yield the distance between the
two junction sequences.
Clustering by the full sequence¶
The chen2010 and ademokun2011 methods provided by DefineClones cluster sequences based on the full length sequence, with imposed penalties for V-gene and/or J-gene mismatches.
Chen et al, 2010 method¶
The chen2010 method of DefineClones is directly from [CCWGaeta10], with additional flexibility in selecting the threshold for determining clonally related groups. The distance metric is a normalized edit distance () calculated as:
where is the un-normalized Levenshtein distance, is the mismatch penalty for the V-gene (0 if same gene, 1 if allele differs, 3 if gene differs, and 5 if family differs), is the mismatch penalty for J-gene (0 if same gene, 1 if allele differs, 3 if gene differs). is the CDR3 alignment length. Given this distance metric, sequences are clustered using hierarchical clustering with average linkage. The resulting dendrogram is trimmed at the specified threshold.
Ademokun et al, 2011 method¶
The ademokun2011 method of DefineClones is directly from [AWM+11], with additional flexibility in selecting the threshold for determining clonally related groups. The distance metric is a minimum edit distance normalized to the length of the shorter sequence up to a maximum of 1 in 5 (or a total of 10) mismatches or indels. Distance is set to 1 for sequences with more than the maximum number of mismatches or sequences with different V-gene families. This metric is then used to do complete linkage hierarchical clustering. The resulting dendrogram is trimmed at the specified threshold.
|[AWM+11]||Alexander Ademokun, Yu-Chang Wu, Victoria Martin, Rajive Mitra, Ulrich Sack, Helen Baxendale, David Kipling, and Deborah K Dunn-Walters. Vaccination-induced changes in human B-cell repertoire and pneumococcal IgM and IgA antibody at different ages. Aging cell, 10(6):922–30, December 2011. URL: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3264704&tool=pmcentrez&rendertype=abstract, doi:10.1111/j.1474-9726.2011.00732.x.|
|[CCWGaeta10]||Zhiliang Chen, Andrew M Collins, Yan Wang, and Bruno a Gaëta. Clustering-based identification of clonally-related immunoglobulin gene sequence sets. Immunome research, 6 Suppl 1(Suppl 1):S4, January 2010. URL: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2946782&tool=pmcentrez&rendertype=abstract, doi:10.1186/1745-7580-6-S1-S4.|
|[CDiNiroVanderHeiden+16]||(1, 2) Ang Cui, Roberto Di Niro, Jason A. Vander Heiden, Adrian W. Briggs, Kris Adams, Tamara Gilbert, Kevin C. O’Connor, Francois Vigneault, Mark J. Shlomchik, and Steven H. Kleinstein. A Model of Somatic Hypermutation Targeting in Mice Based on High-Throughput Ig Sequencing Data. The Journal of Immunology, 197(9):3566–3574, nov 2016. URL: http://www.jimmunol.org/content/197/9/3566.abstract http://www.jimmunol.org/lookup/doi/10.4049/jimmunol.1502263, doi:10.4049/jimmunol.1502263.|
|[YVanderHeidenU+13]||(1, 2) Gur Yaari, Jason A Vander Heiden, Mohamed Uduman, Daniel Gadala-Maria, Namita Gupta, Joel N H Stern, Kevin C O’Connor, David A Hafler, Uri Laserson, Francois Vigneault, and Steven H Kleinstein. Models of somatic hypermutation targeting and substitution based on synonymous mutations from high-throughput immunoglobulin sequencing data. Frontiers in immunology, 4:358, January 2013. URL: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3828525&tool=pmcentrez&rendertype=abstract, doi:10.3389/fimmu.2013.00358.|