Clonal clustering methods

The DefineClones tool provides multiple different approaches to assigning Ig sequences into clonal groups.

Clustering by V-gene, J-gene and junction length

All methods provided by the bygroup subcommand of DefineClones first partition sequences based on common IGHV gene, IGHJ gene, and junction region length. These groups are then further subdivided into clonally related groups based on the following distance metrics on the junction region. The specified distance metric (--model) is then used to perform hierarchical clustering under the specified linkage (--link) clustering. Clonal groups are defined by trimming the resulting dendrogram at the specified threshold (--dist).

Amino acid model

The aa distance model is the Hamming distance between junction amino acid sequences.

Hamming distance model

The ham distance model is the Hamming distance between junction nucleotide sequences.

Human and mouse 1-mer models

The hh_s1f and mk_rs5nf distance models are single nucleotide distance matrices derived from averaging and symmetrizing the human 5-mer targeting model in [YVanderHeidenU+13] and the mouse 5-mer targeting model in [CDiNiroVanderHeiden+16]. The are broadly similar to a transition/transversion model.

Human 1-mer substitution matrix:

Nucleotide A C G T N
A 0 1.21 0.64 1.16 0
C 1.21 0 1.16 0.64 0
G 0.64 1.16 0 1.21 0
T 1.16 0.64 1.21 0 0
N 0 0 0 0 0

Mouse 1-mer substitution matrix:

Nucleotide A C G T N
A 0 1.51 0.32 1.17 0
C 1.51 0 1.17 0.32 0
G 0.32 1.17 0 1.51 0
T 1.17 0.32 1.51 0 0
N 0 0 0 0 0

Human and mouse 5-mer models

The hh_s5f and mk_rs5nf distance models are based on the human 5-mer targeting model in [YVanderHeidenU+13] and mouse 5-mer argeting models in [CDiNiroVanderHeiden+16], respectively. The targeting matrix T has 5-mers across the columns and the nucleotide to which the center base of the 5-mer mutates as the rows. The value for a given nucleotide, 5-mer pair T[i,j] is the product of the likelihood of that 5-mer to be mutated mut(j) and the likelihood of the center base mutating to the given nucleotide sub(j\rightarrow i). This matrix of probabilities is converted into a distance matrix D via the following steps:

  1. D = -log10(T)
  2. D is then divided by the mean of values in D
  3. All distances in D that are infinite (probability of zero), distances on the diagonal (no change), and NA distances are set to 0.

Since the distance matrix D is not symmetric, the --sym argument can be specified to calculate either the average (avg) or minimum (min) of D(j\rightarrow i) and D(i\rightarrow j). The distances defined by D for each nucleotide difference are summed for all 5-mers in the junction to yield the distance between the two junction sequences.

Clustering by the full sequence

The chen2010 and ademokun2011 methods provided by DefineClones cluster sequences based on the full length sequence, by imposing penalties for V-gene and/or J-gene mismatches.

Chen et al, 2010 method

The chen2010 method of DefineClones is directly from [CCWGaeta10]. The distance metric is a normalized edit distance (NED_VJ) calculated as:

NED\_VJ = \frac{LD+S_V+S_J}{L}

where LD is the unnormalized Levenshtein distance, S_V is the mismatch penalty for the V-gene (0 if same gene, 1 if allele differs, 3 if gene differs, and 5 if family differs), S_J is the mismatch penalty for J-gene (0 if same gene, 1 if allele differs, 3 if gene differs). L is the CDR3 alignment length. Given this distance metric, sequences are clustered using hierarchical clustering with average linkage. The resulting dendrogram is trimmed at 0.32.

Ademokun et al, 2011 method

The ademokun2011 method of DefineClones is directly from [AWM+11]. The distance metric is a minimum edit distance normalized to the length of the shorter sequence up to a maximum of 1 in 5 (or a total of 10) mismatches or indels. Distance is set to 1 for sequences with more than the maximum number of mismatches or sequences with different V-gene families. This metric is then used to do complete linkage hierarchical clustering. The resulting dendrogram is trimmed at 0.25.

[AWM+11]Alexander Ademokun, Yu-Chang Wu, Victoria Martin, Rajive Mitra, Ulrich Sack, Helen Baxendale, David Kipling, and Deborah K Dunn-Walters. Vaccination-induced changes in human B-cell repertoire and pneumococcal IgM and IgA antibody at different ages. Aging cell, 10(6):922–30, December 2011. URL:, doi:10.1111/j.1474-9726.2011.00732.x.
[CCWGaeta10]Zhiliang Chen, Andrew M Collins, Yan Wang, and Bruno a Gaëta. Clustering-based identification of clonally-related immunoglobulin gene sequence sets. Immunome research, 6 Suppl 1(Suppl 1):S4, January 2010. URL:, doi:10.1186/1745-7580-6-S1-S4.
[CDiNiroVanderHeiden+16](1, 2) Ang Cui, Roberto Di Niro, Jason A. Vander Heiden, Adrian W. Briggs, Kris Adams, Tamara Gilbert, Kevin C. O’Connor, Francois Vigneault, Mark J. Shlomchik, and Steven H. Kleinstein. A Model of Somatic Hypermutation Targeting in Mice Based on High-Throughput Ig Sequencing Data. The Journal of Immunology, 197(9):3566–3574, nov 2016. URL:, doi:10.4049/jimmunol.1502263.
[YVanderHeidenU+13](1, 2) Gur Yaari, Jason A Vander Heiden, Mohamed Uduman, Daniel Gadala-Maria, Namita Gupta, Joel N H Stern, Kevin C O’Connor, David A Hafler, Uri Laserson, Francois Vigneault, and Steven H Kleinstein. Models of somatic hypermutation targeting and substitution based on synonymous mutations from high-throughput immunoglobulin sequencing data. Frontiers in immunology, 4:358, January 2013. URL:, doi:10.3389/fimmu.2013.00358.