Clustering sequences into clonal groups¶
Determining a clustering threshold¶
Before running DefineClones, it is important to determine an
appropriate threshold for trimming the hierarchical clustering into B cell
clones. The distToNearest
function in the SHazaM R package calculates
the distance between each sequence in the data and its nearest neighbor. The
resulting distribution is bimodal, with the first mode representing sequences
with clonal relatives in the dataset and the second mode representing singletons.
The ideal threshold for separating clonal groups is the value that separates
the two modes of this distribution and can be found using the
function in the SHazaM R package. The
function allows selection of all parameters that are available in DefineClones.
Using the length normalization parameter ensures that mutations are weighted equally
regardless of junction sequence length. The distance to nearest neighbor distribution
for the example data is shown below. The threshold is
0.16 - indicated
by the red dotted line.
Download the R Script to generate
the distance to nearest neighbor distribution.
For additional details see the distToNearest documentation.
There are several parameter choices when grouping Ig sequences into B cell
clones. The argument
accounts for ambiguous V-gene and J-gene calls when grouping similar sequences. The
is nucleotide Hamming distance. Because
ham distance model is symmetric,
--sym min argument can be left as default.
Because the threshold was generated using length normalized distances, the
--norm len argument is selected with the
DefineClones.py bygroup -d S43_db-pass_parse-select.tab --act set --model ham \ --sym min --norm len --dist 0.16