Clustering sequences into clonal groups¶
Example data¶
We have hosted a small example data set resulting from the UMI barcoded MiSeq workflow described in the pRESTO documentation. The files can be downloded from here:
The following examples use the HD13M_db-pass.tsv
database file provided in
the example bundle, which has already undergone the IMGT/IgBLAST
parsing and filtering operations.
Determining a clustering threshold¶
Before running DefineClones.py, it is important to determine an
appropriate threshold for trimming the hierarchical clustering into B cell
clones. The distToNearest
function in the SHazaM R package calculates
the distance between each sequence in the data and its nearest-neighbor. The
resulting distribution should be bimodal, with the first mode representing sequences
with clonal relatives in the dataset and the second mode representing singletons.
The ideal threshold for separating clonal groups is the value that separates
the two modes of this distribution and can be found using the
findThreshold
function in the SHazaM R package. The
distToNearest
function allows selection of all parameters that are available in DefineClones.py.
Using the length normalization parameter ensures that mutations are weighted equally
regardless of junction sequence length. The distance to nearest-neighbor distribution
for the example data is shown below. The threshold is approximately 0.16
- indicated
by the red dotted line.
See also
For additional details see the vignette on tuning clonal assignment thresholds.
Assigning clones¶
There are several parameter choices when grouping Ig sequences into B cell
clones. The argument --act set
accounts for ambiguous V gene and J gene calls when grouping similar sequences. The
distance metric --model ham
is nucleotide Hamming distance. Because the threshold was generated using length
normalized distances, the --norm len
argument is
selected with the previously determined threshold --dist 0.16
:
DefineClones.py -d HD13M_db-pass.tsv --act set --model ham \
--norm len --dist 0.16
Note
Because T cells don’t undergo SHM, non-zero nucleotide distances suggest sequences orginate from a different ancestor. To identify TCR clones, use –dist 0 or a very low distance value to allow for sequencing error.