Clustering sequences into clonal groups

Example Data

We have hosted a small example data set resulting from the Roche 454 example workflow described in the pRESTO documentation. The files can be downloded from here:

Change-O Example Files

The following examples use the S43_db-pass_parse-select.tab database file provided in the example bundle, which has already undergone the IMGT/IgBLAST parsing and filtering operations.

Determining a clustering threshold

Before running DefineClones, it is important to determine an appropriate threshold for trimming the hierarchical clustering into B cell clones. The distToNearest function in the SHazaM R package calculates the distance between each sequence in the data and its nearest neighbor. The resulting distribution is bimodal, with the first mode representing sequences with clonal relatives in the dataset and the second mode representing singletons. The ideal threshold for separating clonal groups is the value that separates the two modes of this distribution and can be found using the findThreshold function in the SHazaM R package. The distToNearest function allows selection of all parameters that are available in DefineClones. Using the length normalization parameter ensures that mutations are weighted equally regardless of junction sequence length. The distance to nearest neighbor distribution for the example data is shown below. The threshold is 0.16 - indicated by the red dotted line.

../_images/DistToNearest.svg

Download the R Script to generate the distance to nearest neighbor distribution.

See also

For additional details see the distToNearest documentation.

Assigning clones

There are several parameter choices when grouping Ig sequences into B cell clones. The argument --act set accounts for ambiguous V-gene and J-gene calls when grouping similar sequences. The distance metric --model ham is nucleotide Hamming distance. Because the ham distance model is symmetric, the --sym min argument can be left as default. Because the threshold was generated using length normalized distances, the --norm len argument is selected with the resultant threshold --dist 0.16:

DefineClones.py bygroup -d S43_db-pass_parse-select.tab --act set --model ham \
--sym min --norm len --dist 0.16