Attention

As of v1.0.0 the default file format will be the Adaptive Immune Receptor Repertoire (AIRR) Community Rearrangement standard. The legacy Change-O format will still be supported through the --format changeo argument. See the Release Notes for more details.

https://img.shields.io/pypi/dm/changeo https://img.shields.io/static/v1?label=AIRR-C%20sw-tools%20v1&message=compliant&color=008AFF&labelColor=000000&style=plastic

Change-O - Repertoire clonal assignment toolkit

Change-O is a collection of tools for processing the output of V(D)J alignment tools, assigning clonal clusters to immunoglobulin (Ig) sequences, and reconstructing germline sequences.

Dramatic improvements in high-throughput sequencing technologies now enable large-scale characterization of Ig repertoires, defined as the collection of trans-membrane antigen-receptor proteins located on the surface of B cells and T cells. Change-O is a suite of utilities to facilitate advanced analysis of Ig and TCR sequences following germline segment assignment. Change-O handles output from IMGT/HighV-QUEST and IgBLAST, and provides a wide variety of clustering methods for assigning clonal groups to Ig sequences. Record sorting, grouping, and various database manipulation operations are also included.

Overview

Change-O performs analyses of lymphocyte receptor sequences following alignment against the germline reference. It includes tools for standardizing the output of alignment software, clonal assignment, germline reconstruction, and basic database manipulation. Change-O was designed to be simple to use, but it does require some familiarity with commandline applications. To maximize flexibility, Change-O employs a simple tab-delimited database format with standardized column names, allowing easy use of Change-O output with external environments and interoperability with the Alakazam, SHazaM, and TIgGER R packages. A brief description of each tool is shown in the table below.

Tool

Subcommand

Description

AlignRecords.py

Multiple aligns sequences in a database

across

Aligns sequence columns within groups and across rows

block

Aligns sequence groups across both columns and rows

within

Aligns sequence fields within rows

AssignGenes.py

igblast

Runs IgBLAST on a fasta file

ConvertDb.py

Converts tab delimited database files

airr

Converts input to an AIRR TSV file

baseline

Creates a special BASELINe formatted fasta file from a database

changeo

Converts input into a Change-O TSV file

fasta

Creates a fasta file from database records

genbank

Creates fasta and feature table files for input to tbl2asn

CreateGermlines.py

Reconstructs germline sequences from alignment data

DefineClones.py

Assigns clones by V gene, J gene and junction distance

MakeDb.py

Creates standardized databases from germline alignment results

igblast

Parses IgBLAST output and adds IMGT-gapping to the V-segment

ihmm

Parses iHMMune-Align output

imgt

Parses IMGT/HighV-QUEST output

ParseDb.py

Parses annotations in tab delimited database files

add

Adds fields to the database

delete

Deletes specific records

drop

Deletes entire fields

index

Adds a numeric index field

merge

Merge files

rename

Renames fields

select

Selects specific records

sort

Sorts records by a field

split

Splits database files by field values

update

Updates field and value pairs

Download

The latest stable release of Change-O may be downloaded from PyPI or Bitbucket.

Development versions and source code are available on Bitbucket.

Installation

The simplest way to install the latest stable release of Change-O is via pip:

> pip3 install changeo --user

The current development build can be installed using pip and git in similar fashion:

> pip3 install git+https://bitbucket.org/kleinstein/changeo@master --user

If you currently have a development version installed, then you will likely need to add the arguments --upgrade --no-deps --force-reinstall to the pip3 command.

Requirements

The minimum dependencies for installation are:

Some tools wrap external applications that are not required for installation. Those tools require minimum versions of:

Linux

  1. The simplest way to install all Python dependencies is to install the full SciPy stack using the instructions, then install Biopython according to its instructions.

  2. Install presto 0.6.2 or greater.

  3. Download the Change-O bundle and run:

    > pip3 install changeo-x.y.z.tar.gz --user
    

Mac OS X

  1. Install Xcode. Available from the Apple store or developer downloads.

  2. Older versions Mac OS X will require you to install XQuartz 2.7.5. Available from the XQuartz project.

  3. Install Homebrew following the installation and post-installation instructions.

  4. Install Python 3.4.0+ and set the path to the python3 executable:

    > brew install python3
    > echo 'export PATH=/usr/local/bin:$PATH' >> ~/.profile
    
  5. Exit and reopen the terminal application so the PATH setting takes effect.

  6. You may, or may not, need to install gfortran (required for SciPy). Try without first, as this can take an hour to install and is not needed on newer releases. If you do need gfortran to install SciPy, you can install it using Homebrew:

    > brew install gfortran
    

    If the above fails run this instead:

    > brew install --env=std gfortran
    
  7. Install NumPy, SciPy, pandas and Biopython using the Python package manager:

    > pip3 install numpy scipy pandas biopython
    
  8. Install presto 0.6.2 or greater.

  9. Download the Change-O bundle, open a terminal window, change directories to the download folder, and run:

    > pip3 install changeo-x.y.z.tar.gz
    

Windows

  1. Install Python 3.4.0+ from Python, selecting both the options ‘pip’ and ‘Add python.exe to Path’.

  2. Install NumPy, SciPy, pandas and Biopython using the packages available from the Unofficial Windows binary collection.

  3. Install presto 0.6.2 or greater.

  4. Download the Change-O bundle, open a Command Prompt, change directories to the download folder, and run:

    > pip install changeo-x.y.z.tar.gz
    
  5. For a default installation of Python 3.4, the Change-0 scripts will be installed into C:\Python34\Scripts and should be directly executable from the Command Prompt. If this is not the case, then follow step 6 below.

  6. Add both the C:\Python34 and C:\Python34\Scripts directories to your %Path%. On both Windows 7 and Windows 10, the %Path% setting is located under Control Panel -> System and Security -> System -> Advanced System Settings -> Environment variables -> System variables -> Path.

  7. If you have trouble with the .py file associations, try adding .PY to your PATHEXT environment variable. Also, try opening a Command Prompt as Administrator and run:

    > assoc .py=Python.File
    > ftype Python.File="C:\Python34\python.exe" "%1" %*
    

Data Standards

All Change-O tools supports both the legacy Change-O standard and the new Adaptive Immune Receptor Repertoire (AIRR) standard developed by the AIRR Community (AIRR-C).

AIRR-C Format

As of v1.0.0, the default file format is the AIRR-C format as described by the Rearrangement Schema (v1.2). The AIRR-C Rearrangement format is a tab-delimited file format (.tsv) that defines the required and optional annotations for rearranged adaptive immune receptor sequences.

To learn more about this format, the valid field names and their expected values, visit the AIRR-C Rearrangement Schema documentation site.

An API for the input and output of the AIRR-C format is provided in the AIRR Python package. Wrappers for this package are provided in the API as changeo.IO.AIRRReader and changeo.IO.AIRRWriter.

Change-O Format

The legacy Change-O standard is a tab-delimited file format (.tab) with a set of predefined column names. The standardized column names used by the Change-O format are shown in the table below. Most tools do not require every column. The columns required by and added by each individual tool are described in the commandline usage documentation. If a column contains multiple entries, such as ambiguous V gene assignments, these nested entries are delimited by commas. The ordering of the columns does not matter.

An API for the input and output of the Change-O format is provided in changeo.IO.ChangeoReader and changeo.IO.ChangeoWriter respectively.

Change-O Field

AIRR Field

Type

Description

Standard Annotations

SEQUENCE_ID

sequence_id

string

Unique sequence identifier

SEQUENCE_INPUT

sequence

string

Input nucleotide sequence

SEQUENCE_VDJ

string

V(D)J nucleotide sequence

SEQUENCE_IMGT

sequence_alignment

string

IMGT-numbered V(D)J nucleotide sequence

FUNCTIONAL

productive

logical

T: V(D)J sequence is predicted to be productive

IN_FRAME

vj_in_frame

logical

T: junction region nucleotide sequence is in-frame

STOP

stop_codon

logical

T: stop codon is present in V(D)J nucleotide sequence

MUTATED_INVARIANT

logical

T: invariant amino acids properly encoded by V(D)J sequence

INDELS

logical

T: V(D)J nucleotide sequence contains insertions and/or deletions

LOCUS

locus

string

Locus of the receptor

V_CALL

v_call

string

V allele assignment(s)

D_CALL

d_call

string

D allele assignment(s)

J_CALL

j_call

string

J allele assignment(s)

C_CALL

c_call

string

C-region assignment

V_SEQ_START

v_sequence_start

integer

Position of first V nucleotide in SEQUENCE_INPUT

V_SEQ_LENGTH

integer

Number of V nucleotides in SEQUENCE_INPUT

V_GERM_START_IMGT

v_germline_start

integer

Position of V_SEQ_START in IMGT-numbered germline V(D)J sequence

V_GERM_LENGTH_IMGT

integer

Length of the IMGT numbered germline V alignment

NP1_LENGTH

np1_length

integer

Number of nucleotides between V and D segments

D_SEQ_START

d_sequence_start

integer

Position of first D nucleotide in SEQUENCE_INPUT

D_SEQ_LENGTH

integer

Number of D nucleotides in SEQUENCE_INPUT

D_GERM_START

d_germline_start

integer

Position of D_SEQ_START in germline V(D)J nucleotide sequence

D_GERM_LENGTH

integer

Length of the germline D alignment

NP2_LENGTH

np2_length

integer

Number of nucleotides between D and J segments

J_SEQ_START

j_sequence_start

integer

Position of first J nucleotide in SEQUENCE_INPUT

J_SEQ_LENGTH

j_sequence_end

integer

Number of J nucleotides in SEQUENCE_INPUT

J_GERM_START

j_germline_start

integer

Position of J_SEQ_START in germline V(D)J nucleotide sequence

J_GERM_LENGTH

integer

Length of the germline J alignment

JUNCTION_LENGTH

junction_length

integer

Number of junction nucleotides in SEQUENCE_VDJ

JUNCTION

junction

string

Junction region nucletide sequence

CELL

cell_id

string

Cell identifier

CLONE

clone_id

string

Clonal grouping identifier

Region Annotations

FWR1_IMGT

fwr1

string

IMGT-numbered FWR1 nucleotide sequence

FWR2_IMGT

fwr2

string

IMGT-numbered FWR2 nucleotide sequence

FWR3_IMGT

fwr3

string

IMGT-numbered FWR3 nucleotide sequence

FWR4_IMGT

fwr4

string

IMGT-numbered FWR4 nucleotide sequence

CDR1_IMGT

cdr1

string

IMGT-numbered CDR1 nucleotide sequence

CDR2_IMGT

cdr2

string

IMGT-numbered CDR2 nucleotide sequence

CDR3_IMGT

cdr3

string

IMGT-numbered CDR3 nucleotide sequence

N1_LENGTH

n1_length

integer

Untemplated nucleotides 5’ of the D segment

N2_LENGTH

n2_length

integer

Untemplated Nucleotides 3’ of the D segment

P3V_LENGTH

p3v_length

integer

Palindromic nucleotides 3’ of the V segment

P5D_LENGTH

p5d_length

integer

Palindromic nucleotides 5’ of the D segment

P3D_LENGTH

p3d_length

integer

Palindromic nucleotides 3’ of the D segment

P5J_LENGTH

p5j_length

integer

Palindromic nucleotides 5’ of the J segment

D_FRAME

integer

D segment reading frame

Germline Annotations

GERMLINE_VDJ

string

Full unaligned germline V(D)J nucleotide sequence

GERMLINE_VDJ_V_REGION

string

Unaligned germline V segment nucleotide sequence

GERMLINE_VDJ_D_MASK

string

Unaligned germline V(D)J nucleotides sequence with Ns masking the NP1-D-NP2 regions

GERMLINE_IMGT

germline_alignment

string

Full IMGT-numbered germline V(D)J nucleotide sequence

GERMLINE_IMGT_V_REGION

string

IMGT-numbered germline V segment nucleotide sequence

GERMLINE_IMGT_D_MASK

string

IMGT-numbered germline V(D)J nucleotide sequence with Ns masking the NP1-D-NP2 regions

GERMLINE_V_CALL

string

Clonal consensus germline V assignment

GERMLINE_D_CALL

string

Clonal consensus germline D assignment

GERMLINE_J_CALL

string

Clonal consensus germline J assignment

GERMLINE_REGIONS

string

String showing germline segments positions encoded as V, D, J, N, and P characters

Alignment Annotations

V_SCORE

v_score

float

Alignment score for the V

V_IDENTITY

v_identity

float

Alignment identity for the V

V_EVALUE

v_support

float

E-value for the alignment of the V

V_CIGAR

v_cigar

string

CIGAR string for the alignment of the V

D_SCORE

d_score

float

Alignment score for the D

D_IDENTITY

d_identity

float

Alignment identity for the D

D_EVALUE

d_support

float

E-value for the alignment of the D

D_CIGAR

d_cigar

string

CIGAR string for the alignment of the D

J_SCORE

j_score

float

Alignment score for the J

J_IDENTITY

j_identity

float

Alignment identity for the J

J_EVALUE

j_support

float

E-value for the alignment of the J

J_CIGAR

j_cigar

string

CIGAR string for the alignment of the J

VDJ_SCORE

float

Alignment score for the V(D)J

TIgGER Annotations

V_CALL_GENOTYPED

string

Adjusted V allele assignment(s) following TIgGER genotype inference

Preprocessing Annotations

PRCONS

string

pRESTO UMI consensus primer

PRIMER

string

pRESTO primers list

CONSCOUNT

consensus_count

integer

Number of reads contributing to the UMI consensus sequence

DUPCOUNT

duplicate_count

integer

Copy number of the sequence

UMICOUNT

integer

UMI count for the sequence

Release Notes

Version 1.2.0: October 29, 2021

  • Updated dependencies to presto >= v0.7.0.

AssignGenes:

  • Fixed reporting of IgBLAST output counts when specifying --format airr.

BuildTrees:

  • Added support for specifying fixed omega and hotness parameters at the commandline.

CreateGermlines:

  • Will now use the first allele in the reference database when duplicate allele names are provided. Only appears to affect mouse BCR light chains and TCR alleles in the IMGT database when the same allele name differs by strain.

MakeDb:

  • Added support for changes in how IMGT/HighV-QUEST v1.8.4 handles special characters in sequence identifiers.

  • Fixed the imgt subcommand incorrectly allowing execution without specifying the IMGT/HighV-QUEST output file at the commandline.

ParseDb:

  • Added reporting of output file sizes to the console log of the split subcommand.

Version 1.1.0: June 21, 2021

  • Fixed gene parsing for IMGT temporary designation nomenclature.

  • Updated dependencies to biopython >= v1.77, airr >= v1.3.1, PyYAML>=5.1.

MakeDb: + Added the --imgt-id-len argument to accommodate changes introduced in how

IMGT/HighV-QUEST truncates sequence identifiers as of v1.8.3 (May 7, 2021). The header lines in the fasta files are now truncated to 49 characters. In IMGT/HighV-QUEST versions older than v1.8.3, they were truncated to 50 characters. --imgt-id-len default value is 49. Users should specify --imgt-id-len 50 to analyze IMGT results generated with IMGT/HighV-QUEST versions older than v1.8.3.

  • Added the --infer-junction argument to MakeDb igblast, to enable the inference of the junction sequence when not reported by IgBLAST. Should be used with data from IgBLAST v1.6.0 or older; before igblast added the IMGT-CDR3 inference.

Version 1.0.2: January 18, 2021

AlignRecords:

  • Fixed a bug caused the program to exit when encountering missing sequence data. It will now fail the row or group with missing data and continue.

MakeDb:

  • Added support for IgBLAST v1.17.0.

ParseDb:

  • Added a relevant error message when an input field is missing from the data.

Version 1.0.1: October 13, 2020

  • Updated to support Biopython v1.78.

  • Increased the biopython dependency to v1.71.

  • Increased the presto dependency to 0.6.2.

Version 1.0.0: May 6, 2020

  • The default output in all tools is now the AIRR Rearrangement standard (--format airr). Support for the legacy Change-O data standard is still provided through the --format changeo argument to the tools.

  • License changed to AGPL-3.

AssignGenes:

  • Added the igblast-aa subcommand to run igblastp on amino acid input.

BuildTrees:

  • Adjusted RECORDS to indicate all sequences in input file. INITIAL_FILTER now shows sequence count after initial min_seq filtering.

  • Added option to skip codon masking: --nmask.

  • Mask :, ,, ), and ( in IDs and metadata with -.

  • Can obtain germline from GERMLINE_IMGT if GERMLINE_IMGT_D_MASK not specified.

  • Can reconstruct intermediate sequences with IgPhyML using --asr.

ConvertDb:

  • Fixed a bug in the airr subcommand that caused the junction_length field to be deleted from the output.

  • Fixed a bug in the genbank subcommand that caused the junction CDS to be missing from the ASN output.

CreateGermlines:

  • Added the --cf argument to allow specification of the clone field.

MakeDb:

  • Added the igblast-aa subcommand to parse the output of igblastp.

  • Changed the log entry FUNCTIONAL to PRODUCTIVE and removed the IMGT_PASS log entry in favor of an informative ERROR entry when sequences fail the junction region validation.

  • Add –regions argument to the igblast and igblast-aa subcommands to allow specification of the IMGT CDR/FWR region boundaries. Currently, the supported specifications are default (human, mouse) and rhesus-igl.

Version 0.4.6: July 19, 2019

BuildTrees:

  • Added capability of running IgPhyML on outputted data (--igphyml) and support for passing IgPhyML arguments through BuildTrees.

  • Added the --clean argument to force deletion of all intermediate files after IgPhyML execution.

  • Added the --format argument to allow specification input and output of either the Change-O standard (changeo) or AIRR Rearrangement standard (airr).

CreateGermlines:

  • Fixed a bug causing incorrect reporting of the germline format in the console log.

ConvertDb:

  • Removed requirement for the NP1_LENGTH and NP2_LENGTH fields from the genbank subcommand.

DefineClones:

  • Fixed a biopython warning arising when applying --model aa to junction sequences that are not a multiple of three. The junction will now be padded with an appropriate number of Ns (usually resulting in a translation to X).

MakeDb:

  • Added the --10x argument to all subcommands to support merging of Cell Ranger annotation data, such as UMI count and C-region assignment, with the output of the supported alignment tools.

  • Added inference of the receptor locus from the alignment data to all subcommands, which is output in the LOCUS field.

  • Combined the extended field arguments of all subcommands (--scores, --regions, --cdr3, and --junction) into a single --extended argument.

  • Removed parsing of old IgBLAST v1.5 CDR3 fields (CDR3_IGBLAST, CDR3_IGBLAST_AA).

Version 0.4.5: January 9, 2019

  • Slightly changed version number display in commandline help.

BuildTrees:

  • Fixed a bug that caused malformed lineages.tsv output file.

CreateGermlines:

  • Fixed a bug in the CreateGermlines log output causing incorrect missing D gene or J gene error messages.

DefineClones:

  • Fixed a bug that caused a missing junction column to cluster sequences together.

MakeDb:

  • Fixed a bug that caused failed germline reconstructions to be recorded as None, rather than an empty string, in the GERMLINE_IMGT column.

Version 0.4.4: October 27, 2018

  • Fixed a bug causing the values of _start fields to be off by one from the v1.2 AIRR Schema requirement when specifying --format airr.

Version 0.4.3: October 19, 2018

  • Updated airr library requirement to v1.2.1 to fix empty V(D)J start coordinate values when specifying --format airr to tools.

  • Changed pRESTO dependency to v0.5.10.

BuildTrees:

  • New tool.

  • Converts tab-delimited database files into input for IgPhyML

CreateGermlines:

  • Now verifies that all files/folder passed to the -r argument exist.

Version 0.4.2: September 6, 2018

  • Updated support for the AIRR Rearrangement schema to v1.2 and added the associated airr library dependency.

AssignGenes:

  • New tool.

  • Provides a simple IgBLAST wrapper as the igblast subcommand.

ConvertDb:

  • The genbank subcommand will perform a check for some of the required columns in the input file and exit if they are not found.

  • Changed the behavior of the -y argument in the genbank subcommand. This argument is now featured to sample features only, but allows for the inclusion of any BioSample attribute.

CreateGermlines:

  • Will now perform a naive verification that the reference sequences provided to the -r argument are IMGT-gapped. A warning will be issued to standard error if the reference sequence fail the check.

  • Will perform a check for some of the required columns in the input file and exit if they are not found.

MakeDb:

  • Changed the output of SEQUENCE_VDJ from the igblast subcommand to retain insertions in the query sequence rather than delete them as is done in the SEQUENCE_IMGT field.

  • Will now perform a naive verification that the reference sequences provided to the -r argument are IMGT-gapped. A warning will be issued to standard error if the reference sequence fail the check.

Version 0.4.1: July 16, 2018

  • Fixed installation incompatibility with pip 10.

  • Fixed duplicate newline issue on Windows.

  • All tools will no longer create empty pass or fail files if there are no records meeting the appropriate criteria for output.

  • Most tools now allow explicit specification of the output file name via the optional -o argument.

  • Added support for the AIRR standard TSV via the --format airr argument to all relevant tools.

  • Replaced V, D and J BTOP columns with CIGAR columns in data standard.

  • Numerous API changes and internal structural changes to commandline tools.

AlignRecords:

  • Fixed a bug arising when space characters are present in the sequence identifiers.

ConvertDb:

  • New tool.

  • Includes the airr and changeo subcommand to convert between AIRR and Change-O formatted TSV files.

  • The genbank subcommand creates MiAIRR compliant files for submission to GenBank/TLS.

  • Contains the baseline and fasta subcommands previously in ParseDb.

CreateGermlines

  • Changed character used to pad clonal consensus sequences from . to N.

  • Changed tie resolution in clonal consensus from random V/J gene to alphabetical by sequence identifier.

  • Added --df and -jf arguments for specifying D and J fields, respectively.

  • Add initial sorting step with specifying --cloned so that clonally ordered input is no longer required.

DefineClones:

  • Removed the chen2010 and ademokun2011 and made the previous bygroup subcommand the default behavior.

  • Renamed the --f argument to --gf for consistency with other tools.

  • Added the arguments --vf and -jf to allow specification of V and J call fields, respectively.

MakeDb:

  • Renamed --noparse argument to --asis-id.

  • Added asis-calls argument to igblast subcommand to allow use with non-standard gene names.

  • Added the GERMLINE_IMGT column to the default output.

  • Changed junction inference in igblast subcommand to use IgBLAST’s CDR3 assignment for IgBLAST versions greater than or equal to 1.7.0.

  • Added a verification that the SEQUENCE_IMGT and JUNCTION fields are in agreement for records to pass.

  • Changed behavior of the igblast subcommand’s translation of the junction sequence to truncate junction that are not multiples of 3, rather than pad to a multiple of 3 (removes trailing X character).

  • The igblast subcommand will now fail records missing the required optional fields subject seq, query seq and BTOP, rather than abort.

  • Fixed bug causing parsing of IgBLAST <= 1.4 output to fail.

ParseDb:

  • Added the merge subcommand which will combine TSV files.

  • All field arguments are now case sensitive to provide support for both the Change-O and AIRR data standards.

Version 0.3.12: February 16, 2018

MakeDb:

  • Fixed a bug wherein specifying multiple simultaneous inputs would cause duplication of parsed pRESTO fields to appear in the second and higher output files.

Version 0.3.11: February 6, 2018

MakeDb:

  • Fixed junction inferrence for igblast subcommand when J region is truncated.

Version 0.3.10: February 6, 2018

Fixed incorrect progress bars resulting from files containing empty lines.

DefineClones:

  • Fixed several bugs in the chen2010 and ademokun2011 methods that caused them to either fail or incorrectly cluster all sequences into a single clone.

  • Added informative message for out of memory error in chen2010 and ademokun2011 methods.

Version 0.3.9: October 17, 2017

DefineClones:

  • Fixed a bug causing DefineClones to fail when all are sequences removed from a group due to missing characters.

Version 0.3.8: October 5, 2017

AlignRecords:

  • Ressurrected AlignRecords which performs multiple alignment of sequence fields.

  • Added new subcommands across (multiple aligns within columns), within (multiple aligns columns within each row), and block (multiple aligns across both columns and rows).

CreateGermlines:

  • Fixed a bug causing CreateGermlines to incorrectly fail records when using the argument --vf V_CALL_GENOTYPED.

DefineClones:

  • Added the --maxmiss argument to the bygroup subcommand of DefineClones which set exclusion criteria for junction sequence with ambiguous and missing characters. By default, bygroup will now fail all sequences with any missing characters in the junction (--maxmiss 0).

Version 0.3.7: June 30, 2017

MakeDb:

  • Fixed an incompatibility with IgBLAST v1.7.0.

CreateGermlines:

  • Fixed an error that occurs when using the --cloned with an input file containing duplicate values in SEQUENCE_ID that caused some records to be discarded.

Version 0.3.6: June 13, 2017

  • Fixed an overflow error on Windows that caused tools to fatally exit.

  • All tools will now print detailed help if no arguments are provided.

Version 0.3.5: May 12, 2017

Fixed a bug wherein .tsv was not being recognized as a valid extension.

MakeDb:

  • Added the --cdr3 argument to the igblast subcommand to extract the CDR3 nucleotide and amino acid sequence defined by IgBLAST.

  • Updated the IMGT/HighV-QUEST parser to handle recent column name changes.

  • Fixed a bug in the igblast parser wherein some sequence identifiers were not being processed correctly.

DefineClones:

  • Changed the way X characters are handled in the amino acid Hamming distance model to count as a match against any character.

Version 0.3.4: February 14, 2017

License changed to Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

CreateGermlines:

  • Added GERMLINE_V_CALL, GERMLINE_D_CALL and GERMLINE_J_CALL columns to the output when the -cloned argument is specified. These columns contain the consensus annotations when clonal groups contain ambiguous gene assignments.

  • Fixed the error message for an invalid repo (-r) argument.

DefineClones:

  • Deprecated m1n and hs1f distance models, renamed them to m1n_compat and hs1f_compat, and replaced them with hh_s1f and replaced mk_rs1nf, respectively.

  • Renamed the hs5f distance model to hh_s5f.

  • Added the mouse specific distance model mk_rs5nf from Cui et al, 2016.

MakeDb:

  • Added compatibility for IgBLAST v1.6.

  • Added the flag --partial which tells MakeDb to pass incomplete alignment results specified.

  • Added missing console log entries for the ihmm subcommand.

  • IMGT/HighV-QUEST, IgBLAST and iHMMune-Align parsers have been cleaned up, better documented and moved into the iterable classes changeo.Parsers.IMGTReader, change.Parsers.IgBLASTReader, and change.Parsers.IHMMuneReader, respectively.

  • Corrected behavior of D_FRAME annotation from the --junction argument to the imgt subcommand such that it now reports no value when no value is reported by IMGT, rather than reporting the reading frame as 0 in these cases.

  • Fixed parsing of IN_FRAME, STOP, D_SEQ_START and D_SEQ_LENGTH fields from iHMMune-Align output.

  • Removed extraneous score fields from each parser.

  • Fixed the error message for an invalid repo (-r) argument.

Version 0.3.3: August 8, 2016

Increased csv.field_size_limit in changeo.IO, ParseDb and DefineClones to be able to handle files with larger number of UMIs in one field.

Renamed the fields N1_LENGTH to NP1_LENGTH and N2_LENGTH to NP2_LENGTH.

CreateGermlines:

  • Added differentiation of the N and P regions the the REGION log field if the N/P region info is present in the input file (eg, from the --junction argument to MakeDb-imgt). If the additional N/P region columns are not present, then both N and P regions will be denoted by N, as in previous versions.

  • Added the option ‘regions’ to the -g argument to create add the GERMLINE_REGIONS field to the output which represents the germline positions as V, D, J, N and P characters. This is equivalent to the REGION log entry.

DefineClones:

  • Improved peformance significantly of the --act set grouping method in the bygroup subcommand.

MakeDb:

  • Fixed a bug producing D_SEQ_START and J_SEQ_START relative to SEQUENCE_VDJ when they should be relative to SEQUENCE_INPUT.

  • Added the argument --junction to the imgt subcommand to parse additional junction information fields, including N/P region lengths and the D-segment reading frame. This provides the following additional output fields: D_FRAME, N1_LENGTH, N2_LENGTH, P3V_LENGTH, P5D_LENGTH, P3D_LENGTH, P5J_LENGTH.

  • The fields N1_LENGTH and N2_LENGTH have been renamed to accommodate adding additional output from IMGT under the --junction flag. The new names are NP1_LENGTH and NP2_LENGTH.

  • Fixed a bug that caused the IN_FRAME, MUTATED_INVARIANT and STOP field to be be parsed incorrectly from IMGT data.

  • Ouput from iHMMuneAlign can now be parsed via the ihmm subcommand. Note, there is insufficient information returned by iHMMuneAlign to reliably reconstruct germline sequences from the output using CreateGermlines.

ParseDb:

  • Renamed the clip subcommand to baseline.

Version 0.3.2: March 8, 2016

Fixed a bug with installation on Windows due to old file paths lingering in changeo.egg-info/SOURCES.txt.

Updated license from CC BY-NC-SA 3.0 to CC BY-NC-SA 4.0.

CreateGermlines:

  • Fixed a bug producing incorrect values in the SEQUENCE field on the log file.

MakeDb:

  • Updated igblast subcommand to correctly parse records with indels. Now igblast must be run with the argument outfmt "7 std qseq sseq btop".

  • Changed the names of the FWR and CDR output columns added with --regions to <region>_IMGT.

  • Added V_BTOP and J_BTOP output when the --scores flag is specified to the igblast subcommand.

Version 0.3.1: December 18, 2015

MakeDb:

  • Fixed bug wherein the imgt subcommand was not properly recognizing an extracted folder as input to the -i argument.

Version 0.3.0: December 4, 2015

Conversion to a proper Python package which uses pip and setuptools for installation.

The package now requires Python 3.4. Python 2.7 is not longer supported.

The required dependency versions have been bumped to numpy 1.9, scipy 0.14, pandas 0.16 and biopython 1.65.

DbCore:

  • Divided DbCore functionality into the separate modules: Defaults, Distance, IO, Multiprocessing and Receptor.

IgCore:

  • Remove IgCore in favor of dependency on pRESTO >= 0.5.0.

AnalyzeAa:

  • This tool was removed. This functionality has been migrated to the alakazam R package.

DefineClones:

  • Added --sf flag to specify sequence field to be used to calculate distance between sequences.

  • Fixed bug in wherein sequences with missing data in grouping columns were being assigned into a single group and clustered. Sequences with missing grouping variables will now be failed.

  • Fixed bug where sequences with “None” junctions were grouped together.

GapRecords:

  • This tool was removed in favor of adding IMGT gapping support to igblast subcommand of MakeDb.

MakeDb:

  • Updated IgBLAST parser to create an IMGT gapped sequence and infer the junction region as defined by IMGT.

  • Added the --regions flag which adds extra columns containing FWR and CDR regions as defined by IMGT.

  • Added support to imgt subcommand for the new IMGT/HighV-QUEST compression scheme (.txz files).

Version 0.2.5: August 25, 2015

CreateGermlines:

  • Removed default ‘-r’ repository and added informative error messages when invalid germline repositories are provided.

  • Updated ‘-r’ flag to take list of folders and/or fasta files with germlines.

Version 0.2.4: August 19, 2015

MakeDb:

  • Fixed a bug wherein N1 and N2 region indexing was off by one nucleotide for the igblast subcommand (leading to incorrect SEQUENCE_VDJ values).

ParseDb:

  • Fixed a bug wherein specifying the -f argument to the index subcommand would cause an error.

Version 0.2.3: July 22, 2015

DefineClones:

  • Fixed a typo in the default normalization setting of the bygroup subcommand, which was being interpreted as ‘none’ rather than ‘len’.

  • Changed the ‘hs5f’ model of the bygroup subcommand to be centered -log10 of the targeting probability.

  • Added the --sym argument to the bygroup subcommand which determines how asymmetric distances are handled.

Version 0.2.2: July 8, 2015

CreateGermlines:

  • Germline creation now works for IgBLAST output parsed with MakeDb. The argument --sf SEQUENCE_VDJ must be provided to generate germlines from IgBLAST output. The same reference database used for the IgBLAST alignment must be specified with the -r flag.

  • Fixed a bug with determination of N1 and N2 region positions.

MakeDb:

  • Combined the -z and -f flags of the imgt subcommand into a single flag, -i, which autodetects the input type.

  • Added requirement that IgBLAST input be generated using the -outfmt "7 std qseq" argument to igblastn.

  • Modified SEQUENCE_VDJ output from IgBLAST parser to include gaps inserted during alignment.

  • Added correction for IgBLAST alignments where V/D, D/J or V/J segments are assigned overlapping positions.

  • Corrected N1_LENGTH and N2_LENGTH calculation from IgBLAST output.

  • Added the --scores flag which adds extra columns containing alignment scores from IMGT and IgBLAST output.

Version 0.2.1: June 18, 2015

DefineClones:

  • Removed mouse 3-mer model, ‘m3n’.

Version 0.2.0: June 17, 2015

Initial public prerelease.

Output files were added to the usage documentation of all scripts.

General code cleanup.

DbCore:

  • Updated loading of database files to convert column names to uppercase.

AnalyzeAa:

  • Fixed a bug where junctions less than one codon long would lead to a division by zero error.

  • Added --failed flag to create database with records that fail analysis.

  • Added --sf flag to specify sequence field to be analyzed.

CreateGermlines:

  • Fixed a bug where germline sequences could not be created for light chains.

DefineClones:

  • Added a human 1-mer model, ‘hs1f’, which uses the substitution rates from from Yaari et al, 2013.

  • Changed default model to ‘hs1f’ and default normalization to length for bygroup subcommand.

  • Added --link argument which allows for specification of single, complete, or average linkage during clonal clustering (default single).

GapRecords:

  • Fixed a bug wherein non-standard sequence fields could not be aligned.

MakeDb:

  • Fixed bug where the allele ‘TRGVA*01’ was not recognized as a valid allele.

ParseDb:

  • Added rename subcommand to ParseDb which renames fields.

Version 0.2.0.beta-2015-05-31: May 31, 2015

Minor changes to a few output file names and log field entries.

ParseDb:

  • Added index subcommand to ParseDb which adds a numeric index field.

Version 0.2.0.beta-2015-05-05: May 05, 2015

Prerelease for review.

Commandline Usage

AlignRecords.py

Multiple aligns sequence fields

usage: AlignRecords.py [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
align-pass

database with multiple aligned sequences.

align-fail

database with records failing alignment.

required fields:

sequence_id, v_call, j_call <field>

user specified sequence fields to align.

output fields:

<field>_align

AlignRecords.py across

usage: AlignRecords.py across [--version] [-h] -d DB_FILES [DB_FILES ...]
                              [-o OUT_FILES [OUT_FILES ...]]
                              [--outdir OUT_DIR] [--outname OUT_NAME]
                              [--log LOG_FILE] [--failed]
                              [--format {airr,changeo}] [--nproc NPROC] --sf
                              SEQ_FIELDS [SEQ_FIELDS ...]
                              [--gf GROUP_FIELDS [GROUP_FIELDS ...]]
                              [--calls {v,d,j} [{v,d,j} ...]]
                              [--mode {allele,gene}] [--act {first}]
                              [--exec MUSCLE_EXEC]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--format {airr,changeo}

Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

--sf <seq_fields>

The sequence fields to multiple align within each group.

--gf <group_fields>

Additional (not allele call) fields to use for grouping.

--calls {v,d,j}

Segment calls (allele assignments) to use for grouping.

--mode {allele,gene}

Specifies whether to use the V(D)J allele or gene when an allele call field (–calls) is specified.

--act {first}

Specifies how to handle multiple values within default allele call fields. Currently, only “first” is supported.

--exec <muscle_exec>

The location of the MUSCLE executable

AlignRecords.py block

usage: AlignRecords.py block [--version] [-h] -d DB_FILES [DB_FILES ...]
                             [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                             [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                             [--format {airr,changeo}] [--nproc NPROC] --sf
                             SEQ_FIELDS [SEQ_FIELDS ...]
                             [--gf GROUP_FIELDS [GROUP_FIELDS ...]]
                             [--calls {v,d,j} [{v,d,j} ...]]
                             [--mode {allele,gene}] [--act {first}]
                             [--exec MUSCLE_EXEC]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--format {airr,changeo}

Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

--sf <seq_fields>

The sequence fields to multiple align within each group.

--gf <group_fields>

Additional (not allele call) fields to use for grouping.

--calls {v,d,j}

Segment calls (allele assignments) to use for grouping.

--mode {allele,gene}

Specifies whether to use the V(D)J allele or gene when an allele call field (–calls) is specified.

--act {first}

Specifies how to handle multiple values within default allele call fields. Currently, only “first” is supported.

--exec <muscle_exec>

The location of the MUSCLE executable

AlignRecords.py within

usage: AlignRecords.py within [--version] [-h] -d DB_FILES [DB_FILES ...]
                              [-o OUT_FILES [OUT_FILES ...]]
                              [--outdir OUT_DIR] [--outname OUT_NAME]
                              [--log LOG_FILE] [--failed]
                              [--format {airr,changeo}] [--nproc NPROC] --sf
                              SEQ_FIELDS [SEQ_FIELDS ...] [--exec MUSCLE_EXEC]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--format {airr,changeo}

Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

--sf <seq_fields>

The sequence fields to multiple align within each record.

--exec <muscle_exec>

The location of the MUSCLE executable

AssignGenes.py

Assign V(D)J gene annotations

usage: AssignGenes.py [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
igblast

Reference alignment results from IgBLAST.

AssignGenes.py igblast

Executes igblastn.

usage: AssignGenes.py igblast [--version] [-h] [-o OUT_FILES [OUT_FILES ...]]
                              [--outdir OUT_DIR] [--outname OUT_NAME]
                              [--nproc NPROC] -s SEQ_FILES [SEQ_FILES ...] -b
                              IGDATA
                              [--organism {human,mouse,rabbit,rat,rhesus_monkey}]
                              [--loci {ig,tr}] [--vdb VDB] [--ddb DDB]
                              [--jdb JDB] [--format {blast,airr}]
                              [--exec IGBLAST_EXEC]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-s <seq_files>

A list of FASTA files containing sequences to process.

-b <igdata>

IgBLAST database directory (IGDATA).

--organism {human,mouse,rabbit,rat,rhesus_monkey}

Organism name.

--loci {ig,tr}

The receptor type.

--vdb <vdb>

Name of the custom V reference in the IgBLAST database folder. If not specified, then a default database name with the form imgt_<organism>_<loci>_v will be used.

--ddb <ddb>

Name of the custom D reference in the IgBLAST database folder. If not specified, then a default database name with the form imgt_<organism>_<loci>_d will be used.

--jdb <jdb>

Name of the custom J reference in the IgBLAST database folder. If not specified, then a default database name with the form imgt_<organism>_<loci>_j will be used.

--format {blast,airr}

Specify the output format. The “blast” will result in the IgBLAST “-outfmt 7 std qseq sseq btop” output format. Specifying “airr” will output the AIRR TSV format provided by the IgBLAST argument “-outfmt 19”.

--exec <igblast_exec>

Path to the igblastn executable.

AssignGenes.py igblast-aa

Executes igblastp.

usage: AssignGenes.py igblast-aa [--version] [-h]
                                 [-o OUT_FILES [OUT_FILES ...]]
                                 [--outdir OUT_DIR] [--outname OUT_NAME]
                                 [--nproc NPROC] -s SEQ_FILES [SEQ_FILES ...]
                                 -b IGDATA
                                 [--organism {human,mouse,rabbit,rat,rhesus_monkey}]
                                 [--loci {ig,tr}] [--vdb VDB]
                                 [--exec IGBLAST_EXEC]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

-s <seq_files>

A list of FASTA files containing sequences to process.

-b <igdata>

IgBLAST database directory (IGDATA).

--organism {human,mouse,rabbit,rat,rhesus_monkey}

Organism name.

--loci {ig,tr}

The receptor type.

--vdb <vdb>

Name of the custom V reference in the IgBLAST database folder. If not specified, then a default database name with the form imgt_aa_<organism>_<loci>_v will be used.

--exec <igblast_exec>

Path to the igblastp executable.

BuildTrees.py

Converts TSV files into IgPhyML input files

usage: BuildTrees.py [--version] [-h] -d DB_FILES [DB_FILES ...]
                     [--outdir OUT_DIR] [--outname OUT_NAME] [--log LOG_FILE]
                     [--failed] [--format {airr,changeo}] [--collapse]
                     [--ncdr3] [--nmask] [--md META_DATA [META_DATA ...]]
                     [--clones TARGET_CLONES [TARGET_CLONES ...]]
                     [--minseq MIN_SEQ] [--sample SAMPLE_DEPTH]
                     [--append APPEND [APPEND ...]] [--igphyml]
                     [--nproc NPROC] [--clean {none,all}]
                     [--optimize {n,r,l,lr,tl,tlr}] [--omega OMEGA] [-t KAPPA]
                     [--motifs MOTIFS] [--hotness HOTNESS]
                     [--oformat {tab,txt}] [--nohlp] [--asr ASR]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--format {airr,changeo}

Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.

--collapse

If specified, collapse identical sequences before exporting to fasta.

--ncdr3

If specified, remove CDR3 from all sequences.

--nmask

If specified, do not attempt to mask split codons.

--md <meta_data>

List of fields to containing metadata to include in output fasta file sequence headers.

--clones <target_clones>

List of clone IDs to output, if specified.

--minseq <min_seq>

Minimum number of data sequences. Any clones with fewer than the specified number of sequences will be excluded.

--sample <sample_depth>

Depth of reads to be subsampled (before deduplication).

--append <append>

List of columns to append to sequence ID to ensure uniqueness.

--igphyml

Run IgPhyML on output?

--nproc <nproc>

Number of threads to parallelize IgPhyML across.

--clean {none,all}

Delete intermediate files? none: leave all intermediate files; all: delete all intermediate files.

--optimize {n,r,l,lr,tl,tlr}

Optimize combination of topology (t) branch lengths (l) and parameters (r), or nothing (n), for IgPhyML.

--omega <omega>

Omega parameters to estimate for FWR,CDR respectively: e = estimate, ce = estimate + confidence interval, or numeric value

-t <kappa>

Kappa parameters to estimate: e = estimate, ce = estimate + confidence interval, or numeric value

--motifs <motifs>

Which motifs to estimate mutability.

--hotness <hotness>

Mutability parameters to estimate: e = estimate, ce = estimate + confidence interval, or numeric value

--oformat {tab,txt}

IgPhyML output format.

--nohlp

Don’t run HLP model?

--asr <asr>

Ancestral sequence reconstruction interval (0-1).

output files:
<folder>

folder containing fasta and partition files for each clone.

lineages

successfully processed records.

lineages-fail

database records failed processing.

igphyml-pass

parameter estimates and lineage trees from running IgPhyML, if specified

required fields:

sequence_id, sequence, sequence_alignment, germline_alignment_d_mask or germline_alignment, v_call, j_call, clone_id, v_sequence_start

ConvertDb.py

Parses tab delimited database files

usage: ConvertDb.py [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
airr

AIRR formatted database files.

changeo

Change-O formatted database files.

sequences

FASTA formatted sequences output from the subcommands fasta and clip.

genbank

feature tables and fasta files containing MiAIRR compliant input for tbl2asn.

required fields:

sequence_id, sequence, sequence_alignment, junction, v_call, d_call, j_call, v_germline_start, v_germline_end, v_sequence_start, v_sequence_end, d_sequence_start, d_sequence_end, j_sequence_start, j_sequence_end

optional fields:

germline_alignment, c_call, clone_id

ConvertDb.py airr

Converts input to an AIRR TSV file.

usage: ConvertDb.py airr [--version] [-h] -d DB_FILES [DB_FILES ...]
                         [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                         [--outname OUT_NAME]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

ConvertDb.py baseline

Creates a BASELINe fasta file from database records.

usage: ConvertDb.py baseline [--version] [-h] -d DB_FILES [DB_FILES ...]
                             [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                             [--outname OUT_NAME] [--if ID_FIELD]
                             [--sf SEQ_FIELD] [--gf GERM_FIELD]
                             [--cf CLUSTER_FIELD]
                             [--mf META_FIELDS [META_FIELDS ...]]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--if <id_field>

The name of the field containing identifiers

--sf <seq_field>

The name of the field containing reads

--gf <germ_field>

The name of the field containing germline sequences

--cf <cluster_field>

The name of the field containing containing sorted clone IDs

--mf <meta_fields>

List of annotation fields to add to the sequence description

ConvertDb.py changeo

Converts input into a Change-O TSV file.

usage: ConvertDb.py changeo [--version] [-h] -d DB_FILES [DB_FILES ...]
                            [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                            [--outname OUT_NAME]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

ConvertDb.py fasta

Creates a fasta file from database records.

usage: ConvertDb.py fasta [--version] [-h] -d DB_FILES [DB_FILES ...]
                          [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                          [--outname OUT_NAME] [--if ID_FIELD]
                          [--sf SEQ_FIELD]
                          [--mf META_FIELDS [META_FIELDS ...]]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--if <id_field>

The name of the field containing identifiers

--sf <seq_field>

The name of the field containing sequences

--mf <meta_fields>

List of annotation fields to add to the sequence description

ConvertDb.py genbank

Creates files for GenBank/TLS submissions.

usage: ConvertDb.py genbank [--version] [-h] -d DB_FILES [DB_FILES ...]
                            [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                            [--outname OUT_NAME] [--format {airr,changeo}]
                            [--mol MOLECULE] [--product PRODUCT]
                            [--db DB_XREF] [--inf INFERENCE]
                            [--organism ORGANISM] [--sex SEX]
                            [--isolate ISOLATE] [--tissue TISSUE]
                            [--cell-type CELL_TYPE] [-y YAML_CONFIG]
                            [--label LABEL] [--cf C_FIELD] [--nf COUNT_FIELD]
                            [--if INDEX_FIELD] [--allow-stop] [--asis-id]
                            [--asis-calls] [--allele-delim ALLELE_DELIM]
                            [--asn] [--sbt ASN_TEMPLATE] [--exec TBL2ASN_EXEC]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--format {airr,changeo}

Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.

--mol <molecule>

The source molecule type. Usually one of “mRNA” or “genomic DNA”.

--product <product>

The product name, such as “immunoglobulin heavy chain”.

--db <db_xref>

Name of the reference database used for alignment. Usually “IMGT/GENE-DB”.

--inf <inference>

Name and version of the inference tool used for reference alignment in the form tool:version.

--organism <organism>

The scientific name of the organism.

--sex <sex>

If specified, adds the given sex annotation to the fasta headers.

--isolate <isolate>

If specified, adds the given isolate annotation (sample label) to the fasta headers.

--tissue <tissue>

If specified, adds the given tissue-type annotation to the fasta headers.

--cell-type <cell_type>

If specified, adds the given cell-type annotation to the fasta headers.

-y <yaml_config>

A yaml file specifying sample features (BioSample attributes) in the form ‘variable: value’. If specified, any features provided in the yaml file will override those provided at the commandline. Note, this config file applies to sample features only and cannot be used for required source features such as the –product or –mol argument.

--label <label>

If specified, add a field name to the sequence identifier. Sequence identifiers will be output in the form <label>=<id>.

--cf <c_field>

Field containing the C region call. If unspecified, the C region gene call will be excluded from the feature table.

--nf <count_field>

If specified, use the provided column to add the AIRR_READ_COUNT note to the feature table.

--if <index_field>

If specified, use the provided column to add the AIRR_CELL_INDEX note to the feature table.

--allow-stop

If specified, retain records in the output with stop codons in the junction region. In such records the CDS will be removed and replaced with a similar misc_feature in the feature table.

--asis-id

If specified, use the existing sequence identifier for the output identifier. By default, only the row number will be used as the identifier to avoid the 50 character limit.

--asis-calls

Specify to prevent alleles from being parsed using the IMGT nomenclature. Note, this requires the gene assignments to be exact matches to valid records in the references database specified by the –db argument.

--allele-delim <allele_delim>

The delimiter to use for splitting the gene name from the allele number. Note, this only applies when specifying –asis-calls. By default, this argument will be ignored and allele numbers extracted under the expectation of IMGT nomenclature consistency.

--asn

If specified, run tbl2asn to generate the .sqn submission file after making the .fsa and .tbl files.

--sbt <asn_template>

If provided along with –asn, use the specified file for the template file argument to tbl2asn.

--exec <tbl2asn_exec>

The name or location of the tbl2asn executable.

CreateGermlines.py

Reconstructs germline sequences from alignment data

usage: CreateGermlines.py [--version] [-h] -d DB_FILES [DB_FILES ...]
                          [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                          [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                          [--format {airr,changeo}] -r REFERENCES
                          [REFERENCES ...]
                          [-g {full,dmask,vonly,regions} [{full,dmask,vonly,regions} ...]]
                          [--cloned] [--sf SEQ_FIELD] [--vf V_FIELD]
                          [--df D_FIELD] [--jf J_FIELD] [--cf CLONE_FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--format {airr,changeo}

Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.

-r <references>

List of folders and/or fasta files (with .fasta, .fna or .fa extension) with germline sequences. When using the default Change-O sequence and coordinate fields, these reference sequences must contain IMGT-numbering spacers (gaps) in the V segment. Alternative numbering schemes, or no numbering, may work for alternative sequence and coordinate definitions that define a valid alignment, but a warning will be issued.

-g {full,dmask,vonly,regions}

Specify type(s) of germlines to include full germline, germline with D segment masked, or germline for V segment only.

--cloned

Specify to create only one germline per clone. Note, if allele calls are ambiguous within a clonal group, this will place the germline call used for the entire clone within the germline_v_call, germline_d_call and germline_j_call fields.

--sf <seq_field>

Field containing the aligned sequence. Defaults to sequence_alignment (airr) or SEQUENCE_IMGT (changeo).

--vf <v_field>

Field containing the germline V segment call. Defaults to v_call (airr) or V_CALL (changeo).

--df <d_field>

Field containing the germline D segment call. Defaults to d_call (airr) or D_CALL (changeo).

--jf <j_field>

Field containing the germline J segment call. Defaults to j_call (airr) or J_CALL (changeo).

--cf <clone_field>

Field containing clone identifiers. Ignored if –cloned is not also specified. Defaults to clone_id (airr) or CLONE (changeo).

output files:
germ-pass

database with assigned germline sequences.

germ-fail

database with records failing germline assignment.

required fields:

sequence_id, sequence_alignment, v_call, d_call, j_call, v_sequence_start, v_sequence_end, v_germline_start, v_germline_end, d_sequence_start, d_sequence_end, d_germline_start, d_germline_end, j_sequence_start, j_sequence_end, j_germline_start, j_germline_end, np1_length, np2_length

optional fields:

n1_length, n2_length, p3v_length, p5d_length, p3d_length, p5j_length, clone_id

output fields:

germline_v_call, germline_d_call, germline_j_call, germline_alignment, germline_alignment_d_mask, germline_alignment_v_region, germline_regions,

DefineClones.py

Assign Ig sequences into clones

usage: DefineClones.py [--version] [-h] -d DB_FILES [DB_FILES ...]
                       [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                       [--outname OUT_NAME] [--log LOG_FILE] [--failed]
                       [--format {airr,changeo}] [--nproc NPROC]
                       [--sf SEQ_FIELD] [--vf V_FIELD] [--jf J_FIELD]
                       [--gf GROUP_FIELDS [GROUP_FIELDS ...]]
                       [--mode {allele,gene}] [--act {first,set}]
                       [--model {ham,aa,hh_s1f,hh_s5f,mk_rs1nf,mk_rs5nf,hs1f_compat,m1n_compat}]
                       [--dist DISTANCE] [--norm {len,mut,none}]
                       [--sym {avg,min}] [--link {single,average,complete}]
                       [--maxmiss MAX_MISSING]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--format {airr,changeo}

Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.

--nproc <nproc>

The number of simultaneous computational processes to execute (CPU cores to utilized).

--sf <seq_field>

Field to be used to calculate distance between records. Defaults to junction (airr) or JUNCTION (changeo).

--vf <v_field>

Field containing the germline V segment call. Defaults to v_call (airr) or V_CALL (changeo).

--jf <j_field>

Field containing the germline J segment call. Defaults to j_call (airr) or J_CALL (changeo).

--gf <group_fields>

Additional fields to use for grouping clones aside from V, J and junction length.

--mode {allele,gene}

Specifies whether to use the V(D)J allele or gene for initial grouping.

--act {first,set}

Specifies how to handle multiple V(D)J assignments for initial grouping. The “first” action will use only the first gene listed. The “set” action will use all gene assignments and construct a larger gene grouping composed of any sequences sharing an assignment or linked to another sequence by a common assignment (similar to single-linkage).

--model {ham,aa,hh_s1f,hh_s5f,mk_rs1nf,mk_rs5nf,hs1f_compat,m1n_compat}

Specifies which substitution model to use for calculating distance between sequences. The “ham” model is nucleotide Hamming distance and “aa” is amino acid Hamming distance. The “hh_s1f” and “hh_s5f” models are human specific single nucleotide and 5-mer content models, respectively, from Yaari et al, 2013. The “mk_rs1nf” and “mk_rs5nf” models are mouse specific single nucleotide and 5-mer content models, respectively, from Cui et al, 2016. The “m1n_compat” and “hs1f_compat” models are deprecated models provided backwards compatibility with the “m1n” and “hs1f” models in Change-O v0.3.3 and SHazaM v0.1.4. Both 5-mer models should be considered experimental.

--dist <distance>

The distance threshold for clonal grouping

--norm {len,mut,none}

Specifies how to normalize distances. One of none (do not normalize), len (normalize by length), or mut (normalize by number of mutations between sequences).

--sym {avg,min}

Specifies how to combine asymmetric distances. One of avg (average of A->B and B->A) or min (minimum of A->B and B->A).

Type of linkage to use for hierarchical clustering.

--maxmiss <max_missing>

The maximum number of non-ACGT characters (gaps or Ns) to permit in the junction sequence before excluding the record from clonal assignment. Note, under single linkage non-informative positions can create artifactual links between unrelated sequences. Use with caution.

output files:
clone-pass

database with assigned clonal group numbers.

clone-fail

database with records failing clonal grouping.

required fields:

sequence_id, v_call, j_call, junction

output fields:

clone_id

MakeDb.py

Create tab-delimited database file to store sequence alignment information

usage: MakeDb.py [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
db-pass

database of alignment records with functionality information, V and J calls, and a junction region.

db-fail

database with records that fail due to no productivity information, no gene V assignment, no J assignment, or no junction region.

universal output fields:

sequence_id, sequence, sequence_alignment, germline_alignment, rev_comp, productive, stop_codon, vj_in_frame, locus, v_call, d_call, j_call, junction, junction_length, junction_aa, v_sequence_start, v_sequence_end, v_germline_start, v_germline_end, d_sequence_start, d_sequence_end, d_germline_start, d_germline_end, j_sequence_start, j_sequence_end, j_germline_start, j_germline_end, np1_length, np2_length, fwr1, fwr2, fwr3, fwr4, cdr1, cdr2, cdr3

imgt specific output fields:

n1_length, n2_length, p3v_length, p5d_length, p3d_length, p5j_length, d_frame, v_score, v_identity, d_score, d_identity, j_score, j_identity

igblast specific output fields:

v_score, v_identity, v_support, v_cigar, d_score, d_identity, d_support, d_cigar, j_score, j_identity, j_support, j_cigar

ihmm specific output fields:

vdj_score

10X specific output fields:

cell_id, c_call, consensus_count, umi_count, v_call_10x, d_call_10x, j_call_10x, junction_10x, junction_10x_aa

MakeDb.py igblast

Process igblastn output.

usage: MakeDb.py igblast [--version] [-h] [-o OUT_FILES [OUT_FILES ...]]
                         [--outdir OUT_DIR] [--outname OUT_NAME]
                         [--log LOG_FILE] [--failed] [--format {airr,changeo}]
                         -i ALIGNER_FILES [ALIGNER_FILES ...] -r REPO
                         [REPO ...] -s SEQ_FILES [SEQ_FILES ...]
                         [--10x CELLRANGER_FILE [CELLRANGER_FILE ...]]
                         [--asis-id] [--asis-calls] [--partial] [--extended]
                         [--regions {default,rhesus-igl}] [--infer-junction]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--format {airr,changeo}

Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.

-i <aligner_files>

IgBLAST output files in format 7 with query sequence (igblastn argument ‘-outfmt “7 std qseq sseq btop”’).

-r <repo>

List of folders and/or fasta files containing the same germline set used in the IgBLAST alignment. These reference sequences must contain IMGT-numbering spacers (gaps) in the V segment.

-s <seq_files>

List of input FASTA files (with .fasta, .fna or .fa extension), containing sequences.

--10x <cellranger_file>

Table file containing 10X annotations (with .csv or .tsv extension).

--asis-id

Specify to prevent input sequence headers from being parsed to add new columns to database. Parsing of sequence headers requires headers to be in the pRESTO annotation format, so this should be specified when sequence headers are incompatible with the pRESTO annotation scheme. Note, unrecognized header formats will default to this behavior.

--asis-calls

Specify to prevent gene calls from being parsed into standard allele names in both the IgBLAST output and reference database. Note, this requires the sequence identifiers in the reference sequence set and the IgBLAST database to be exact string matches.

--partial

If specified, include incomplete V(D)J alignments in the pass file instead of the fail file. An incomplete alignment is defined as a record for which a valid IMGT-gapped sequence cannot be built or that is missing a V gene assignment, J gene assignment, junction region, or productivity call.

--extended

Specify to include additional aligner specific fields in the output. Adds <vdj>_score, <vdj>_identity, <vdj>_support, <vdj>_cigar, fwr1, fwr2, fwr3, fwr4, cdr1, cdr2 and cdr3.

--regions {default,rhesus-igl}

IMGT CDR and FWR boundary definition to use.

--infer-junction

Infer the junction sequence. For use with IgBLAST v1.6.0 or older, prior to the addition of IMGT-CDR3 inference.

MakeDb.py igblast-aa

Process igblastp output.

usage: MakeDb.py igblast-aa [--version] [-h] [-o OUT_FILES [OUT_FILES ...]]
                            [--outdir OUT_DIR] [--outname OUT_NAME]
                            [--log LOG_FILE] [--failed]
                            [--format {airr,changeo}] -i ALIGNER_FILES
                            [ALIGNER_FILES ...] -r REPO [REPO ...] -s
                            SEQ_FILES [SEQ_FILES ...]
                            [--10x CELLRANGER_FILE [CELLRANGER_FILE ...]]
                            [--asis-id] [--asis-calls] [--extended]
                            [--regions {default,rhesus-igl}]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--format {airr,changeo}

Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.

-i <aligner_files>

IgBLAST output files in format 7 with query sequence (igblastp argument ‘-outfmt “7 std qseq sseq btop”’).

-r <repo>

List of folders and/or fasta files containing the same germline set used in the IgBLAST alignment. These reference sequences must contain IMGT-numbering spacers (gaps) in the V segment.

-s <seq_files>

List of input FASTA files (with .fasta, .fna or .fa extension), containing sequences.

--10x <cellranger_file>

Table file containing 10X annotations (with .csv or .tsv extension).

--asis-id

Specify to prevent input sequence headers from being parsed to add new columns to database. Parsing of sequence headers requires headers to be in the pRESTO annotation format, so this should be specified when sequence headers are incompatible with the pRESTO annotation scheme. Note, unrecognized header formats will default to this behavior.

--asis-calls

Specify to prevent gene calls from being parsed into standard allele names in both the IgBLAST output and reference database. Note, this requires the sequence identifiers in the reference sequence set and the IgBLAST database to be exact string matches.

--extended

Specify to include additional aligner specific fields in the output. Adds v_score, v_identity, v_support, v_cigar, fwr1, fwr2, fwr3, cdr1 and cdr2.

--regions {default,rhesus-igl}

IMGT CDR and FWR boundary definition to use.

MakeDb.py ihmm

Process iHMMune-Align output.

usage: MakeDb.py ihmm [--version] [-h] [-o OUT_FILES [OUT_FILES ...]]
                      [--outdir OUT_DIR] [--outname OUT_NAME] [--log LOG_FILE]
                      [--failed] [--format {airr,changeo}] -i ALIGNER_FILES
                      [ALIGNER_FILES ...] -r REPO [REPO ...] -s SEQ_FILES
                      [SEQ_FILES ...]
                      [--10x CELLRANGER_FILE [CELLRANGER_FILE ...]]
                      [--asis-id] [--partial] [--extended]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--format {airr,changeo}

Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.

-i <aligner_files>

iHMMune-Align output file.

-r <repo>

List of folders and/or FASTA files containing the set of germline sequences used by iHMMune-Align. These reference sequences must contain IMGT-numbering spacers (gaps) in the V segment.

-s <seq_files>

List of input FASTA files (with .fasta, .fna or .fa extension) containing sequences.

--10x <cellranger_file>

Table file containing 10X annotations (with .csv or .tsv extension).

--asis-id

Specify to prevent input sequence headers from being parsed to add new columns to database. Parsing of sequence headers requires headers to be in the pRESTO annotation format, so this should be specified when sequence headers are incompatible with the pRESTO annotation scheme. Note, unrecognized header formats will default to this behavior.

--partial

If specified, include incomplete V(D)J alignments in the pass file instead of the fail file. An incomplete alignment is defined as a record for which a valid IMGT-gapped sequence cannot be built or that is missing a V gene assignment, J gene assignment, junction region, or productivity call.

--extended

Specify to include additional aligner specific fields in the output. Adds the path score of the iHMMune-Align hidden Markov model as vdj_score; adds fwr1, fwr2, fwr3, fwr4, cdr1, cdr2 and cdr3.

MakeDb.py imgt

Process IMGT/HighV-Quest output (does not work with V-QUEST).

usage: MakeDb.py imgt [--version] [-h] [-o OUT_FILES [OUT_FILES ...]]
                      [--outdir OUT_DIR] [--outname OUT_NAME] [--log LOG_FILE]
                      [--failed] [--format {airr,changeo}] -i ALIGNER_FILES
                      [ALIGNER_FILES ...] [-s [SEQ_FILES [SEQ_FILES ...]]]
                      [-r REPO [REPO ...]]
                      [--10x CELLRANGER_FILE [CELLRANGER_FILE ...]]
                      [--asis-id] [--partial] [--extended]
                      [--imgt-id-len IMGT_ID_LEN]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

--log <log_file>

Specify to write verbose logging to a file. May not be specified with multiple input files.

--failed

If specified create files containing records that fail processing.

--format {airr,changeo}

Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.

-i <aligner_files>

Either zipped IMGT output files (.zip or .txz) or a folder containing unzipped IMGT output files (which must include 1_Summary, 2_IMGT-gapped, 3_Nt-sequences, and 6_Junction).

-s <seq_files>

List of FASTA files (with .fasta, .fna or .fa extension) that were submitted to IMGT/HighV-QUEST. If unspecified, sequence identifiers truncated by IMGT/HighV-QUEST will not be corrected.

-r <repo>

List of folders and/or fasta files containing the germline sequence set used by IMGT/HighV-QUEST. These reference sequences must contain IMGT-numbering spacers (gaps) in the V segment. If unspecified, the germline sequence reconstruction will not be included in the output.

--10x <cellranger_file>

Table file containing 10X annotations (with .csv or .tsv extension).

--asis-id

Specify to prevent input sequence headers from being parsed to add new columns to database. Parsing of sequence headers requires headers to be in the pRESTO annotation format, so this should be specified when sequence headers are incompatible with the pRESTO annotation scheme. Note, unrecognized header formats will default to this behavior.

--partial

If specified, include incomplete V(D)J alignments in the pass file instead of the fail file. An incomplete alignment is defined as a record that is missing a V gene assignment, J gene assignment, junction region, or productivity call.

--extended

Specify to include additional aligner specific fields in the output. Adds <vdj>_score, <vdj>_identity>, fwr1, fwr2, fwr3, fwr4, cdr1, cdr2, cdr3, n1_length, n2_length, p3v_length, p5d_length, p3d_length, p5j_length and d_frame.

--imgt-id-len <imgt_id_len>

The maximum character length of sequence identifiers reported by IMGT/HighV-QUEST. Specify 50 if the IMGT files (-i) were generated with an IMGT/HighV-QUEST version older than 1.8.3 (May 7, 2021).

ParseDb.py

Parses tab delimited database files

usage: ParseDb.py [--version] [-h]  ...
--version

show program’s version number and exit

-h, --help

show this help message and exit

output files:
sequences

FASTA formatted sequences output from the subcommands fasta and clip.

<field>-<value>

database files partitioned by annotation <field> and <value>.

parse-<command>

output of the database modification functions where <command> is one of the subcommands add, index, drop, delete, rename, select, sort or update.

required fields:

sequence_id

ParseDb.py add

Adds field and value pairs.

usage: ParseDb.py add [--version] [-h] -d DB_FILES [DB_FILES ...]
                      [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                      [--outname OUT_NAME] -f FIELDS [FIELDS ...] -u VALUES
                      [VALUES ...]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

-f <fields>

The name of the fields to add.

-u <values>

The value to assign to all rows for each field.

ParseDb.py delete

Deletes specific records.

usage: ParseDb.py delete [--version] [-h] -d DB_FILES [DB_FILES ...]
                         [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                         [--outname OUT_NAME] -f FIELDS [FIELDS ...]
                         [-u VALUES [VALUES ...]] [--logic {any,all}]
                         [--regex]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

-f <fields>

The name of the fields to check for deletion criteria.

-u <values>

The values defining which records to delete. A value may appear in any of the fields specified with -f.

--logic {any,all}

Defines whether a value may appear in any field (any) or whether it must appear in all fields (all).

--regex

If specified, treat values as regular expressions and allow partial string matches.

ParseDb.py drop

Deletes entire fields.

usage: ParseDb.py drop [--version] [-h] -d DB_FILES [DB_FILES ...]
                       [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                       [--outname OUT_NAME] -f FIELDS [FIELDS ...]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

-f <fields>

The name of the fields to delete from the database.

ParseDb.py index

Adds a numeric index field.

usage: ParseDb.py index [--version] [-h] -d DB_FILES [DB_FILES ...]
                        [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                        [--outname OUT_NAME] [-f FIELD]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

-f <field>

The name of the index field to add to the database.

ParseDb.py merge

Merges files.

usage: ParseDb.py merge [--version] [-h] -d DB_FILES [DB_FILES ...]
                        [--outdir OUT_DIR] [--outname OUT_NAME] [-o OUT_FILE]
                        [--drop]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

-o <out_file>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir or –outname arguments.

--drop

If specified, drop fields that do not exist in all input files. Otherwise, include all columns in all files and fill missing data with empty strings.

ParseDb.py rename

Renames fields.

usage: ParseDb.py rename [--version] [-h] -d DB_FILES [DB_FILES ...]
                         [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                         [--outname OUT_NAME] -f FIELDS [FIELDS ...] -k NAMES
                         [NAMES ...]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

-f <fields>

List of fields to rename.

-k <names>

List of new names for each field.

ParseDb.py select

Selects specific records.

usage: ParseDb.py select [--version] [-h] -d DB_FILES [DB_FILES ...]
                         [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                         [--outname OUT_NAME] -f FIELDS [FIELDS ...] -u VALUES
                         [VALUES ...] [--logic {any,all}] [--regex]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

-f <fields>

The name of the fields to check for selection criteria.

-u <values>

The values defining with records to select. A value may appear in any of the fields specified with -f.

--logic {any,all}

Defines whether a value may appear in any field (any) or whether it must appear in all fields (all).

--regex

If specified, treat values as regular expressions and allow partial string matches.

ParseDb.py sort

Sorts records by field values.

usage: ParseDb.py sort [--version] [-h] -d DB_FILES [DB_FILES ...]
                       [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                       [--outname OUT_NAME] -f FIELD [--num] [--descend]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

-f <field>

The annotation field by which to sort records.

--num

Specify to define the sort column as numeric rather than textual.

--descend

If specified, sort records in descending, rather than ascending, order by values in the target field.

ParseDb.py split

Splits database files by field values

usage: ParseDb.py split [--version] [-h] -d DB_FILES [DB_FILES ...]
                        [--outdir OUT_DIR] [--outname OUT_NAME] -f FIELD
                        [--num NUM_SPLIT]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

-f <field>

Annotation field by which to split database files.

--num <num_split>

Specify to define the field as numeric and group records by whether they are less than or at least (greater than or equal to) the specified value.

ParseDb.py update

Updates field and value pairs.

usage: ParseDb.py update [--version] [-h] -d DB_FILES [DB_FILES ...]
                         [-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
                         [--outname OUT_NAME] -f FIELD -u VALUES [VALUES ...]
                         -t UPDATES [UPDATES ...]
--version

show program’s version number and exit

-h, --help

show this help message and exit

-d <db_files>

A list of tab delimited database files.

-o <out_files>

Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).

--outdir <out_dir>

Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.

--outname <out_name>

Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.

-f <field>

The name of the field to update.

-u <values>

The values that will be replaced.

-t <updates>

The new value to assign to each selected row.

API

changeo.Alignment

Alignment manipulation

class changeo.Alignment.RegionDefinition(junction_length, amino_acid=False, definition='default')

Bases: object

FWR and CDR region boundary definitions

getRegions(seq)

Return IMGT defined FWR and CDR regions

Parameters

seq – IMGT-gapped sequence.

Returns

dictionary of FWR and CDR sequences.

Return type

dict

changeo.Alignment.alignmentPositions(alignment)

Extracts start position and length from an alignment

Parameters

alignment – tuples of (operation, length) for each alignment operation.

Returns

query (q) and reference (r) start (0-based) and length information with keys

{q_start, q_length, r_start, r_length}.

Return type

dict

changeo.Alignment.decodeBTOP(btop)

Parse a BTOP string into a list of tuples in CIGAR annotation.

Parameters

btop – BTOP string.

Returns

tuples of (operation, length) for each operation in the BTOP string using CIGAR annotation.

Return type

list

changeo.Alignment.decodeCIGAR(cigar)

Parse a CIGAR string into a list of tuples.

Parameters

cigar – CIGAR string.

Returns

tuples of (operation, length) for each operation in the CIGAR string.

Return type

list

changeo.Alignment.encodeCIGAR(alignment)

Encodes a list of tuple with alignment information into a CIGAR string.

Parameters

tuple – tuples of (type, length) for each alignment operation.

Returns

CIGAR string.

Return type

str

changeo.Alignment.gapV(seq, v_germ_start, v_germ_length, v_call, references, asis_calls=False)

Construction IMGT-gapped V segment sequences.

Parameters
  • seq (str) – V(D)J sequence alignment (SEQUENCE_VDJ).

  • v_germ_start (int) – start position V segment alignment in the germline (V_GERM_START_VDJ, 1-based).

  • v_germ_length (int) – length of the V segment alignment against the germline (V_GERM_LENGTH_VDJ, 1-based).

  • v_call (str) – V segment allele assignment (V_CALL).

  • references (dict) – dictionary of IMGT-gapped reference sequences.

  • asis_calls (bool) – if True do not parse v_call for allele names and just split by comma.

Returns

dictionary containing IMGT-gapped query sequences and germline positions.

Return type

dict

Raises

KeyError – raised if the v_call is not found in the reference dictionary.

changeo.Alignment.getRegions(seq, junction_length)

Identify FWR and CDR regions by IMGT definition.

Parameters
  • seq – IMGT-gapped sequence.

  • junction_length – length of the junction region in nucleotides.

Returns

dictionary of FWR and CDR sequences.

Return type

dict

changeo.Alignment.inferJunction(seq, j_germ_start, j_germ_length, j_call, references, asis_calls=False, regions='default')

Identify junction region by IMGT definition.

Parameters
  • seq (str) – IMGT-gapped V(D)J sequence alignment (SEQUENCE_IMGT).

  • j_germ_start (int) – start position J segment alignment in the germline (J_GERM_START, 1-based).

  • j_germ_length (int) – length of the J segment alignment against the germline (J_GERM_LENGTH).

  • j_call (str) – J segment allele assignment (J_CALL).

  • references (dict) – dictionary of IMGT-gapped reference sequences.

  • asis_calls (bool) – if True do not parse V_CALL for allele names and just split by comma.

  • regions (str) – name of the IMGT FWR/CDR region definitions to use.

Returns

dictionary containing junction sequence, translation and length.

Return type

dict

changeo.Alignment.padAlignment(alignment, q_start, r_start)

Pads the start of an alignment based on query and reference positions.

Parameters
  • alignment – tuples of (operation, length) for each alignment operation.

  • q_start – query (input) start position (0-based)

  • r_start – reference (subject) start position (0-based)

Returns

updated list of tuples of (operation, length) for the alignment.

Return type

list

changeo.Applications

Application wrappers

changeo.Applications.getIgBLASTVersion(exec='igblastn')

Gets the version of the IgBLAST executable

Parameters

exec (str) – the name or path to the igblastn executable.

Returns

version number.

Return type

str

changeo.Applications.runASN(fasta, template=None, exec='tbl2asn')

Executes tbl2asn to generate Sequin files

Parameters
  • fasta (str) – fsa file name.

  • template (str) – sbt file name.

  • exec (str) – the name or path to the tbl2asn executable.

Returns

tbl2asn console output.

Return type

str

changeo.Applications.runIgBLASTN(fasta, igdata, loci='ig', organism='human', vdb=None, ddb=None, jdb=None, output=None, format='legacy', threads=1, exec='igblastn')

Runs igblastn on a sequence file

Parameters
  • fasta (str) – fasta file containing sequences.

  • igdata (str) – path to the IgBLAST database directory (IGDATA environment).

  • loci (str) – receptor type; one of ‘ig’ or ‘tr’.

  • organism (str) – species name.

  • vdb (str) – name of a custom V reference in the database folder to use.

  • ddb (str) – name of a custom D reference in the database folder to use.

  • jdb (str) – name of a custom J reference in the database folder to use.

  • output (str) – output file name. If None, automatically generate from the fasta file name.

  • format (str) – output format. One of ‘blast’ or ‘airr’.

  • threads (int) – number of threads for igblastn.

  • exec (str) – the name or path to the igblastn executable.

Returns

IgBLAST console output.

Return type

str

changeo.Applications.runIgBLASTP(fasta, igdata, loci='ig', organism='human', vdb=None, output=None, threads=1, exec='igblastp')

Runs igblastp on a sequence file

Parameters
  • fasta (str) – fasta file containing sequences.

  • igdata (str) – path to the IgBLAST database directory (IGDATA environment).

  • loci (str) – receptor type; one of ‘ig’ or ‘tr’.

  • organism (str) – species name.

  • vdb (str) – name of a custom V reference in the database folder to use.

  • output (str) – output file name. If None, automatically generate from the fasta file name.

  • threads (int) – number of threads for igblastp.

  • exec (str) – the name or path to the igblastp executable.

Returns

IgBLAST console output.

Return type

str

changeo.Applications.runIgPhyML(rep_file, rep_dir, model='HLP17', motifs='FCH', threads=1, exec='igphyml')

Run IgPhyML

Parameters
  • rep_file (str) – repertoire tsv file.

  • rep_dir (str) – directory containing input fasta files.

  • model (str) – model to use.

  • motif (str) – motifs argument.

  • threads – number of threads.

  • exec – the path to the IgPhyMl executable.

Returns

name of the output tree file.

Return type

str

changeo.Commandline

Commandline interface

class changeo.Commandline.CommonHelpFormatter(prog, indent_increment=2, max_help_position=24, width=None)

Bases: argparse.RawDescriptionHelpFormatter, argparse.ArgumentDefaultsHelpFormatter

Custom argparse.HelpFormatter

changeo.Commandline.checkArgs(parser)

Checks that arguments have been provided and prints help if they have not.

Parameters

parser – An argparse.ArgumentParser defining the commandline arguments.

Returns

True if arguments are present. Prints help and exits if not.

Return type

boolean

changeo.Commandline.getCommonArgParser(db_in=True, db_out=True, out_file=True, failed=True, log=True, format=True, multiproc=False, add_help=True)

Defines an ArgumentParser object with common pRESTO arguments

Parameters
  • db_in (bool) – if True include tab delimited database input arguments.

  • db_out (bool) – if True include explicit output file name argument.

  • out_file (bool) – if True add explicit output file name arguments.

  • failed (bool) – if True include arguments for output of failed results.

  • log (bool) – if True include log arguments.

  • format (bool) – input and output type arguments.

  • multiproc (bool) – if True include multiprocessing arguments.

Returns

an argument parser.

Return type

argparse.ArgumentParser

changeo.Commandline.parseCommonArgs(args, in_arg=None, in_types=None, in_list=False)

Checks common arguments from getCommonArgParser and transforms output options to a dictionary

Parameters
  • args – Argument Namespace defined by ArgumentParser.parse_args.

  • in_arg – String defining a non-standard input file argument to verify; by default ‘db_files’ and ‘seq_files’ are supported in that order.

  • in_types – List of types (file extensions as strings) to allow for files in file_arg; if None do not check type.

  • in_list – if True allow multiple input files with the out_name and log arguments.

Returns

Dictionary copy of args with output arguments embedded in the dictionary out_args

Return type

dict

changeo.Commandline.setDefaultFields(args, defaults, format='airr')

Sets default field arguments by format

Parameters
  • args (dict) – parsed argument dictionary.

  • defaults (dict) – default variables to set with with keys as argument variables and values as AIRR field names.

  • format (str) – one of ‘changeo’ or ‘airr’ which defines the file format.

Returns

modified input args.

Return type

dict

changeo.Distance

Distance calculations

changeo.Distance.calcDistances(sequences, n, dist_mat, sym='avg', norm=None)

Calculate pairwise distances between input sequences

Parameters
  • sequences – List of sequences for which to calculate pairwise distances

  • n – Length of n-mers to be used in calculating distance

  • dist_mat – pandas.DataFrame of mutation distances

  • norm – Normalization method. One of None, ‘len’, or ‘mut’.

  • sym – Symmetry method; one of ‘avg’ of ‘min.

Returns

numpy matrix of pairwise distances between input sequences

Return type

ndarray

changeo.Distance.formClusters(dists, link, distance)

Form clusters based on hierarchical clustering of input distance matrix with linkage type and cutoff distance

Parameters
  • dists – numpy matrix of distances

  • link – Linkage type for hierarchical clustering

  • distance – Distance at which to cut into clusters

Returns

List of cluster assignments

Return type

list

changeo.Distance.getAADistMatrix(mat=None, mask_dist=0, gap_dist=0)

Generates an amino acid distance matrix

Parameters
  • mat – Input distance matrix to extend to full alphabet; if unspecified, creates Hamming distance matrix that incorporates IUPAC equivalencies

  • mask_dict – Score for all matches against an X character

  • gap_dist – Score for all matches against a gap (-, .) character

Returns

pandas.DataFrame of distances

Return type

DataFrame

changeo.Distance.getDNADistMatrix(mat=None, mask_dist=0, gap_dist=0)

Generates a DNA distance matrix

Parameters
  • mat – Input distance matrix to extend to full alphabet; if unspecified, creates Hamming distance matrix that incorporates IUPAC equivalencies

  • mask_dist – Distance for all matches against an N character

  • gap_dist – Distance for all matches against a gap (-, .) character

Returns

pandas.DataFrame of distances

Return type

DataFrame

changeo.Distance.getNmers(sequences, n)

Breaks input sequences down into n-mers

Parameters
  • sequences – List of sequences to be broken into n-mers

  • n – Length of n-mers to return

Returns

Dictionary mapping sequence to a list of n-mers

Return type

dict

changeo.Distance.zip_equal(*iterables)

Zips iterables and raises exception if different lengths

Parameters

iterables – pointer to iterables to zip together

Returns

A generator of tuples with combined elements from the iterables

Return type

iter

changeo.Gene

Gene annotations

changeo.Gene.buildClonalGermline(receptors, references, seq_field='sequence_imgt', v_field='v_call', d_field='d_call', j_field='j_call', amino_acid=False)

Determine consensus clone sequence and create germline for clone

Parameters
  • receptors (changeo.Receptor.Receptor) – list of Receptor objects

  • references (dict) – dictionary of IMGT gapped germline sequences

  • seq_field (str) – Receptor attribute in which to look for sequence

  • v_field (str) – Receptor attributein which to look for V call

  • d_field (str) – Receptor attributein which to look for D call

  • j_field (str) – Receptor attributein which to look for J call

  • amino_acid (bool) – if True then use the amino acid positional fields, otherwise use the nucleotide fields.

Returns

log dictionary, dictionary of {germline_type: germline_sequence},

dictionary of consensus {segment: gene call}

Return type

tuple

changeo.Gene.buildGermline(receptor, references, seq_field='sequence_imgt', v_field='v_call', d_field='d_call', j_field='j_call', amino_acid=False)

Join gapped germline sequences aligned with sample sequences

Parameters
  • receptor (changeo.Receptor.Receptor) – Receptor object.

  • references (dict) – dictionary of IMGT gapped germline sequences.

  • seq_field (str) – Receptor attribute in which to look for sequence.

  • v_field (str) – Receptor attribute in which to look for V call.

  • d_field (str) – Receptor attribute in which to look for V call.

  • j_field (str) – Receptor attribute in which to look for V call.

  • amino_acid (bool) – if True then use the amino acid positional fields, otherwise use the nucleotide fields.

Returns

log dictionary, dictionary of {germline_type: germline_sequence}, dictionary of {segment: gene call}

Return type

tuple

changeo.Gene.getAllele(gene, action='first')

Extract allele from gene call string

Parameters
  • gene (str) – string with gene calls

  • action (str) – action to perform for multiple alleles; one of (‘first’, ‘set’, ‘list’).

Returns

String of the first allele calls when action is ‘first’. tuple: Tuple of allele calls for ‘set’ or ‘list’ actions.

Return type

str

changeo.Gene.getAlleleNumber(gene, action='first')

Extract allele number from gene call string

Parameters
  • gene (str) – string with gene calls

  • action (str) – action to perform for multiple alleles; one of (‘first’, ‘set’, ‘list’).

Returns

String of the first allele number call when action is ‘first’. tuple: Tuple of allele numbers for ‘set’ or ‘list’ actions.

Return type

str

changeo.Gene.getCGene(gene, action='first')

Extract C-region gene from gene call string

Parameters
  • gene (str) – string with C-region gene calls

  • action (str) – action to perform for multiple alleles; one of (‘first’, ‘set’, ‘list’).

Returns

String of the first C-region gene call when action is ‘first’. tuple: Tuple of gene calls for ‘set’ or ‘list’ actions.

Return type

str

changeo.Gene.getDAllele(gene, action='first')

Extract D allele gene from gene call string

Parameters
  • gene (str) – string with D gene calls

  • action (str) – action to perform for multiple alleles; one of (‘first’, ‘set’, ‘list’).

Returns

String of the first D allele call when action is ‘first’. tuple: Tuple of D allele calls for ‘set’ or ‘list’ actions.

Return type

str

changeo.Gene.getDGermline(receptor, references, d_field='d_call', amino_acid=False)

Extract D allele and germline sequence

Parameters
  • receptor (changeo.Receptor.Receptor) – Receptor object

  • references (dict) – dictionary of germline sequences

  • d_field (str) – Receptor attribute containing the D allele assignment

  • amino_acid (bool) – if True then use the amino acid positional fields, otherwise use the nucleotide fields.

Returns

D allele name, D segment germline sequence

Return type

tuple

changeo.Gene.getFamily(gene, action='first')

Extract family from gene call string

Parameters
  • gene (str) – string with gene calls

  • action (str) – action to perform for multiple alleles; one of (‘first’, ‘set’, ‘list’).

Returns

String of the first family call when action is ‘first’. tuple: Tuple of allele calls for ‘set’ or ‘list’ actions.

Return type

str

changeo.Gene.getGene(gene, action='first')

Extract gene from gene call string

Parameters
  • gene (str) – string with gene calls

  • action (str) – action to perform for multiple alleles; one of (‘first’, ‘set’, ‘list’).

Returns

String of the first gene call when action is ‘first’. tuple: Tuple of gene calls for ‘set’ or ‘list’ actions.

Return type

str

changeo.Gene.getJAllele(gene, action='first')

Extract J allele gene from gene call string

Parameters
  • gene (str) – string with J gene calls

  • action (str) – action to perform for multiple alleles; one of (‘first’, ‘set’, ‘list’).

Returns

String of the first J allele call when action is ‘first’. tuple: Tuple of J allele calls for ‘set’ or ‘list’ actions.

Return type

str

changeo.Gene.getJGermline(receptor, references, j_field='j_call', amino_acid=False)

Extract J allele and germline sequence

Parameters
  • receptor (changeo.Receptor.Receptor) – Receptor object

  • references (dict) – dictionary of germline sequences

  • j_field (str) – Receptor attribute containing the J allele assignment

  • amino_acid (bool) – if True then use the amino acid positional fields, otherwise use the nucleotide fields.

Returns

J allele name, J segment germline sequence

Return type

tuple

changeo.Gene.getLocus(gene, action='first')

Extract locus from gene call string

Parameters
  • gene (str) – string with gene calls

  • action (str) – action to perform for multiple alleles; one of (‘first’, ‘set’, ‘list’).

Returns

String of the first locus call when action is ‘first’. tuple: Tuple of locus calls for ‘set’ or ‘list’ actions.

Return type

str

changeo.Gene.getVAllele(gene, action='first')

Extract V allele gene from gene call string

Parameters
  • gene (str) – string with V gene calls

  • action (str) – action to perform for multiple alleles; one of (‘first’, ‘set’, ‘list’).

Returns

String of the first V allele call when action is ‘first’. tuple: Tuple of V allele calls for ‘set’ or ‘list’ actions.

Return type

str

changeo.Gene.getVGermline(receptor, references, v_field='v_call', amino_acid=False)

Extract V allele and germline sequence

Parameters
  • receptor (changeo.Receptor.Receptor) – Receptor object

  • references (dict) – dictionary of germline sequences

  • v_field (str) – Receptor attribute containing the V allele assignment

  • amino_acid (bool) – if True then use the amino acid positional fields, otherwise use the nucleotide fields.

Returns

V allele name, V segment germline sequence

Return type

tuple

changeo.Gene.parseGeneCall(gene, regex, action='first')

Extract alleles from strings

Parameters
  • gene (str) – string with gene calls

  • regex (re.Pattern) – compiled regular expression for allele match

  • action (str) – action to perform for multiple alleles; one of (‘first’, ‘set’, ‘list’).

Returns

String of the allele when action is ‘first’; tuple: Tuple of allele calls for ‘set’ or ‘list’ actions.

Return type

str

changeo.Gene.stitchRegions(receptor, v_seq, d_seq, j_seq, amino_acid=False)

Assemble full length region encoding

Parameters
  • receptor (changeo.Receptor.Receptor) – Receptor object

  • v_seq (str) – V segment germline sequence as a string

  • d_seq (str) – D segment germline sequence as a string

  • j_seq (str) – J segment germline sequence as a string

  • amino_acid (bool) – if True use amino acid positional fields, otherwise use nucleotide fields.

Returns

string defining germline regions

Return type

str

changeo.Gene.stitchVDJ(receptor, v_seq, d_seq, j_seq, amino_acid=False)

Assemble full length germline sequence

Parameters
  • receptor (changeo.Receptor.Receptor) – Receptor object

  • v_seq (str) – V segment sequence as a string

  • d_seq (str) – D segment sequence as a string

  • j_seq (str) – J segment sequence as a string

  • amino_acid (bool) – if True use X for N/P regions and amino acid positional fields, otherwise use N and nucleotide fields.

Returns

full germline sequence

Return type

str

changeo.IO

File I/O and parsers

class changeo.IO.AIRRReader(handle)

Bases: changeo.IO.TSVReader

An iterator to read and parse AIRR formatted data.

class changeo.IO.AIRRWriter(handle, fields=['sequence_id', 'sequence', 'sequence_alignment', 'germline_alignment', 'rev_comp', 'productive', 'stop_codon', 'vj_in_frame', 'locus', 'v_call', 'd_call', 'j_call', 'junction', 'junction_length', 'junction_aa', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end'])

Bases: changeo.IO.TSVWriter

Writes AIRR formatted data.

writeReceptor(records)

Writes a row from a Receptor object

Parameters

records – a changeo.Receptor object to write or iterable of such objects.

Returns

None

class changeo.IO.ChangeoReader(handle)

Bases: changeo.IO.TSVReader

An iterator to read and parse Change-O formatted data.

class changeo.IO.ChangeoWriter(handle, fields=['SEQUENCE_ID', 'SEQUENCE_INPUT', 'FUNCTIONAL', 'IN_FRAME', 'STOP', 'MUTATED_INVARIANT', 'INDELS', 'LOCUS', 'V_CALL', 'D_CALL', 'J_CALL', 'SEQUENCE_VDJ', 'SEQUENCE_IMGT', 'V_SEQ_START', 'V_SEQ_LENGTH', 'V_GERM_START_VDJ', 'V_GERM_LENGTH_VDJ', 'V_GERM_START_IMGT', 'V_GERM_LENGTH_IMGT', 'NP1_LENGTH', 'D_SEQ_START', 'D_SEQ_LENGTH', 'D_GERM_START', 'D_GERM_LENGTH', 'NP2_LENGTH', 'J_SEQ_START', 'J_SEQ_LENGTH', 'J_GERM_START', 'J_GERM_LENGTH', 'JUNCTION', 'JUNCTION_LENGTH', 'GERMLINE_IMGT'], header=True)

Bases: changeo.IO.TSVWriter

Writes Change-O formatted data.

writeReceptor(records)

Writes a row from a Receptor object

Parameters

records – a changeo.Receptor.Receptor object to write or an iterable of such objects.

Returns

None

class changeo.IO.IHMMuneReader(ihmmune, sequences, references, receptor=True)

Bases: object

An iterator to read and parse iHMMune-Align output files.

__iter__()

Iterator initializer.

Returns

changeo.IO.IHMMuneReader

__next__()

Next method.

Returns

parsed IMGT/HighV-QUEST result as an Receptor (receptor=True) or dictionary (receptor=False).

Return type

changeo.Receptor.Receptor

static customFields(scores=False, regions=False, cell=False, schema=None)

Returns non-standard Receptor attributes defined by the parser

Parameters
  • scores – if True include alignment scoring fields.

  • regions – if True include IMGT-gapped CDR and FWR region fields.

  • schema – schema class to pass field through for conversion. If None, return changeo.Receptor.Receptor attribute names.

Returns

list of field names.

Return type

list

ihmmune_fields = ['SEQUENCE_ID', 'V_CALL', 'D_CALL', 'J_CALL', 'V_SEQ', 'NP1_SEQ', 'D_SEQ', 'NP2_SEQ', 'J_SEQ', 'V_MUT', 'D_MUT', 'J_MUT', 'NX_COUNT', 'J_INFRAME', 'V_SEQ_START', 'STOP_COUNT', 'D_PROB', 'HMM_SCORE', 'RC', 'COMMON_MUT', 'COMMON_NX_COUNT', 'V_SEQ_START', 'V_SEQ_LENGTH', 'A_SCORE']
parseRecord(record)

Parses a single row from each IMTG file.

Parameters

record – dictionary containing one row of iHMMune-Align file.

Returns

database entry for the row.

Return type

dict

class changeo.IO.IMGTReader(summary, gapped, ntseq, junction, receptor=True)

Bases: object

An iterator to read and parse IMGT output files.

__iter__()

Iterator initializer.

Returns

changeo.IO.IMGTReader

__next__()

Next method.

Returns

parsed IMGT/HighV-QUEST result as an Receptor (receptor=True) or dictionary (receptor=False).

Return type

changeo.Receptor.Receptor

static customFields(scores=False, regions=False, junction=False, schema=None)

Returns non-standard fields defined by the parser

Parameters
  • scores – if True include alignment scoring fields.

  • regions – if True include IMGT-gapped CDR and FWR region fields.

  • junction – if True include detailed junction annotation fields.

  • schema – schema class to pass field through for conversion. If None, return changeo.Receptor.Receptor attribute names.

Returns

list of field names.

Return type

list

parseRecord(summary, gapped, ntseq, junction)

Parses a single row from each IMTG file.

Parameters
  • summary – dictionary containing one row of the ‘1_Summary’ file.

  • gapped – dictionary containing one row of the ‘2_IMGT-gapped-nt-sequences’ file.

  • ntseq – dictionary containing one row of the ‘3_Nt-sequences’ file.

  • junction – dictionary containing one row of the ‘6_Junction’ file.

Returns

database entry for the row.

Return type

dict

class changeo.IO.IgBLASTReader(igblast, sequences, references, asis_calls=False, regions='default', receptor=True, infer_junction=False)

Bases: object

An iterator to read and parse IgBLAST output files

__iter__()

Iterator initializer.

Returns

changeo.IO.IgBLASTReader

__next__()

Next method.

Returns

parsed IMGT/HighV-QUEST result as an Receptor (receptor=True) or dictionary (receptor=False).

Return type

changeo.Receptor.Receptor

static customFields(schema=None)

Returns non-standard fields defined by the parser

Parameters

schema – schema class to pass field through for conversion. If None, return changeo.Receptor.Receptor attribute names.

Returns

list of field names.

Return type

list

parseBlock(block)

Parses an IgBLAST result into separate sections

Parameters

block (iter) – an iterator from itertools.groupby containing a single IgBLAST result.

Returns

a parsed results block;

with the keys ‘query’ (sequence identifier as a string), ‘summary’ (dictionary of the alignment summary), ‘subregion’ (dictionary of IgBLAST CDR3 sequences), and ‘hits’ (VDJ hit table as a list of dictionaries). Returns None if the block has no data that can be parsed.

Return type

dict

parseSections(sections)

Parses an IgBLAST sections into a db dictionary

Parameters

sections – dictionary of parsed sections from parseBlock.

Returns

db entries.

Return type

dict

class changeo.IO.IgBLASTReaderAA(igblast, sequences, references, asis_calls=False, regions='default', receptor=True, infer_junction=False)

Bases: changeo.IO.IgBLASTReader

An iterator to read and parse IgBLAST amino acid alignment output files

static customFields(schema=None)

Returns non-standard fields defined by the parser

Parameters

schema – schema class to pass field through for conversion. If None, return changeo.Receptor.Receptor attribute names.

Returns

list of field names.

Return type

list

parseSections(sections)

Parses an IgBLAST sections into a db dictionary

Parameters

sections – dictionary of parsed sections from parseBlock.

Returns

db entries.

Return type

dict

class changeo.IO.TSVReader(handle)

Bases: object

Simple csv.DictReader wrapper to read format agnostic TSV files.

reader

reader object.

Type

iter

fields

field names.

Type

list

__iter__()

Iterator initializer

Returns

changeo.IO.TSVReader

__next__()

Next method

Returns

row as a dictionary of field:value pairs.

Return type

dist

class changeo.IO.TSVWriter(handle, fields, header=True)

Bases: object

Simple csv.DictWriter wrapper to write format agnostic TSV files.

writeDict(records)

Writes a row from a dictionary

Parameters

records – dictionary of row data or an iterable of such objects.

Returns

None

writeHeader()

Writes the header

Returns

None

changeo.IO.checkFields(attributes, header, schema=<class 'changeo.Receptor.AIRRSchema'>)

Checks that a file header contains a required set of Receptor attributes

Parameters
  • attributes (list) – list of Receptor attributes to check for.

  • header (list) – list of fields names in the file header.

  • schema (object) – schema object to convert field names to Receptor attributes.

Returns

True if all attributes mapping fields are found.

Return type

bool

Raises

LookupError

changeo.IO.countDbFile(file)

Counts the records in database files

Parameters

file – tab-delimited database file.

Returns

count of records in the database file.

Return type

int

changeo.IO.extractIMGT(imgt_output)

Extract necessary files from IMGT/HighV-QUEST results.

Parameters

imgt_output – zipped file or unzipped folder output by IMGT/HighV-QUEST.

Returns

(temporary directory handle, dictionary with names of extracted IMGT files).

Return type

tuple

changeo.IO.getDbFields(file, add=None, exclude=None, reader=<class 'changeo.IO.TSVReader'>)

Get field names from a db file

Parameters
  • file – db file to pull base fields from.

  • add – fields to append to the field set.

  • exclude – fields to exclude from the field set.

  • reader – reader class.

Returns

list of field names

Return type

list

changeo.IO.getFormatOperators(format)

Simple wrapper for fetching the set of operator classes for a data format

Parameters

format (str) – name of the data format.

Returns

a tuple with the reader class, writer class, and schema definition class.

Return type

tuple

changeo.IO.getOutputHandle(file, out_label=None, out_dir=None, out_name=None, out_type=None)

Opens an output file handle

Parameters
  • file – filename to base output file name on.

  • out_label – text to be inserted before the file extension; if None do not add a label.

  • out_type – the file extension of the output file; if None use input file extension.

  • out_dir – the output directory; if None use directory of input file

  • out_name – the short filename to use for the output file; if None use input file short name.

Returns

File handle

Return type

file

changeo.IO.getOutputName(file, out_label=None, out_dir=None, out_name=None, out_type=None)

Creates and output filename from an existing filename

Parameters
  • file – filename to base output file name on.

  • out_label – text to be inserted before the file extension; if None do not add a label.

  • out_type – the file extension of the output file; if None use input file extension.

  • out_dir – the output directory; if None use directory of input file

  • out_name – the short filename to use for the output file; if None use input file short name.

Returns

file name.

Return type

str

changeo.IO.readGermlines(references, asis=False, warn=False)

Parses germline repositories

Parameters
  • references (list) – list of strings specifying directories and/or files from which to read germline records.

  • asis (bool) – if True use sequence ID as record name and do not parse headers for allele names.

  • warn (bool) – print warning messages to standard error if True.

Returns

Dictionary of germlines in the form {allele: sequence}.

Return type

dict

changeo.IO.splitName(file)

Extract the extension from a file name

Parameters

file (str) – file name.

Returns

tuple of the file directory, basename and extension.

Return type

tuple

changeo.IO.yamlDict(file)

Returns a dictionary from a yaml file

Parameters

file (str) – simple yaml file with rows in the form ‘argument: value’.

Returns

dictionary of key:value pairs in the file.

Return type

dict

changeo.Multiprocessing

Multiprocessing

class changeo.Multiprocessing.DbData(key, records)

Bases: object

A class defining data objects for worker processes

id

result identifier

data

list of data records

valid

True if preprocessing was successfull and data should be processed

class changeo.Multiprocessing.DbResult(key, records)

Bases: object

A class defining result objects for collector processes

id

result identifier

data

list of original data records

results

list of processed records

data_pass

list of records that pass filtering for workers that split data before processing

data_fail

list of records that failed filtering for workers that split data before processing

valid

True if processing was successful and results should be written

log

OrderedDict of log items

property data_count
changeo.Multiprocessing.collectDbQueue(alive, result_queue, collect_queue, db_file, label, fields, writer=<class 'changeo.IO.AIRRWriter'>, out_file=None, out_args={'failed': False, 'log_file': None, 'out_dir': None, 'out_name': None, 'out_type': 'tsv'})

Pulls from results queue, assembles results and manages log and file IO

Parameters
  • alive – multiprocessing.Value boolean controlling whether processing continues; when False function returns.

  • result_queue – multiprocessing.Queue holding worker results.

  • collect_queue – multiprocessing.Queue to store collector return values.

  • db_file – database file name.

  • label – task label used to tag the output files.

  • fields – list of output fields.

  • writer – writer class.

  • out_file – output file name. Automatically generated from the input file if None.

  • out_args – common output argument dictionary from parseCommonArgs.

Returns

Adds a dictionary with key value pairs to collect_queue containing

’log’ defining a log object along with the ‘pass’ and ‘fail’ output file names.

Return type

None

changeo.Multiprocessing.feedDbQueue(alive, data_queue, db_file, reader=<class 'changeo.IO.AIRRReader'>, group_func=None, group_args={})

Feeds the data queue with Ig records

Parameters
  • alive – multiprocessing.Value boolean controlling whether processing continues if False exit process

  • data_queue – multiprocessing.Queue to hold data for processing

  • db_file – database file

  • reader – database reader class

  • group_func – function to use for grouping records

  • group_args – dictionary of arguments to pass to group_func

Returns

None

changeo.Multiprocessing.processDbQueue(alive, data_queue, result_queue, process_func, process_args={}, filter_func=None, filter_args={})

Pulls from data queue, performs calculations, and feeds results queue

Parameters
  • alive – multiprocessing.Value boolean controlling whether processing continues; when False function returns

  • data_queue – multiprocessing.Queue holding data to process

  • result_queue – multiprocessing.Queue to hold processed results

  • process_func – function to use for processing sequences

  • process_args – dictionary of arguments to pass to process_func

  • filter_func – function to use for filtering sequences before processing

  • filter_args – dictionary of arguments to pass to filter_func

Returns

None

changeo.Receptor

Receptor data structure

class changeo.Receptor.AIRRSchema

Bases: object

AIRR format to Receptor mappings

fields = ['sequence_id', 'sequence', 'sequence_alignment', 'germline_alignment', 'sequence_aa', 'sequence_aa_alignment', 'germline_aa_alignment', 'rev_comp', 'productive', 'stop_codon', 'vj_in_frame', 'v_frameshift', 'locus', 'v_call', 'd_call', 'j_call', 'junction', 'junction_start', 'junction_end', 'junction_length', 'junction_aa', 'junction_aa_length', 'np1_length', 'np2_length', 'np1_aa_length', 'np2_aa_length', 'v_sequence_start', 'v_sequence_end', 'v_sequence_length', 'v_germline_start', 'v_germline_end', 'v_germline_length', 'v_sequence_aa_start', 'v_sequence_aa_end', 'v_sequence_aa_length', 'v_germline_aa_start', 'v_germline_aa_end', 'v_germline_aa_length', 'd_sequence_start', 'd_sequence_end', 'd_sequence_length', 'd_germline_start', 'd_germline_end', 'd_germline_length', 'd_sequence_aa_start', 'd_sequence_aa_end', 'd_sequence_aa_length', 'd_germline_aa_start', 'd_germline_aa_end', 'd_germline_aa_length', 'j_sequence_start', 'j_sequence_end', 'j_sequence_length', 'j_germline_start', 'j_germline_end', 'j_germline_length', 'j_sequence_aa_start', 'j_sequence_aa_end', 'j_sequence_aa_length', 'j_germline_aa_start', 'j_germline_aa_end', 'j_germline_aa_length', 'c_call', 'germline_alignment_d_mask', 'v_score', 'v_identity', 'v_support', 'v_cigar', 'd_score', 'd_identity', 'd_support', 'd_cigar', 'j_score', 'j_identity', 'j_support', 'j_cigar', 'vdj_score', 'cdr1', 'cdr2', 'cdr3', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_start', 'cdr1_end', 'cdr2_start', 'cdr2_end', 'cdr3_start', 'cdr3_end', 'fwr1_start', 'fwr1_end', 'fwr2_start', 'fwr2_end', 'fwr3_start', 'fwr3_end', 'fwr4_start', 'fwr4_end', 'n1_length', 'n2_length', 'p3v_length', 'p5d_length', 'p3d_length', 'p5j_length', 'd_frame', 'cdr3_igblast', 'cdr3_igblast_aa', 'duplicate_count', 'consensus_count', 'umi_count', 'clone_id', 'cell_id']
static fromReceptor(field)

Returns an AIRR column name from a Receptor attribute name

Parameters

field – Receptor attribute name.

Returns

AIRR column name.

Return type

str

out_type = 'tsv'
required = ['sequence_id', 'sequence', 'sequence_alignment', 'germline_alignment', 'rev_comp', 'productive', 'stop_codon', 'vj_in_frame', 'locus', 'v_call', 'd_call', 'j_call', 'junction', 'junction_length', 'junction_aa', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end']
static toReceptor(field)

Returns a Receptor attribute name from an AIRR column name

Parameters

field – AIRR column name.

Returns

Receptor attribute name.

Return type

str

class changeo.Receptor.AIRRSchemaAA

Bases: changeo.Receptor.AIRRSchema

AIRR format to Receptor amino acid mappings

required = ['sequence_id', 'sequence', 'sequence_alignment', 'germline_alignment', 'sequence_aa', 'sequence_aa_alignment', 'germline_aa_alignment', 'rev_comp', 'productive', 'stop_codon', 'locus', 'v_call', 'd_call', 'j_call', 'junction', 'junction_length', 'junction_aa', 'v_sequence_aa_start', 'v_sequence_aa_end', 'v_germline_aa_start', 'v_germline_aa_end']
class changeo.Receptor.ChangeoSchema

Bases: object

Change-O to Receptor mappings

fields = ['SEQUENCE_ID', 'SEQUENCE_INPUT', 'SEQUENCE_AA_INPUT', 'FUNCTIONAL', 'IN_FRAME', 'STOP', 'MUTATED_INVARIANT', 'INDELS', 'V_FRAMESHIFT', 'LOCUS', 'V_CALL', 'D_CALL', 'J_CALL', 'SEQUENCE_VDJ', 'SEQUENCE_IMGT', 'SEQUENCE_AA_VDJ', 'SEQUENCE_AA_IMGT', 'V_SEQ_START', 'V_SEQ_LENGTH', 'V_GERM_START_VDJ', 'V_GERM_LENGTH_VDJ', 'V_GERM_START_IMGT', 'V_GERM_LENGTH_IMGT', 'V_SEQ_AA_START', 'V_SEQ_AA_LENGTH', 'V_GERM_AA_START_VDJ', 'V_GERM_AA_LENGTH_VDJ', 'V_GERM_AA_START_IMGT', 'V_GERM_AA_LENGTH_IMGT', 'NP1_LENGTH', 'NP1_AA_LENGTH', 'D_SEQ_START', 'D_SEQ_LENGTH', 'D_GERM_START', 'D_GERM_LENGTH', 'D_SEQ_AA_START', 'D_SEQ_AA_LENGTH', 'D_GERM_AA_START', 'D_GERM_AA_LENGTH', 'NP2_LENGTH', 'NP2_AA_LENGTH', 'J_SEQ_START', 'J_SEQ_LENGTH', 'J_GERM_START', 'J_GERM_LENGTH', 'J_SEQ_AA_START', 'J_SEQ_AA_LENGTH', 'J_GERM_AA_START', 'J_GERM_AA_LENGTH', 'JUNCTION', 'JUNCTION_LENGTH', 'GERMLINE_IMGT', 'GERMLINE_AA_IMGT', 'JUNCTION_START', 'V_SCORE', 'V_IDENTITY', 'V_EVALUE', 'V_BTOP', 'V_CIGAR', 'D_SCORE', 'D_IDENTITY', 'D_EVALUE', 'D_BTOP', 'D_CIGAR', 'J_SCORE', 'J_IDENTITY', 'J_EVALUE', 'J_BTOP', 'J_CIGAR', 'VDJ_SCORE', 'FWR1_IMGT', 'FWR2_IMGT', 'FWR3_IMGT', 'FWR4_IMGT', 'CDR1_IMGT', 'CDR2_IMGT', 'CDR3_IMGT', 'FWR1_AA_IMGT', 'FWR2_AA_IMGT', 'FWR3_AA_IMGT', 'FWR4_AA_IMGT', 'CDR1_AA_IMGT', 'CDR2_AA_IMGT', 'CDR3_AA_IMGT', 'N1_LENGTH', 'N2_LENGTH', 'P3V_LENGTH', 'P5D_LENGTH', 'P3D_LENGTH', 'P5J_LENGTH', 'D_FRAME', 'C_CALL', 'CDR3_IGBLAST', 'CDR3_IGBLAST_AA', 'CONSCOUNT', 'DUPCOUNT', 'UMICOUNT', 'CLONE', 'CELL']
static fromReceptor(field)

Returns a Change-O column name from a Receptor attribute name

Parameters

field – Receptor attribute name.

Returns

Change-O column name.

Return type

str

out_type = 'tab'
required = ['SEQUENCE_ID', 'SEQUENCE_INPUT', 'FUNCTIONAL', 'IN_FRAME', 'STOP', 'MUTATED_INVARIANT', 'INDELS', 'LOCUS', 'V_CALL', 'D_CALL', 'J_CALL', 'SEQUENCE_VDJ', 'SEQUENCE_IMGT', 'V_SEQ_START', 'V_SEQ_LENGTH', 'V_GERM_START_VDJ', 'V_GERM_LENGTH_VDJ', 'V_GERM_START_IMGT', 'V_GERM_LENGTH_IMGT', 'NP1_LENGTH', 'D_SEQ_START', 'D_SEQ_LENGTH', 'D_GERM_START', 'D_GERM_LENGTH', 'NP2_LENGTH', 'J_SEQ_START', 'J_SEQ_LENGTH', 'J_GERM_START', 'J_GERM_LENGTH', 'JUNCTION', 'JUNCTION_LENGTH', 'GERMLINE_IMGT']
static toReceptor(field)

Returns a Receptor attribute name from a Change-O column name

Parameters

field – Change-O column name.

Returns

Receptor attribute name.

Return type

str

class changeo.Receptor.ChangeoSchemaAA

Bases: changeo.Receptor.ChangeoSchema

Change-O to Receptor amino acid mappings

required = ['SEQUENCE_ID', 'SEQUENCE_AA_INPUT', 'STOP', 'INDELS', 'LOCUS', 'V_CALL', 'SEQUENCE_AA_VDJ', 'SEQUENCE_AA_IMGT', 'V_SEQ_AA_START', 'V_SEQ_AA_LENGTH', 'V_GERM_AA_START_VDJ', 'V_GERM_AA_LENGTH_VDJ', 'V_GERM_AA_START_IMGT', 'V_GERM_AA_LENGTH_IMGT', 'GERMLINE_AA_IMGT']
class changeo.Receptor.Receptor(data)

Bases: object

A class defining a V(D)J sequence and its annotations

property d_germ_aa_end

Position of the last amino acid in the D germline amino acid alignment

property d_germ_end

Position of the last nucleotide in the D germline sequence alignment

property d_seq_aa_end

Position of the last D amino acid in the input amino acid sequence

property d_seq_end

Position of the last D nucleotide in the input sequence

getAIRR(field, seq=False)

Get an attribute from an AIRR field name

Parameters
  • field – AIRR column name as a string

  • seq – if True return the attribute as a Seq object

Returns

Value in the AIRR field. Returns None if the field cannot be found.

getAlleleCalls(calls, action='first')

Get multiple allele calls

Parameters
  • calls – iterable of calls to get; one or more of (‘v’,’d’,’j’)

  • actions – One of (‘first’,’set’)

Returns

List of requested calls in order

Return type

list

getAlleleNumbers(calls, action='first')

Get multiple allele numeric identifiers

Parameters
  • calls – iterable of calls to get; one or more of (‘v’,’d’,’j’)

  • actions – One of (‘first’,’set’)

Returns

List of requested calls in order

Return type

list

getChangeo(field, seq=False)

Get an attribute from a Change-O field name

Parameters
  • field – Change-O column name as a string

  • seq – if True return the attribute as a Seq object

Returns

Value in the Change-O field. Returns None if the field cannot be found.

getDAllele(action='first', field=None)

D segment allele getter

Parameters
  • actions – One of ‘first’, ‘set’ or ‘list’

  • field – attribute or annotation name containing the D call. Use d_call attribute if None.

Returns

String of the allele when action is ‘first’; tuple : Tuple of allele calls for ‘set’ or ‘list’ actions.

Return type

str

getDAlleleNumber(action='first', field=None)

D segment allele number getter

Parameters
  • actions – One of ‘first’, ‘set’ or list’

  • field – attribute or annotation name containing the D call. Use d_call attribute if None.

Returns

String of the allele when action is ‘first’; tuple : Tuple of allele numbers for ‘set’ or ‘list’ actions.

Return type

str

getDFamily(action='first', field=None)

D segment family getter

Parameters
  • actions – One of ‘first’, ‘set’ or list’

  • field – attribute or annotation name containing the D call. Use d_call attribute if None.

Returns

String of the allele when action is ‘first’; tuple : Tuple of allele calls for ‘set’ or ‘list’ actions.

Return type

str

getDGene(action='first', field=None)

D segment gene getter

Parameters
  • actions – One of ‘first’, ‘set’ or list’

  • field – attribute or annotation name containing the D call. Use d_call attribute if None.

Returns

String of the allele when action is ‘first’; tuple : Tuple of allele calls for ‘set’ or ‘list’ actions.

Return type

str

getFamilyCalls(calls, action='first')

Get multiple family calls

Parameters
  • calls – iterable of calls to get; one or more of (‘v’,’d’,’j’)

  • actions – One of (‘first’,’set’)

Returns

List of requested calls in order

Return type

list

getField(field)

Get an attribute or annotation value

Parameters

field – attribute name as a string

Returns

Value in the attribute. Returns None if the attribute cannot be found.

getGeneCalls(calls, action='first')

Get multiple gene calls

Parameters
  • calls – iterable of calls to get; one or more of (‘v’,’d’,’j’)

  • actions – One of (‘first’,’set’)

Returns

List of requested calls in order

Return type

list

getJAllele(action='first', field=None)

J segment allele getter

Parameters
  • actions – One of ‘first’, ‘set’ or ‘list’

  • field – attribute or annotation name containing the J call. Use j_call attribute if None.

Returns

String of the allele when action is ‘first’; tuple : Tuple of allele calls for ‘set’ or ‘list’ actions.

Return type

str

getJAlleleNumber(action='first', field=None)

J segment allele number getter

Parameters
  • actions – One of ‘first’, ‘set’ or list’

  • field – attribute or annotation name containing the J call. Use j_call attribute if None.

Returns

String of the allele when action is ‘first’; tuple : Tuple of allele numbers for ‘set’ or ‘list’ actions.

Return type

str

getJFamily(action='first', field=None)

J segment family getter

Parameters
  • actions – One of ‘first’, ‘set’ or list’

  • field – attribute or annotation name containing the J call. Use j_call attribute if None.

Returns

String of the allele when action is ‘first’; tuple : Tuple of allele calls for ‘set’ or ‘list’ actions.

Return type

str

getJGene(action='first', field=None)

J segment gene getter

Parameters
  • actions – One of ‘first’, ‘set’ or list’

  • field – attribute or annotation name containing the J call. Use j_call attribute if None.

Returns

String of the allele when action is ‘first’; tuple : Tuple of allele calls for ‘set’ or ‘list’ actions.

Return type

str

getSeq(field)

Get an attribute value converted to a Seq object

Parameters

field – variable name as a string

Returns

Value in the field as a Seq object

Return type

Bio.Seq.Seq

getVAllele(action='first', field=None)

V segment allele getter

Parameters
  • actions – One of ‘first’, ‘set’ or list’

  • field – attribute or annotation name containing the V call. Use v_call attribute if None.

Returns

String of the allele when action is ‘first’; tuple : Tuple of allele calls for ‘set’ or ‘list’ actions.

Return type

str

getVAlleleNumber(action='first', field=None)

V segment allele number getter

Parameters
  • actions – One of ‘first’, ‘set’ or list’

  • field – attribute or annotation name containing the V call. Use v_call attribute if None.

Returns

String of the allele when action is ‘first’; tuple : Tuple of allele numbers for ‘set’ or ‘list’ actions.

Return type

str

getVFamily(action='first', field=None)

V segment family getter

Parameters
  • actions – One of ‘first’, ‘set’ or list’

  • field – attribute or annotation name containing the V call. Use v_call attribute if None.

Returns

String of the allele when action is ‘first’; tuple : Tuple of allele calls for ‘set’ or ‘list’ actions.

Return type

str

getVGene(action='first', field=None)

V segment gene getter

Parameters
  • actions – One of ‘first’, ‘set’ or list’

  • field – attribute or annotation name containing the V call. Use v_call attribute if None.

Returns

String of the allele when action is ‘first’; tuple : Tuple of allele calls for ‘set’ or ‘list’ actions.

Return type

str

property j_germ_aa_end

Position of the last amino acid in the J germline amino acid alignment

property j_germ_end

Position of the last nucleotide in the J germline sequence alignment

property j_seq_aa_end

Position of the last J amino acid in the input amino sequence

property j_seq_end

Position of the last J nucleotide in the input sequence

property junction_end

Position of the last junction nucleotide in the input sequence

setDict(data, parse=False)

Adds or updates multiple attributes and annotations

Parameters
  • data – a dictionary of annotations to add or update.

  • parse – if True pass values through string parsing functions for known fields.

Returns

updates attribute values and the annotations attribute.

Return type

None

setField(field, value, parse=False)

Set an attribute or annotation value

Parameters
  • field – attribute name as a string

  • value – value to assign

  • parse – if True pass values through string parsing functions for known fields.

Returns

None. Updates attribute or annotation.

toDict()

Convert the namespace to a dictionary

Returns

member fields with values converted to appropriate strings

Return type

dict

property v_germ_aa_end_imgt

Position of the last nucleotide in the IMGT-gapped V germline sequence alignment

property v_germ_aa_end_vdj

Position of the last nucleotide in the ungapped V germline sequence alignment

property v_germ_end_imgt

Position of the last nucleotide in the IMGT-gapped V germline sequence alignment

property v_germ_end_vdj

Position of the last nucleotide in the ungapped V germline sequence alignment

property v_seq_aa_end

Position of the last V nucleotide in the input sequence

property v_seq_end

Position of the last V nucleotide in the input sequence

class changeo.Receptor.ReceptorData

Bases: object

A class containing type conversion methods for Receptor data attributes

sequence_id

unique sequence identifier.

Type

str

rev_comp

whether the alignment is relative to the reverse compliment of the input sequence.

Type

bool

functional

whether sample V(D)J sequence is predicted to be functional.

Type

bool

in_frame

whether junction region is in-frame.

Type

bool

stop

whether a stop codon is present in the V(D)J sequence.

Type

bool

mutated_invariant

whether the conserved amino acids are mutated in the V(D)J sequence.

Type

bool

indels

whether the V(D)J nucleotide sequence contains insertions and/or deletions.

Type

bool

v_frameshift

whether the V segment contains a frameshift

Type

bool

sequence_input

input nucleotide sequence.

Type

Bio.Seq.Seq

sequence_vdj

Aligned V(D)J nucleotide sequence without IMGT-gaps.

Type

Bio.Seq.Seq

sequence_imgt

IMGT-gapped V(D)J nucleotide sequence.

Type

Bio.Seq.Seq

sequence_aa_input

input amino acid sequence.

Type

Bio.Seq.Seq

sequence_aa_vdj

Aligned V(D)J nucleotide sequence without IMGT-gaps.

Type

Bio.Seq.Seq

sequence_aa_imgt

IMGT-gapped V(D)J amino sequence.

Type

Bio.Seq.Seq

junction

ungapped junction region nucletide sequence.

Type

Bio.Seq.Seq

junction_aa

ungapped junction region amino acid sequence.

Type

Bio.Seq.Seq

junction_start

start positions of the junction in the input nucleotide sequence.

Type

int

junction_length

length of the junction in nucleotides.

Type

int

germline_vdj

full ungapped germline V(D)J nucleotide sequence.

Type

Bio.Seq.Seq

germline_vdj_d_mask

ungapped germline V(D)J nucleotides sequence with Ns masking the NP1-D-NP2 regions.

Type

Bio.Seq.Seq

germline_imgt

full IMGT-gapped germline V(D)J nucleotide sequence.

Type

Bio.Seq.Seq

germline_imgt_d_mask

IMGT-gapped germline V(D)J nucleotide sequence with ns masking the NP1-D-NP2 regions.

Type

Bio.Seq.Seq

germline_aa_vdj

full ungapped germline V(D)J amino acid sequence.

Type

Bio.Seq.Seq

germline_aa_imgt

full IMGT-gapped germline V(D)J amino acid sequence.

Type

Bio.Seq.Seq

v_call

V allele assignment(s).

Type

str

d_call

D allele assignment(s).

Type

str

j_call

J allele assignment(s).

Type

str

c_call

C region assignment.

Type

str

v_seq_start

position of the first V nucleotide in the input sequence (1-based).

Type

int

v_seq_length

number of V nucleotides in the input sequence.

Type

int

v_germ_start_imgt

position of the first V nucleotide in IMGT-gapped V germline sequence alignment (1-based).

Type

int

v_germ_length_imgt

length of the IMGT numbered germline V alignment.

Type

int

v_germ_start_vdj

position of the first nucleotide in ungapped V germline sequence alignment (1-based).

Type

int

v_germ_length_vdj

length of the ungapped germline V alignment.

Type

int

v_seq_aa_start

position of the first V amino acid in the amino acid input sequence (1-based).

Type

int

v_seq_aa_length

number of V amino acid in the amino acid input sequence.

Type

int

v_germ_aa_start_imgt

position of the first V amino acid in IMGT-gapped V germline amino acid alignment (1-based).

Type

int

v_germ_aa_length_imgt

length of the IMGT numbered germline V amino acid alignment.

Type

int

v_germ_aa_start_vdj

position of the first amino acid in ungapped V germline amino acid alignment (1-based).

Type

int

v_germ_aa_length_vdj

length of the ungapped germline V amino acid alignment.

Type

int

np1_start

position of the first untemplated nucleotide between the V and D segments in the input sequence (1-based).

Type

int

np1_length

number of untemplated nucleotides between the V and D segments.

Type

int

np1_aa_start

position of the first untemplated amino acid between the V and D segments in the input amino acid sequence (1-based).

Type

int

np1_aa_length

number of untemplated amino acids between the V and D segments.

Type

int

d_seq_start

position of the first D nucleotide in the input sequence (1-based).

Type

int

d_seq_length

number of D nucleotides in the input sequence.

Type

int

d_germ_start

position of the first nucleotide in D germline sequence alignment (1-based).

Type

int

d_germ_length

length of the germline D alignment.

Type

int

d_seq_aa_start

position of the first D amino acid in the input amino acidsequence (1-based).

Type

int

d_seq_aa_length

number of D amino acids in the input amino acid sequence.

Type

int

d_germ_aa_start

position of the first amino acid in D germline amino acid alignment (1-based).

Type

int

d_germ_aa_length

length of the germline D amino acid alignment.

Type

int

np2_start

position of the first untemplated nucleotide between the D and J segments in the input sequence (1-based).

Type

int

np2_length

number of untemplated nucleotides between the D and J segments.

Type

int

np2_aa_start

position of the first untemplated amino acid between the D and J segments in the input amino acid sequence (1-based).

Type

int

np2_aa_length

number of untemplated amino acid between the D and J segments.

Type

int

j_seq_start

position of the first J nucleotide in the input sequence (1-based).

Type

int

j_seq_length

number of J nucleotides in the input sequence.

Type

int

j_germ_start

position of the first nucleotide in J germline sequence alignment (1-based).

Type

int

j_germ_length

length of the germline J alignment.

Type

int

j_seq_aa_start

position of the first J amino acid in the input amino acidsequence (1-based).

Type

int

j_seq_aa_length

number of J amino acid in the input amino acidsequence.

Type

int

j_germ_aa_start

position of the first amino acid in J germline amino acid alignment (1-based).

Type

int

j_germ_aa_length

length of the germline J amino acid alignment.

Type

int

v_score

alignment score for the V.

Type

float

v_identity

alignment identity for the V.

Type

float

v_evalue

E-value for the alignment of the V.

Type

float

v_btop

BTOP for the alignment of the V.

Type

str

v_cigar

CIGAR for the alignment of the V.

Type

str

d_score

alignment score for the D.

Type

float

d_identity

alignment identity for the D.

Type

float

d_evalue

E-value for the alignment of the D.

Type

float

d_btop

BTOP for the alignment of the D.

Type

str

d_cigar

CIGAR for the alignment of the D.

Type

str

j_score

alignment score for the J.

Type

float

j_identity

alignment identity for the J.

Type

float

j_evalue

E-value for the alignment of the J.

Type

float

j_btop

BTOP for the alignment of the J.

Type

str

j_cigar

CIGAR for the alignment of the J.

Type

str

vdj_score

alignment score for the V(D)J.

Type

float

fwr1_imgt

IMGT-gapped FWR1 nucleotide sequence.

Type

Bio.Seq.Seq

fwr2_imgt

IMGT-gapped FWR2 nucleotide sequence.

Type

Bio.Seq.Seq

fwr3_imgt

IMGT-gapped FWR3 nucleotide sequence.

Type

Bio.Seq.Seq

fwr4_imgt

IMGT-gapped FWR4 nucleotide sequence.

Type

Bio.Seq.Seq

cdr1_imgt

IMGT-gapped CDR1 nucleotide sequence.

Type

Bio.Seq.Seq

cdr2_imgt

IMGT-gapped CDR2 nucleotide sequence.

Type

Bio.Seq.Seq

cdr3_imgt

IMGT-gapped CDR3 nucleotide sequence.

Type

Bio.Seq.Seq

cdr3_igblast

CDR3 nucleotide sequence assigned by IgBLAST.

Type

Bio.Seq.Seq

fwr1_aa_imgt

IMGT-gapped FWR1 amino acid sequence.

Type

Bio.Seq.Seq

fwr2_aa_imgt

IMGT-gapped FWR2 amino acid sequence.

Type

Bio.Seq.Seq

fwr3_aa_imgt

IMGT-gapped FWR3 amino acid sequence.

Type

Bio.Seq.Seq

fwr4_aa_imgt

IMGT-gapped FWR4 amino acid sequence.

Type

Bio.Seq.Seq

cdr1_aa_imgt

IMGT-gapped CDR1 amino acid sequence.

Type

Bio.Seq.Seq

cdr2_aa_imgt

IMGT-gapped CDR2 amino acid sequence.

Type

Bio.Seq.Seq

cdr3_aa_imgt

IMGT-gapped CDR3 amino acid sequence.

Type

Bio.Seq.Seq

cdr3_igblast_aa

CDR3 amino acid sequence assigned by IgBLAST.

Type

Bio.Seq.Seq

n1_length

M nucleotides 5’ of the D segment.

Type

int

n2_length

nucleotides 3’ of the D segment.

Type

int

p3v_length

palindromic nucleotides 3’ of the V segment.

Type

int

p5d_length

palindromic nucleotides 5’ of the D segment.

Type

int

p3d_length

palindromic nucleotides 3’ of the D segment.

Type

int

p5j_length

palindromic nucleotides 5’ of the J segment.

Type

int

d_frame

D segment reading frame.

Type

int

conscount

number of reads contributing to the UMI consensus sequence.

Type

int

dupcount

copy number of the sequence.

Type

int

umicount

number of UMIs representing the sequence.

Type

int

clone

clonal cluster identifier.

Type

str

cell

origin cell identifier.

Type

str

annotations

dictionary containing all unknown fields.

Type

dict

static aminoacid(v, deparse=False)
static double(v, deparse=False)
end_fields = {'cdr1_end': ('cdr1_start', 'cdr1_length'), 'cdr2_end': ('cdr2_start', 'cdr2_length'), 'cdr3_end': ('cdr3_start', 'cdr3_length'), 'd_germ_aa_end': ('d_germ_aa_start', 'd_germ_aa_length'), 'd_germ_end': ('d_germ_start', 'd_germ_length'), 'd_seq_aa_end': ('d_seq_aa_start', 'd_seq_aa_length'), 'd_seq_end': ('d_seq_start', 'd_seq_length'), 'fwr1_end': ('fwr1_start', 'fwr1_length'), 'fwr2_end': ('fwr2_start', 'fwr2_length'), 'fwr3_end': ('fwr3_start', 'fwr3_length'), 'fwr4_end': ('fwr4_start', 'fwr4_length'), 'j_germ_aa_end': ('j_germ_aa_start', 'j_germ_aa_length'), 'j_germ_end': ('j_germ_start', 'j_germ_length'), 'j_seq_aa_end': ('j_seq_aa_start', 'j_seq_aa_length'), 'j_seq_end': ('j_seq_start', 'j_seq_length'), 'junction_end': ('junction_start', 'junction_length'), 'v_alignment_aa_end': ('v_alignment_aa_start', 'v_alignment_aa_length'), 'v_alignment_end': ('v_alignment_start', 'v_alignment_length'), 'v_germ_aa_end_imgt': ('v_germ_aa_start_imgt', 'v_germ_aa_length_imgt'), 'v_germ_aa_end_vdj': ('v_germ_aa_start_vdj', 'v_germ_aa_length_vdj'), 'v_germ_end_imgt': ('v_germ_start_imgt', 'v_germ_length_imgt'), 'v_germ_end_vdj': ('v_germ_start_vdj', 'v_germ_length_vdj'), 'v_seq_aa_end': ('v_seq_aa_start', 'v_seq_aa_length'), 'v_seq_end': ('v_seq_start', 'v_seq_length')}
static identity(v, deparse=False)
static integer(v, deparse=False)
length_fields = {'cdr1_length': ('cdr1_start', 'cdr1_end'), 'cdr2_length': ('cdr2_start', 'cdr2_end'), 'cdr3_length': ('cdr3_start', 'cdr3_end'), 'd_germ_aa_length': ('d_germ_aa_start', 'd_germ_aa_end'), 'd_germ_length': ('d_germ_start', 'd_germ_end'), 'd_seq_aa_length': ('d_seq_aa_start', 'd_seq_aa_end'), 'd_seq_length': ('d_seq_start', 'd_seq_end'), 'fwr1_length': ('fwr1_start', 'fwr1_end'), 'fwr2_length': ('fwr2_start', 'fwr2_end'), 'fwr3_length': ('fwr3_start', 'fwr3_end'), 'fwr4_length': ('fwr4_start', 'fwr4_end'), 'j_germ_aa_length': ('j_germ_aa_start', 'j_germ_aa_end'), 'j_germ_length': ('j_germ_start', 'j_germ_end'), 'j_seq_aa_length': ('j_seq_aa_start', 'j_seq_aa_end'), 'j_seq_length': ('j_seq_start', 'j_seq_end'), 'junction_length': ('junction_start', 'junction_end'), 'v_alignment_aa_length': ('v_alignment_aa_start', 'v_alignment_aa_end'), 'v_alignment_length': ('v_alignment_start', 'v_alignment_end'), 'v_germ_aa_length_imgt': ('v_germ_aa_start_imgt', 'v_germ_aa_end_imgt'), 'v_germ_aa_length_vdj': ('v_germ_aa_start_vdj', 'v_germ_aa_end_vdj'), 'v_germ_length_imgt': ('v_germ_start_imgt', 'v_germ_end_imgt'), 'v_germ_length_vdj': ('v_germ_start_vdj', 'v_germ_end_vdj'), 'v_seq_aa_length': ('v_seq_aa_start', 'v_seq_aa_end'), 'v_seq_length': ('v_seq_start', 'v_seq_end')}
static logical(v, deparse=False)
static nucleotide(v, deparse=False)
parsers = {'c_call': 'identity', 'cdr1_aa_imgt': 'aminoacid', 'cdr1_imgt': 'nucleotide', 'cdr2_aa_imgt': 'aminoacid', 'cdr2_imgt': 'nucleotide', 'cdr3_aa_imgt': 'aminoacid', 'cdr3_igblast': 'nucleotide', 'cdr3_igblast_aa': 'aminoacid', 'cdr3_imgt': 'nucleotide', 'cell': 'identity', 'clone': 'identity', 'conscount': 'integer', 'd_btop': 'identity', 'd_call': 'identity', 'd_cigar': 'identity', 'd_evalue': 'double', 'd_frame': 'integer', 'd_germ_aa_length': 'integer', 'd_germ_aa_start': 'integer', 'd_germ_length': 'integer', 'd_germ_start': 'integer', 'd_identity': 'double', 'd_score': 'double', 'd_seq_aa_length': 'integer', 'd_seq_aa_start': 'integer', 'd_seq_length': 'integer', 'd_seq_start': 'integer', 'dupcount': 'integer', 'functional': 'logical', 'fwr1_aa_imgt': 'aminoacid', 'fwr1_imgt': 'nucleotide', 'fwr2_aa_imgt': 'aminoacid', 'fwr2_imgt': 'nucleotide', 'fwr3_aa_imgt': 'aminoacid', 'fwr3_imgt': 'nucleotide', 'fwr4_aa_imgt': 'aminoacid', 'fwr4_imgt': 'nucleotide', 'germline_aa_imgt': 'aminoacid', 'germline_aa_vdj': 'aminoacid', 'germline_imgt': 'nucleotide', 'germline_imgt_d_mask': 'nucleotide', 'germline_vdj': 'nucleotide', 'germline_vdj_d_mask': 'nucleotide', 'in_frame': 'logical', 'indels': 'logical', 'j_btop': 'identity', 'j_call': 'identity', 'j_cigar': 'identity', 'j_evalue': 'double', 'j_germ_aa_length': 'integer', 'j_germ_aa_start': 'integer', 'j_germ_length': 'integer', 'j_germ_start': 'integer', 'j_identity': 'double', 'j_score': 'double', 'j_seq_aa_length': 'integer', 'j_seq_aa_start': 'integer', 'j_seq_length': 'integer', 'j_seq_start': 'integer', 'junction': 'nucleotide', 'junction_aa': 'aminoacid', 'junction_length': 'integer', 'junction_start': 'integer', 'locus': 'identity', 'mutated_invariant': 'logical', 'n1_length': 'integer', 'n2_length': 'integer', 'np1_aa_length': 'integer', 'np1_aa_start': 'integer', 'np1_length': 'integer', 'np1_start': 'integer', 'np2_aa_length': 'integer', 'np2_aa_start': 'integer', 'np2_length': 'integer', 'np2_start': 'integer', 'p3d_length': 'integer', 'p3v_length': 'integer', 'p5d_length': 'integer', 'p5j_length': 'integer', 'rev_comp': 'logical', 'sequence_aa_imgt': 'aminoacid', 'sequence_aa_input': 'aminoacid', 'sequence_aa_vdj': 'aminoacid', 'sequence_id': 'identity', 'sequence_imgt': 'nucleotide', 'sequence_input': 'nucleotide', 'sequence_vdj': 'nucleotide', 'stop': 'logical', 'umicount': 'integer', 'v_btop': 'identity', 'v_call': 'identity', 'v_cigar': 'identity', 'v_evalue': 'double', 'v_frameshift': 'logical', 'v_germ_aa_length_imgt': 'integer', 'v_germ_aa_length_vdj': 'integer', 'v_germ_aa_start_imgt': 'integer', 'v_germ_aa_start_vdj': 'integer', 'v_germ_length_imgt': 'integer', 'v_germ_length_vdj': 'integer', 'v_germ_start_imgt': 'integer', 'v_germ_start_vdj': 'integer', 'v_identity': 'double', 'v_score': 'double', 'v_seq_aa_length': 'integer', 'v_seq_aa_start': 'integer', 'v_seq_length': 'integer', 'v_seq_start': 'integer', 'vdj_score': 'double'}
start_fields = {'cdr1_start': ('cdr1_length', 'cdr1_end'), 'cdr2_start': ('cdr2_length', 'cdr2_end'), 'cdr3_start': ('cdr3_length', 'cdr3_end'), 'd_germ_aa_start': ('d_germ_aa_length', 'd_germ_aa_end'), 'd_germ_start': ('d_germ_length', 'd_germ_end'), 'd_seq_aa_start': ('d_seq_aa_length', 'd_seq_aa_end'), 'd_seq_start': ('d_seq_length', 'd_seq_end'), 'fwr1_start': ('fwr1_length', 'fwr1_end'), 'fwr2_start': ('fwr2_length', 'fwr2_end'), 'fwr3_start': ('fwr3_length', 'fwr3_end'), 'fwr4_start': ('fwr4_length', 'fwr4_end'), 'j_germ_aa_start': ('j_germ_aa_length', 'j_germ_aa_end'), 'j_germ_start': ('j_germ_length', 'j_germ_end'), 'j_seq_aa_start': ('j_seq_aa_length', 'j_seq_aa_end'), 'j_seq_start': ('j_seq_length', 'j_seq_end'), 'junction_start': ('junction_length', 'junction_end'), 'v_alignment_aa_start': ('v_alignment_aa_length', 'v_alignment_aa_end'), 'v_alignment_start': ('v_alignment_length', 'v_alignment_end'), 'v_germ_aa_start_imgt': ('v_germ_aa_length_imgt', 'v_germ_aa_end_imgt'), 'v_germ_aa_start_vdj': ('v_germ_aa_length_vdj', 'v_germ_aa_end_vdj'), 'v_germ_start_imgt': ('v_germ_length_imgt', 'v_germ_end_imgt'), 'v_germ_start_vdj': ('v_germ_length_vdj', 'v_germ_end_vdj'), 'v_seq_aa_start': ('v_seq_aa_length', 'v_seq_aa_end'), 'v_seq_start': ('v_seq_length', 'v_seq_end')}

Using IgBLAST

Example data

We have hosted a small example data set resulting from the UMI barcoded MiSeq workflow described in the pRESTO documentation. In addition to the example FASTA files, we have included the standalone IgBLAST results. The files can be downloded from here:

Change-O Example Files

Configuring IgBLAST

A collection of scripts for setting up the standalone IgBLAST database from the IMGT reference sequences are available on the Immcantation repository. To use these scripts, copy all the tools in the /scripts folder to a location in your PATH. At a minimum, you’ll need the following scripts:

  1. fetch_igblastdb.sh

  2. fetch_imgtdb.sh

  3. clean_imgtdb.py

  4. imgt2igblast.sh

Download and configure the IgBLAST and IMGT reference databases as follows, adjusting the version number to taste:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Download and extract IgBLAST
VERSION="1.17.0"
wget ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/${VERSION}/ncbi-igblast-${VERSION}-x64-linux.tar.gz
tar -zxf ncbi-igblast-${VERSION}-x64-linux.tar.gz
cp ncbi-igblast-${VERSION}/bin/* ~/bin
# Download reference databases and setup IGDATA directory
fetch_igblastdb.sh -o ~/share/igblast
cp -r ncbi-igblast-${VERSION}/internal_data ~/share/igblast
cp -r ncbi-igblast-${VERSION}/optional_file ~/share/igblast
# Build IgBLAST database from IMGT reference sequences
fetch_imgtdb.sh -o ~/share/germlines/imgt
imgt2igblast.sh -i ~/share/germlines/imgt -o ~/share/igblast

Note

Several Immcantation tools require the observed V(D)J sequence (sequence_alignment) and associated germline fields (germline_alignment or germline_alignment_d_mask) to have gaps inserted to conform to the IMGT numbering scheme. Thus, when a tool such as MakeDb.py or CreateGermlines.py requires a reference sequence set as input, it will required the IMGT-gapped reference set. Meaning, the reference sequences that were downloaded using the fetch_imgtdb.sh script, or downloaded manually from the IMGT reference directory, rather than the final upgapped reference set required by IgBLAST.

See also

The provided scripts download only the mouse and human IMGT reference databases. See the IgBLAST documentation for instructions on how to build the database in a more general case. Shown below is an example of how to performed the same steps as the Immcantation scripts using a separately downloaded IMGT reference set and the scripts provided by IgBLAST. You must have all of the associated commands in your PATH and the appropriate directories created:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# V segment database
edit_imgt_file.pl IMGT_Human_IGHV.fasta > ~/share/igblast/fasta/imgt_human_ig_v.fasta
makeblastdb -parse_seqids -dbtype nucl -in ~/share/igblast/fasta/imgt_human_ig_v.fasta \
    -out ~/share/igblast/database/imgt_human_ig_v
# D segment database
edit_imgt_file.pl IMGT_Human_IGHD.fasta > ~/share/igblast/fasta/imgt_human_ig_d.fasta
makeblastdb -parse_seqids -dbtype nucl -in ~/share/igblast/fasta/imgt_human_ig_d.fasta \
    -out ~/share/igblast/database/imgt_human_ig_d
# J segment database
edit_imgt_file.pl IMGT_Human_IGHJ.fasta > ~/share/igblast/fasta/imgt_human_ig_j.fasta
makeblastdb -parse_seqids -dbtype nucl -in ~/share/igblast/fasta/imgt_human_ig_j.fasta \
    -out ~/share/igblast/database/imgt_human_ig_j

Once these databases are built for each segment they can be referenced when running IgBLAST.

Running IgBLAST

Change-O provides a simple wrapper script to run IgBLAST with the required options as the igblast subcommand of AssignGenes.py. This wrapper can be run as follows using the database built using the Immcantation scripts:

AssignGenes.py igblast -s HD13M.fasta -b ~/share/igblast \
    --organism human --loci ig --format blast

The optional --format blast argument defines the output format of IgBLAST. The default, blast, is the blocked tabular output provided by specifying the -outfmt '7 std qseq sseq btop' argument to IgBLAST. Specifying --format airr will output a tab-delimited file compliant with the AIRR Rearrangement schema defined by the AIRR Community. AIRR format support requires IgBLAST v1.9.0 or higher.

The -b ~/share/igblast argument specifies the path containing the database, internal_data, and optional_file directories required by IgBLAST. This option sets the IGDATA environment variable that controls where IgBLAST looks for internal database files. See the IgBLAST documentation for more details regarding the IGDATA environment variable.

See also

The AssignGenes.py IgBLAST wrapper provides limited functionality. For more control, IgBLAST should be run directly. The only strict requirement for compatibility with Changeo-O is that the output must either be an AIRR tab-delimited file (--outfmt 19) or a blast-style tabular output with the optional query sequence, subject sequence and BTOP fields (-outfmt '7 std qseq sseq btop'). An example of how to run IgBLAST directly is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
export IGDATA=~/share/igblast
igblastn \
    -germline_db_V ~/share/igblast/database/imgt_human_ig_v\
    -germline_db_D ~/share/igblast/database/imgt_human_ig_d \
    -germline_db_J ~/share/igblast/database/imgt_human_ig_v \
    -auxiliary_data ~/share/igblast/optional_file/human_gl.aux \
    -domain_system imgt -ig_seqtype Ig -organism human \
    -outfmt '7 std qseq sseq btop' \
    -query HD13M.fasta \
    -out HD13M.fmt7

Processing the output of IgBLAST

Standalone IgBLAST blast-style tabular output is parsed by the igblast subcommand of MakeDb.py to generate the standardized tab-delimited database file on which all subsequent Change-O modules operate. In addition to the IgBLAST output (-i HD13M.fmt7), both the FASTA files input to IgBLAST (-s HD13M.fasta) and the IMGT-gapped reference sequences (-r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta) must be provided to MakeDb.py:

MakeDb.py igblast -i HD13M.fmt7 -s HD13M.fasta \
    -r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta \
    --extended

The optional --extended argument adds extra columns to the output database containing IMGT-gapped CDR/FWR regions and alignment metrics.

Warning

The references sequences you provide to MakeDb.py must contain IMGT-gapped V segment references, and these reference must be the same sequences used to build the IgBLAST reference database. If your IgBLAST germlines are not IMGT-gapped and/or they are not identical to those provided to MakeDb.py, then sequences which were assigned missing germlines will fail the parsing operation and the junction (CDR3) sequences will not be correct.

Parsing IMGT output

Example data

We have hosted a small example data set resulting from the UMI barcoded MiSeq workflow described in the pRESTO documentation. In addition to the example FASTA files, we have included the IMGT/HighV-QUEST results. The files can be downloded from here:

Change-O Example Files

Reducing file size for submission to IMGT/HighV-QUEST

IMGT/HighV-QUEST currently limits the size of uploaded files to 500,000 sequences. To accomodate this limit, you can use the count subcommand of the pRESTO tool SplitSeq to divide your files into small pieces:

SplitSeq.py count -s file.fastq -n 500000 --fasta

The -n 500000 argument sets the maximum number of sequences in each file and the --fasta argument tells the tool to output a FASTA, rather than FASTQ, formatted file suitable for upload to IMGT/HighV-QUEST.

See also

For additional details see the corresponding example in the pRESTO documentation

Processing the output of IMGT/HighV-QUEST

The output from IMGT/HighV-QUEST may be parsed via the imgt subcommand of MakeDb.py to generate the standardized tab-delimited database file on which all subsequent Change-O modules operate. Processing the IMGT output requires either the compressed output file (.zip or .txz) or an uncompressed folder containing the 1_Summary, 2_IMGT-gapped, 3_Nt-sequences and 6_Junction files (-i HD13M.txz). Additionally, it is recommended that you provide the FASTA file that was submitted to HighV-QUEST (-s HD13M.fasta), as this will allow MakeDb.py to correct the changes HighV-QUEST makes to the sequence identifier and add additional columns corresponding any annotations generated by pRESTO:

MakeDb.py imgt -i HD13M.txz -s HD13M.fasta --extended

The optional --extended argument add extra columns to the output database containing IMGT-gapped CDR/FWR regions and alignment metrics.

Merging processed IMGT/HighV-QUEST output

If you previously split files for submission to IMGT/HighV-QUEST, you can run each partition through MakeDb.py individually and merge the resulting output files using the merge subcommand of ParseDb.py:

MakeDb.py imgt -i part1.txz -s part1.fasta -o part1.tsv
MakeDb.py imgt -i part2.txz -s part2.fasta -o part2.tsv
ParseDb.py merge -d part1.tsv part2.tsv -o merged.tsv

Parsing 10X Genomics V(D)J data

Example data

10X Genomics provides an example data set of Ig V(D)J processed by the Cell Ranger pipeline, which is available for download from their Single Cell Immune Profiling support site.

Converting 10X V(D)J data into the AIRR Community standardized format

To process 10X V(D)J data, a combination of AssignGenes.py and MakeDb.py can be used to generate a TSV file compliant with the AIRR Community Rearrangement schema that incorporates annotation information provided by the Cell Ranger pipeline. The --10x filtered_contig_annotations.csv specifies the path of the contig annotations file generated by cellranger vdj, which can be found in the outs directory.

Generate AIRR Rearrangement data from the 10X V(D)J FASTA files using the steps below:

AssignGenes.py igblast -s filtered_contig.fasta -b ~/share/igblast \
   --organism human --loci ig --format blast
MakeDb.py igblast -i filtered_contig_igblast.fmt7 -s filtered_contig.fasta \
   -r IMGT_Human_*.fasta --10x filtered_contig_annotations.csv --extended

all_contig.fasta can be exchanged for filtered_contig.fasta, and all_contig_annotations.csv can be exchanged for filtered_contig_annotations.csv.

Warning

The resulting table overwrites the V, D and J gene assignments generated by Cell Ranger and uses those generated by IgBLAST or IMGT/HighV-QUEST instead.

See also

To process mouse data and/or TCR data alter the --organism and --loci arguments to AssignGenes.py accordingly (e.g., --organism mouse, --loci tcr) and use the appropriate V, D and J IMGT reference databases (e.g., IMGT_Mouse_TR*.fasta)

See the IgBLAST usage guide for further details regarding the setup and use of IgBLAST with Change-O.

Identifying clones from B cells in AIRR formatted 10X V(D)J data

Splitting into separate light and heavy chain files

To group B cells into clones from AIRR Rearrangement data, the output from MakeDb.py must be parsed into a light chain file and a heavy chain file:

ParseDb.py select -d 10x_igblast_db-pass.tsv -f locus -u "IGH" \
        --logic all --regex --outname heavy
ParseDb.py select -d 10x_igblast_db-pass.tsv -f locus -u "IG[LK]" \
        --logic all --regex --outname light

Assign clonal groups to the heavy chain data

The heavy chain file must then be clonally clustered separately. See Clustering sequences into clonal groups for how to use DefineClones.py to assign clonal cluster annotations to the IGH file.

Correct clonal groups based on light chain data

DefineClones.py currently does not support light chain cloning. However, cloning can be performed after heavy chain cloning using light_cluster.py provided on the Immcantation Bitbucket repository in the scripts directory:

light_cluster.py -d heavy_select-pass_clone-pass.tsv -e light_select-pass.tsv \
        -o 10X_clone-pass.tsv

Here, heavy_select-pass_clone-pass.tsv refers to the cloned heavy chain AIRR Rearrangement file, light_select-pass.tsv refers to the light chain file, and 10X_clone-pass.tsv is the resulting output file.

The algorithm will (1) remove cells associated with more than one heavy chain and (2) correct heavy chain clone definitions based on an analysis of the light chain partners associated with the heavy chain clone.

Note

By default, light_chain.py expects the AIRR Rearrangement columns:

  • v_call

  • j_call

  • junction_length

  • umi_count

  • cell_id

  • clone_id

To process legacy Change-O formatted data add the --format changeo argument:

light_cluster.py -d heavy_select-pass_clone-pass.tab -e light_select-pass.tab \
    -o 10X_clone-pass.tab --format changeo

Which expects the following Change-O columns:

  • V_CALL

  • J_CALL

  • JUNCTION_LENGTH

  • UMICOUNT

  • CELL

  • CLONE

Filtering records

The ParseDb.py tool provides a basic set of operations for manipulating Change-O database files from the commandline, including removing or updating rows and columns.

Removing non-productive sequences

After building a Change-O database from either IMGT/HighV-QUEST or IgBLAST output, you may wish to subset your data to only productive sequences. This can be done in one of two roughly equivalent ways using the ParseDb.py tool:

1
2
ParseDb.py select -d HD13M_db-pass.tsv -f productive -u T
ParseDb.py split -d HD13M_db-pass.tsv -f productive

The first line above uses the select subcommand to output a single file labeled parse-select containing only records with the value of T (-u T) in the productive column (-f productive).

Alternatively, the second line above uses the split subcommand to output multiple files with each file containing records with one of the values found in the productive column (-f productive). This will generate two files labeled productive-T and productive-F.

Removing disagreements between the C-region primers and the reference alignment

If you have data that includes both heavy and light chains in the same library, the V-segment and J-segment alignments from IMGT/HighV-QUEST or IgBLAST may not always agree with the isotype assignments from the C-region primers. In these cases, you can filter out such reads with the select subcommand of ParseDb.py. An example function call using an imaginary file db.tsv is provided below:

1
2
3
4
ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IGH" \
    --logic all --regex --outname heavy
ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IG[LK]" \
    --logic all --regex --outname light

These commands will require that all of the v_call, j_call and c_call fields (-f v_call j_call c_call and --logic all) contain the string IGH (lines 1-2) or one of IGK or IGL (lines 3-4). The --regex argument allows for partial matching and interpretation of regular expressions. The output from these two commands are two files, one containing only heavy chains (heavy_parse-select.tsv) and one containg only light chains (light_parse-select.tsv).

Exporting records to FASTA files

You may want to use external tools, or tools from pRESTO, on your Change-O result files. The ConvertDb.py tool provides two options for exporting data from tab-delimited files to FASTA format.

Standard FASTA

The fasta subcommand allows you to export sequences and annotations to FASTA formatted files in the pRESTO annototation scheme:

ConvertDb.py fasta -d HD13M_db-pass.tsv --if sequence_id \
    --sf sequence_alignment --mf v_call duplicate_count

Where the column containing the sequence identifier is specified by --if sequence_id, the nucleotide sequence column is specified by --sf sequence_id, and additional annotations to be added to the sequence header are specified by --mf v_call duplicate_count.

BASELINe FASTA

The baseline subcommand generates a FASTA derivative format required by the BASELINe web tool. Generating these files is similar to building standard FASTA files, but requires a few more options. An example function call using an imaginary file db.tsv is provided below:

ConvertDb.py baseline -d db.tsv --if sequence_id \
    --sf sequence_alignment --mf v_call duplicate_count \
    --cf clone_id --gf germline_alignment_d_mask

The additional arguments required by the baseline subcommand include the clonal grouping (--cf clone_id) and germline sequence (--gf germline_alignment_d_mask) columns added by the DefineClones and CreateGermlines tasks, respectively.

Note

The baseline subcommand requires the CLONE column to be sorted. DefineClones.py generates a sorted CLONE column by default. However, you needed to alter the order of the CLONE column at some point, then you can re-sort the clonal assignments using the sort subcommand of ParseDb.py. An example function call using an imaginary file db.tsv is provided below:

ParseDb.py sort -d db.tsv -f clone_id

Which will sort records by the value in the clone_id column (-f clone_id).

Clustering sequences into clonal groups

Example data

We have hosted a small example data set resulting from the UMI barcoded MiSeq workflow described in the pRESTO documentation. The files can be downloded from here:

Change-O Example Files

The following examples use the HD13M_db-pass.tsv database file provided in the example bundle, which has already undergone the IMGT/IgBLAST parsing and filtering operations.

Determining a clustering threshold

Before running DefineClones.py, it is important to determine an appropriate threshold for trimming the hierarchical clustering into B cell clones. The distToNearest function in the SHazaM R package calculates the distance between each sequence in the data and its nearest-neighbor. The resulting distribution should be bimodal, with the first mode representing sequences with clonal relatives in the dataset and the second mode representing singletons. The ideal threshold for separating clonal groups is the value that separates the two modes of this distribution and can be found using the findThreshold function in the SHazaM R package. The distToNearest function allows selection of all parameters that are available in DefineClones.py. Using the length normalization parameter ensures that mutations are weighted equally regardless of junction sequence length. The distance to nearest-neighbor distribution for the example data is shown below. The threshold is approximately 0.16 - indicated by the red dotted line.

_images/cloning_threshold.svg

See also

For additional details see the vignette on tuning clonal assignment thresholds.

Assigning clones

There are several parameter choices when grouping Ig sequences into B cell clones. The argument --act set accounts for ambiguous V gene and J gene calls when grouping similar sequences. The distance metric --model ham is nucleotide Hamming distance. Because the threshold was generated using length normalized distances, the --norm len argument is selected with the previously determined threshold --dist 0.16:

DefineClones.py -d HD13M_db-pass.tsv --act set --model ham \
    --norm len --dist 0.16

Reconstructing germline sequences

Example data

We have hosted a small example data set resulting from the UMI barcoded MiSeq workflow described in the pRESTO documentation. The files can be downloaded from here:

Change-O Example Files

The following examples use the HD13M_db-pass.tsv AIRR Rearrangement file provided in the example bundle, which has already undergone the IMGT/IgBLAST parsing and filtering operations.

Adding germline sequences to the database

The CreateGermlines.py tool is used to reconstruct the germline V(D)J sequence, from which the Ig lineage and mutations can be inferred. In addition to the alignment information parsed by MakeDb.py to generate the initial database, CreateGermlines.py also requires the set of germline sequences that were used for the alignment passed to the -r argument. In the case of V-segment germlines, the reference sequences must be IMGT-gapped. Because the D-segment call for B cell receptor alignments is often low confidence, the default germline format (-g dmask) places Ns in the N/P and D-segments of the junction region rather than using the D-segment assigned during reference alignment; this can be modified to generate a complete germline (-g full) or a V-segment only germline (-g vonly) if you wish. The command below adds the germline sequence to the germline_alignment_d_mask column of the output database:

CreateGermlines.py -d HD13M_db-pass.tsv -g dmask \
    -r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta

Alternatively, if you have run the clonal assignment task prior to invoking CreateGermlines.py, then adding the --cloned argument is recommended, as this will generate a single germline of consensus length for each clone:

CreateGermlines.py -d HD13M_db-pass_clone-pass.tsv -g dmask --cloned \
    -r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta

Important

The germline set passed to -r must contain the complete set of germlines used by the reference alignment software (IMGT/HighV-QUEST or IgBLAST). If alleles called by the aligner are missing from the reference set, they will not be successfully processed. Additionally, the V-segment reference set must contain IMGT-gapped sequences to properly reconstruct germlines, even if the reference alignment was performed on ungapped sequences.

Note

While MakeDb.py provides the ihmm subcommand to parse alignment output generated by iHMMuneAlign, there is insufficient information to successfully reconstruct germline sequences for all cases using CreateGermlines.py.

See also

The TIgGER R package provided tools for identifying novel polymorphisms and building a personalized germline database. To use the germline corrections provided by TIgGER you would replace the V-segment germline file with the one generated by genotypeFasta (-r IGHV_genotype.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta) and specify the genotyped V-segment column (--vf v_call_genotyped):

CreateGermlines.py -d genotyped.tsv -g dmask --vf v_call_genotyped \
    -r IGHV_genotype.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta

IgPhyML lineage tree analysis

IgPhyML is a program designed to build phylogenetic trees and test evolutionary hypotheses regarding B cell affinity maturation.

The biology of B cell somatic hypermutation (SHM) violates important assumptions in most standard phylogenetic substitution models; further, while most phylogenetics programs are designed to analyze single lineages, B cell repertoires typically contain thousands of lineages. IgPhyML addresses both of these issues by implementing substitution models that correct for the context-sensitive nature of SHM, and combines information from multiple lineages to give more precisely estimated repertoire-wide model parameter estimates.

An in-depth description of IgPhyML installation and usage can be found at the IgPhyML website.

Quick start

Once installed, IgPhyML can be run through BuildTrees by specifying the --igphyml option. IgPhyML is easiest to run through the Immcantation Docker image. If this is not possible, these instructions require Change-O 0.4.6 or higher, Alakazam 0.3.0 or higher, and IgPhyML to be installed, with the executable in your PATH variable.

The following commands should work as a first pass on many reasonably sized datasets, but if you really want to understand what’s going on or make sure what you’re doing makes sense, please check out the IgPhyML website.

Build trees and estimate model parameters

Download the IgPhyML repository, move to the examples folder, and run BuildTrees:

# Clone IgPhyML repository to get example files
git clone https://bitbucket.org/kleinstein/igphyml

# Move to examples directory
cd igphyml/examples

# Run BuildTrees
BuildTrees.py -d example.tsv --outname ex --log ex.log --collapse \
    --sample 3000 --igphyml --clean all --nproc 1

This command processes an AIRR-formatted dataset of BCR sequences that have been clonally clustered with germlines reconstructed. It then quickly builds trees using the GY94 model and, using these fixed topologies, estimates HLP19 model parameters. This can be sped up by increasing the --nproc option. Subsampling using the --sample option in isn’t strictly necessary, but IgPhyML will run slowly when applied to large datasets. Here, the --collapse flag is used to collapse identical sequences. This is highly recommended because identical sequences slow down calculations without affecting likelihood values in IgPhyML.

Visualize results

The output file of the above command can be read using the readIgphyml function of Alakazam. After opening an R session in the examples subfolder, enter the following commands. Note that when using the Docker container, you’ll need to run dev.off() after plotting the tree to create a pdf plot in the examples directory:

library(alakazam)
library(igraph)

db = readIgphyml("ex_igphyml-pass.tab")

# Plot largest lineage tree
plot(db$trees[[1]],layout=layout_as_tree)

# Show HLP10 parameters
print(t(db$param[1,]))
CLONE         "REPERTOIRE"
NSEQ          "4"
NSITE         "107"
TREE_LENGTH   "0.286"
LHOOD         "-290.7928"
KAPPA_MLE     "2.266"
OMEGA_FWR_MLE "0.5284"
OMEGA_CDR_MLE "2.3324"
WRC_2_MLE     "4.8019"
GYW_0_MLE     "3.4464"
WA_1_MLE      "5.972"
TW_0_MLE      "0.8131"
SYC_2_MLE     "-0.99"
GRS_0_MLE     "0.2583"
map to buried treasure

Lineage tree of example clone.

To visualize a larger dataset with bigger trees, and bifurcating tree topologies, again open an R session in the examples directory:

library(alakazam)
library(ape)

db = readIgphyml("sample1_igphyml-pass.tab",format="phylo")

# Plot largest lineage tree
plot(ladderize(db$trees[[1]]),cex=0.7,no.margin=TRUE)
phylo

Phylo-formatted lineage tree of a larger B cell clone.

Generating MiAIRR compliant GenBank/TLS submissions

MiAIRR

The MiAIRR standard (minimal information about adaptive immune receptor repertoires) is a minimal reporting standard for experiments using sequencing-based technologies to study adaptive immune receptors (T and B cell receptors). The current version (1.0) of the standard was published in Rubelt et al, 2017 and accepted by the general assembly at the annual AIRR Community meeting in December 2017.

MiAIRR recommends submission of raw read data to the Sequence Read Archive (SRA) and submission of processed and annotated data to the Targeted Locus Study (TLS) section of GenBank.

This example will cover generation of files for submission to TLS starting from Change-O formatted data. For complete details of the required and optional elements of the TLS submission see the AIRR Standards documentation site.

Special attention should be paid to the REQUIRED elements. Note that GenBank expects there to be a CDS element that corresponds to the JUNCTION. If submitting single-cell heavy:light paired BCR data, GenBank expects separate files for the heavy, the kappa, and the lambda chains. Note that even though the kappa and the lambda chain sequences should be in separate files, their misc_feature comments should both read immunoglobulin light chain variable region, per AIRR standard requirements. In addition, every effort should be made to make sure that the values of the attributes for GenBank submission match those of the BioSample attributes. In particular, if the BioSample specifies a strain value (e.g. for mouse data), then a strain attribute MUST be included when preparing GenBank submission, and that value MUST match the BioSample value.

Example data

We have hosted a small example data set resulting from the UMI barcoded MiSeq workflow described in the pRESTO documentation. The files can be downloded from here:

Change-O Example Files

The following examples use the HD13M_db-pass.tsv database file and HD13M_template.sbt file provided in the example bundle, which has already undergone the IgBLAST annotation, parsing, and filtering operations.

Generating files for submission

Requirements

Important

C region annotations must use official gene symbols (IGHM, IGHG, etc) so that they are properly recognized by remote databases. If your annotations are not of this form, then they must be updated prior to generating the GenBank/TLS submission files. The following example shows how to use the update subcommand of ParseDb.py to rename the values in the c_call column. The files provided for this example already have correctly annotated c_call information, so the following is hypothetical example (db.tsv) with existing annotation of the for IgM, IgG, etc:

ParseDb.py update -d db.tsv -f c_call \
    -u IgA IgD IgE IgG IgM \
    -t IGHA IGHD IGHE IGHG IGHM

Creating ASN files

ASN submission files are generated using the genbank subcommand of ConvertDb.py as follows:

ConvertDb.py genbank -d HD13M_db-pass.tsv \
    --product "immunoglobulin heavy chain" \
    --db "IMGT/GENE-DB" \
    --inf "IgBLAST:1.14.0" \
    --organism "Homo sapiens" \
    --tissue "Peripheral blood" \
    --cell-type "B cell" \
    --isolate HD13M \
    --cf c_call \
    --nf duplicate_count \
    --asis-id \
    --asn \
    --sbt HD13M_template.sbt \
    --outdir HD13M_TLS

The resulting output in the HD13M_TLS folder will include a number of files. The Sequin file HD13M_db-pass_genbank.sqn is the file that will be used for submission and the GenBank record file HD13M_db-pass_genbank.gbf is similar to what the submission will look like once it has been accepted by GenBank.

The command above manually specifies several required and optional annotations. Alternatively, sample information (organism, sex, isolate, tissue_type, cell_type) can be specified in a separate yaml file and provided via the -y argument. Additional harmonized BioSample attributes, which are not convered by the existing commandline arguments, may be provided in the yaml file. Note, the yaml file adds only sample features, so it cannot be used to specify source features (--product, --mol, --inf and --db arguments), parsing arguments, or run parameters (–label`, --exec, etc). Features specified in the yaml file will override equivalent features specified through the corresponding commandline arguments.

Note

The example shown above automatically runs tbl2asn, because the --asn argument was specified. ConvertDb.py can be run without running tbl2asn, which will generate only the feature table (S43_update_genbank.tbl) and fasta (HD13M_db-pass_genbank.fsa) files required to run tbl2asn manually via the command:

tbl2asn -p . -a s -V vb -t S43_template.sbt

Important

When running tbl2asn using the --asn argument to ConvertDb.py there is no internal validation that the records passing the filters in ConvertDb.py also pass the filters in tbl2asn. As such, it is recommended that the number of sequences in the output .sqn file be verified against the number of sequences in the .tbl and .fsa output files. From the command line, this can be achieved via:

grep -c iupacna *.sqn

Warning

There is a known issue with the --asn argument. In some environments, for reasons that are presently unknown, tbl2asn may fail to recongizing the input fasta file and report an error stating Unable to read any FASTA records. Running tbl2asn manually should resolve the issue.

Submitting to GenBank/TLS using SequinMacroSend

After generating the .sqn files, you can submit them as MiAIRR compliant GenBank/TLS records using GenBank’s SequinMacroSend service.

When submitting, simply add the keyword AIRR to the subject line in the submission system and it will be routed accordingly.

Warning

Currently, the SequinMacroSend system cannot accept files over 512MB in size. For submissions over the size limit, you must split them into smaller files and note in the submission comments that they are a part of a split submission. Note, the .sqn files used for submission are usually about 30 times the size of the original tab-delimited Change-O file. See the split subcommand of ParseDb.py for one approach to logically dividing large submissions.

Clonal clustering methods

The DefineClones.py tool provides multiple different approaches to assigning Ig sequences into clonal groups.

Clustering by V-gene, J-gene and junction length

All methods provided by DefineClones.py first partition sequences based on common IGHV gene, IGHJ gene, and junction region length. These groups are then further subdivided into clonally related groups based on the following distance metrics on the junction region. The specified distance metric (--model) is then used to perform hierarchical clustering under the specified linkage (--link) clustering. Clonal groups are defined by trimming the resulting dendrogram at the specified threshold (--dist).

Amino acid model

The aa distance model is the Hamming distance between junction amino acid sequences.

Hamming distance model

The ham distance model is the Hamming distance between junction nucleotide sequences.

Human and mouse 1-mer models

The hh_s1f and mk_rs5nf distance models are single nucleotide distance matrices derived from averaging and symmetrizing the human 5-mer targeting model in [YVanderHeidenU+13] and the mouse 5-mer targeting model in [CDiNiroVanderHeiden+16]. The are broadly similar to a transition/transversion model.

Human 1-mer substitution matrix:

Nucleotide

A

C

G

T

N

A

0

1.21

0.64

1.16

0

C

1.21

0

1.16

0.64

0

G

0.64

1.16

0

1.21

0

T

1.16

0.64

1.21

0

0

N

0

0

0

0

0

Mouse 1-mer substitution matrix:

Nucleotide

A

C

G

T

N

A

0

1.51

0.32

1.17

0

C

1.51

0

1.17

0.32

0

G

0.32

1.17

0

1.51

0

T

1.17

0.32

1.51

0

0

N

0

0

0

0

0

Human and mouse 5-mer models

The hh_s5f and mk_rs5nf distance models are based on the human 5-mer targeting model in [YVanderHeidenU+13] and mouse 5-mer argeting models in [CDiNiroVanderHeiden+16], respectively. The targeting matrix T has 5-mers across the columns and the nucleotide to which the center base of the 5-mer mutates as the rows. The value for a given nucleotide, 5-mer pair T[i,j] is the product of the likelihood of that 5-mer to be mutated mut(j) and the likelihood of the center base mutating to the given nucleotide sub(j\rightarrow i). This matrix of probabilities is converted into a distance matrix D via the following steps:

  1. D = -log10(T)

  2. D is then divided by the mean of values in D

  3. All distances in D that are infinite (probability of zero), distances on the diagonal (no change), and NA distances are set to 0.

Since the distance matrix D is not symmetric, the --sym argument can be specified to calculate either the average (avg) or minimum (min) of D(j\rightarrow i) and D(i\rightarrow j). The distances defined by D for each nucleotide difference are summed for all 5-mers in the junction to yield the distance between the two junction sequences.

CDiNiroVanderHeiden+16(1,2)

Ang Cui, Roberto Di Niro, Jason A. Vander Heiden, Adrian W. Briggs, Kris Adams, Tamara Gilbert, Kevin C. O’Connor, Francois Vigneault, Mark J. Shlomchik, and Steven H. Kleinstein. A Model of Somatic Hypermutation Targeting in Mice Based on High-Throughput Ig Sequencing Data. The Journal of Immunology, 197(9):3566–3574, nov 2016. URL: http://www.jimmunol.org/content/197/9/3566.abstract http://www.jimmunol.org/lookup/doi/10.4049/jimmunol.1502263, doi:10.4049/jimmunol.1502263.

YVanderHeidenU+13(1,2)

Gur Yaari, Jason A Vander Heiden, Mohamed Uduman, Daniel Gadala-Maria, Namita Gupta, Joel N H Stern, Kevin C O’Connor, David A Hafler, Uri Laserson, Francois Vigneault, and Steven H Kleinstein. Models of somatic hypermutation targeting and substitution based on synonymous mutations from high-throughput immunoglobulin sequencing data. Frontiers in immunology, 4:358, January 2013. URL: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3828525\&tool=pmcentrez\&rendertype=abstract, doi:10.3389/fimmu.2013.00358.

Reconstruction of germline sequences from alignment data

The CreateGermlines.py tool takes the individual segment alignment information for each sequence and reconstructs a full length germline sequence from the V(D)J reference sequences.

To reconstruct the germline, CreateGermlines.py trims V(D)J germline segments and N/P regions by alignment length and concatenates them together. It puts Ns in the untemplated N/P regions and optionally masks the D with Ns (-g dmask). CreateGermlines.py also looks for and corrects cases where the alignment tool assigned the same part of the input sequence to two different regions (eg, assigning the same nucleotides to N/P and J).

At the end of the germline reconstruction process, each sequence has been assigned a germline specific to the sequence.

When the (--cloned) flag is specified, the process is the same except it is clone specific and results in the creation of one germline per clone. CreateGermlines.py selects first a single V and J allele to use as the germline from all the assigned annotations in each clone. The selection is made by simple majority rule of all the allele calls in the clone. After the germline reconstruction process, all sequences belonging to the same clone have been assigned the same germline.

Contact

If you have questions you can email the Immcantation Group.

If you’ve discovered a bug or have a feature request, you can create an issue on Bitbucket using the Issue Tracker.

Citation

To cite Change-O in publications please use:

Gupta NT*, Vander Heiden JA*, Uduman M, Gadala-Maria D, Yaari G, Kleinstein SH. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics 2015; doi: 10.1093/bioinformatics/btv359

License

This work is licensed under the GNU Affero General Public License Version 3 (AGPL-3).