Attention
As of v1.0.0 the default file format will be the
Adaptive Immune Receptor Repertoire (AIRR) Community Rearrangement standard.
The legacy Change-O format will still be supported through the --format changeo
argument. See the Release Notes for more details.
Change-O - Repertoire clonal assignment toolkit¶
Change-O is a collection of tools for processing the output of V(D)J alignment tools, assigning clonal clusters to immunoglobulin (Ig) sequences, and reconstructing germline sequences.
Dramatic improvements in high-throughput sequencing technologies now enable large-scale characterization of Ig repertoires, defined as the collection of trans-membrane antigen-receptor proteins located on the surface of B cells and T cells. Change-O is a suite of utilities to facilitate advanced analysis of Ig and TCR sequences following germline segment assignment. Change-O handles output from IMGT/HighV-QUEST and IgBLAST, and provides a wide variety of clustering methods for assigning clonal groups to Ig sequences. Record sorting, grouping, and various database manipulation operations are also included.
Overview¶
Change-O performs analyses of lymphocyte receptor sequences following alignment against the germline reference. It includes tools for standardizing the output of alignment software, clonal assignment, germline reconstruction, and basic database manipulation. Change-O was designed to be simple to use, but it does require some familiarity with commandline applications. To maximize flexibility, Change-O employs a simple tab-delimited database format with standardized column names, allowing easy use of Change-O output with external environments and interoperability with the Alakazam, SHazaM, and TIgGER R packages. A brief description of each tool is shown in the table below.
Tool |
Subcommand |
Description |
---|---|---|
Multiple aligns sequences in a database |
||
across |
Aligns sequence columns within groups and across rows |
|
block |
Aligns sequence groups across both columns and rows |
|
within |
Aligns sequence fields within rows |
|
igblast |
Runs IgBLAST on nucleotide sequences |
|
igblast-aa |
Runs IgBLAST on amino acid sequences |
|
Generates IgPhyML input files |
||
Converts tab delimited database files |
||
airr |
Converts input to an AIRR TSV file |
|
baseline |
Creates a special BASELINe formatted fasta file from a database |
|
changeo |
Converts input into a Change-O TSV file |
|
fasta |
Creates a fasta file from database records |
|
genbank |
Creates fasta and feature table files for input to tbl2asn |
|
Reconstructs germline sequences from alignment data |
||
Assigns clones by V gene, J gene and junction distance |
||
Creates standardized databases from germline alignment results |
||
igblast |
Parses IgBLAST nucleotide alignments |
|
igblast-aa |
Parses IgBLAST amino acid alignments |
|
ihmm |
Parses iHMMune-Align output |
|
imgt |
Parses IMGT/HighV-QUEST output |
|
Parses annotations in tab delimited database files |
||
add |
Adds fields to the database |
|
delete |
Deletes specific records |
|
drop |
Deletes entire fields |
|
index |
Adds a numeric index field |
|
merge |
Merge files |
|
rename |
Renames fields |
|
select |
Selects specific records |
|
sort |
Sorts records by a field |
|
split |
Splits database files by field values |
|
update |
Updates field and value pairs |
Download¶
The latest stable release of Change-O may be downloaded from PyPI or Bitbucket.
Development versions and source code are available on Bitbucket.
Installation¶
The simplest way to install the latest stable release of Change-O is via pip:
> pip3 install changeo --user
The current development build can be installed using pip and git in similar fashion:
> pip3 install git+https://bitbucket.org/kleinstein/changeo@master --user
If you currently have a development version installed, then you will likely
need to add the arguments --upgrade --no-deps --force-reinstall
to the
pip3 command.
Requirements¶
The minimum dependencies for installation are:
Some tools wrap external applications that are not required for installation. Those tools require minimum versions of:
AlignRecords requires MUSCLE 3.8
ConvertDb-genbank requires tbl2asn
AssignGenes requires IgBLAST 1.6, but version 1.11 or higher is strongly recommended.
BuildTrees requires IgPhyML 1.0.5
Linux¶
The simplest way to install all Python dependencies is to install the full SciPy stack using the instructions, then install Biopython according to its instructions.
Install presto 0.6.2 or greater.
Download the Change-O bundle and run:
> pip3 install changeo-x.y.z.tar.gz --user
Mac OS X¶
Install Xcode. Available from the Apple store or developer downloads.
Older versions Mac OS X will require you to install XQuartz 2.7.5. Available from the XQuartz project.
Install Homebrew following the installation and post-installation instructions.
Install Python 3.4.0+ and set the path to the python3 executable:
> brew install python3 > echo 'export PATH=/usr/local/bin:$PATH' >> ~/.profile
Exit and reopen the terminal application so the PATH setting takes effect.
You may, or may not, need to install gfortran (required for SciPy). Try without first, as this can take an hour to install and is not needed on newer releases. If you do need gfortran to install SciPy, you can install it using Homebrew:
> brew install gfortran
If the above fails run this instead:
> brew install --env=std gfortran
Install NumPy, SciPy, pandas and Biopython using the Python package manager:
> pip3 install numpy scipy pandas biopython
Install presto 0.6.2 or greater.
Download the Change-O bundle, open a terminal window, change directories to the download folder, and run:
> pip3 install changeo-x.y.z.tar.gz
Windows¶
Install Python 3.4.0+ from Python, selecting both the options ‘pip’ and ‘Add python.exe to Path’.
Install NumPy, SciPy, pandas and Biopython using the packages available from the Unofficial Windows binary collection.
Install presto 0.6.2 or greater.
Download the Change-O bundle, open a Command Prompt, change directories to the download folder, and run:
> pip install changeo-x.y.z.tar.gz
For a default installation of Python 3.4, the Change-0 scripts will be installed into
C:\Python34\Scripts
and should be directly executable from the Command Prompt. If this is not the case, then follow step 6 below.Add both the
C:\Python34
andC:\Python34\Scripts
directories to your%Path%
. On both Windows 7 and Windows 10, the%Path%
setting is located under Control Panel -> System and Security -> System -> Advanced System Settings -> Environment variables -> System variables -> Path.If you have trouble with the
.py
file associations, try adding.PY
to yourPATHEXT
environment variable. Also, try opening a Command Prompt as Administrator and run:> assoc .py=Python.File > ftype Python.File="C:\Python34\python.exe" "%1" %*
Data Standards¶
All Change-O tools supports both the legacy Change-O standard and the new Adaptive Immune Receptor Repertoire (AIRR) standard developed by the AIRR Community (AIRR-C).
AIRR-C Format¶
As of v1.0.0, the default file format is the AIRR-C format as described by the
Rearrangement Schema (v1.2). The AIRR-C Rearrangement format is a tab-delimited
file format (.tsv
) that defines the required and optional annotations for
rearranged adaptive immune receptor sequences.
To learn more about this format, the valid field names and their expected values, visit the AIRR-C Rearrangement Schema documentation site.
An API for the input and output of the AIRR-C format is provided in the
AIRR Python package.
Wrappers for this package are provided in the API as changeo.IO.AIRRReader
and changeo.IO.AIRRWriter
.
Change-O Format¶
The legacy Change-O standard is a tab-delimited file format (.tab
) with a set
of predefined column names. The standardized column names used by the Change-O format
are shown in the table below. Most tools do not require every column. The columns
required by and added by each individual tool are described in the
commandline usage documentation. If a column contains multiple
entries, such as ambiguous V gene assignments, these nested entries are delimited
by commas. The ordering of the columns does not matter.
An API for the input and output of the Change-O format is provided in
changeo.IO.ChangeoReader
and changeo.IO.ChangeoWriter
respectively.
Change-O Field |
AIRR Field |
Type |
Description |
---|---|---|---|
Standard Annotations |
|||
SEQUENCE_ID |
sequence_id |
string |
Unique sequence identifier |
SEQUENCE_INPUT |
sequence |
string |
Input nucleotide sequence |
SEQUENCE_VDJ |
string |
V(D)J nucleotide sequence |
|
SEQUENCE_IMGT |
sequence_alignment |
string |
IMGT-numbered V(D)J nucleotide sequence |
FUNCTIONAL |
productive |
logical |
T: V(D)J sequence is predicted to be productive |
IN_FRAME |
vj_in_frame |
logical |
T: junction region nucleotide sequence is in-frame |
STOP |
stop_codon |
logical |
T: stop codon is present in V(D)J nucleotide sequence |
MUTATED_INVARIANT |
logical |
T: invariant amino acids properly encoded by V(D)J sequence |
|
INDELS |
logical |
T: V(D)J nucleotide sequence contains insertions and/or deletions |
|
LOCUS |
locus |
string |
Locus of the receptor |
V_CALL |
v_call |
string |
V allele assignment(s) |
D_CALL |
d_call |
string |
D allele assignment(s) |
J_CALL |
j_call |
string |
J allele assignment(s) |
C_CALL |
c_call |
string |
C-region assignment |
V_SEQ_START |
v_sequence_start |
integer |
Position of first V nucleotide in SEQUENCE_INPUT |
V_SEQ_LENGTH |
integer |
Number of V nucleotides in SEQUENCE_INPUT |
|
V_GERM_START_IMGT |
v_germline_start |
integer |
Position of V_SEQ_START in IMGT-numbered germline V(D)J sequence |
V_GERM_LENGTH_IMGT |
integer |
Length of the IMGT numbered germline V alignment |
|
NP1_LENGTH |
np1_length |
integer |
Number of nucleotides between V and D segments |
D_SEQ_START |
d_sequence_start |
integer |
Position of first D nucleotide in SEQUENCE_INPUT |
D_SEQ_LENGTH |
integer |
Number of D nucleotides in SEQUENCE_INPUT |
|
D_GERM_START |
d_germline_start |
integer |
Position of D_SEQ_START in germline V(D)J nucleotide sequence |
D_GERM_LENGTH |
integer |
Length of the germline D alignment |
|
NP2_LENGTH |
np2_length |
integer |
Number of nucleotides between D and J segments |
J_SEQ_START |
j_sequence_start |
integer |
Position of first J nucleotide in SEQUENCE_INPUT |
J_SEQ_LENGTH |
j_sequence_end |
integer |
Number of J nucleotides in SEQUENCE_INPUT |
J_GERM_START |
j_germline_start |
integer |
Position of J_SEQ_START in germline V(D)J nucleotide sequence |
J_GERM_LENGTH |
integer |
Length of the germline J alignment |
|
JUNCTION_LENGTH |
junction_length |
integer |
Number of junction nucleotides in SEQUENCE_VDJ |
JUNCTION |
junction |
string |
Junction region nucletide sequence |
CELL |
cell_id |
string |
Cell identifier |
CLONE |
clone_id |
string |
Clonal grouping identifier |
Region Annotations |
|||
FWR1_IMGT |
fwr1 |
string |
IMGT-numbered FWR1 nucleotide sequence |
FWR2_IMGT |
fwr2 |
string |
IMGT-numbered FWR2 nucleotide sequence |
FWR3_IMGT |
fwr3 |
string |
IMGT-numbered FWR3 nucleotide sequence |
FWR4_IMGT |
fwr4 |
string |
IMGT-numbered FWR4 nucleotide sequence |
CDR1_IMGT |
cdr1 |
string |
IMGT-numbered CDR1 nucleotide sequence |
CDR2_IMGT |
cdr2 |
string |
IMGT-numbered CDR2 nucleotide sequence |
CDR3_IMGT |
cdr3 |
string |
IMGT-numbered CDR3 nucleotide sequence |
N1_LENGTH |
n1_length |
integer |
Untemplated nucleotides 5’ of the D segment |
N2_LENGTH |
n2_length |
integer |
Untemplated Nucleotides 3’ of the D segment |
P3V_LENGTH |
p3v_length |
integer |
Palindromic nucleotides 3’ of the V segment |
P5D_LENGTH |
p5d_length |
integer |
Palindromic nucleotides 5’ of the D segment |
P3D_LENGTH |
p3d_length |
integer |
Palindromic nucleotides 3’ of the D segment |
P5J_LENGTH |
p5j_length |
integer |
Palindromic nucleotides 5’ of the J segment |
D_FRAME |
integer |
D segment reading frame |
|
Germline Annotations |
|||
GERMLINE_VDJ |
string |
Full unaligned germline V(D)J nucleotide sequence |
|
GERMLINE_VDJ_V_REGION |
string |
Unaligned germline V segment nucleotide sequence |
|
GERMLINE_VDJ_D_MASK |
string |
Unaligned germline V(D)J nucleotides sequence with Ns masking the NP1-D-NP2 regions |
|
GERMLINE_IMGT |
germline_alignment |
string |
Full IMGT-numbered germline V(D)J nucleotide sequence |
GERMLINE_IMGT_V_REGION |
string |
IMGT-numbered germline V segment nucleotide sequence |
|
GERMLINE_IMGT_D_MASK |
string |
IMGT-numbered germline V(D)J nucleotide sequence with Ns masking the NP1-D-NP2 regions |
|
GERMLINE_V_CALL |
string |
Clonal consensus germline V assignment |
|
GERMLINE_D_CALL |
string |
Clonal consensus germline D assignment |
|
GERMLINE_J_CALL |
string |
Clonal consensus germline J assignment |
|
GERMLINE_REGIONS |
string |
String showing germline segments positions encoded as V, D, J, N, and P characters |
|
Alignment Annotations |
|||
V_SCORE |
v_score |
float |
Alignment score for the V |
V_IDENTITY |
v_identity |
float |
Alignment identity for the V |
V_EVALUE |
v_support |
float |
E-value for the alignment of the V |
V_CIGAR |
v_cigar |
string |
CIGAR string for the alignment of the V |
D_SCORE |
d_score |
float |
Alignment score for the D |
D_IDENTITY |
d_identity |
float |
Alignment identity for the D |
D_EVALUE |
d_support |
float |
E-value for the alignment of the D |
D_CIGAR |
d_cigar |
string |
CIGAR string for the alignment of the D |
J_SCORE |
j_score |
float |
Alignment score for the J |
J_IDENTITY |
j_identity |
float |
Alignment identity for the J |
J_EVALUE |
j_support |
float |
E-value for the alignment of the J |
J_CIGAR |
j_cigar |
string |
CIGAR string for the alignment of the J |
VDJ_SCORE |
float |
Alignment score for the V(D)J |
|
TIgGER Annotations |
|||
V_CALL_GENOTYPED |
string |
Adjusted V allele assignment(s) following TIgGER genotype inference |
|
Preprocessing Annotations |
|||
PRCONS |
string |
pRESTO UMI consensus primer |
|
PRIMER |
string |
pRESTO primers list |
|
CONSCOUNT |
consensus_count |
integer |
Number of reads contributing to the UMI consensus sequence |
DUPCOUNT |
duplicate_count |
integer |
Copy number of the sequence |
UMICOUNT |
integer |
UMI count for the sequence |
Release Notes¶
Version 1.3.0: December 11, 2022¶
Various updates to internals and error messages.
AssignGenes:
Added support for
.fastq
files. If a.fastq
file is input, then a corresponding.fasta
file will be created in output directory.Added support for C region alignment calls provide by IgBLAST v1.18+.
MakeDb:
Added support for C region alignment calls provide by IgBLAST v1.18+.
Version 1.2.0: October 29, 2021¶
Updated dependencies to presto >= v0.7.0.
AssignGenes:
Fixed reporting of IgBLAST output counts when specifying
--format airr
.
BuildTrees:
Added support for specifying fixed omega and hotness parameters at the commandline.
CreateGermlines:
Will now use the first allele in the reference database when duplicate allele names are provided. Only appears to affect mouse BCR light chains and TCR alleles in the IMGT database when the same allele name differs by strain.
MakeDb:
Added support for changes in how IMGT/HighV-QUEST v1.8.4 handles special characters in sequence identifiers.
Fixed the
imgt
subcommand incorrectly allowing execution without specifying the IMGT/HighV-QUEST output file at the commandline.
ParseDb:
Added reporting of output file sizes to the console log of the
split
subcommand.
Version 1.1.0: June 21, 2021¶
Fixed gene parsing for IMGT temporary designation nomenclature.
Updated dependencies to biopython >= v1.77, airr >= v1.3.1, PyYAML>=5.1.
MakeDb:
+ Added the --imgt-id-len
argument to accommodate changes introduced in how
IMGT/HighV-QUEST truncates sequence identifiers as of v1.8.3 (May 7, 2021). The header lines in the fasta files are now truncated to 49 characters. In IMGT/HighV-QUEST versions older than v1.8.3, they were truncated to 50 characters.
--imgt-id-len
default value is 49. Users should specify--imgt-id-len 50
to analyze IMGT results generated with IMGT/HighV-QUEST versions older than v1.8.3.
Added the
--infer-junction
argument toMakeDb igblast
, to enable the inference of the junction sequence when not reported by IgBLAST. Should be used with data from IgBLAST v1.6.0 or older; before igblast added the IMGT-CDR3 inference.
Version 1.0.2: January 18, 2021¶
AlignRecords:
Fixed a bug caused the program to exit when encountering missing sequence data. It will now fail the row or group with missing data and continue.
MakeDb:
Added support for IgBLAST v1.17.0.
ParseDb:
Added a relevant error message when an input field is missing from the data.
Version 1.0.1: October 13, 2020¶
Updated to support Biopython v1.78.
Increased the biopython dependency to v1.71.
Increased the presto dependency to 0.6.2.
Version 1.0.0: May 6, 2020¶
The default output in all tools is now the AIRR Rearrangement standard (
--format airr
). Support for the legacy Change-O data standard is still provided through the--format changeo
argument to the tools.License changed to AGPL-3.
AssignGenes:
Added the
igblast-aa
subcommand to run igblastp on amino acid input.
BuildTrees:
Adjusted
RECORDS
to indicate all sequences in input file.INITIAL_FILTER
now shows sequence count after initialmin_seq
filtering.Added option to skip codon masking:
--nmask
.Mask
:
,,
,)
, and(
in IDs and metadata with-
.Can obtain germline from
GERMLINE_IMGT
ifGERMLINE_IMGT_D_MASK
not specified.Can reconstruct intermediate sequences with IgPhyML using
--asr
.
ConvertDb:
Fixed a bug in the
airr
subcommand that caused thejunction_length
field to be deleted from the output.Fixed a bug in the
genbank
subcommand that caused the junction CDS to be missing from the ASN output.
CreateGermlines:
Added the
--cf
argument to allow specification of the clone field.
MakeDb:
Added the
igblast-aa
subcommand to parse the output of igblastp.Changed the log entry
FUNCTIONAL
toPRODUCTIVE
and removed theIMGT_PASS
log entry in favor of an informativeERROR
entry when sequences fail the junction region validation.Add –regions argument to the
igblast
andigblast-aa
subcommands to allow specification of the IMGT CDR/FWR region boundaries. Currently, the supported specifications aredefault
(human, mouse) andrhesus-igl
.
Version 0.4.6: July 19, 2019¶
BuildTrees:
Added capability of running IgPhyML on outputted data (
--igphyml
) and support for passing IgPhyML arguments through BuildTrees.Added the
--clean
argument to force deletion of all intermediate files after IgPhyML execution.Added the
--format
argument to allow specification input and output of either the Change-O standard (changeo
) or AIRR Rearrangement standard (airr
).
CreateGermlines:
Fixed a bug causing incorrect reporting of the germline format in the console log.
ConvertDb:
Removed requirement for the
NP1_LENGTH
andNP2_LENGTH
fields from the genbank subcommand.
DefineClones:
Fixed a biopython warning arising when applying
--model aa
to junction sequences that are not a multiple of three. The junction will now be padded with an appropriate number of Ns (usually resulting in a translation to X).
MakeDb:
Added the
--10x
argument to all subcommands to support merging of Cell Ranger annotation data, such as UMI count and C-region assignment, with the output of the supported alignment tools.Added inference of the receptor locus from the alignment data to all subcommands, which is output in the
LOCUS
field.Combined the extended field arguments of all subcommands (
--scores
,--regions
,--cdr3
, and--junction
) into a single--extended
argument.Removed parsing of old IgBLAST v1.5 CDR3 fields (
CDR3_IGBLAST
,CDR3_IGBLAST_AA
).
Version 0.4.5: January 9, 2019¶
Slightly changed version number display in commandline help.
BuildTrees:
Fixed a bug that caused malformed lineages.tsv output file.
CreateGermlines:
Fixed a bug in the CreateGermlines log output causing incorrect missing D gene or J gene error messages.
DefineClones:
Fixed a bug that caused a missing junction column to cluster sequences together.
MakeDb:
Fixed a bug that caused failed germline reconstructions to be recorded as
None
, rather than an empty string, in theGERMLINE_IMGT
column.
Version 0.4.4: October 27, 2018¶
Fixed a bug causing the values of
_start
fields to be off by one from the v1.2 AIRR Schema requirement when specifying--format airr
.
Version 0.4.3: October 19, 2018¶
Updated airr library requirement to v1.2.1 to fix empty V(D)J start coordinate values when specifying
--format airr
to tools.Changed pRESTO dependency to v0.5.10.
BuildTrees:
New tool.
Converts tab-delimited database files into input for IgPhyML
CreateGermlines:
Now verifies that all files/folder passed to the
-r
argument exist.
Version 0.4.2: September 6, 2018¶
Updated support for the AIRR Rearrangement schema to v1.2 and added the associated airr library dependency.
AssignGenes:
New tool.
Provides a simple IgBLAST wrapper as the
igblast
subcommand.
ConvertDb:
The
genbank
subcommand will perform a check for some of the required columns in the input file and exit if they are not found.Changed the behavior of the
-y
argument in thegenbank
subcommand. This argument is now featured to sample features only, but allows for the inclusion of any BioSample attribute.
CreateGermlines:
Will now perform a naive verification that the reference sequences provided to the
-r
argument are IMGT-gapped. A warning will be issued to standard error if the reference sequence fail the check.Will perform a check for some of the required columns in the input file and exit if they are not found.
MakeDb:
Changed the output of
SEQUENCE_VDJ
from the igblast subcommand to retain insertions in the query sequence rather than delete them as is done in theSEQUENCE_IMGT
field.Will now perform a naive verification that the reference sequences provided to the
-r
argument are IMGT-gapped. A warning will be issued to standard error if the reference sequence fail the check.
Version 0.4.1: July 16, 2018¶
Fixed installation incompatibility with pip 10.
Fixed duplicate newline issue on Windows.
All tools will no longer create empty pass or fail files if there are no records meeting the appropriate criteria for output.
Most tools now allow explicit specification of the output file name via the optional
-o
argument.Added support for the AIRR standard TSV via the
--format airr
argument to all relevant tools.Replaced V, D and J
BTOP
columns withCIGAR
columns in data standard.Numerous API changes and internal structural changes to commandline tools.
AlignRecords:
Fixed a bug arising when space characters are present in the sequence identifiers.
ConvertDb:
New tool.
Includes the airr and changeo subcommand to convert between AIRR and Change-O formatted TSV files.
The genbank subcommand creates MiAIRR compliant files for submission to GenBank/TLS.
Contains the baseline and fasta subcommands previously in ParseDb.
CreateGermlines
Changed character used to pad clonal consensus sequences from
.
toN
.Changed tie resolution in clonal consensus from random V/J gene to alphabetical by sequence identifier.
Added
--df
and-jf
arguments for specifying D and J fields, respectively.Add initial sorting step with specifying
--cloned
so that clonally ordered input is no longer required.
DefineClones:
Removed the chen2010 and ademokun2011 and made the previous bygroup subcommand the default behavior.
Renamed the
--f
argument to--gf
for consistency with other tools.Added the arguments
--vf
and-jf
to allow specification of V and J call fields, respectively.
MakeDb:
Renamed
--noparse
argument to--asis-id
.Added
asis-calls
argument to igblast subcommand to allow use with non-standard gene names.Added the
GERMLINE_IMGT
column to the default output.Changed junction inference in igblast subcommand to use IgBLAST’s CDR3 assignment for IgBLAST versions greater than or equal to 1.7.0.
Added a verification that the
SEQUENCE_IMGT
andJUNCTION
fields are in agreement for records to pass.Changed behavior of the igblast subcommand’s translation of the junction sequence to truncate junction that are not multiples of 3, rather than pad to a multiple of 3 (removes trailing X character).
The igblast subcommand will now fail records missing the required optional fields
subject seq
,query seq
andBTOP
, rather than abort.Fixed bug causing parsing of IgBLAST <= 1.4 output to fail.
ParseDb:
Added the merge subcommand which will combine TSV files.
All field arguments are now case sensitive to provide support for both the Change-O and AIRR data standards.
Version 0.3.12: February 16, 2018¶
MakeDb:
Fixed a bug wherein specifying multiple simultaneous inputs would cause duplication of parsed pRESTO fields to appear in the second and higher output files.
Version 0.3.11: February 6, 2018¶
MakeDb:
Fixed junction inferrence for igblast subcommand when J region is truncated.
Version 0.3.10: February 6, 2018¶
Fixed incorrect progress bars resulting from files containing empty lines.
DefineClones:
Fixed several bugs in the chen2010 and ademokun2011 methods that caused them to either fail or incorrectly cluster all sequences into a single clone.
Added informative message for out of memory error in chen2010 and ademokun2011 methods.
Version 0.3.9: October 17, 2017¶
DefineClones:
Fixed a bug causing DefineClones to fail when all are sequences removed from a group due to missing characters.
Version 0.3.8: October 5, 2017¶
AlignRecords:
Ressurrected AlignRecords which performs multiple alignment of sequence fields.
Added new subcommands
across
(multiple aligns within columns),within
(multiple aligns columns within each row), andblock
(multiple aligns across both columns and rows).
CreateGermlines:
Fixed a bug causing CreateGermlines to incorrectly fail records when using the argument
--vf V_CALL_GENOTYPED
.
DefineClones:
Added the
--maxmiss
argument to the bygroup subcommand of DefineClones which set exclusion criteria for junction sequence with ambiguous and missing characters. By default, bygroup will now fail all sequences with any missing characters in the junction (--maxmiss 0
).
Version 0.3.7: June 30, 2017¶
MakeDb:
Fixed an incompatibility with IgBLAST v1.7.0.
CreateGermlines:
Fixed an error that occurs when using the
--cloned
with an input file containing duplicate values inSEQUENCE_ID
that caused some records to be discarded.
Version 0.3.6: June 13, 2017¶
Fixed an overflow error on Windows that caused tools to fatally exit.
All tools will now print detailed help if no arguments are provided.
Version 0.3.5: May 12, 2017¶
Fixed a bug wherein .tsv
was not being recognized as a valid extension.
MakeDb:
Added the
--cdr3
argument to the igblast subcommand to extract the CDR3 nucleotide and amino acid sequence defined by IgBLAST.Updated the IMGT/HighV-QUEST parser to handle recent column name changes.
Fixed a bug in the igblast parser wherein some sequence identifiers were not being processed correctly.
DefineClones:
Changed the way
X
characters are handled in the amino acid Hamming distance model to count as a match against any character.
Version 0.3.4: February 14, 2017¶
License changed to Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
CreateGermlines:
Added
GERMLINE_V_CALL
,GERMLINE_D_CALL
andGERMLINE_J_CALL
columns to the output when the-cloned
argument is specified. These columns contain the consensus annotations when clonal groups contain ambiguous gene assignments.Fixed the error message for an invalid repo (
-r
) argument.
DefineClones:
Deprecated
m1n
andhs1f
distance models, renamed them tom1n_compat
andhs1f_compat
, and replaced them withhh_s1f
and replacedmk_rs1nf
, respectively.Renamed the
hs5f
distance model tohh_s5f
.Added the mouse specific distance model
mk_rs5nf
from Cui et al, 2016.
MakeDb:
Added compatibility for IgBLAST v1.6.
Added the flag
--partial
which tells MakeDb to pass incomplete alignment results specified.Added missing console log entries for the ihmm subcommand.
IMGT/HighV-QUEST, IgBLAST and iHMMune-Align parsers have been cleaned up, better documented and moved into the iterable classes
changeo.Parsers.IMGTReader
,change.Parsers.IgBLASTReader
, andchange.Parsers.IHMMuneReader
, respectively.Corrected behavior of
D_FRAME
annotation from the--junction
argument to the imgt subcommand such that it now reports no value when no value is reported by IMGT, rather than reporting the reading frame as 0 in these cases.Fixed parsing of
IN_FRAME
,STOP
,D_SEQ_START
andD_SEQ_LENGTH
fields from iHMMune-Align output.Removed extraneous score fields from each parser.
Fixed the error message for an invalid repo (
-r
) argument.
Version 0.3.3: August 8, 2016¶
Increased csv.field_size_limit
in changeo.IO, ParseDb and DefineClones
to be able to handle files with larger number of UMIs in one field.
Renamed the fields N1_LENGTH
to NP1_LENGTH
and N2_LENGTH
to NP2_LENGTH
.
CreateGermlines:
Added differentiation of the N and P regions the the
REGION
log field if the N/P region info is present in the input file (eg, from the--junction
argument to MakeDb-imgt). If the additional N/P region columns are not present, then both N and P regions will be denoted by N, as in previous versions.Added the option ‘regions’ to the
-g
argument to create add theGERMLINE_REGIONS
field to the output which represents the germline positions as V, D, J, N and P characters. This is equivalent to theREGION
log entry.
DefineClones:
Improved peformance significantly of the
--act set
grouping method in the bygroup subcommand.
MakeDb:
Fixed a bug producing
D_SEQ_START
andJ_SEQ_START
relative toSEQUENCE_VDJ
when they should be relative toSEQUENCE_INPUT
.Added the argument
--junction
to the imgt subcommand to parse additional junction information fields, including N/P region lengths and the D-segment reading frame. This provides the following additional output fields:D_FRAME
,N1_LENGTH
,N2_LENGTH
,P3V_LENGTH
,P5D_LENGTH
,P3D_LENGTH
,P5J_LENGTH
.The fields
N1_LENGTH
andN2_LENGTH
have been renamed to accommodate adding additional output from IMGT under the--junction
flag. The new names areNP1_LENGTH
andNP2_LENGTH
.Fixed a bug that caused the
IN_FRAME
,MUTATED_INVARIANT
andSTOP
field to be be parsed incorrectly from IMGT data.Ouput from iHMMuneAlign can now be parsed via the
ihmm
subcommand. Note, there is insufficient information returned by iHMMuneAlign to reliably reconstruct germline sequences from the output using CreateGermlines.
ParseDb:
Renamed the clip subcommand to baseline.
Version 0.3.2: March 8, 2016¶
Fixed a bug with installation on Windows due to old file paths lingering in changeo.egg-info/SOURCES.txt.
Updated license from CC BY-NC-SA 3.0 to CC BY-NC-SA 4.0.
CreateGermlines:
Fixed a bug producing incorrect values in the
SEQUENCE
field on the log file.
MakeDb:
Updated igblast subcommand to correctly parse records with indels. Now igblast must be run with the argument
outfmt "7 std qseq sseq btop"
.Changed the names of the FWR and CDR output columns added with
--regions
to<region>_IMGT
.Added
V_BTOP
andJ_BTOP
output when the--scores
flag is specified to the igblast subcommand.
Version 0.3.1: December 18, 2015¶
MakeDb:
Fixed bug wherein the imgt subcommand was not properly recognizing an extracted folder as input to the
-i
argument.
Version 0.3.0: December 4, 2015¶
Conversion to a proper Python package which uses pip and setuptools for installation.
The package now requires Python 3.4. Python 2.7 is not longer supported.
The required dependency versions have been bumped to numpy 1.9, scipy 0.14, pandas 0.16 and biopython 1.65.
DbCore:
Divided DbCore functionality into the separate modules: Defaults, Distance, IO, Multiprocessing and Receptor.
IgCore:
Remove IgCore in favor of dependency on pRESTO >= 0.5.0.
AnalyzeAa:
This tool was removed. This functionality has been migrated to the alakazam R package.
DefineClones:
Added
--sf
flag to specify sequence field to be used to calculate distance between sequences.Fixed bug in wherein sequences with missing data in grouping columns were being assigned into a single group and clustered. Sequences with missing grouping variables will now be failed.
Fixed bug where sequences with “None” junctions were grouped together.
GapRecords:
This tool was removed in favor of adding IMGT gapping support to igblast subcommand of MakeDb.
MakeDb:
Updated IgBLAST parser to create an IMGT gapped sequence and infer the junction region as defined by IMGT.
Added the
--regions
flag which adds extra columns containing FWR and CDR regions as defined by IMGT.Added support to imgt subcommand for the new IMGT/HighV-QUEST compression scheme (.txz files).
Version 0.2.5: August 25, 2015¶
CreateGermlines:
Removed default ‘-r’ repository and added informative error messages when invalid germline repositories are provided.
Updated ‘-r’ flag to take list of folders and/or fasta files with germlines.
Version 0.2.4: August 19, 2015¶
MakeDb:
Fixed a bug wherein N1 and N2 region indexing was off by one nucleotide for the igblast subcommand (leading to incorrect SEQUENCE_VDJ values).
ParseDb:
Fixed a bug wherein specifying the
-f
argument to the index subcommand would cause an error.
Version 0.2.3: July 22, 2015¶
DefineClones:
Fixed a typo in the default normalization setting of the bygroup subcommand, which was being interpreted as ‘none’ rather than ‘len’.
Changed the ‘hs5f’ model of the bygroup subcommand to be centered -log10 of the targeting probability.
Added the
--sym
argument to the bygroup subcommand which determines how asymmetric distances are handled.
Version 0.2.2: July 8, 2015¶
CreateGermlines:
Germline creation now works for IgBLAST output parsed with MakeDb. The argument
--sf SEQUENCE_VDJ
must be provided to generate germlines from IgBLAST output. The same reference database used for the IgBLAST alignment must be specified with the-r
flag.Fixed a bug with determination of N1 and N2 region positions.
MakeDb:
Combined the
-z
and-f
flags of the imgt subcommand into a single flag,-i
, which autodetects the input type.Added requirement that IgBLAST input be generated using the
-outfmt "7 std qseq"
argument to igblastn.Modified SEQUENCE_VDJ output from IgBLAST parser to include gaps inserted during alignment.
Added correction for IgBLAST alignments where V/D, D/J or V/J segments are assigned overlapping positions.
Corrected N1_LENGTH and N2_LENGTH calculation from IgBLAST output.
Added the
--scores
flag which adds extra columns containing alignment scores from IMGT and IgBLAST output.
Version 0.2.0: June 17, 2015¶
Initial public prerelease.
Output files were added to the usage documentation of all scripts.
General code cleanup.
DbCore:
Updated loading of database files to convert column names to uppercase.
AnalyzeAa:
Fixed a bug where junctions less than one codon long would lead to a division by zero error.
Added
--failed
flag to create database with records that fail analysis.Added
--sf
flag to specify sequence field to be analyzed.
CreateGermlines:
Fixed a bug where germline sequences could not be created for light chains.
DefineClones:
Added a human 1-mer model, ‘hs1f’, which uses the substitution rates from from Yaari et al, 2013.
Changed default model to ‘hs1f’ and default normalization to length for bygroup subcommand.
Added
--link
argument which allows for specification of single, complete, or average linkage during clonal clustering (default single).
GapRecords:
Fixed a bug wherein non-standard sequence fields could not be aligned.
MakeDb:
Fixed bug where the allele ‘TRGVA*01’ was not recognized as a valid allele.
ParseDb:
Added rename subcommand to ParseDb which renames fields.
Version 0.2.0.beta-2015-05-31: May 31, 2015¶
Minor changes to a few output file names and log field entries.
ParseDb:
Added index subcommand to ParseDb which adds a numeric index field.
Version 0.2.0.beta-2015-05-05: May 05, 2015¶
Prerelease for review.
Commandline Usage¶
AlignRecords.py¶
Multiple aligns sequence fields
usage: AlignRecords.py [--version] [-h] ...
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
- output files:
- align-pass
database with multiple aligned sequences.
- align-fail
database with records failing alignment.
- required fields:
sequence_id, v_call, j_call <field>
user specified sequence fields to align.
- output fields:
<field>_align
AlignRecords.py across¶
usage: AlignRecords.py across [--version] [-h] -d DB_FILES [DB_FILES ...]
[-o OUT_FILES [OUT_FILES ...]]
[--outdir OUT_DIR] [--outname OUT_NAME]
[--log LOG_FILE] [--failed]
[--format {airr,changeo}] [--nproc NPROC] --sf
SEQ_FIELDS [SEQ_FIELDS ...]
[--gf GROUP_FIELDS [GROUP_FIELDS ...]]
[--calls {v,d,j} [{v,d,j} ...]]
[--mode {allele,gene}] [--act {first}]
[--exec MUSCLE_EXEC]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--format
{airr,changeo}
¶ Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
--sf
<seq_fields>
¶ The sequence fields to multiple align within each group.
-
--gf
<group_fields>
¶ Additional (not allele call) fields to use for grouping.
-
--calls
{v,d,j}
¶ Segment calls (allele assignments) to use for grouping.
-
--mode
{allele,gene}
¶ Specifies whether to use the V(D)J allele or gene when an allele call field (–calls) is specified.
-
--act
{first}
¶ Specifies how to handle multiple values within default allele call fields. Currently, only “first” is supported.
-
--exec
<muscle_exec>
¶ The location of the MUSCLE executable
AlignRecords.py block¶
usage: AlignRecords.py block [--version] [-h] -d DB_FILES [DB_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--format {airr,changeo}] [--nproc NPROC] --sf
SEQ_FIELDS [SEQ_FIELDS ...]
[--gf GROUP_FIELDS [GROUP_FIELDS ...]]
[--calls {v,d,j} [{v,d,j} ...]]
[--mode {allele,gene}] [--act {first}]
[--exec MUSCLE_EXEC]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--format
{airr,changeo}
¶ Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
--sf
<seq_fields>
¶ The sequence fields to multiple align within each group.
-
--gf
<group_fields>
¶ Additional (not allele call) fields to use for grouping.
-
--calls
{v,d,j}
¶ Segment calls (allele assignments) to use for grouping.
-
--mode
{allele,gene}
¶ Specifies whether to use the V(D)J allele or gene when an allele call field (–calls) is specified.
-
--act
{first}
¶ Specifies how to handle multiple values within default allele call fields. Currently, only “first” is supported.
-
--exec
<muscle_exec>
¶ The location of the MUSCLE executable
AlignRecords.py within¶
usage: AlignRecords.py within [--version] [-h] -d DB_FILES [DB_FILES ...]
[-o OUT_FILES [OUT_FILES ...]]
[--outdir OUT_DIR] [--outname OUT_NAME]
[--log LOG_FILE] [--failed]
[--format {airr,changeo}] [--nproc NPROC] --sf
SEQ_FIELDS [SEQ_FIELDS ...] [--exec MUSCLE_EXEC]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--format
{airr,changeo}
¶ Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
--sf
<seq_fields>
¶ The sequence fields to multiple align within each record.
-
--exec
<muscle_exec>
¶ The location of the MUSCLE executable
AssignGenes.py¶
Assign V(D)J gene annotations
usage: AssignGenes.py [--version] [-h] ...
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
- output files:
- igblast
Reference alignment results from IgBLAST.
AssignGenes.py igblast¶
Executes igblastn.
usage: AssignGenes.py igblast [--version] [-h] [-o OUT_FILES [OUT_FILES ...]]
[--outdir OUT_DIR] [--outname OUT_NAME]
[--nproc NPROC] -s SEQ_FILES [SEQ_FILES ...] -b
IGDATA
[--organism {human,mouse,rabbit,rat,rhesus_monkey}]
[--loci {ig,tr}] [--vdb VDB] [--ddb DDB]
[--jdb JDB] [--cdb CDB] [--format {blast,airr}]
[--exec IGBLAST_EXEC]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
-s
<seq_files>
¶ A list of FASTA files containing sequences to process.
-
-b
<igdata>
¶ IgBLAST database directory (IGDATA).
-
--organism
{human,mouse,rabbit,rat,rhesus_monkey}
¶ Organism name.
-
--loci
{ig,tr}
¶ The receptor type.
-
--vdb
<vdb>
¶ Name of the custom V reference in the IgBLAST database folder. If not specified, then a default database name with the form imgt_<organism>_<loci>_v will be used.
-
--ddb
<ddb>
¶ Name of the custom D reference in the IgBLAST database folder. If not specified, then a default database name with the form imgt_<organism>_<loci>_d will be used.
-
--jdb
<jdb>
¶ Name of the custom J reference in the IgBLAST database folder. If not specified, then a default database name with the form imgt_<organism>_<loci>_j will be used.
-
--cdb
<cdb>
¶ Name of the custom C reference in the IgBLAST database folder. If not specified, then a default database name with the form imgt_<organism>_<loci>_c will be used. Note, this argument will be ignored for IgBLAST versions below 1.18.0.
-
--format
{blast,airr}
¶ Specify the output format. The “blast” will result in the IgBLAST “-outfmt 7 std qseq sseq btop” output format. Specifying “airr” will output the AIRR TSV format provided by the IgBLAST argument “-outfmt 19”.
-
--exec
<igblast_exec>
¶ Path to the igblastn executable.
AssignGenes.py igblast-aa¶
Executes igblastp.
usage: AssignGenes.py igblast-aa [--version] [-h]
[-o OUT_FILES [OUT_FILES ...]]
[--outdir OUT_DIR] [--outname OUT_NAME]
[--nproc NPROC] -s SEQ_FILES [SEQ_FILES ...]
-b IGDATA
[--organism {human,mouse,rabbit,rat,rhesus_monkey}]
[--loci {ig,tr}] [--vdb VDB]
[--exec IGBLAST_EXEC]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
-s
<seq_files>
¶ A list of FASTA files containing sequences to process.
-
-b
<igdata>
¶ IgBLAST database directory (IGDATA).
-
--organism
{human,mouse,rabbit,rat,rhesus_monkey}
¶ Organism name.
-
--loci
{ig,tr}
¶ The receptor type.
-
--vdb
<vdb>
¶ Name of the custom V reference in the IgBLAST database folder. If not specified, then a default database name with the form imgt_aa_<organism>_<loci>_v will be used.
-
--exec
<igblast_exec>
¶ Path to the igblastp executable.
BuildTrees.py¶
Converts TSV files into IgPhyML input files
usage: BuildTrees.py [--version] [-h] -d DB_FILES [DB_FILES ...]
[--outdir OUT_DIR] [--outname OUT_NAME] [--log LOG_FILE]
[--failed] [--format {airr,changeo}] [--collapse]
[--ncdr3] [--nmask] [--md META_DATA [META_DATA ...]]
[--clones TARGET_CLONES [TARGET_CLONES ...]]
[--minseq MIN_SEQ] [--sample SAMPLE_DEPTH]
[--append APPEND [APPEND ...]] [--igphyml]
[--nproc NPROC] [--clean {none,all}]
[--optimize {n,r,l,lr,tl,tlr}] [--omega OMEGA] [-t KAPPA]
[--motifs MOTIFS] [--hotness HOTNESS]
[--oformat {tab,txt}] [--nohlp] [--asr ASR]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--format
{airr,changeo}
¶ Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.
-
--collapse
¶
If specified, collapse identical sequences before exporting to fasta.
-
--ncdr3
¶
If specified, remove CDR3 from all sequences.
-
--nmask
¶
If specified, do not attempt to mask split codons.
-
--md
<meta_data>
¶ List of fields to containing metadata to include in output fasta file sequence headers.
-
--clones
<target_clones>
¶ List of clone IDs to output, if specified.
-
--minseq
<min_seq>
¶ Minimum number of data sequences. Any clones with fewer than the specified number of sequences will be excluded.
-
--sample
<sample_depth>
¶ Depth of reads to be subsampled (before deduplication).
-
--append
<append>
¶ List of columns to append to sequence ID to ensure uniqueness.
-
--igphyml
¶
Run IgPhyML on output?
-
--nproc
<nproc>
¶ Number of threads to parallelize IgPhyML across.
-
--clean
{none,all}
¶ Delete intermediate files? none: leave all intermediate files; all: delete all intermediate files.
-
--optimize
{n,r,l,lr,tl,tlr}
¶ Optimize combination of topology (t) branch lengths (l) and parameters (r), or nothing (n), for IgPhyML.
-
--omega
<omega>
¶ Omega parameters to estimate for FWR,CDR respectively: e = estimate, ce = estimate + confidence interval, or numeric value
-
-t
<kappa>
¶ Kappa parameters to estimate: e = estimate, ce = estimate + confidence interval, or numeric value
-
--motifs
<motifs>
¶ Which motifs to estimate mutability.
-
--hotness
<hotness>
¶ Mutability parameters to estimate: e = estimate, ce = estimate + confidence interval, or numeric value
-
--oformat
{tab,txt}
¶ IgPhyML output format.
-
--nohlp
¶
Don’t run HLP model?
-
--asr
<asr>
¶ Ancestral sequence reconstruction interval (0-1).
- output files:
- <folder>
folder containing fasta and partition files for each clone.
- lineages
successfully processed records.
- lineages-fail
database records failed processing.
- igphyml-pass
parameter estimates and lineage trees from running IgPhyML, if specified
- required fields:
sequence_id, sequence, sequence_alignment, germline_alignment_d_mask or germline_alignment, v_call, j_call, clone_id, v_sequence_start
ConvertDb.py¶
Parses tab delimited database files
usage: ConvertDb.py [--version] [-h] ...
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
- output files:
- airr
AIRR formatted database files.
- changeo
Change-O formatted database files.
- sequences
FASTA formatted sequences output from the subcommands fasta and clip.
- genbank
feature tables and fasta files containing MiAIRR compliant input for tbl2asn.
- required fields:
sequence_id, sequence, sequence_alignment, junction, v_call, d_call, j_call, v_germline_start, v_germline_end, v_sequence_start, v_sequence_end, d_sequence_start, d_sequence_end, j_sequence_start, j_sequence_end
- optional fields:
germline_alignment, c_call, clone_id
ConvertDb.py airr¶
Converts input to an AIRR TSV file.
usage: ConvertDb.py airr [--version] [-h] -d DB_FILES [DB_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
ConvertDb.py baseline¶
Creates a BASELINe fasta file from database records.
usage: ConvertDb.py baseline [--version] [-h] -d DB_FILES [DB_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--if ID_FIELD]
[--sf SEQ_FIELD] [--gf GERM_FIELD]
[--cf CLUSTER_FIELD]
[--mf META_FIELDS [META_FIELDS ...]]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--if
<id_field>
¶ The name of the field containing identifiers
-
--sf
<seq_field>
¶ The name of the field containing reads
-
--gf
<germ_field>
¶ The name of the field containing germline sequences
-
--cf
<cluster_field>
¶ The name of the field containing containing sorted clone IDs
-
--mf
<meta_fields>
¶ List of annotation fields to add to the sequence description
ConvertDb.py changeo¶
Converts input into a Change-O TSV file.
usage: ConvertDb.py changeo [--version] [-h] -d DB_FILES [DB_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
ConvertDb.py fasta¶
Creates a fasta file from database records.
usage: ConvertDb.py fasta [--version] [-h] -d DB_FILES [DB_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--if ID_FIELD]
[--sf SEQ_FIELD]
[--mf META_FIELDS [META_FIELDS ...]]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--if
<id_field>
¶ The name of the field containing identifiers
-
--sf
<seq_field>
¶ The name of the field containing sequences
-
--mf
<meta_fields>
¶ List of annotation fields to add to the sequence description
ConvertDb.py genbank¶
Creates files for GenBank/TLS submissions.
usage: ConvertDb.py genbank [--version] [-h] -d DB_FILES [DB_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--format {airr,changeo}]
[--mol MOLECULE] [--product PRODUCT]
[--db DB_XREF] [--inf INFERENCE]
[--organism ORGANISM] [--sex SEX]
[--isolate ISOLATE] [--tissue TISSUE]
[--cell-type CELL_TYPE] [-y YAML_CONFIG]
[--label LABEL] [--cf C_FIELD] [--nf COUNT_FIELD]
[--if INDEX_FIELD] [--allow-stop] [--asis-id]
[--asis-calls] [--allele-delim ALLELE_DELIM]
[--asn] [--sbt ASN_TEMPLATE] [--exec TBL2ASN_EXEC]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--format
{airr,changeo}
¶ Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.
-
--mol
<molecule>
¶ The source molecule type. Usually one of “mRNA” or “genomic DNA”.
-
--product
<product>
¶ The product name, such as “immunoglobulin heavy chain”.
-
--db
<db_xref>
¶ Name of the reference database used for alignment. Usually “IMGT/GENE-DB”.
-
--inf
<inference>
¶ Name and version of the inference tool used for reference alignment in the form tool:version.
-
--organism
<organism>
¶ The scientific name of the organism.
-
--sex
<sex>
¶ If specified, adds the given sex annotation to the fasta headers.
-
--isolate
<isolate>
¶ If specified, adds the given isolate annotation (sample label) to the fasta headers.
-
--tissue
<tissue>
¶ If specified, adds the given tissue-type annotation to the fasta headers.
-
--cell-type
<cell_type>
¶ If specified, adds the given cell-type annotation to the fasta headers.
-
-y
<yaml_config>
¶ A yaml file specifying sample features (BioSample attributes) in the form ‘variable: value’. If specified, any features provided in the yaml file will override those provided at the commandline. Note, this config file applies to sample features only and cannot be used for required source features such as the –product or –mol argument.
-
--label
<label>
¶ If specified, add a field name to the sequence identifier. Sequence identifiers will be output in the form <label>=<id>.
-
--cf
<c_field>
¶ Field containing the C region call. If unspecified, the C region gene call will be excluded from the feature table.
-
--nf
<count_field>
¶ If specified, use the provided column to add the AIRR_READ_COUNT note to the feature table.
-
--if
<index_field>
¶ If specified, use the provided column to add the AIRR_CELL_INDEX note to the feature table.
-
--allow-stop
¶
If specified, retain records in the output with stop codons in the junction region. In such records the CDS will be removed and replaced with a similar misc_feature in the feature table.
-
--asis-id
¶
If specified, use the existing sequence identifier for the output identifier. By default, only the row number will be used as the identifier to avoid the 50 character limit.
-
--asis-calls
¶
Specify to prevent alleles from being parsed using the IMGT nomenclature. Note, this requires the gene assignments to be exact matches to valid records in the references database specified by the –db argument.
-
--allele-delim
<allele_delim>
¶ The delimiter to use for splitting the gene name from the allele number. Note, this only applies when specifying –asis-calls. By default, this argument will be ignored and allele numbers extracted under the expectation of IMGT nomenclature consistency.
-
--asn
¶
If specified, run tbl2asn to generate the .sqn submission file after making the .fsa and .tbl files.
-
--sbt
<asn_template>
¶ If provided along with –asn, use the specified file for the template file argument to tbl2asn.
-
--exec
<tbl2asn_exec>
¶ The name or location of the tbl2asn executable.
CreateGermlines.py¶
Reconstructs germline sequences from alignment data
usage: CreateGermlines.py [--version] [-h] -d DB_FILES [DB_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--format {airr,changeo}] -r REFERENCES
[REFERENCES ...]
[-g {full,dmask,vonly,regions} [{full,dmask,vonly,regions} ...]]
[--cloned] [--sf SEQ_FIELD] [--vf V_FIELD]
[--df D_FIELD] [--jf J_FIELD] [--cf CLONE_FIELD]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--format
{airr,changeo}
¶ Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.
-
-r
<references>
¶ List of folders and/or fasta files (with .fasta, .fna or .fa extension) with germline sequences. When using the default Change-O sequence and coordinate fields, these reference sequences must contain IMGT-numbering spacers (gaps) in the V segment. Alternative numbering schemes, or no numbering, may work for alternative sequence and coordinate definitions that define a valid alignment, but a warning will be issued.
-
-g
{full,dmask,vonly,regions}
¶ Specify type(s) of germlines to include full germline, germline with D segment masked, or germline for V segment only.
-
--cloned
¶
Specify to create only one germline per clone. Note, if allele calls are ambiguous within a clonal group, this will place the germline call used for the entire clone within the germline_v_call, germline_d_call and germline_j_call fields.
-
--sf
<seq_field>
¶ Field containing the aligned sequence. Defaults to sequence_alignment (airr) or SEQUENCE_IMGT (changeo).
-
--vf
<v_field>
¶ Field containing the germline V segment call. Defaults to v_call (airr) or V_CALL (changeo).
-
--df
<d_field>
¶ Field containing the germline D segment call. Defaults to d_call (airr) or D_CALL (changeo).
-
--jf
<j_field>
¶ Field containing the germline J segment call. Defaults to j_call (airr) or J_CALL (changeo).
-
--cf
<clone_field>
¶ Field containing clone identifiers. Ignored if –cloned is not also specified. Defaults to clone_id (airr) or CLONE (changeo).
- output files:
- germ-pass
database with assigned germline sequences.
- germ-fail
database with records failing germline assignment.
- required fields:
sequence_id, sequence_alignment, v_call, d_call, j_call, v_sequence_start, v_sequence_end, v_germline_start, v_germline_end, d_sequence_start, d_sequence_end, d_germline_start, d_germline_end, j_sequence_start, j_sequence_end, j_germline_start, j_germline_end, np1_length, np2_length
- optional fields:
n1_length, n2_length, p3v_length, p5d_length, p3d_length, p5j_length, clone_id
- output fields:
germline_v_call, germline_d_call, germline_j_call, germline_alignment, germline_alignment_d_mask, germline_alignment_v_region, germline_regions,
DefineClones.py¶
Assign Ig sequences into clones
usage: DefineClones.py [--version] [-h] -d DB_FILES [DB_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [--log LOG_FILE] [--failed]
[--format {airr,changeo}] [--nproc NPROC]
[--sf SEQ_FIELD] [--vf V_FIELD] [--jf J_FIELD]
[--gf GROUP_FIELDS [GROUP_FIELDS ...]]
[--mode {allele,gene}] [--act {first,set}]
[--model {ham,aa,hh_s1f,hh_s5f,mk_rs1nf,mk_rs5nf,hs1f_compat,m1n_compat}]
[--dist DISTANCE] [--norm {len,mut,none}]
[--sym {avg,min}] [--link {single,average,complete}]
[--maxmiss MAX_MISSING]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--format
{airr,changeo}
¶ Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.
-
--nproc
<nproc>
¶ The number of simultaneous computational processes to execute (CPU cores to utilized).
-
--sf
<seq_field>
¶ Field to be used to calculate distance between records. Defaults to junction (airr) or JUNCTION (changeo).
-
--vf
<v_field>
¶ Field containing the germline V segment call. Defaults to v_call (airr) or V_CALL (changeo).
-
--jf
<j_field>
¶ Field containing the germline J segment call. Defaults to j_call (airr) or J_CALL (changeo).
-
--gf
<group_fields>
¶ Additional fields to use for grouping clones aside from V, J and junction length.
-
--mode
{allele,gene}
¶ Specifies whether to use the V(D)J allele or gene for initial grouping.
-
--act
{first,set}
¶ Specifies how to handle multiple V(D)J assignments for initial grouping. The “first” action will use only the first gene listed. The “set” action will use all gene assignments and construct a larger gene grouping composed of any sequences sharing an assignment or linked to another sequence by a common assignment (similar to single-linkage).
-
--model
{ham,aa,hh_s1f,hh_s5f,mk_rs1nf,mk_rs5nf,hs1f_compat,m1n_compat}
¶ Specifies which substitution model to use for calculating distance between sequences. The “ham” model is nucleotide Hamming distance and “aa” is amino acid Hamming distance. The “hh_s1f” and “hh_s5f” models are human specific single nucleotide and 5-mer content models, respectively, from Yaari et al, 2013. The “mk_rs1nf” and “mk_rs5nf” models are mouse specific single nucleotide and 5-mer content models, respectively, from Cui et al, 2016. The “m1n_compat” and “hs1f_compat” models are deprecated models provided backwards compatibility with the “m1n” and “hs1f” models in Change-O v0.3.3 and SHazaM v0.1.4. Both 5-mer models should be considered experimental.
-
--dist
<distance>
¶ The distance threshold for clonal grouping
-
--norm
{len,mut,none}
¶ Specifies how to normalize distances. One of none (do not normalize), len (normalize by length), or mut (normalize by number of mutations between sequences).
-
--sym
{avg,min}
¶ Specifies how to combine asymmetric distances. One of avg (average of A->B and B->A) or min (minimum of A->B and B->A).
-
--link
{single,average,complete}
¶ Type of linkage to use for hierarchical clustering.
-
--maxmiss
<max_missing>
¶ The maximum number of non-ACGT characters (gaps or Ns) to permit in the junction sequence before excluding the record from clonal assignment. Note, under single linkage non-informative positions can create artifactual links between unrelated sequences. Use with caution.
- output files:
- clone-pass
database with assigned clonal group numbers.
- clone-fail
database with records failing clonal grouping.
- required fields:
sequence_id, v_call, j_call, junction
- output fields:
clone_id
MakeDb.py¶
Create tab-delimited database file to store sequence alignment information
usage: MakeDb.py [--version] [-h] ...
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
- output files:
- db-pass
database of alignment records with functionality information, V and J calls, and a junction region.
- db-fail
database with records that fail due to no productivity information, no gene V assignment, no J assignment, or no junction region.
- universal output fields:
sequence_id, sequence, sequence_alignment, germline_alignment, rev_comp, productive, stop_codon, vj_in_frame, locus, v_call, d_call, j_call, c_call, junction, junction_length, junction_aa, v_sequence_start, v_sequence_end, v_germline_start, v_germline_end, d_sequence_start, d_sequence_end, d_germline_start, d_germline_end, j_sequence_start, j_sequence_end, j_germline_start, j_germline_end, np1_length, np2_length, fwr1, fwr2, fwr3, fwr4, cdr1, cdr2, cdr3
- imgt specific output fields:
n1_length, n2_length, p3v_length, p5d_length, p3d_length, p5j_length, d_frame, v_score, v_identity, d_score, d_identity, j_score, j_identity
- igblast specific output fields:
v_score, v_identity, v_support, v_cigar, d_score, d_identity, d_support, d_cigar, j_score, j_identity, j_support, j_cigar
- ihmm specific output fields:
vdj_score
- 10x specific output fields:
cell_id, consensus_count, umi_count, v_call_10x, d_call_10x, j_call_10x, junction_10x, junction_10x_aa
MakeDb.py igblast¶
Process igblastn output.
usage: MakeDb.py igblast [--version] [-h] [-o OUT_FILES [OUT_FILES ...]]
[--outdir OUT_DIR] [--outname OUT_NAME]
[--log LOG_FILE] [--failed] [--format {airr,changeo}]
-i ALIGNER_FILES [ALIGNER_FILES ...] -r REPO
[REPO ...] -s SEQ_FILES [SEQ_FILES ...]
[--10x CELLRANGER_FILES [CELLRANGER_FILES ...]]
[--asis-id] [--asis-calls] [--extended]
[--regions {default,rhesus-igl}] [--infer-junction]
[--partial]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--format
{airr,changeo}
¶ Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.
-
-i
<aligner_files>
¶ IgBLAST output files in format 7 with query sequence (igblastn argument ‘-outfmt “7 std qseq sseq btop”’).
-
-r
<repo>
¶ List of folders and/or fasta files containing the same germline set used in the IgBLAST alignment. These reference sequences must contain IMGT-numbering spacers (gaps) in the V segment.
-
-s
<seq_files>
¶ List of input FASTA files (with .fasta, .fna or .fa extension), containing sequences.
-
--10x
<cellranger_files>
¶ Table file containing 10X annotations (with .csv or .tsv extension).
-
--asis-id
¶
Specify to prevent input sequence headers from being parsed to add new columns to database. Parsing of sequence headers requires headers to be in the pRESTO annotation format, so this should be specified when sequence headers are incompatible with the pRESTO annotation scheme. Note, unrecognized header formats will default to this behavior.
-
--asis-calls
¶
Specify to prevent gene calls from being parsed into standard allele names in both the IgBLAST output and reference database. Note, this requires the sequence identifiers in the reference sequence set and the IgBLAST database to be exact string matches.
-
--extended
¶
Specify to include additional aligner specific fields in the output. Adds <vdj>_score, <vdj>_identity, <vdj>_support, <vdj>_cigar, fwr1, fwr2, fwr3, fwr4, cdr1, cdr2 and cdr3.
-
--regions
{default,rhesus-igl}
¶ IMGT CDR and FWR boundary definition to use.
-
--infer-junction
¶
Infer the junction sequence. For use with IgBLAST v1.6.0 or older, prior to the addition of IMGT-CDR3 inference.
-
--partial
¶
If specified, include incomplete V(D)J alignments in the pass file instead of the fail file. An incomplete alignment is defined as a record that is missing a V gene assignment, J gene assignment, junction region, or productivity call.
MakeDb.py igblast-aa¶
Process igblastp output.
usage: MakeDb.py igblast-aa [--version] [-h] [-o OUT_FILES [OUT_FILES ...]]
[--outdir OUT_DIR] [--outname OUT_NAME]
[--log LOG_FILE] [--failed]
[--format {airr,changeo}] -i ALIGNER_FILES
[ALIGNER_FILES ...] -r REPO [REPO ...] -s
SEQ_FILES [SEQ_FILES ...]
[--10x CELLRANGER_FILES [CELLRANGER_FILES ...]]
[--asis-id] [--asis-calls] [--extended]
[--regions {default,rhesus-igl}]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--format
{airr,changeo}
¶ Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.
-
-i
<aligner_files>
¶ IgBLAST output files in format 7 with query sequence (igblastp argument ‘-outfmt “7 std qseq sseq btop”’).
-
-r
<repo>
¶ List of folders and/or fasta files containing the same germline set used in the IgBLAST alignment. These reference sequences must contain IMGT-numbering spacers (gaps) in the V segment.
-
-s
<seq_files>
¶ List of input FASTA files (with .fasta, .fna or .fa extension), containing sequences.
-
--10x
<cellranger_files>
¶ Table file containing 10X annotations (with .csv or .tsv extension).
-
--asis-id
¶
Specify to prevent input sequence headers from being parsed to add new columns to database. Parsing of sequence headers requires headers to be in the pRESTO annotation format, so this should be specified when sequence headers are incompatible with the pRESTO annotation scheme. Note, unrecognized header formats will default to this behavior.
-
--asis-calls
¶
Specify to prevent gene calls from being parsed into standard allele names in both the IgBLAST output and reference database. Note, this requires the sequence identifiers in the reference sequence set and the IgBLAST database to be exact string matches.
-
--extended
¶
Specify to include additional aligner specific fields in the output. Adds v_score, v_identity, v_support, v_cigar, fwr1, fwr2, fwr3, cdr1 and cdr2.
-
--regions
{default,rhesus-igl}
¶ IMGT CDR and FWR boundary definition to use.
MakeDb.py ihmm¶
Process iHMMune-Align output.
usage: MakeDb.py ihmm [--version] [-h] [-o OUT_FILES [OUT_FILES ...]]
[--outdir OUT_DIR] [--outname OUT_NAME] [--log LOG_FILE]
[--failed] [--format {airr,changeo}] -i ALIGNER_FILES
[ALIGNER_FILES ...] -r REPO [REPO ...] -s SEQ_FILES
[SEQ_FILES ...]
[--10x CELLRANGER_FILES [CELLRANGER_FILES ...]]
[--asis-id] [--extended] [--partial]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--format
{airr,changeo}
¶ Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.
-
-i
<aligner_files>
¶ iHMMune-Align output file.
-
-r
<repo>
¶ List of folders and/or FASTA files containing the set of germline sequences used by iHMMune-Align. These reference sequences must contain IMGT-numbering spacers (gaps) in the V segment.
-
-s
<seq_files>
¶ List of input FASTA files (with .fasta, .fna or .fa extension) containing sequences.
-
--10x
<cellranger_files>
¶ Table file containing 10X annotations (with .csv or .tsv extension).
-
--asis-id
¶
Specify to prevent input sequence headers from being parsed to add new columns to database. Parsing of sequence headers requires headers to be in the pRESTO annotation format, so this should be specified when sequence headers are incompatible with the pRESTO annotation scheme. Note, unrecognized header formats will default to this behavior.
-
--extended
¶
Specify to include additional aligner specific fields in the output. Adds the path score of the iHMMune-Align hidden Markov model as vdj_score; adds fwr1, fwr2, fwr3, fwr4, cdr1, cdr2 and cdr3.
-
--partial
¶
If specified, include incomplete V(D)J alignments in the pass file instead of the fail file. An incomplete alignment is defined as a record that is missing a V gene assignment, J gene assignment, junction region, or productivity call.
MakeDb.py imgt¶
Process IMGT/HighV-Quest output (does not work with V-QUEST).
usage: MakeDb.py imgt [--version] [-h] [-o OUT_FILES [OUT_FILES ...]]
[--outdir OUT_DIR] [--outname OUT_NAME] [--log LOG_FILE]
[--failed] [--format {airr,changeo}] -i ALIGNER_FILES
[ALIGNER_FILES ...] [-s [SEQ_FILES [SEQ_FILES ...]]]
[-r REPO [REPO ...]]
[--10x CELLRANGER_FILES [CELLRANGER_FILES ...]]
[--extended] [--asis-id] [--imgt-id-len IMGT_ID_LEN]
[--partial]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
--log
<log_file>
¶ Specify to write verbose logging to a file. May not be specified with multiple input files.
-
--failed
¶
If specified create files containing records that fail processing.
-
--format
{airr,changeo}
¶ Output format. Also specifies the input format for tools accepting tab delimited AIRR Rearrangement or Change-O files.
-
-i
<aligner_files>
¶ Either zipped IMGT output files (.zip or .txz) or a folder containing unzipped IMGT output files (which must include 1_Summary, 2_IMGT-gapped, 3_Nt-sequences, and 6_Junction).
-
-s
<seq_files>
¶ List of FASTA files (with .fasta, .fna or .fa extension) that were submitted to IMGT/HighV-QUEST. If unspecified, sequence identifiers truncated by IMGT/HighV-QUEST will not be corrected.
-
-r
<repo>
¶ List of folders and/or fasta files containing the germline sequence set used by IMGT/HighV-QUEST. These reference sequences must contain IMGT-numbering spacers (gaps) in the V segment. If unspecified, the germline sequence reconstruction will not be included in the output.
-
--10x
<cellranger_files>
¶ Table file containing 10X annotations (with .csv or .tsv extension).
-
--extended
¶
Specify to include additional aligner specific fields in the output. Adds <vdj>_score, <vdj>_identity>, fwr1, fwr2, fwr3, fwr4, cdr1, cdr2, cdr3, n1_length, n2_length, p3v_length, p5d_length, p3d_length, p5j_length and d_frame.
-
--asis-id
¶
Specify to prevent input sequence headers from being parsed to add new columns to database. Parsing of sequence headers requires headers to be in the pRESTO annotation format, so this should be specified when sequence headers are incompatible with the pRESTO annotation scheme. Note, unrecognized header formats will default to this behavior.
-
--imgt-id-len
<imgt_id_len>
¶ The maximum character length of sequence identifiers reported by IMGT/HighV-QUEST. Specify 50 if the IMGT files (-i) were generated with an IMGT/HighV-QUEST version older than 1.8.3 (May 7, 2021).
-
--partial
¶
If specified, include incomplete V(D)J alignments in the pass file instead of the fail file. An incomplete alignment is defined as a record that is missing a V gene assignment, J gene assignment, junction region, or productivity call.
ParseDb.py¶
Parses tab delimited database files
usage: ParseDb.py [--version] [-h] ...
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
- output files:
- sequences
FASTA formatted sequences output from the subcommands fasta and clip.
- <field>-<value>
database files partitioned by annotation <field> and <value>.
- parse-<command>
output of the database modification functions where <command> is one of the subcommands add, index, drop, delete, rename, select, sort or update.
- required fields:
sequence_id
ParseDb.py add¶
Adds field and value pairs.
usage: ParseDb.py add [--version] [-h] -d DB_FILES [DB_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] -f FIELDS [FIELDS ...] -u VALUES
[VALUES ...]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
-f
<fields>
¶ The name of the fields to add.
-
-u
<values>
¶ The value to assign to all rows for each field.
ParseDb.py delete¶
Deletes specific records.
usage: ParseDb.py delete [--version] [-h] -d DB_FILES [DB_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] -f FIELDS [FIELDS ...]
[-u VALUES [VALUES ...]] [--logic {any,all}]
[--regex]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
-f
<fields>
¶ The name of the fields to check for deletion criteria.
-
-u
<values>
¶ The values defining which records to delete. A value may appear in any of the fields specified with -f.
-
--logic
{any,all}
¶ Defines whether a value may appear in any field (any) or whether it must appear in all fields (all).
-
--regex
¶
If specified, treat values as regular expressions and allow partial string matches.
ParseDb.py drop¶
Deletes entire fields.
usage: ParseDb.py drop [--version] [-h] -d DB_FILES [DB_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] -f FIELDS [FIELDS ...]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
-f
<fields>
¶ The name of the fields to delete from the database.
ParseDb.py index¶
Adds a numeric index field.
usage: ParseDb.py index [--version] [-h] -d DB_FILES [DB_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] [-f FIELD]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
-f
<field>
¶ The name of the index field to add to the database.
ParseDb.py merge¶
Merges files.
usage: ParseDb.py merge [--version] [-h] -d DB_FILES [DB_FILES ...]
[--outdir OUT_DIR] [--outname OUT_NAME] [-o OUT_FILE]
[--drop]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
-o
<out_file>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir or –outname arguments.
-
--drop
¶
If specified, drop fields that do not exist in all input files. Otherwise, include all columns in all files and fill missing data with empty strings.
ParseDb.py rename¶
Renames fields.
usage: ParseDb.py rename [--version] [-h] -d DB_FILES [DB_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] -f FIELDS [FIELDS ...] -k NAMES
[NAMES ...]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
-f
<fields>
¶ List of fields to rename.
-
-k
<names>
¶ List of new names for each field.
ParseDb.py select¶
Selects specific records.
usage: ParseDb.py select [--version] [-h] -d DB_FILES [DB_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] -f FIELDS [FIELDS ...] -u VALUES
[VALUES ...] [--logic {any,all}] [--regex]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
-f
<fields>
¶ The name of the fields to check for selection criteria.
-
-u
<values>
¶ The values defining with records to select. A value may appear in any of the fields specified with -f.
-
--logic
{any,all}
¶ Defines whether a value may appear in any field (any) or whether it must appear in all fields (all).
-
--regex
¶
If specified, treat values as regular expressions and allow partial string matches.
ParseDb.py sort¶
Sorts records by field values.
usage: ParseDb.py sort [--version] [-h] -d DB_FILES [DB_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] -f FIELD [--num] [--descend]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
-f
<field>
¶ The annotation field by which to sort records.
-
--num
¶
Specify to define the sort column as numeric rather than textual.
-
--descend
¶
If specified, sort records in descending, rather than ascending, order by values in the target field.
ParseDb.py split¶
Splits database files by field values
usage: ParseDb.py split [--version] [-h] -d DB_FILES [DB_FILES ...]
[--outdir OUT_DIR] [--outname OUT_NAME] -f FIELD
[--num NUM_SPLIT]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
-f
<field>
¶ Annotation field by which to split database files.
-
--num
<num_split>
¶ Specify to define the field as numeric and group records by whether they are less than or at least (greater than or equal to) the specified value.
ParseDb.py update¶
Updates field and value pairs.
usage: ParseDb.py update [--version] [-h] -d DB_FILES [DB_FILES ...]
[-o OUT_FILES [OUT_FILES ...]] [--outdir OUT_DIR]
[--outname OUT_NAME] -f FIELD -u VALUES [VALUES ...]
-t UPDATES [UPDATES ...]
-
--version
¶
show program’s version number and exit
-
-h
,
--help
¶
show this help message and exit
-
-d
<db_files>
¶ A list of tab delimited database files.
-
-o
<out_files>
¶ Explicit output file name. Note, this argument cannot be used with the –failed, –outdir, or –outname arguments. If unspecified, then the output filename will be based on the input filename(s).
-
--outdir
<out_dir>
¶ Specify to changes the output directory to the location specified. The input file directory is used if this is not specified.
-
--outname
<out_name>
¶ Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files.
-
-f
<field>
¶ The name of the field to update.
-
-u
<values>
¶ The values that will be replaced.
-
-t
<updates>
¶ The new value to assign to each selected row.
API¶
changeo.Alignment¶
Alignment manipulation
-
class
changeo.Alignment.
RegionDefinition
(junction_length, amino_acid=False, definition='default')¶ Bases:
object
FWR and CDR region boundary definitions
-
changeo.Alignment.
alignmentPositions
(alignment)¶ Extracts start position and length from an alignment
- Parameters
alignment – tuples of (operation, length) for each alignment operation.
- Returns
- query (q) and reference (r) start (0-based) and length information with keys
{q_start, q_length, r_start, r_length}.
- Return type
-
changeo.Alignment.
decodeBTOP
(btop)¶ Parse a BTOP string into a list of tuples in CIGAR annotation.
- Parameters
btop – BTOP string.
- Returns
tuples of (operation, length) for each operation in the BTOP string using CIGAR annotation.
- Return type
-
changeo.Alignment.
decodeCIGAR
(cigar)¶ Parse a CIGAR string into a list of tuples.
- Parameters
cigar – CIGAR string.
- Returns
tuples of (operation, length) for each operation in the CIGAR string.
- Return type
-
changeo.Alignment.
encodeCIGAR
(alignment)¶ Encodes a list of tuple with alignment information into a CIGAR string.
- Parameters
tuple – tuples of (type, length) for each alignment operation.
- Returns
CIGAR string.
- Return type
-
changeo.Alignment.
gapV
(seq, v_germ_start, v_germ_length, v_call, references, asis_calls=False)¶ Construction IMGT-gapped V segment sequences.
- Parameters
seq (str) – V(D)J sequence alignment (SEQUENCE_VDJ).
v_germ_start (int) – start position V segment alignment in the germline (V_GERM_START_VDJ, 1-based).
v_germ_length (int) – length of the V segment alignment against the germline (V_GERM_LENGTH_VDJ, 1-based).
v_call (str) – V segment allele assignment (V_CALL).
references (dict) – dictionary of IMGT-gapped reference sequences.
asis_calls (bool) – if True do not parse v_call for allele names and just split by comma.
- Returns
dictionary containing IMGT-gapped query sequences and germline positions.
- Return type
- Raises
KeyError – raised if the v_call is not found in the reference dictionary.
-
changeo.Alignment.
getRegions
(seq, junction_length)¶ Identify FWR and CDR regions by IMGT definition.
- Parameters
seq – IMGT-gapped sequence.
junction_length – length of the junction region in nucleotides.
- Returns
dictionary of FWR and CDR sequences.
- Return type
-
changeo.Alignment.
inferJunction
(seq, j_germ_start, j_germ_length, j_call, references, asis_calls=False, regions='default')¶ Identify junction region by IMGT definition.
- Parameters
seq (str) – IMGT-gapped V(D)J sequence alignment (SEQUENCE_IMGT).
j_germ_start (int) – start position J segment alignment in the germline (J_GERM_START, 1-based).
j_germ_length (int) – length of the J segment alignment against the germline (J_GERM_LENGTH).
j_call (str) – J segment allele assignment (J_CALL).
references (dict) – dictionary of IMGT-gapped reference sequences.
asis_calls (bool) – if True do not parse V_CALL for allele names and just split by comma.
regions (str) – name of the IMGT FWR/CDR region definitions to use.
- Returns
dictionary containing junction sequence, translation and length.
- Return type
-
changeo.Alignment.
padAlignment
(alignment, q_start, r_start)¶ Pads the start of an alignment based on query and reference positions.
- Parameters
alignment – tuples of (operation, length) for each alignment operation.
q_start – query (input) start position (0-based)
r_start – reference (subject) start position (0-based)
- Returns
updated list of tuples of (operation, length) for the alignment.
- Return type
changeo.Applications¶
Application wrappers
-
changeo.Applications.
getIgBLASTVersion
(exec='igblastn')¶ Gets the version of the IgBLAST executable
-
changeo.Applications.
runASN
(fasta, template=None, exec='tbl2asn')¶ Executes tbl2asn to generate Sequin files
-
changeo.Applications.
runIgBLASTN
(fasta, igdata, loci='ig', organism='human', vdb=None, ddb=None, jdb=None, cdb=None, output=None, format='legacy', threads=1, exec='igblastn')¶ Runs igblastn on a sequence file
- Parameters
fasta (str) – fasta file containing sequences.
igdata (str) – path to the IgBLAST database directory (IGDATA environment).
loci (str) – receptor type; one of ‘ig’ or ‘tr’.
organism (str) – species name.
vdb (str) – name of a custom V reference in the database folder to use.
ddb (str) – name of a custom D reference in the database folder to use.
jdb (str) – name of a custom J reference in the database folder to use.
cdb (str) – name of a custom C reference in the database folder to use.
output (str) – output file name. If None, automatically generate from the fasta file name.
format (str) – output format. One of ‘blast’ or ‘airr’.
threads (int) – number of threads for igblastn.
exec (str) – the name or path to the igblastn executable.
- Returns
IgBLAST console output.
- Return type
-
changeo.Applications.
runIgBLASTP
(fasta, igdata, loci='ig', organism='human', vdb=None, output=None, threads=1, exec='igblastp')¶ Runs igblastp on a sequence file
- Parameters
fasta (str) – fasta file containing sequences.
igdata (str) – path to the IgBLAST database directory (IGDATA environment).
loci (str) – receptor type; one of ‘ig’ or ‘tr’.
organism (str) – species name.
vdb (str) – name of a custom V reference in the database folder to use.
output (str) – output file name. If None, automatically generate from the fasta file name.
threads (int) – number of threads for igblastp.
exec (str) – the name or path to the igblastp executable.
- Returns
IgBLAST console output.
- Return type
-
changeo.Applications.
runIgPhyML
(rep_file, rep_dir, model='HLP17', motifs='FCH', threads=1, exec='igphyml')¶ Run IgPhyML
changeo.Commandline¶
Commandline interface
-
class
changeo.Commandline.
CommonHelpFormatter
(prog, indent_increment=2, max_help_position=24, width=None)¶ Bases:
argparse.RawDescriptionHelpFormatter
,argparse.ArgumentDefaultsHelpFormatter
Custom argparse.HelpFormatter
-
changeo.Commandline.
checkArgs
(parser)¶ Checks that arguments have been provided and prints help if they have not.
- Parameters
parser – An argparse.ArgumentParser defining the commandline arguments.
- Returns
True if arguments are present. Prints help and exits if not.
- Return type
boolean
-
changeo.Commandline.
getCommonArgParser
(db_in=True, db_out=True, out_file=True, failed=True, log=True, format=True, multiproc=False, add_help=True)¶ Defines an ArgumentParser object with common pRESTO arguments
- Parameters
db_in (bool) – if True include tab delimited database input arguments.
db_out (bool) – if True include explicit output file name argument.
out_file (bool) – if True add explicit output file name arguments.
failed (bool) – if True include arguments for output of failed results.
log (bool) – if True include log arguments.
format (bool) – input and output type arguments.
multiproc (bool) – if True include multiprocessing arguments.
- Returns
an argument parser.
- Return type
-
changeo.Commandline.
parseCommonArgs
(args, in_arg=None, in_types=None, in_list=False)¶ Checks common arguments from getCommonArgParser and transforms output options to a dictionary
- Parameters
args – Argument Namespace defined by ArgumentParser.parse_args.
in_arg – String defining a non-standard input file argument to verify; by default ‘db_files’ and ‘seq_files’ are supported in that order.
in_types – List of types (file extensions as strings) to allow for files in file_arg; if None do not check type.
in_list – if True allow multiple input files with the out_name and log arguments.
- Returns
Dictionary copy of args with output arguments embedded in the dictionary out_args
- Return type
-
changeo.Commandline.
setDefaultFields
(args, defaults, format='airr')¶ Sets default field arguments by format
changeo.Distance¶
Distance calculations
-
changeo.Distance.
calcDistances
(sequences, n, dist_mat, sym='avg', norm=None)¶ Calculate pairwise distances between input sequences
- Parameters
sequences – List of sequences for which to calculate pairwise distances
n – Length of n-mers to be used in calculating distance
dist_mat – pandas.DataFrame of mutation distances
norm – Normalization method. One of None, ‘len’, or ‘mut’.
sym – Symmetry method; one of ‘avg’ of ‘min.
- Returns
numpy matrix of pairwise distances between input sequences
- Return type
ndarray
-
changeo.Distance.
formClusters
(dists, link, distance)¶ Form clusters based on hierarchical clustering of input distance matrix with linkage type and cutoff distance
- Parameters
dists – numpy matrix of distances
link – Linkage type for hierarchical clustering
distance – Distance at which to cut into clusters
- Returns
List of cluster assignments
- Return type
-
changeo.Distance.
getAADistMatrix
(mat=None, mask_dist=0, gap_dist=0)¶ Generates an amino acid distance matrix
- Parameters
mat – Input distance matrix to extend to full alphabet; if unspecified, creates Hamming distance matrix that incorporates IUPAC equivalencies
mask_dict – Score for all matches against an X character
gap_dist – Score for all matches against a gap (-, .) character
- Returns
pandas.DataFrame of distances
- Return type
DataFrame
-
changeo.Distance.
getDNADistMatrix
(mat=None, mask_dist=0, gap_dist=0)¶ Generates a DNA distance matrix
- Parameters
mat – Input distance matrix to extend to full alphabet; if unspecified, creates Hamming distance matrix that incorporates IUPAC equivalencies
mask_dist – Distance for all matches against an N character
gap_dist – Distance for all matches against a gap (-, .) character
- Returns
pandas.DataFrame of distances
- Return type
DataFrame
-
changeo.Distance.
getNmers
(sequences, n)¶ Breaks input sequences down into n-mers
- Parameters
sequences – List of sequences to be broken into n-mers
n – Length of n-mers to return
- Returns
Dictionary mapping sequence to a list of n-mers
- Return type
-
changeo.Distance.
zip_equal
(*iterables)¶ Zips iterables and raises exception if different lengths
- Parameters
iterables – pointer to iterables to zip together
- Returns
A generator of tuples with combined elements from the iterables
- Return type
iter
changeo.Gene¶
Gene annotations
-
changeo.Gene.
buildClonalGermline
(receptors, references, seq_field='sequence_imgt', v_field='v_call', d_field='d_call', j_field='j_call', amino_acid=False)¶ Determine consensus clone sequence and create germline for clone
- Parameters
receptors (changeo.Receptor.Receptor) – list of Receptor objects
references (dict) – dictionary of IMGT gapped germline sequences
seq_field (str) – Receptor attribute in which to look for sequence
v_field (str) – Receptor attributein which to look for V call
d_field (str) – Receptor attributein which to look for D call
j_field (str) – Receptor attributein which to look for J call
amino_acid (bool) – if True then use the amino acid positional fields, otherwise use the nucleotide fields.
- Returns
- log dictionary, dictionary of {germline_type: germline_sequence},
dictionary of consensus {segment: gene call}
- Return type
-
changeo.Gene.
buildGermline
(receptor, references, seq_field='sequence_imgt', v_field='v_call', d_field='d_call', j_field='j_call', amino_acid=False)¶ Join gapped germline sequences aligned with sample sequences
- Parameters
receptor (changeo.Receptor.Receptor) – Receptor object.
references (dict) – dictionary of IMGT gapped germline sequences.
seq_field (str) – Receptor attribute in which to look for sequence.
v_field (str) – Receptor attribute in which to look for V call.
d_field (str) – Receptor attribute in which to look for V call.
j_field (str) – Receptor attribute in which to look for V call.
amino_acid (bool) – if True then use the amino acid positional fields, otherwise use the nucleotide fields.
- Returns
log dictionary, dictionary of {germline_type: germline_sequence}, dictionary of {segment: gene call}
- Return type
-
changeo.Gene.
getAllele
(gene, action='first')¶ Extract allele from gene call string
-
changeo.Gene.
getAlleleNumber
(gene, action='first')¶ Extract allele number from gene call string
-
changeo.Gene.
getCAllele
(gene, action='first')¶ Extract C-region allele gene call string
-
changeo.Gene.
getCGene
(gene, action='first')¶ Extract C-region gene from gene call string
-
changeo.Gene.
getDAllele
(gene, action='first')¶ Extract D allele gene from gene call string
-
changeo.Gene.
getDGermline
(receptor, references, d_field='d_call', amino_acid=False)¶ Extract D allele and germline sequence
- Parameters
receptor (changeo.Receptor.Receptor) – Receptor object
references (dict) – dictionary of germline sequences
d_field (str) – Receptor attribute containing the D allele assignment
amino_acid (bool) – if True then use the amino acid positional fields, otherwise use the nucleotide fields.
- Returns
D allele name, D segment germline sequence
- Return type
-
changeo.Gene.
getFamily
(gene, action='first')¶ Extract family from gene call string
-
changeo.Gene.
getGene
(gene, action='first')¶ Extract gene from gene call string
-
changeo.Gene.
getJAllele
(gene, action='first')¶ Extract J allele gene from gene call string
-
changeo.Gene.
getJGermline
(receptor, references, j_field='j_call', amino_acid=False)¶ Extract J allele and germline sequence
- Parameters
receptor (changeo.Receptor.Receptor) – Receptor object
references (dict) – dictionary of germline sequences
j_field (str) – Receptor attribute containing the J allele assignment
amino_acid (bool) – if True then use the amino acid positional fields, otherwise use the nucleotide fields.
- Returns
J allele name, J segment germline sequence
- Return type
-
changeo.Gene.
getLocus
(gene, action='first')¶ Extract locus from gene call string
-
changeo.Gene.
getVAllele
(gene, action='first')¶ Extract V allele gene from gene call string
-
changeo.Gene.
getVGermline
(receptor, references, v_field='v_call', amino_acid=False)¶ Extract V allele and germline sequence
- Parameters
receptor (changeo.Receptor.Receptor) – Receptor object
references (dict) – dictionary of germline sequences
v_field (str) – Receptor attribute containing the V allele assignment
amino_acid (bool) – if True then use the amino acid positional fields, otherwise use the nucleotide fields.
- Returns
V allele name, V segment germline sequence
- Return type
-
changeo.Gene.
parseGeneCall
(gene, regex, action='first')¶ Extract alleles from strings
- Parameters
- Returns
String of the allele when action is ‘first’; tuple: Tuple of allele calls for ‘set’ or ‘list’ actions.
- Return type
-
changeo.Gene.
stitchRegions
(receptor, v_seq, d_seq, j_seq, amino_acid=False)¶ Assemble full length region encoding
- Parameters
receptor (changeo.Receptor.Receptor) – Receptor object
v_seq (str) – V segment germline sequence as a string
d_seq (str) – D segment germline sequence as a string
j_seq (str) – J segment germline sequence as a string
amino_acid (bool) – if True use amino acid positional fields, otherwise use nucleotide fields.
- Returns
string defining germline regions
- Return type
-
changeo.Gene.
stitchVDJ
(receptor, v_seq, d_seq, j_seq, amino_acid=False)¶ Assemble full length germline sequence
- Parameters
receptor (changeo.Receptor.Receptor) – Receptor object
v_seq (str) – V segment sequence as a string
d_seq (str) – D segment sequence as a string
j_seq (str) – J segment sequence as a string
amino_acid (bool) – if True use X for N/P regions and amino acid positional fields, otherwise use N and nucleotide fields.
- Returns
full germline sequence
- Return type
changeo.IO¶
File I/O and parsers
-
class
changeo.IO.
AIRRReader
(handle)¶ Bases:
changeo.IO.TSVReader
An iterator to read and parse AIRR formatted data.
-
class
changeo.IO.
AIRRWriter
(handle, fields=['sequence_id', 'sequence', 'sequence_alignment', 'germline_alignment', 'rev_comp', 'productive', 'stop_codon', 'vj_in_frame', 'locus', 'v_call', 'd_call', 'j_call', 'c_call', 'junction', 'junction_length', 'junction_aa', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end'])¶ Bases:
changeo.IO.TSVWriter
Writes AIRR formatted data.
-
writeReceptor
(records)¶ Writes a row from a Receptor object
- Parameters
records – a changeo.Receptor object to write or iterable of such objects.
- Returns
None
-
-
class
changeo.IO.
ChangeoReader
(handle)¶ Bases:
changeo.IO.TSVReader
An iterator to read and parse Change-O formatted data.
-
class
changeo.IO.
ChangeoWriter
(handle, fields=['SEQUENCE_ID', 'SEQUENCE_INPUT', 'FUNCTIONAL', 'IN_FRAME', 'STOP', 'MUTATED_INVARIANT', 'INDELS', 'LOCUS', 'V_CALL', 'D_CALL', 'J_CALL', 'SEQUENCE_VDJ', 'SEQUENCE_IMGT', 'V_SEQ_START', 'V_SEQ_LENGTH', 'V_GERM_START_VDJ', 'V_GERM_LENGTH_VDJ', 'V_GERM_START_IMGT', 'V_GERM_LENGTH_IMGT', 'NP1_LENGTH', 'D_SEQ_START', 'D_SEQ_LENGTH', 'D_GERM_START', 'D_GERM_LENGTH', 'NP2_LENGTH', 'J_SEQ_START', 'J_SEQ_LENGTH', 'J_GERM_START', 'J_GERM_LENGTH', 'JUNCTION', 'JUNCTION_LENGTH', 'GERMLINE_IMGT'], header=True)¶ Bases:
changeo.IO.TSVWriter
Writes Change-O formatted data.
-
writeReceptor
(records)¶ Writes a row from a Receptor object
- Parameters
records – a changeo.Receptor.Receptor object to write or an iterable of such objects.
- Returns
None
-
-
class
changeo.IO.
IHMMuneReader
(ihmmune, sequences, references, receptor=True)¶ Bases:
object
An iterator to read and parse iHMMune-Align output files.
-
__iter__
()¶ Iterator initializer.
- Returns
changeo.IO.IHMMuneReader
-
__next__
()¶ Next method.
- Returns
parsed IMGT/HighV-QUEST result as an Receptor (receptor=True) or dictionary (receptor=False).
- Return type
-
static
customFields
(scores=False, regions=False, cell=False, schema=None)¶ Returns non-standard Receptor attributes defined by the parser
- Parameters
scores – if True include alignment scoring fields.
regions – if True include IMGT-gapped CDR and FWR region fields.
schema – schema class to pass field through for conversion. If None, return changeo.Receptor.Receptor attribute names.
- Returns
list of field names.
- Return type
-
ihmmune_fields
= ['SEQUENCE_ID', 'V_CALL', 'D_CALL', 'J_CALL', 'V_SEQ', 'NP1_SEQ', 'D_SEQ', 'NP2_SEQ', 'J_SEQ', 'V_MUT', 'D_MUT', 'J_MUT', 'NX_COUNT', 'J_INFRAME', 'V_SEQ_START', 'STOP_COUNT', 'D_PROB', 'HMM_SCORE', 'RC', 'COMMON_MUT', 'COMMON_NX_COUNT', 'V_SEQ_START', 'V_SEQ_LENGTH', 'A_SCORE']¶
-
-
class
changeo.IO.
IMGTReader
(summary, gapped, ntseq, junction, receptor=True)¶ Bases:
object
An iterator to read and parse IMGT output files.
-
__iter__
()¶ Iterator initializer.
- Returns
changeo.IO.IMGTReader
-
__next__
()¶ Next method.
- Returns
parsed IMGT/HighV-QUEST result as an Receptor (receptor=True) or dictionary (receptor=False).
- Return type
-
static
customFields
(scores=False, regions=False, junction=False, schema=None)¶ Returns non-standard fields defined by the parser
- Parameters
scores – if True include alignment scoring fields.
regions – if True include IMGT-gapped CDR and FWR region fields.
junction – if True include detailed junction annotation fields.
schema – schema class to pass field through for conversion. If None, return changeo.Receptor.Receptor attribute names.
- Returns
list of field names.
- Return type
-
parseRecord
(summary, gapped, ntseq, junction)¶ Parses a single row from each IMTG file.
- Parameters
summary – dictionary containing one row of the ‘1_Summary’ file.
gapped – dictionary containing one row of the ‘2_IMGT-gapped-nt-sequences’ file.
ntseq – dictionary containing one row of the ‘3_Nt-sequences’ file.
junction – dictionary containing one row of the ‘6_Junction’ file.
- Returns
database entry for the row.
- Return type
-
-
class
changeo.IO.
IgBLASTReader
(igblast, sequences, references, asis_calls=False, regions='default', receptor=True, infer_junction=False)¶ Bases:
object
An iterator to read and parse IgBLAST output files
-
__iter__
()¶ Iterator initializer.
- Returns
changeo.IO.IgBLASTReader
-
__next__
()¶ Next method.
- Returns
parsed IMGT/HighV-QUEST result as an Receptor (receptor=True) or dictionary (receptor=False).
- Return type
-
static
customFields
(schema=None)¶ Returns non-standard fields defined by the parser
- Parameters
schema – schema class to pass field through for conversion. If None, return changeo.Receptor.Receptor attribute names.
- Returns
list of field names.
- Return type
-
parseBlock
(block)¶ Parses an IgBLAST result into separate sections
- Parameters
block (iter) – an iterator from itertools.groupby containing a single IgBLAST result.
- Returns
- a parsed results block;
with the keys ‘query’ (sequence identifier as a string), ‘summary’ (dictionary of the alignment summary), ‘subregion’ (dictionary of IgBLAST CDR3 sequences), and ‘hits’ (VDJ hit table as a list of dictionaries). Returns None if the block has no data that can be parsed.
- Return type
-
-
class
changeo.IO.
IgBLASTReaderAA
(igblast, sequences, references, asis_calls=False, regions='default', receptor=True, infer_junction=False)¶ Bases:
changeo.IO.IgBLASTReader
An iterator to read and parse IgBLAST amino acid alignment output files
-
static
customFields
(schema=None)¶ Returns non-standard fields defined by the parser
- Parameters
schema – schema class to pass field through for conversion. If None, return changeo.Receptor.Receptor attribute names.
- Returns
list of field names.
- Return type
-
static
-
class
changeo.IO.
TSVReader
(handle)¶ Bases:
object
Simple csv.DictReader wrapper to read format agnostic TSV files.
-
reader
¶ reader object.
- Type
iter
-
__iter__
()¶ Iterator initializer
- Returns
changeo.IO.TSVReader
-
__next__
()¶ Next method
- Returns
row as a dictionary of field:value pairs.
- Return type
dist
-
-
class
changeo.IO.
TSVWriter
(handle, fields, header=True)¶ Bases:
object
Simple csv.DictWriter wrapper to write format agnostic TSV files.
-
writeDict
(records)¶ Writes a row from a dictionary
- Parameters
records – dictionary of row data or an iterable of such objects.
- Returns
None
-
writeHeader
()¶ Writes the header
- Returns
None
-
-
changeo.IO.
checkFields
(attributes, header, schema=<class 'changeo.Receptor.AIRRSchema'>)¶ Checks that a file header contains a required set of Receptor attributes
- Parameters
- Returns
True if all attributes mapping fields are found.
- Return type
- Raises
-
changeo.IO.
countDbFile
(file)¶ Counts the records in database files
- Parameters
file – tab-delimited database file.
- Returns
count of records in the database file.
- Return type
-
changeo.IO.
extractIMGT
(imgt_output)¶ Extract necessary files from IMGT/HighV-QUEST results.
- Parameters
imgt_output – zipped file or unzipped folder output by IMGT/HighV-QUEST.
- Returns
(temporary directory handle, dictionary with names of extracted IMGT files).
- Return type
-
changeo.IO.
getDbFields
(file, add=None, exclude=None, reader=<class 'changeo.IO.TSVReader'>)¶ Get field names from a db file
- Parameters
file – db file to pull base fields from.
add – fields to append to the field set.
exclude – fields to exclude from the field set.
reader – reader class.
- Returns
list of field names
- Return type
-
changeo.IO.
getFormatOperators
(format)¶ Simple wrapper for fetching the set of operator classes for a data format
-
changeo.IO.
getOutputHandle
(file, out_label=None, out_dir=None, out_name=None, out_type=None)¶ Opens an output file handle
- Parameters
file – filename to base output file name on.
out_label – text to be inserted before the file extension; if None do not add a label.
out_type – the file extension of the output file; if None use input file extension.
out_dir – the output directory; if None use directory of input file
out_name – the short filename to use for the output file; if None use input file short name.
- Returns
File handle
- Return type
file
-
changeo.IO.
getOutputName
(file, out_label=None, out_dir=None, out_name=None, out_type=None)¶ Creates and output filename from an existing filename
- Parameters
file – filename to base output file name on.
out_label – text to be inserted before the file extension; if None do not add a label.
out_type – the file extension of the output file; if None use input file extension.
out_dir – the output directory; if None use directory of input file
out_name – the short filename to use for the output file; if None use input file short name.
- Returns
file name.
- Return type
-
changeo.IO.
readGermlines
(references, asis=False, warn=False)¶ Parses germline repositories
- Parameters
- Returns
Dictionary of germlines in the form {allele: sequence}.
- Return type
-
changeo.IO.
splitName
(file)¶ Extract the extension from a file name
changeo.Multiprocessing¶
Multiprocessing
-
class
changeo.Multiprocessing.
DbData
(key, records)¶ Bases:
object
A class defining data objects for worker processes
-
id
¶ result identifier
-
data
¶ list of data records
-
valid
¶ True if preprocessing was successfull and data should be processed
-
-
class
changeo.Multiprocessing.
DbResult
(key, records)¶ Bases:
object
A class defining result objects for collector processes
-
id
¶ result identifier
-
data
¶ list of original data records
-
results
¶ list of processed records
-
data_pass
¶ list of records that pass filtering for workers that split data before processing
-
data_fail
¶ list of records that failed filtering for workers that split data before processing
-
valid
¶ True if processing was successful and results should be written
-
log
¶ OrderedDict of log items
-
property
data_count
¶
-
-
changeo.Multiprocessing.
collectDbQueue
(alive, result_queue, collect_queue, db_file, label, fields, writer=<class 'changeo.IO.AIRRWriter'>, out_file=None, out_args={'failed': False, 'log_file': None, 'out_dir': None, 'out_name': None, 'out_type': 'tsv'})¶ Pulls from results queue, assembles results and manages log and file IO
- Parameters
alive – multiprocessing.Value boolean controlling whether processing continues; when False function returns.
result_queue – multiprocessing.Queue holding worker results.
collect_queue – multiprocessing.Queue to store collector return values.
db_file – database file name.
label – task label used to tag the output files.
fields – list of output fields.
writer – writer class.
out_file – output file name. Automatically generated from the input file if None.
out_args – common output argument dictionary from parseCommonArgs.
- Returns
- Adds a dictionary with key value pairs to collect_queue containing
’log’ defining a log object along with the ‘pass’ and ‘fail’ output file names.
- Return type
-
changeo.Multiprocessing.
feedDbQueue
(alive, data_queue, db_file, reader=<class 'changeo.IO.AIRRReader'>, group_func=None, group_args={})¶ Feeds the data queue with Ig records
- Parameters
alive – multiprocessing.Value boolean controlling whether processing continues if False exit process
data_queue – multiprocessing.Queue to hold data for processing
db_file – database file
reader – database reader class
group_func – function to use for grouping records
group_args – dictionary of arguments to pass to group_func
- Returns
None
-
changeo.Multiprocessing.
processDbQueue
(alive, data_queue, result_queue, process_func, process_args={}, filter_func=None, filter_args={})¶ Pulls from data queue, performs calculations, and feeds results queue
- Parameters
alive – multiprocessing.Value boolean controlling whether processing continues; when False function returns
data_queue – multiprocessing.Queue holding data to process
result_queue – multiprocessing.Queue to hold processed results
process_func – function to use for processing sequences
process_args – dictionary of arguments to pass to process_func
filter_func – function to use for filtering sequences before processing
filter_args – dictionary of arguments to pass to filter_func
- Returns
None
changeo.Receptor¶
Receptor data structure
-
class
changeo.Receptor.
AIRRSchema
¶ Bases:
object
AIRR format to Receptor mappings
-
fields
= ['sequence_id', 'sequence', 'sequence_alignment', 'germline_alignment', 'sequence_aa', 'sequence_aa_alignment', 'germline_aa_alignment', 'rev_comp', 'productive', 'stop_codon', 'vj_in_frame', 'v_frameshift', 'locus', 'v_call', 'd_call', 'j_call', 'c_call', 'junction', 'junction_start', 'junction_end', 'junction_length', 'junction_aa', 'junction_aa_length', 'np1_length', 'np2_length', 'np1_aa_length', 'np2_aa_length', 'v_sequence_start', 'v_sequence_end', 'v_sequence_length', 'v_germline_start', 'v_germline_end', 'v_germline_length', 'v_sequence_aa_start', 'v_sequence_aa_end', 'v_sequence_aa_length', 'v_germline_aa_start', 'v_germline_aa_end', 'v_germline_aa_length', 'd_sequence_start', 'd_sequence_end', 'd_sequence_length', 'd_germline_start', 'd_germline_end', 'd_germline_length', 'd_sequence_aa_start', 'd_sequence_aa_end', 'd_sequence_aa_length', 'd_germline_aa_start', 'd_germline_aa_end', 'd_germline_aa_length', 'j_sequence_start', 'j_sequence_end', 'j_sequence_length', 'j_germline_start', 'j_germline_end', 'j_germline_length', 'j_sequence_aa_start', 'j_sequence_aa_end', 'j_sequence_aa_length', 'j_germline_aa_start', 'j_germline_aa_end', 'j_germline_aa_length', 'germline_alignment_d_mask', 'v_score', 'v_identity', 'v_support', 'v_cigar', 'd_score', 'd_identity', 'd_support', 'd_cigar', 'j_score', 'j_identity', 'j_support', 'j_cigar', 'vdj_score', 'cdr1', 'cdr2', 'cdr3', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_start', 'cdr1_end', 'cdr2_start', 'cdr2_end', 'cdr3_start', 'cdr3_end', 'fwr1_start', 'fwr1_end', 'fwr2_start', 'fwr2_end', 'fwr3_start', 'fwr3_end', 'fwr4_start', 'fwr4_end', 'n1_length', 'n2_length', 'p3v_length', 'p5d_length', 'p3d_length', 'p5j_length', 'd_frame', 'cdr3_igblast', 'cdr3_igblast_aa', 'duplicate_count', 'consensus_count', 'umi_count', 'clone_id', 'cell_id']¶
-
static
fromReceptor
(field)¶ Returns an AIRR column name from a Receptor attribute name
- Parameters
field – Receptor attribute name.
- Returns
AIRR column name.
- Return type
-
out_type
= 'tsv'¶
-
required
= ['sequence_id', 'sequence', 'sequence_alignment', 'germline_alignment', 'rev_comp', 'productive', 'stop_codon', 'vj_in_frame', 'locus', 'v_call', 'd_call', 'j_call', 'c_call', 'junction', 'junction_length', 'junction_aa', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end']¶
-
-
class
changeo.Receptor.
AIRRSchemaAA
¶ Bases:
changeo.Receptor.AIRRSchema
AIRR format to Receptor amino acid mappings
-
required
= ['sequence_id', 'sequence', 'sequence_alignment', 'germline_alignment', 'sequence_aa', 'sequence_aa_alignment', 'germline_aa_alignment', 'rev_comp', 'productive', 'stop_codon', 'locus', 'v_call', 'd_call', 'j_call', 'c_call', 'junction', 'junction_length', 'junction_aa', 'v_sequence_aa_start', 'v_sequence_aa_end', 'v_germline_aa_start', 'v_germline_aa_end']¶
-
-
class
changeo.Receptor.
ChangeoSchema
¶ Bases:
object
Change-O to Receptor mappings
-
fields
= ['SEQUENCE_ID', 'SEQUENCE_INPUT', 'SEQUENCE_AA_INPUT', 'FUNCTIONAL', 'IN_FRAME', 'STOP', 'MUTATED_INVARIANT', 'INDELS', 'V_FRAMESHIFT', 'LOCUS', 'V_CALL', 'D_CALL', 'J_CALL', 'C_CALL', 'SEQUENCE_VDJ', 'SEQUENCE_IMGT', 'SEQUENCE_AA_VDJ', 'SEQUENCE_AA_IMGT', 'V_SEQ_START', 'V_SEQ_LENGTH', 'V_GERM_START_VDJ', 'V_GERM_LENGTH_VDJ', 'V_GERM_START_IMGT', 'V_GERM_LENGTH_IMGT', 'V_SEQ_AA_START', 'V_SEQ_AA_LENGTH', 'V_GERM_AA_START_VDJ', 'V_GERM_AA_LENGTH_VDJ', 'V_GERM_AA_START_IMGT', 'V_GERM_AA_LENGTH_IMGT', 'NP1_LENGTH', 'NP1_AA_LENGTH', 'D_SEQ_START', 'D_SEQ_LENGTH', 'D_GERM_START', 'D_GERM_LENGTH', 'D_SEQ_AA_START', 'D_SEQ_AA_LENGTH', 'D_GERM_AA_START', 'D_GERM_AA_LENGTH', 'NP2_LENGTH', 'NP2_AA_LENGTH', 'J_SEQ_START', 'J_SEQ_LENGTH', 'J_GERM_START', 'J_GERM_LENGTH', 'J_SEQ_AA_START', 'J_SEQ_AA_LENGTH', 'J_GERM_AA_START', 'J_GERM_AA_LENGTH', 'JUNCTION', 'JUNCTION_LENGTH', 'GERMLINE_IMGT', 'GERMLINE_AA_IMGT', 'JUNCTION_START', 'V_SCORE', 'V_IDENTITY', 'V_EVALUE', 'V_BTOP', 'V_CIGAR', 'D_SCORE', 'D_IDENTITY', 'D_EVALUE', 'D_BTOP', 'D_CIGAR', 'J_SCORE', 'J_IDENTITY', 'J_EVALUE', 'J_BTOP', 'J_CIGAR', 'VDJ_SCORE', 'FWR1_IMGT', 'FWR2_IMGT', 'FWR3_IMGT', 'FWR4_IMGT', 'CDR1_IMGT', 'CDR2_IMGT', 'CDR3_IMGT', 'FWR1_AA_IMGT', 'FWR2_AA_IMGT', 'FWR3_AA_IMGT', 'FWR4_AA_IMGT', 'CDR1_AA_IMGT', 'CDR2_AA_IMGT', 'CDR3_AA_IMGT', 'N1_LENGTH', 'N2_LENGTH', 'P3V_LENGTH', 'P5D_LENGTH', 'P3D_LENGTH', 'P5J_LENGTH', 'D_FRAME', 'CDR3_IGBLAST', 'CDR3_IGBLAST_AA', 'CONSCOUNT', 'DUPCOUNT', 'UMICOUNT', 'CLONE', 'CELL']¶
-
static
fromReceptor
(field)¶ Returns a Change-O column name from a Receptor attribute name
- Parameters
field – Receptor attribute name.
- Returns
Change-O column name.
- Return type
-
out_type
= 'tab'¶
-
required
= ['SEQUENCE_ID', 'SEQUENCE_INPUT', 'FUNCTIONAL', 'IN_FRAME', 'STOP', 'MUTATED_INVARIANT', 'INDELS', 'LOCUS', 'V_CALL', 'D_CALL', 'J_CALL', 'SEQUENCE_VDJ', 'SEQUENCE_IMGT', 'V_SEQ_START', 'V_SEQ_LENGTH', 'V_GERM_START_VDJ', 'V_GERM_LENGTH_VDJ', 'V_GERM_START_IMGT', 'V_GERM_LENGTH_IMGT', 'NP1_LENGTH', 'D_SEQ_START', 'D_SEQ_LENGTH', 'D_GERM_START', 'D_GERM_LENGTH', 'NP2_LENGTH', 'J_SEQ_START', 'J_SEQ_LENGTH', 'J_GERM_START', 'J_GERM_LENGTH', 'JUNCTION', 'JUNCTION_LENGTH', 'GERMLINE_IMGT']¶
-
-
class
changeo.Receptor.
ChangeoSchemaAA
¶ Bases:
changeo.Receptor.ChangeoSchema
Change-O to Receptor amino acid mappings
-
required
= ['SEQUENCE_ID', 'SEQUENCE_AA_INPUT', 'STOP', 'INDELS', 'LOCUS', 'V_CALL', 'SEQUENCE_AA_VDJ', 'SEQUENCE_AA_IMGT', 'V_SEQ_AA_START', 'V_SEQ_AA_LENGTH', 'V_GERM_AA_START_VDJ', 'V_GERM_AA_LENGTH_VDJ', 'V_GERM_AA_START_IMGT', 'V_GERM_AA_LENGTH_IMGT', 'GERMLINE_AA_IMGT']¶
-
-
class
changeo.Receptor.
Receptor
(data)¶ Bases:
object
A class defining a V(D)J sequence and its annotations
-
property
d_germ_aa_end
¶ Position of the last amino acid in the D germline amino acid alignment
-
property
d_germ_end
¶ Position of the last nucleotide in the D germline sequence alignment
-
property
d_seq_aa_end
¶ Position of the last D amino acid in the input amino acid sequence
-
property
d_seq_end
¶ Position of the last D nucleotide in the input sequence
-
getAIRR
(field, seq=False)¶ Get an attribute from an AIRR field name
- Parameters
field – AIRR column name as a string
seq – if True return the attribute as a Seq object
- Returns
Value in the AIRR field. Returns None if the field cannot be found.
-
getAlleleCalls
(calls, action='first')¶ Get multiple allele calls
- Parameters
calls – iterable of calls to get; one or more of (‘v’,’d’,’j’)
actions – One of (‘first’,’set’)
- Returns
List of requested calls in order
- Return type
-
getAlleleNumbers
(calls, action='first')¶ Get multiple allele numeric identifiers
- Parameters
calls – iterable of calls to get; one or more of (‘v’,’d’,’j’)
actions – One of (‘first’,’set’)
- Returns
List of requested calls in order
- Return type
-
getChangeo
(field, seq=False)¶ Get an attribute from a Change-O field name
- Parameters
field – Change-O column name as a string
seq – if True return the attribute as a Seq object
- Returns
Value in the Change-O field. Returns None if the field cannot be found.
-
getDAllele
(action='first', field=None)¶ D segment allele getter
- Parameters
actions – One of ‘first’, ‘set’ or ‘list’
field – attribute or annotation name containing the D call. Use d_call attribute if None.
- Returns
String of the allele when action is ‘first’; tuple : Tuple of allele calls for ‘set’ or ‘list’ actions.
- Return type
-
getDAlleleNumber
(action='first', field=None)¶ D segment allele number getter
- Parameters
actions – One of ‘first’, ‘set’ or list’
field – attribute or annotation name containing the D call. Use d_call attribute if None.
- Returns
String of the allele when action is ‘first’; tuple : Tuple of allele numbers for ‘set’ or ‘list’ actions.
- Return type
-
getDFamily
(action='first', field=None)¶ D segment family getter
- Parameters
actions – One of ‘first’, ‘set’ or list’
field – attribute or annotation name containing the D call. Use d_call attribute if None.
- Returns
String of the allele when action is ‘first’; tuple : Tuple of allele calls for ‘set’ or ‘list’ actions.
- Return type
-
getDGene
(action='first', field=None)¶ D segment gene getter
- Parameters
actions – One of ‘first’, ‘set’ or list’
field – attribute or annotation name containing the D call. Use d_call attribute if None.
- Returns
String of the allele when action is ‘first’; tuple : Tuple of allele calls for ‘set’ or ‘list’ actions.
- Return type
-
getFamilyCalls
(calls, action='first')¶ Get multiple family calls
- Parameters
calls – iterable of calls to get; one or more of (‘v’,’d’,’j’)
actions – One of (‘first’,’set’)
- Returns
List of requested calls in order
- Return type
-
getField
(field)¶ Get an attribute or annotation value
- Parameters
field – attribute name as a string
- Returns
Value in the attribute. Returns None if the attribute cannot be found.
-
getGeneCalls
(calls, action='first')¶ Get multiple gene calls
- Parameters
calls – iterable of calls to get; one or more of (‘v’,’d’,’j’)
actions – One of (‘first’,’set’)
- Returns
List of requested calls in order
- Return type
-
getJAllele
(action='first', field=None)¶ J segment allele getter
- Parameters
actions – One of ‘first’, ‘set’ or ‘list’
field – attribute or annotation name containing the J call. Use j_call attribute if None.
- Returns
String of the allele when action is ‘first’; tuple : Tuple of allele calls for ‘set’ or ‘list’ actions.
- Return type
-
getJAlleleNumber
(action='first', field=None)¶ J segment allele number getter
- Parameters
actions – One of ‘first’, ‘set’ or list’
field – attribute or annotation name containing the J call. Use j_call attribute if None.
- Returns
String of the allele when action is ‘first’; tuple : Tuple of allele numbers for ‘set’ or ‘list’ actions.
- Return type
-
getJFamily
(action='first', field=None)¶ J segment family getter
- Parameters
actions – One of ‘first’, ‘set’ or list’
field – attribute or annotation name containing the J call. Use j_call attribute if None.
- Returns
String of the allele when action is ‘first’; tuple : Tuple of allele calls for ‘set’ or ‘list’ actions.
- Return type
-
getJGene
(action='first', field=None)¶ J segment gene getter
- Parameters
actions – One of ‘first’, ‘set’ or list’
field – attribute or annotation name containing the J call. Use j_call attribute if None.
- Returns
String of the allele when action is ‘first’; tuple : Tuple of allele calls for ‘set’ or ‘list’ actions.
- Return type
-
getSeq
(field)¶ Get an attribute value converted to a Seq object
- Parameters
field – variable name as a string
- Returns
Value in the field as a Seq object
- Return type
Bio.Seq.Seq
-
getVAllele
(action='first', field=None)¶ V segment allele getter
- Parameters
actions – One of ‘first’, ‘set’ or list’
field – attribute or annotation name containing the V call. Use v_call attribute if None.
- Returns
String of the allele when action is ‘first’; tuple : Tuple of allele calls for ‘set’ or ‘list’ actions.
- Return type
-
getVAlleleNumber
(action='first', field=None)¶ V segment allele number getter
- Parameters
actions – One of ‘first’, ‘set’ or list’
field – attribute or annotation name containing the V call. Use v_call attribute if None.
- Returns
String of the allele when action is ‘first’; tuple : Tuple of allele numbers for ‘set’ or ‘list’ actions.
- Return type
-
getVFamily
(action='first', field=None)¶ V segment family getter
- Parameters
actions – One of ‘first’, ‘set’ or list’
field – attribute or annotation name containing the V call. Use v_call attribute if None.
- Returns
String of the allele when action is ‘first’; tuple : Tuple of allele calls for ‘set’ or ‘list’ actions.
- Return type
-
getVGene
(action='first', field=None)¶ V segment gene getter
- Parameters
actions – One of ‘first’, ‘set’ or list’
field – attribute or annotation name containing the V call. Use v_call attribute if None.
- Returns
String of the allele when action is ‘first’; tuple : Tuple of allele calls for ‘set’ or ‘list’ actions.
- Return type
-
property
j_germ_aa_end
¶ Position of the last amino acid in the J germline amino acid alignment
-
property
j_germ_end
¶ Position of the last nucleotide in the J germline sequence alignment
-
property
j_seq_aa_end
¶ Position of the last J amino acid in the input amino sequence
-
property
j_seq_end
¶ Position of the last J nucleotide in the input sequence
-
property
junction_end
¶ Position of the last junction nucleotide in the input sequence
-
setDict
(data, parse=False)¶ Adds or updates multiple attributes and annotations
- Parameters
data – a dictionary of annotations to add or update.
parse – if True pass values through string parsing functions for known fields.
- Returns
updates attribute values and the annotations attribute.
- Return type
-
setField
(field, value, parse=False)¶ Set an attribute or annotation value
- Parameters
field – attribute name as a string
value – value to assign
parse – if True pass values through string parsing functions for known fields.
- Returns
None. Updates attribute or annotation.
-
toDict
()¶ Convert the namespace to a dictionary
- Returns
member fields with values converted to appropriate strings
- Return type
-
property
v_germ_aa_end_imgt
¶ Position of the last nucleotide in the IMGT-gapped V germline sequence alignment
-
property
v_germ_aa_end_vdj
¶ Position of the last nucleotide in the ungapped V germline sequence alignment
-
property
v_germ_end_imgt
¶ Position of the last nucleotide in the IMGT-gapped V germline sequence alignment
-
property
v_germ_end_vdj
¶ Position of the last nucleotide in the ungapped V germline sequence alignment
-
property
v_seq_aa_end
¶ Position of the last V nucleotide in the input sequence
-
property
v_seq_end
¶ Position of the last V nucleotide in the input sequence
-
property
-
class
changeo.Receptor.
ReceptorData
¶ Bases:
object
A class containing type conversion methods for Receptor data attributes
-
rev_comp
¶ whether the alignment is relative to the reverse compliment of the input sequence.
- Type
-
sequence_input
¶ input nucleotide sequence.
- Type
Bio.Seq.Seq
-
sequence_vdj
¶ Aligned V(D)J nucleotide sequence without IMGT-gaps.
- Type
Bio.Seq.Seq
-
sequence_imgt
¶ IMGT-gapped V(D)J nucleotide sequence.
- Type
Bio.Seq.Seq
-
sequence_aa_input
¶ input amino acid sequence.
- Type
Bio.Seq.Seq
-
sequence_aa_vdj
¶ Aligned V(D)J nucleotide sequence without IMGT-gaps.
- Type
Bio.Seq.Seq
-
sequence_aa_imgt
¶ IMGT-gapped V(D)J amino sequence.
- Type
Bio.Seq.Seq
-
junction
¶ ungapped junction region nucletide sequence.
- Type
Bio.Seq.Seq
-
junction_aa
¶ ungapped junction region amino acid sequence.
- Type
Bio.Seq.Seq
-
germline_vdj
¶ full ungapped germline V(D)J nucleotide sequence.
- Type
Bio.Seq.Seq
-
germline_vdj_d_mask
¶ ungapped germline V(D)J nucleotides sequence with Ns masking the NP1-D-NP2 regions.
- Type
Bio.Seq.Seq
-
germline_imgt
¶ full IMGT-gapped germline V(D)J nucleotide sequence.
- Type
Bio.Seq.Seq
-
germline_imgt_d_mask
¶ IMGT-gapped germline V(D)J nucleotide sequence with ns masking the NP1-D-NP2 regions.
- Type
Bio.Seq.Seq
-
germline_aa_vdj
¶ full ungapped germline V(D)J amino acid sequence.
- Type
Bio.Seq.Seq
-
germline_aa_imgt
¶ full IMGT-gapped germline V(D)J amino acid sequence.
- Type
Bio.Seq.Seq
-
v_germ_start_imgt
¶ position of the first V nucleotide in IMGT-gapped V germline sequence alignment (1-based).
- Type
-
v_germ_start_vdj
¶ position of the first nucleotide in ungapped V germline sequence alignment (1-based).
- Type
-
v_seq_aa_start
¶ position of the first V amino acid in the amino acid input sequence (1-based).
- Type
-
v_germ_aa_start_imgt
¶ position of the first V amino acid in IMGT-gapped V germline amino acid alignment (1-based).
- Type
-
v_germ_aa_start_vdj
¶ position of the first amino acid in ungapped V germline amino acid alignment (1-based).
- Type
-
np1_start
¶ position of the first untemplated nucleotide between the V and D segments in the input sequence (1-based).
- Type
-
np1_aa_start
¶ position of the first untemplated amino acid between the V and D segments in the input amino acid sequence (1-based).
- Type
-
d_seq_aa_start
¶ position of the first D amino acid in the input amino acidsequence (1-based).
- Type
-
d_germ_aa_start
¶ position of the first amino acid in D germline amino acid alignment (1-based).
- Type
-
np2_start
¶ position of the first untemplated nucleotide between the D and J segments in the input sequence (1-based).
- Type
-
np2_aa_start
¶ position of the first untemplated amino acid between the D and J segments in the input amino acid sequence (1-based).
- Type
-
j_seq_aa_start
¶ position of the first J amino acid in the input amino acidsequence (1-based).
- Type
-
j_germ_aa_start
¶ position of the first amino acid in J germline amino acid alignment (1-based).
- Type
-
fwr1_imgt
¶ IMGT-gapped FWR1 nucleotide sequence.
- Type
Bio.Seq.Seq
-
fwr2_imgt
¶ IMGT-gapped FWR2 nucleotide sequence.
- Type
Bio.Seq.Seq
-
fwr3_imgt
¶ IMGT-gapped FWR3 nucleotide sequence.
- Type
Bio.Seq.Seq
-
fwr4_imgt
¶ IMGT-gapped FWR4 nucleotide sequence.
- Type
Bio.Seq.Seq
-
cdr1_imgt
¶ IMGT-gapped CDR1 nucleotide sequence.
- Type
Bio.Seq.Seq
-
cdr2_imgt
¶ IMGT-gapped CDR2 nucleotide sequence.
- Type
Bio.Seq.Seq
-
cdr3_imgt
¶ IMGT-gapped CDR3 nucleotide sequence.
- Type
Bio.Seq.Seq
-
cdr3_igblast
¶ CDR3 nucleotide sequence assigned by IgBLAST.
- Type
Bio.Seq.Seq
-
fwr1_aa_imgt
¶ IMGT-gapped FWR1 amino acid sequence.
- Type
Bio.Seq.Seq
-
fwr2_aa_imgt
¶ IMGT-gapped FWR2 amino acid sequence.
- Type
Bio.Seq.Seq
-
fwr3_aa_imgt
¶ IMGT-gapped FWR3 amino acid sequence.
- Type
Bio.Seq.Seq
-
fwr4_aa_imgt
¶ IMGT-gapped FWR4 amino acid sequence.
- Type
Bio.Seq.Seq
-
cdr1_aa_imgt
¶ IMGT-gapped CDR1 amino acid sequence.
- Type
Bio.Seq.Seq
-
cdr2_aa_imgt
¶ IMGT-gapped CDR2 amino acid sequence.
- Type
Bio.Seq.Seq
-
cdr3_aa_imgt
¶ IMGT-gapped CDR3 amino acid sequence.
- Type
Bio.Seq.Seq
-
cdr3_igblast_aa
¶ CDR3 amino acid sequence assigned by IgBLAST.
- Type
Bio.Seq.Seq
-
static
aminoacid
(v, deparse=False)¶
-
static
double
(v, deparse=False)¶
-
end_fields
= {'cdr1_end': ('cdr1_start', 'cdr1_length'), 'cdr2_end': ('cdr2_start', 'cdr2_length'), 'cdr3_end': ('cdr3_start', 'cdr3_length'), 'd_germ_aa_end': ('d_germ_aa_start', 'd_germ_aa_length'), 'd_germ_end': ('d_germ_start', 'd_germ_length'), 'd_seq_aa_end': ('d_seq_aa_start', 'd_seq_aa_length'), 'd_seq_end': ('d_seq_start', 'd_seq_length'), 'fwr1_end': ('fwr1_start', 'fwr1_length'), 'fwr2_end': ('fwr2_start', 'fwr2_length'), 'fwr3_end': ('fwr3_start', 'fwr3_length'), 'fwr4_end': ('fwr4_start', 'fwr4_length'), 'j_germ_aa_end': ('j_germ_aa_start', 'j_germ_aa_length'), 'j_germ_end': ('j_germ_start', 'j_germ_length'), 'j_seq_aa_end': ('j_seq_aa_start', 'j_seq_aa_length'), 'j_seq_end': ('j_seq_start', 'j_seq_length'), 'junction_end': ('junction_start', 'junction_length'), 'v_alignment_aa_end': ('v_alignment_aa_start', 'v_alignment_aa_length'), 'v_alignment_end': ('v_alignment_start', 'v_alignment_length'), 'v_germ_aa_end_imgt': ('v_germ_aa_start_imgt', 'v_germ_aa_length_imgt'), 'v_germ_aa_end_vdj': ('v_germ_aa_start_vdj', 'v_germ_aa_length_vdj'), 'v_germ_end_imgt': ('v_germ_start_imgt', 'v_germ_length_imgt'), 'v_germ_end_vdj': ('v_germ_start_vdj', 'v_germ_length_vdj'), 'v_seq_aa_end': ('v_seq_aa_start', 'v_seq_aa_length'), 'v_seq_end': ('v_seq_start', 'v_seq_length')}¶
-
static
identity
(v, deparse=False)¶
-
static
integer
(v, deparse=False)¶
-
length_fields
= {'cdr1_length': ('cdr1_start', 'cdr1_end'), 'cdr2_length': ('cdr2_start', 'cdr2_end'), 'cdr3_length': ('cdr3_start', 'cdr3_end'), 'd_germ_aa_length': ('d_germ_aa_start', 'd_germ_aa_end'), 'd_germ_length': ('d_germ_start', 'd_germ_end'), 'd_seq_aa_length': ('d_seq_aa_start', 'd_seq_aa_end'), 'd_seq_length': ('d_seq_start', 'd_seq_end'), 'fwr1_length': ('fwr1_start', 'fwr1_end'), 'fwr2_length': ('fwr2_start', 'fwr2_end'), 'fwr3_length': ('fwr3_start', 'fwr3_end'), 'fwr4_length': ('fwr4_start', 'fwr4_end'), 'j_germ_aa_length': ('j_germ_aa_start', 'j_germ_aa_end'), 'j_germ_length': ('j_germ_start', 'j_germ_end'), 'j_seq_aa_length': ('j_seq_aa_start', 'j_seq_aa_end'), 'j_seq_length': ('j_seq_start', 'j_seq_end'), 'junction_length': ('junction_start', 'junction_end'), 'v_alignment_aa_length': ('v_alignment_aa_start', 'v_alignment_aa_end'), 'v_alignment_length': ('v_alignment_start', 'v_alignment_end'), 'v_germ_aa_length_imgt': ('v_germ_aa_start_imgt', 'v_germ_aa_end_imgt'), 'v_germ_aa_length_vdj': ('v_germ_aa_start_vdj', 'v_germ_aa_end_vdj'), 'v_germ_length_imgt': ('v_germ_start_imgt', 'v_germ_end_imgt'), 'v_germ_length_vdj': ('v_germ_start_vdj', 'v_germ_end_vdj'), 'v_seq_aa_length': ('v_seq_aa_start', 'v_seq_aa_end'), 'v_seq_length': ('v_seq_start', 'v_seq_end')}¶
-
static
logical
(v, deparse=False)¶
-
static
nucleotide
(v, deparse=False)¶
-
parsers
= {'c_call': 'identity', 'cdr1_aa_imgt': 'aminoacid', 'cdr1_imgt': 'nucleotide', 'cdr2_aa_imgt': 'aminoacid', 'cdr2_imgt': 'nucleotide', 'cdr3_aa_imgt': 'aminoacid', 'cdr3_igblast': 'nucleotide', 'cdr3_igblast_aa': 'aminoacid', 'cdr3_imgt': 'nucleotide', 'cell': 'identity', 'clone': 'identity', 'conscount': 'integer', 'd_btop': 'identity', 'd_call': 'identity', 'd_cigar': 'identity', 'd_evalue': 'double', 'd_frame': 'integer', 'd_germ_aa_length': 'integer', 'd_germ_aa_start': 'integer', 'd_germ_length': 'integer', 'd_germ_start': 'integer', 'd_identity': 'double', 'd_score': 'double', 'd_seq_aa_length': 'integer', 'd_seq_aa_start': 'integer', 'd_seq_length': 'integer', 'd_seq_start': 'integer', 'dupcount': 'integer', 'functional': 'logical', 'fwr1_aa_imgt': 'aminoacid', 'fwr1_imgt': 'nucleotide', 'fwr2_aa_imgt': 'aminoacid', 'fwr2_imgt': 'nucleotide', 'fwr3_aa_imgt': 'aminoacid', 'fwr3_imgt': 'nucleotide', 'fwr4_aa_imgt': 'aminoacid', 'fwr4_imgt': 'nucleotide', 'germline_aa_imgt': 'aminoacid', 'germline_aa_vdj': 'aminoacid', 'germline_imgt': 'nucleotide', 'germline_imgt_d_mask': 'nucleotide', 'germline_vdj': 'nucleotide', 'germline_vdj_d_mask': 'nucleotide', 'in_frame': 'logical', 'indels': 'logical', 'j_btop': 'identity', 'j_call': 'identity', 'j_cigar': 'identity', 'j_evalue': 'double', 'j_germ_aa_length': 'integer', 'j_germ_aa_start': 'integer', 'j_germ_length': 'integer', 'j_germ_start': 'integer', 'j_identity': 'double', 'j_score': 'double', 'j_seq_aa_length': 'integer', 'j_seq_aa_start': 'integer', 'j_seq_length': 'integer', 'j_seq_start': 'integer', 'junction': 'nucleotide', 'junction_aa': 'aminoacid', 'junction_length': 'integer', 'junction_start': 'integer', 'locus': 'identity', 'mutated_invariant': 'logical', 'n1_length': 'integer', 'n2_length': 'integer', 'np1_aa_length': 'integer', 'np1_aa_start': 'integer', 'np1_length': 'integer', 'np1_start': 'integer', 'np2_aa_length': 'integer', 'np2_aa_start': 'integer', 'np2_length': 'integer', 'np2_start': 'integer', 'p3d_length': 'integer', 'p3v_length': 'integer', 'p5d_length': 'integer', 'p5j_length': 'integer', 'rev_comp': 'logical', 'sequence_aa_imgt': 'aminoacid', 'sequence_aa_input': 'aminoacid', 'sequence_aa_vdj': 'aminoacid', 'sequence_id': 'identity', 'sequence_imgt': 'nucleotide', 'sequence_input': 'nucleotide', 'sequence_vdj': 'nucleotide', 'stop': 'logical', 'umicount': 'integer', 'v_btop': 'identity', 'v_call': 'identity', 'v_cigar': 'identity', 'v_evalue': 'double', 'v_frameshift': 'logical', 'v_germ_aa_length_imgt': 'integer', 'v_germ_aa_length_vdj': 'integer', 'v_germ_aa_start_imgt': 'integer', 'v_germ_aa_start_vdj': 'integer', 'v_germ_length_imgt': 'integer', 'v_germ_length_vdj': 'integer', 'v_germ_start_imgt': 'integer', 'v_germ_start_vdj': 'integer', 'v_identity': 'double', 'v_score': 'double', 'v_seq_aa_length': 'integer', 'v_seq_aa_start': 'integer', 'v_seq_length': 'integer', 'v_seq_start': 'integer', 'vdj_score': 'double'}¶
-
start_fields
= {'cdr1_start': ('cdr1_length', 'cdr1_end'), 'cdr2_start': ('cdr2_length', 'cdr2_end'), 'cdr3_start': ('cdr3_length', 'cdr3_end'), 'd_germ_aa_start': ('d_germ_aa_length', 'd_germ_aa_end'), 'd_germ_start': ('d_germ_length', 'd_germ_end'), 'd_seq_aa_start': ('d_seq_aa_length', 'd_seq_aa_end'), 'd_seq_start': ('d_seq_length', 'd_seq_end'), 'fwr1_start': ('fwr1_length', 'fwr1_end'), 'fwr2_start': ('fwr2_length', 'fwr2_end'), 'fwr3_start': ('fwr3_length', 'fwr3_end'), 'fwr4_start': ('fwr4_length', 'fwr4_end'), 'j_germ_aa_start': ('j_germ_aa_length', 'j_germ_aa_end'), 'j_germ_start': ('j_germ_length', 'j_germ_end'), 'j_seq_aa_start': ('j_seq_aa_length', 'j_seq_aa_end'), 'j_seq_start': ('j_seq_length', 'j_seq_end'), 'junction_start': ('junction_length', 'junction_end'), 'v_alignment_aa_start': ('v_alignment_aa_length', 'v_alignment_aa_end'), 'v_alignment_start': ('v_alignment_length', 'v_alignment_end'), 'v_germ_aa_start_imgt': ('v_germ_aa_length_imgt', 'v_germ_aa_end_imgt'), 'v_germ_aa_start_vdj': ('v_germ_aa_length_vdj', 'v_germ_aa_end_vdj'), 'v_germ_start_imgt': ('v_germ_length_imgt', 'v_germ_end_imgt'), 'v_germ_start_vdj': ('v_germ_length_vdj', 'v_germ_end_vdj'), 'v_seq_aa_start': ('v_seq_aa_length', 'v_seq_aa_end'), 'v_seq_start': ('v_seq_length', 'v_seq_end')}¶
-
Using IgBLAST¶
Example data¶
We have hosted a small example data set resulting from the UMI barcoded MiSeq workflow described in the pRESTO documentation. In addition to the example FASTA files, we have included the standalone IgBLAST results. The files can be downloded from here:
Configuring IgBLAST¶
A collection of scripts for setting up the standalone IgBLAST database from the
IMGT reference sequences are available on the
Immcantation repository.
To use these scripts, copy all the tools in the /scripts
folder to a location
in your PATH
. At a minimum, you’ll need the following scripts:
fetch_igblastdb.sh
fetch_imgtdb.sh
clean_imgtdb.py
imgt2igblast.sh
Download and configure the IgBLAST and IMGT reference databases as follows, adjusting the version number to taste:
1 2 3 4 5 6 7 8 9 10 11 12 | # Download and extract IgBLAST VERSION="1.17.0" wget ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/${VERSION}/ncbi-igblast-${VERSION}-x64-linux.tar.gz tar -zxf ncbi-igblast-${VERSION}-x64-linux.tar.gz cp ncbi-igblast-${VERSION}/bin/* ~/bin # Download reference databases and setup IGDATA directory fetch_igblastdb.sh -o ~/share/igblast cp -r ncbi-igblast-${VERSION}/internal_data ~/share/igblast cp -r ncbi-igblast-${VERSION}/optional_file ~/share/igblast # Build IgBLAST database from IMGT reference sequences fetch_imgtdb.sh -o ~/share/germlines/imgt imgt2igblast.sh -i ~/share/germlines/imgt -o ~/share/igblast |
Note
Several Immcantation tools require the observed V(D)J sequence
(sequence_alignment
) and associated germline fields (germline_alignment
or germline_alignment_d_mask
) to have gaps inserted to conform to the
IMGT numbering scheme. Thus, when a tool such as MakeDb.py or
CreateGermlines.py requires a reference sequence set as input,
it will required the IMGT-gapped reference set. Meaning,
the reference sequences that were downloaded using the fetch_imgtdb.sh
script, or downloaded manually from the
IMGT reference directory,
rather than the final upgapped reference set required by IgBLAST.
See also
The provided scripts download only the mouse and human IMGT reference databases.
See the IgBLAST documentation for instructions
on how to build the database in a more general case. Shown below is an example of how
to performed the same steps as the Immcantation scripts using a separately
downloaded IMGT reference set and the scripts provided by IgBLAST. You must have all of
the associated commands in your PATH
and the appropriate directories created:
1 2 3 4 5 6 7 8 9 10 11 12 | # V segment database edit_imgt_file.pl IMGT_Human_IGHV.fasta > ~/share/igblast/fasta/imgt_human_ig_v.fasta makeblastdb -parse_seqids -dbtype nucl -in ~/share/igblast/fasta/imgt_human_ig_v.fasta \ -out ~/share/igblast/database/imgt_human_ig_v # D segment database edit_imgt_file.pl IMGT_Human_IGHD.fasta > ~/share/igblast/fasta/imgt_human_ig_d.fasta makeblastdb -parse_seqids -dbtype nucl -in ~/share/igblast/fasta/imgt_human_ig_d.fasta \ -out ~/share/igblast/database/imgt_human_ig_d # J segment database edit_imgt_file.pl IMGT_Human_IGHJ.fasta > ~/share/igblast/fasta/imgt_human_ig_j.fasta makeblastdb -parse_seqids -dbtype nucl -in ~/share/igblast/fasta/imgt_human_ig_j.fasta \ -out ~/share/igblast/database/imgt_human_ig_j |
Once these databases are built for each segment they can be referenced when running IgBLAST.
Running IgBLAST¶
Change-O provides a simple wrapper script to run IgBLAST with the required options as the igblast subcommand of AssignGenes.py. This wrapper can be run as follows using the database built using the Immcantation scripts:
AssignGenes.py igblast -s HD13M.fasta -b ~/share/igblast \
--organism human --loci ig --format blast
The optional --format blast
argument
defines the output format of IgBLAST. The default, blast
, is the
blocked tabular output provided by specifying the -outfmt '7 std qseq sseq btop'
argument to IgBLAST. Specifying --format airr
will output a tab-delimited file compliant with the
AIRR Rearrangement schema
defined by the AIRR Community.
AIRR format support requires IgBLAST v1.9.0 or higher.
The -b ~/share/igblast
argument specifies the
path containing the database
, internal_data
, and optional_file
directories required by IgBLAST. This option sets the IGDATA
environment variable
that controls where IgBLAST looks for internal database files. See the
IgBLAST documentation for more details
regarding the IGDATA
environment variable.
See also
The AssignGenes.py IgBLAST wrapper provides limited functionality.
For more control, IgBLAST should be run directly. The only strict
requirement for compatibility with Changeo-O is that the output must
either be an AIRR tab-delimited file (--outfmt 19
) or a blast-style
tabular output with the optional query sequence, subject sequence and BTOP fields
(-outfmt '7 std qseq sseq btop'
). An example of how to run IgBLAST
directly is shown below:
1 2 3 4 5 6 7 8 9 10 | export IGDATA=~/share/igblast igblastn \ -germline_db_V ~/share/igblast/database/imgt_human_ig_v\ -germline_db_D ~/share/igblast/database/imgt_human_ig_d \ -germline_db_J ~/share/igblast/database/imgt_human_ig_j \ -auxiliary_data ~/share/igblast/optional_file/human_gl.aux \ -domain_system imgt -ig_seqtype Ig -organism human \ -outfmt '7 std qseq sseq btop' \ -query HD13M.fasta \ -out HD13M.fmt7 |
Processing the output of IgBLAST¶
Standalone IgBLAST blast-style tabular output is parsed by the igblast
subcommand of MakeDb.py to generate the standardized tab-delimited database file
on which all subsequent Change-O modules operate. In addition to the IgBLAST output
(-i HD13M.fmt7
), both the FASTA files input to IgBLAST
(-s HD13M.fasta
) and the IMGT-gapped reference sequences
(-r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta
)
must be provided to MakeDb.py:
MakeDb.py igblast -i HD13M.fmt7 -s HD13M.fasta \
-r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta \
--extended
The optional --extended
argument adds extra
columns to the output database containing IMGT-gapped CDR/FWR regions and
alignment metrics.
Warning
The references sequences you provide to MakeDb.py must contain IMGT-gapped V segment references, and these reference must be the same sequences used to build the IgBLAST reference database. If your IgBLAST germlines are not IMGT-gapped and/or they are not identical to those provided to MakeDb.py, then sequences which were assigned missing germlines will fail the parsing operation and the junction (CDR3) sequences will not be correct.
Parsing IMGT output¶
Example data¶
We have hosted a small example data set resulting from the UMI barcoded MiSeq workflow described in the pRESTO documentation. In addition to the example FASTA files, we have included the IMGT/HighV-QUEST results. The files can be downloded from here:
Reducing file size for submission to IMGT/HighV-QUEST¶
IMGT/HighV-QUEST currently limits the size of uploaded files to 500,000 sequences. To accomodate this limit, you can use the count subcommand of the pRESTO tool SplitSeq to divide your files into small pieces:
SplitSeq.py count -s file.fastq -n 500000 --fasta
The -n 500000
argument sets the maximum number of sequences in each file and the
--fasta
argument tells the tool to output a FASTA, rather than FASTQ, formatted file
suitable for upload to IMGT/HighV-QUEST.
See also
For additional details see the corresponding example in the pRESTO documentation
Processing the output of IMGT/HighV-QUEST¶
The output from IMGT/HighV-QUEST may be
parsed via the imgt subcommand of MakeDb.py to generate the standardized
tab-delimited database file on which all subsequent Change-O modules operate.
Processing the IMGT output requires either the compressed output file (.zip or .txz)
or an uncompressed folder containing the 1_Summary
, 2_IMGT-gapped
, 3_Nt-sequences
and
6_Junction
files (-i HD13M.txz
).
Additionally, it is recommended that you provide the FASTA file that was submitted to HighV-QUEST
(-s HD13M.fasta
), as this will allow MakeDb.py to correct the
changes HighV-QUEST makes to the sequence identifier and add additional columns corresponding any
annotations generated by pRESTO:
MakeDb.py imgt -i HD13M.txz -s HD13M.fasta --extended
The optional --extended
argument add extra
columns to the output database containing IMGT-gapped CDR/FWR regions and
alignment metrics.
Merging processed IMGT/HighV-QUEST output¶
If you previously split files for submission to IMGT/HighV-QUEST, you can run each partition through MakeDb.py individually and merge the resulting output files using the merge subcommand of ParseDb.py:
MakeDb.py imgt -i part1.txz -s part1.fasta -o part1.tsv
MakeDb.py imgt -i part2.txz -s part2.fasta -o part2.tsv
ParseDb.py merge -d part1.tsv part2.tsv -o merged.tsv
Parsing 10X Genomics V(D)J data¶
New
We have an updated tutorial covering the processing of 10x Genomics VDJ data with Change-O and SCOPer. You can also follow the steps below to process 10x VDJ data using methods available in Change-O.
Example data¶
10X Genomics provides an example data set of Ig V(D)J processed by the Cell Ranger pipeline, which is available for download from their Single Cell Immune Profiling support site.
Converting 10X V(D)J data into the AIRR Community standardized format¶
To process 10X V(D)J data, a combination of AssignGenes.py and MakeDb.py
can be used to generate a TSV file compliant with the
AIRR Community Rearrangement
schema that incorporates annotation information provided by the Cell Ranger pipeline. The
--10x filtered_contig_annotations.csv
specifies the path of the contig annotations file generated by cellranger vdj
,
which can be found in the outs
directory.
Generate AIRR Rearrangement data from the 10X V(D)J FASTA files using the steps below:
AssignGenes.py igblast -s filtered_contig.fasta -b ~/share/igblast \
--organism human --loci ig --format blast
MakeDb.py igblast -i filtered_contig_igblast.fmt7 -s filtered_contig.fasta \
-r IMGT_Human_*.fasta --10x filtered_contig_annotations.csv --extended
all_contig.fasta
can be exchanged for filtered_contig.fasta
, and
all_contig_annotations.csv
can be exchanged for filtered_contig_annotations.csv
.
Warning
The resulting table overwrites the V, D and J gene assignments generated by Cell Ranger and uses those generated by IgBLAST or IMGT/HighV-QUEST instead.
See also
To process mouse data and/or TCR data alter the --organism
and --loci
arguments to AssignGenes.py accordingly
(e.g., --organism mouse
,
--loci tcr
) and use the appropriate V, D and J IMGT
reference databases (e.g., IMGT_Mouse_TR*.fasta
)
See the IgBLAST usage guide for further details regarding the setup and use of IgBLAST with Change-O.
Identifying clones from B cells in AIRR formatted 10X V(D)J data¶
Splitting into separate light and heavy chain files¶
To group B cells into clones from AIRR Rearrangement data, the output from MakeDb.py must be parsed into a light chain file and a heavy chain file:
ParseDb.py select -d 10x_igblast_db-pass.tsv -f locus -u "IGH" \
--logic all --regex --outname heavy
ParseDb.py select -d 10x_igblast_db-pass.tsv -f locus -u "IG[LK]" \
--logic all --regex --outname light
Assign clonal groups to the heavy chain data¶
The heavy chain file must then be clonally clustered separately. See Clustering sequences into clonal groups for how to use DefineClones.py to assign clonal cluster annotations to the IGH file.
Correct clonal groups based on light chain data¶
DefineClones.py currently does not support light chain cloning. However,
cloning can be performed after heavy chain cloning using
light_cluster.py
provided on the Immcantation Bitbucket repository
in the scripts
directory:
light_cluster.py -d heavy_select-pass_clone-pass.tsv -e light_select-pass.tsv \
-o 10X_clone-pass.tsv
Here, heavy_select-pass_clone-pass.tsv
refers to the cloned heavy chain
AIRR Rearrangement file, light_select-pass.tsv
refers to the light chain file,
and 10X_clone-pass.tsv
is the resulting output file.
The algorithm will (1) remove cells associated with more than one heavy chain and (2) correct heavy chain clone definitions based on an analysis of the light chain partners associated with the heavy chain clone.
Note
By default, light_chain.py
expects the
AIRR Rearrangement columns:
v_call
j_call
junction_length
umi_count
cell_id
clone_id
To process legacy Change-O formatted data add the --format changeo
argument:
light_cluster.py -d heavy_select-pass_clone-pass.tab -e light_select-pass.tab \
-o 10X_clone-pass.tab --format changeo
Which expects the following Change-O columns:
V_CALL
J_CALL
JUNCTION_LENGTH
UMICOUNT
CELL
CLONE
Filtering records¶
The ParseDb.py tool provides a basic set of operations for manipulating Change-O database files from the commandline, including removing or updating rows and columns.
Removing non-productive sequences¶
After building a Change-O database from either IMGT/HighV-QUEST or IgBLAST output, you may wish to subset your data to only productive sequences. This can be done in one of two roughly equivalent ways using the ParseDb.py tool:
1 2 | ParseDb.py select -d HD13M_db-pass.tsv -f productive -u T ParseDb.py split -d HD13M_db-pass.tsv -f productive |
The first line above uses the select subcommand to output a single file
labeled parse-select
containing only records with the value of T
(-u T
) in the productive
column
(-f productive
).
Alternatively, the second line above uses the split subcommand to output
multiple files with each file containing records with one of the values found in the
productive
column (-f productive
). This will
generate two files labeled productive-T
and productive-F
.
Removing disagreements between the C-region primers and the reference alignment¶
If you have data that includes both heavy and light chains in the same library,
the V-segment and J-segment alignments from IMGT/HighV-QUEST or IgBLAST may not
always agree with the isotype assignments from the C-region primers. In these cases,
you can filter out such reads with the select subcommand of ParseDb.py.
An example function call using an imaginary file db.tsv
is provided below:
1 2 3 4 | ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IGH" \ --logic all --regex --outname heavy ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IG[LK]" \ --logic all --regex --outname light |
These commands will require that all of the v_call
, j_call
and c_call
fields (-f v_call j_call c_call
and
--logic all
) contain the string IGH
(lines 1-2)
or one of IGK
or IGL
(lines 3-4). The --regex
argument allows for partial matching and interpretation of regular expressions. The
output from these two commands are two files, one containing only heavy chains
(heavy_parse-select.tsv
) and one containg only light chains (light_parse-select.tsv
).
Exporting records to FASTA files¶
You may want to use external tools, or tools from pRESTO, on your Change-O result files. The ConvertDb.py tool provides two options for exporting data from tab-delimited files to FASTA format.
Standard FASTA¶
The fasta subcommand allows you to export sequences and annotations to FASTA formatted files in the pRESTO annototation scheme:
ConvertDb.py fasta -d HD13M_db-pass.tsv --if sequence_id \
--sf sequence_alignment --mf v_call duplicate_count
Where the column containing the sequence identifier is specified by
--if sequence_id
, the nucleotide sequence column is
specified by --sf sequence_id
, and additional annotations
to be added to the sequence header are specified by
--mf v_call duplicate_count
.
BASELINe FASTA¶
The baseline subcommand generates a FASTA derivative format required by the
BASELINe web tool. Generating these
files is similar to building standard FASTA files, but requires a few more options.
An example function call using an imaginary file db.tsv
is provided below:
ConvertDb.py baseline -d db.tsv --if sequence_id \
--sf sequence_alignment --mf v_call duplicate_count \
--cf clone_id --gf germline_alignment_d_mask
The additional arguments required by the baseline subcommand include the
clonal grouping (--cf clone_id
) and germline sequence
(--gf germline_alignment_d_mask
) columns added by
the DefineClones and CreateGermlines tasks,
respectively.
Note
The baseline subcommand requires the CLONE
column to be sorted.
DefineClones.py generates a sorted CLONE
column by default. However,
you needed to alter the order of the CLONE
column at some point,
then you can re-sort the clonal assignments using the sort
subcommand of ParseDb.py. An example function call using an imaginary
file db.tsv
is provided below:
ParseDb.py sort -d db.tsv -f clone_id
Which will sort records by the value in the clone_id
column
(-f clone_id
).
Clustering sequences into clonal groups¶
Example data¶
We have hosted a small example data set resulting from the UMI barcoded MiSeq workflow described in the pRESTO documentation. The files can be downloded from here:
The following examples use the HD13M_db-pass.tsv
database file provided in
the example bundle, which has already undergone the IMGT/IgBLAST
parsing and filtering operations.
Determining a clustering threshold¶
Before running DefineClones.py, it is important to determine an
appropriate threshold for trimming the hierarchical clustering into B cell
clones. The distToNearest
function in the SHazaM R package calculates
the distance between each sequence in the data and its nearest-neighbor. The
resulting distribution should be bimodal, with the first mode representing sequences
with clonal relatives in the dataset and the second mode representing singletons.
The ideal threshold for separating clonal groups is the value that separates
the two modes of this distribution and can be found using the
findThreshold
function in the SHazaM R package. The
distToNearest
function allows selection of all parameters that are available in DefineClones.py.
Using the length normalization parameter ensures that mutations are weighted equally
regardless of junction sequence length. The distance to nearest-neighbor distribution
for the example data is shown below. The threshold is approximately 0.16
- indicated
by the red dotted line.
See also
For additional details see the vignette on tuning clonal assignment thresholds.
Assigning clones¶
There are several parameter choices when grouping Ig sequences into B cell
clones. The argument --act set
accounts for ambiguous V gene and J gene calls when grouping similar sequences. The
distance metric --model ham
is nucleotide Hamming distance. Because the threshold was generated using length
normalized distances, the --norm len
argument is
selected with the previously determined threshold --dist 0.16
:
DefineClones.py -d HD13M_db-pass.tsv --act set --model ham \
--norm len --dist 0.16
Note
Because T cells don’t undergo SHM, non-zero nucleotide distances suggest sequences orginate from a different ancestor. To identify TCR clones, use –dist 0 or a very low distance value to allow for sequencing error.
Reconstructing germline sequences¶
Example data¶
We have hosted a small example data set resulting from the UMI barcoded MiSeq workflow described in the pRESTO documentation. The files can be downloaded from here:
The following examples use the HD13M_db-pass.tsv
AIRR Rearrangement file provided in
the example bundle, which has already undergone the IMGT/IgBLAST
parsing and filtering operations.
Adding germline sequences to the database¶
The CreateGermlines.py tool is used to reconstruct the germline V(D)J sequence,
from which the Ig lineage and mutations can be inferred. In addition to the alignment
information parsed by MakeDb.py to generate the initial database, CreateGermlines.py
also requires the set of germline sequences that were used for the alignment
passed to the -r
argument. In the case of V-segment germlines,
the reference sequences must be IMGT-gapped. Because the D-segment call for B cell receptor
alignments is often low confidence, the default
germline format (-g dmask
) places Ns in the N/P and D-segments
of the junction region rather than using the D-segment assigned during reference alignment;
this can be modified to generate a complete germline (-g full
)
or a V-segment only germline (-g vonly
) if you wish.
The command below adds the germline sequence to the germline_alignment_d_mask
column of
the output database:
CreateGermlines.py -d HD13M_db-pass.tsv -g dmask \
-r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta
Alternatively, if you have run the clonal assignment task prior to invoking
CreateGermlines.py, then adding the --cloned
argument is recommended, as this will generate a single germline of consensus length for each clone:
CreateGermlines.py -d HD13M_db-pass_clone-pass.tsv -g dmask --cloned \
-r IMGT_Human_IGHV.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta
Important
The germline set passed to -r
must contain the
complete set of germlines used by the reference alignment software
(IMGT/HighV-QUEST or IgBLAST). If alleles called by the aligner are missing from the
reference set, they will not be successfully processed. Additionally, the V-segment
reference set must contain IMGT-gapped sequences to properly reconstruct germlines,
even if the reference alignment was performed on ungapped sequences.
Note
While MakeDb.py provides the ihmm subcommand to parse alignment output generated by iHMMuneAlign, there is insufficient information to successfully reconstruct germline sequences for all cases using CreateGermlines.py.
See also
The TIgGER R package provided tools for
identifying novel polymorphisms and building a personalized germline database. To
use the germline corrections provided by TIgGER
you would replace the V-segment germline file with the one generated by
genotypeFasta
(-r IGHV_genotype.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta
) and
specify the genotyped V-segment column (--vf v_call_genotyped
):
CreateGermlines.py -d genotyped.tsv -g dmask --vf v_call_genotyped \
-r IGHV_genotype.fasta IMGT_Human_IGHD.fasta IMGT_Human_IGHJ.fasta
IgPhyML lineage tree analysis¶
IgPhyML is a program designed to build phylogenetic trees and test evolutionary hypotheses regarding B cell affinity maturation.
The biology of B cell somatic hypermutation (SHM) violates important assumptions in most standard phylogenetic substitution models; further, while most phylogenetics programs are designed to analyze single lineages, B cell repertoires typically contain thousands of lineages. IgPhyML addresses both of these issues by implementing substitution models that correct for the context-sensitive nature of SHM, and combines information from multiple lineages to give more precisely estimated repertoire-wide model parameter estimates.
An in-depth description of IgPhyML installation and usage can be found at the IgPhyML website.
Quick start¶
Once installed, IgPhyML can be run through
BuildTrees
by specifying the --igphyml
option. IgPhyML is easiest to run through the
Immcantation Docker image.
If this is not possible, these instructions require Change-O 0.4.6 or higher, Alakazam 0.3.0 or higher,
and IgPhyML to be installed, with the executable in your PATH
variable.
The following commands should work as a first pass on many reasonably sized datasets, but if you really want to understand what’s going on or make sure what you’re doing makes sense, please check out the IgPhyML website.
Build trees and estimate model parameters¶
Download the IgPhyML repository, move to the examples
folder, and run
BuildTrees:
# Clone IgPhyML repository to get example files
git clone https://bitbucket.org/kleinstein/igphyml
# Move to examples directory
cd igphyml/examples
# Run BuildTrees
BuildTrees.py -d example.tsv --outname ex --log ex.log --collapse \
--sample 3000 --igphyml --clean all --nproc 1
This command processes an AIRR-formatted dataset of BCR sequences that have been
clonally clustered
with germlines reconstructed.
It then quickly builds trees using the GY94 model and, using these
fixed topologies, estimates HLP19 model parameters. This can be sped up by
increasing the --nproc
option. Subsampling using the --sample
option in isn’t
strictly necessary, but IgPhyML will run slowly when applied to large datasets.
Here, the --collapse
flag is used to collapse identical sequences. This is
highly recommended because identical sequences slow down calculations without
affecting likelihood values in IgPhyML.
Visualize results¶
The output file of the above command can be read using the
readIgphyml
function of
Alakazam.
After opening an R
session in the examples
subfolder, enter the following commands. Note that
when using the Docker container, you’ll need to run dev.off()
after
plotting the tree to create a pdf plot in the examples
directory:
library(alakazam)
library(igraph)
db = readIgphyml("ex_igphyml-pass.tab")
# Plot largest lineage tree
plot(db$trees[[1]],layout=layout_as_tree)
# Show HLP10 parameters
print(t(db$param[1,]))
CLONE "REPERTOIRE"
NSEQ "4"
NSITE "107"
TREE_LENGTH "0.286"
LHOOD "-290.7928"
KAPPA_MLE "2.266"
OMEGA_FWR_MLE "0.5284"
OMEGA_CDR_MLE "2.3324"
WRC_2_MLE "4.8019"
GYW_0_MLE "3.4464"
WA_1_MLE "5.972"
TW_0_MLE "0.8131"
SYC_2_MLE "-0.99"
GRS_0_MLE "0.2583"

Lineage tree of example clone.¶
To visualize a larger dataset with bigger trees, and bifurcating tree topologies,
again open an R
session in the examples
directory:
library(alakazam)
library(ape)
db = readIgphyml("sample1_igphyml-pass.tab",format="phylo")
# Plot largest lineage tree
plot(ladderize(db$trees[[1]]),cex=0.7,no.margin=TRUE)

Phylo-formatted lineage tree of a larger B cell clone.¶
Generating MiAIRR compliant GenBank/TLS submissions¶
MiAIRR¶
The MiAIRR standard (minimal information about adaptive immune receptor repertoires) is a minimal reporting standard for experiments using sequencing-based technologies to study adaptive immune receptors (T and B cell receptors). The current version (1.0) of the standard was published in Rubelt et al, 2017 and accepted by the general assembly at the annual AIRR Community meeting in December 2017.
MiAIRR recommends submission of raw read data to the Sequence Read Archive (SRA) and submission of processed and annotated data to the Targeted Locus Study (TLS) section of GenBank.
This example will cover generation of files for submission to TLS starting from Change-O formatted data. For complete details of the required and optional elements of the TLS submission see the AIRR Standards documentation site.
Special attention should be paid to the
REQUIRED elements.
Note that GenBank expects there to be a CDS
element that corresponds to the JUNCTION
. If submitting
single-cell heavy:light paired BCR data, GenBank expects separate files for the heavy, the kappa, and the
lambda chains. Note that even though the kappa and the lambda chain sequences should be in separate files,
their misc_feature
comments should both read immunoglobulin light chain variable region
, per AIRR
standard requirements. In addition, every effort should be made to make sure that the values of the attributes
for GenBank submission match those of the BioSample attributes. In particular, if the BioSample specifies
a strain
value (e.g. for mouse data), then a strain
attribute MUST be included when preparing GenBank
submission, and that value MUST match the BioSample value.
Example data¶
We have hosted a small example data set resulting from the UMI barcoded MiSeq workflow described in the pRESTO documentation. The files can be downloded from here:
The following examples use the HD13M_db-pass.tsv
database file and
HD13M_template.sbt
file provided in the example bundle, which has already undergone
the IgBLAST annotation, parsing, and filtering
operations.
Generating files for submission¶
Requirements¶
An annotated data set in either the Change-O or the AIRR Data Representation formats. Records must have valid V, J and junction region annotations to be suitable for submission.
tbl2asn installed and in your
PATH
.A GenBank submission template file (
.sbt
), generating using the NCBI Template Generator.
Important
C region annotations must use official gene symbols (IGHM, IGHG, etc) so that they are properly
recognized by remote databases. If your annotations are not of this form, then they must be updated
prior to generating the GenBank/TLS submission files. The following example shows how to use the
update subcommand of ParseDb.py to rename the values in the c_call
column.
The files provided for this example already have correctly annotated c_call
information, so
the following is hypothetical example (db.tsv
) with existing annotation of the for IgM, IgG, etc:
ParseDb.py update -d db.tsv -f c_call \
-u IgA IgD IgE IgG IgM \
-t IGHA IGHD IGHE IGHG IGHM
Creating ASN files¶
ASN submission files are generated using the genbank subcommand of ConvertDb.py as follows:
ConvertDb.py genbank -d HD13M_db-pass.tsv \
--product "immunoglobulin heavy chain" \
--db "IMGT/GENE-DB" \
--inf "IgBLAST:1.14.0" \
--organism "Homo sapiens" \
--tissue "Peripheral blood" \
--cell-type "B cell" \
--isolate HD13M \
--cf c_call \
--nf duplicate_count \
--asis-id \
--asn \
--sbt HD13M_template.sbt \
--outdir HD13M_TLS
The resulting output in the HD13M_TLS
folder will include a number of files.
The Sequin file HD13M_db-pass_genbank.sqn
is the file that will be used for submission
and the GenBank record file HD13M_db-pass_genbank.gbf
is similar to what the submission
will look like once it has been accepted by GenBank.
The command above manually specifies several required and optional annotations.
Alternatively, sample information (organism
, sex
, isolate
, tissue_type
,
cell_type
) can be specified in a separate yaml file and provided via the
-y
argument. Additional harmonized
BioSample attributes,
which are not convered by the existing commandline arguments, may be provided
in the yaml file. Note, the yaml file adds only sample features, so it cannot be used
to specify source features (--product
, --mol
, --inf
and --db
arguments), parsing
arguments, or run parameters (–label`, --exec
, etc). Features specified in the yaml
file will override equivalent features specified through the corresponding commandline arguments.
Note
The example shown above automatically runs tbl2asn, because the
--asn
argument was specified. ConvertDb.py
can be run without running tbl2asn, which will generate only the
feature table (S43_update_genbank.tbl
) and fasta (HD13M_db-pass_genbank.fsa
) files
required to run tbl2asn manually via the command:
tbl2asn -p . -a s -V vb -t S43_template.sbt
Important
When running tbl2asn using the --asn
argument to ConvertDb.py there is no internal validation that the records
passing the filters in ConvertDb.py also pass the filters in tbl2asn.
As such, it is recommended that the number of sequences in the output .sqn
file be verified against the number of sequences in the .tbl
and .fsa
output files. From the command line, this can be achieved via:
grep -c iupacna *.sqn
Warning
There is a known issue with the --asn
argument.
In some environments, for reasons that are presently unknown, tbl2asn
may fail to recongizing the input fasta file and report an error stating
Unable to read any FASTA records. Running tbl2asn manually should
resolve the issue.
Submitting to GenBank/TLS using SequinMacroSend¶
After generating the .sqn
files, you can submit them as MiAIRR compliant
GenBank/TLS records using GenBank’s
SequinMacroSend service.
When submitting, simply add the keyword AIRR
to the subject line in the
submission system and it will be routed accordingly.
Warning
Currently, the SequinMacroSend system cannot accept files over 512MB in size.
For submissions over the size limit, you must split them into smaller files
and note in the submission comments that they are a part of a split submission.
Note, the .sqn
files used for submission are usually about 30 times the size
of the original tab-delimited Change-O file. See the split subcommand
of ParseDb.py for one approach to logically dividing large submissions.
Clonal clustering methods¶
The DefineClones.py tool provides multiple different approaches to assigning Ig sequences into clonal groups.
Clustering by V-gene, J-gene and junction length¶
All methods provided by DefineClones.py first partition sequences based on
common IGHV gene, IGHJ gene, and junction region length. These groups are then
further subdivided into clonally related groups based on the following distance
metrics on the junction region. The specified distance metric
(--model
) is then
used to perform hierarchical clustering under the specified linkage
(--link
) clustering. Clonal groups are
defined by trimming the resulting dendrogram at the specified threshold
(--dist
).
Amino acid model¶
The aa
distance model is the Hamming distance
between junction amino acid sequences.
Hamming distance model¶
The ham
distance model is the Hamming
distance between junction nucleotide sequences.
Human and mouse 1-mer models¶
The hh_s1f
and
mk_rs5nf
distance models are single
nucleotide distance matrices derived from averaging and symmetrizing the human 5-mer
targeting model in [YVanderHeidenU+13] and the mouse 5-mer targeting model in
[CDiNiroVanderHeiden+16]. The are broadly similar to a transition/transversion model.
Human 1-mer substitution matrix:
Nucleotide |
A |
C |
G |
T |
N |
---|---|---|---|---|---|
A |
0 |
1.21 |
0.64 |
1.16 |
0 |
C |
1.21 |
0 |
1.16 |
0.64 |
0 |
G |
0.64 |
1.16 |
0 |
1.21 |
0 |
T |
1.16 |
0.64 |
1.21 |
0 |
0 |
N |
0 |
0 |
0 |
0 |
0 |
Mouse 1-mer substitution matrix:
Nucleotide |
A |
C |
G |
T |
N |
---|---|---|---|---|---|
A |
0 |
1.51 |
0.32 |
1.17 |
0 |
C |
1.51 |
0 |
1.17 |
0.32 |
0 |
G |
0.32 |
1.17 |
0 |
1.51 |
0 |
T |
1.17 |
0.32 |
1.51 |
0 |
0 |
N |
0 |
0 |
0 |
0 |
0 |
Human and mouse 5-mer models¶
The hh_s5f
and
mk_rs5nf
distance models are based on
the human 5-mer targeting model in [YVanderHeidenU+13] and mouse 5-mer
argeting models in [CDiNiroVanderHeiden+16], respectively. The targeting
matrix has 5-mers across the columns and the nucleotide to
which the center base of the 5-mer mutates as the rows. The value for a
given nucleotide, 5-mer pair
is the product of the
likelihood of that 5-mer to be mutated
and the
likelihood of the center base mutating to the given nucleotide
. This matrix of probabilities is converted
into a distance matrix
via the following steps:
is then divided by the mean of values in
All distances in
that are infinite (probability of zero), distances on the diagonal (no change), and NA distances are set to 0.
Since the distance matrix is not symmetric, the
--sym
argument
can be specified to calculate either the average (avg) or minimum (min)
of and
.
The distances defined by
for each nucleotide difference are
summed for all 5-mers in the junction to yield the distance between the
two junction sequences.
- CDiNiroVanderHeiden+16(1,2)
Ang Cui, Roberto Di Niro, Jason A. Vander Heiden, Adrian W. Briggs, Kris Adams, Tamara Gilbert, Kevin C. O’Connor, Francois Vigneault, Mark J. Shlomchik, and Steven H. Kleinstein. A Model of Somatic Hypermutation Targeting in Mice Based on High-Throughput Ig Sequencing Data. The Journal of Immunology, 197(9):3566–3574, nov 2016. URL: http://www.jimmunol.org/content/197/9/3566.abstract http://www.jimmunol.org/lookup/doi/10.4049/jimmunol.1502263, doi:10.4049/jimmunol.1502263.
- YVanderHeidenU+13(1,2)
Gur Yaari, Jason A Vander Heiden, Mohamed Uduman, Daniel Gadala-Maria, Namita Gupta, Joel N H Stern, Kevin C O’Connor, David A Hafler, Uri Laserson, Francois Vigneault, and Steven H Kleinstein. Models of somatic hypermutation targeting and substitution based on synonymous mutations from high-throughput immunoglobulin sequencing data. Frontiers in immunology, 4:358, January 2013. URL: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3828525\&tool=pmcentrez\&rendertype=abstract, doi:10.3389/fimmu.2013.00358.
Reconstruction of germline sequences from alignment data¶
The CreateGermlines.py tool takes the individual segment alignment information for each sequence and reconstructs a full length germline sequence from the V(D)J reference sequences.
To reconstruct the germline, CreateGermlines.py trims V(D)J germline segments
and N/P regions by alignment length and concatenates them together. It puts Ns
in the untemplated N/P regions and optionally masks the D with Ns
(-g dmask
). CreateGermlines.py also looks for and
corrects cases where the alignment tool assigned the same part of the input sequence
to two different regions (eg, assigning the same nucleotides to N/P and J).
At the end of the germline reconstruction process, each sequence has been assigned a germline specific to the sequence.
When the (--cloned
) flag is specified, the
process is the same except it is clone specific and results in the
creation of one germline per clone. CreateGermlines.py selects first a
single V and J allele to use as the germline from all the assigned
annotations in each clone. The selection is made by simple majority rule of all
the allele calls in the clone. After the germline reconstruction process, all
sequences belonging to the same clone have been assigned the same germline.
Contact¶
If you have questions you can email the Immcantation Group.
If you’ve discovered a bug or have a feature request, you can create an issue on Bitbucket using the Issue Tracker.
Citation¶
To cite Change-O in publications please use:
Gupta NT*, Vander Heiden JA*, Uduman M, Gadala-Maria D, Yaari G, Kleinstein SH. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics 2015; doi: 10.1093/bioinformatics/btv359
License¶
This work is licensed under the GNU Affero General Public License Version 3 (AGPL-3).