Data Standards¶
All Change-O tools supports both the legacy Change-O standard and the new Adaptive Immune Receptor Repertoire (AIRR) standard developed by the AIRR Community (AIRR-C).
AIRR-C Format¶
As of v1.0.0, the default file format is the AIRR-C format as described by the
Rearrangement Schema (v1.2). The AIRR-C Rearrangement format is a tab-delimited
file format (.tsv
) that defines the required and optional annotations for
rearranged adaptive immune receptor sequences.
To learn more about this format, the valid field names and their expected values, visit the AIRR-C Rearrangement Schema documentation site.
An API for the input and output of the AIRR-C format is provided in the
AIRR Python package.
Wrappers for this package are provided in the API as changeo.IO.AIRRReader
and changeo.IO.AIRRWriter
.
Change-O Format¶
The legacy Change-O standard is a tab-delimited file format (.tab
) with a set
of predefined column names. The standardized column names used by the Change-O format
are shown in the table below. Most tools do not require every column. The columns
required by and added by each individual tool are described in the
commandline usage documentation. If a column contains multiple
entries, such as ambiguous V gene assignments, these nested entries are delimited
by commas. The ordering of the columns does not matter.
An API for the input and output of the Change-O format is provided in
changeo.IO.ChangeoReader
and changeo.IO.ChangeoWriter
respectively.
Change-O Field |
AIRR Field |
Type |
Description |
---|---|---|---|
Standard Annotations |
|||
SEQUENCE_ID |
sequence_id |
string |
Unique sequence identifier |
SEQUENCE_INPUT |
sequence |
string |
Input nucleotide sequence |
SEQUENCE_VDJ |
string |
V(D)J nucleotide sequence |
|
SEQUENCE_IMGT |
sequence_alignment |
string |
IMGT-numbered V(D)J nucleotide sequence |
FUNCTIONAL |
productive |
logical |
T: V(D)J sequence is predicted to be productive |
IN_FRAME |
vj_in_frame |
logical |
T: junction region nucleotide sequence is in-frame |
STOP |
stop_codon |
logical |
T: stop codon is present in V(D)J nucleotide sequence |
MUTATED_INVARIANT |
logical |
T: invariant amino acids properly encoded by V(D)J sequence |
|
INDELS |
logical |
T: V(D)J nucleotide sequence contains insertions and/or deletions |
|
LOCUS |
locus |
string |
Locus of the receptor |
V_CALL |
v_call |
string |
V allele assignment(s) |
D_CALL |
d_call |
string |
D allele assignment(s) |
J_CALL |
j_call |
string |
J allele assignment(s) |
C_CALL |
c_call |
string |
C-region assignment |
V_SEQ_START |
v_sequence_start |
integer |
Position of first V nucleotide in SEQUENCE_INPUT |
V_SEQ_LENGTH |
integer |
Number of V nucleotides in SEQUENCE_INPUT |
|
V_GERM_START_IMGT |
v_germline_start |
integer |
Position of V_SEQ_START in IMGT-numbered germline V(D)J sequence |
V_GERM_LENGTH_IMGT |
integer |
Length of the IMGT numbered germline V alignment |
|
NP1_LENGTH |
np1_length |
integer |
Number of nucleotides between V and D segments |
D_SEQ_START |
d_sequence_start |
integer |
Position of first D nucleotide in SEQUENCE_INPUT |
D_SEQ_LENGTH |
integer |
Number of D nucleotides in SEQUENCE_INPUT |
|
D_GERM_START |
d_germline_start |
integer |
Position of D_SEQ_START in germline V(D)J nucleotide sequence |
D_GERM_LENGTH |
integer |
Length of the germline D alignment |
|
NP2_LENGTH |
np2_length |
integer |
Number of nucleotides between D and J segments |
J_SEQ_START |
j_sequence_start |
integer |
Position of first J nucleotide in SEQUENCE_INPUT |
J_SEQ_LENGTH |
j_sequence_end |
integer |
Number of J nucleotides in SEQUENCE_INPUT |
J_GERM_START |
j_germline_start |
integer |
Position of J_SEQ_START in germline V(D)J nucleotide sequence |
J_GERM_LENGTH |
integer |
Length of the germline J alignment |
|
JUNCTION_LENGTH |
junction_length |
integer |
Number of junction nucleotides in SEQUENCE_VDJ |
JUNCTION |
junction |
string |
Junction region nucletide sequence |
CELL |
cell_id |
string |
Cell identifier |
CLONE |
clone_id |
string |
Clonal grouping identifier |
Region Annotations |
|||
FWR1_IMGT |
fwr1 |
string |
IMGT-numbered FWR1 nucleotide sequence |
FWR2_IMGT |
fwr2 |
string |
IMGT-numbered FWR2 nucleotide sequence |
FWR3_IMGT |
fwr3 |
string |
IMGT-numbered FWR3 nucleotide sequence |
FWR4_IMGT |
fwr4 |
string |
IMGT-numbered FWR4 nucleotide sequence |
CDR1_IMGT |
cdr1 |
string |
IMGT-numbered CDR1 nucleotide sequence |
CDR2_IMGT |
cdr2 |
string |
IMGT-numbered CDR2 nucleotide sequence |
CDR3_IMGT |
cdr3 |
string |
IMGT-numbered CDR3 nucleotide sequence |
N1_LENGTH |
n1_length |
integer |
Untemplated nucleotides 5’ of the D segment |
N2_LENGTH |
n2_length |
integer |
Untemplated Nucleotides 3’ of the D segment |
P3V_LENGTH |
p3v_length |
integer |
Palindromic nucleotides 3’ of the V segment |
P5D_LENGTH |
p5d_length |
integer |
Palindromic nucleotides 5’ of the D segment |
P3D_LENGTH |
p3d_length |
integer |
Palindromic nucleotides 3’ of the D segment |
P5J_LENGTH |
p5j_length |
integer |
Palindromic nucleotides 5’ of the J segment |
D_FRAME |
integer |
D segment reading frame |
|
Germline Annotations |
|||
GERMLINE_VDJ |
string |
Full unaligned germline V(D)J nucleotide sequence |
|
GERMLINE_VDJ_V_REGION |
string |
Unaligned germline V segment nucleotide sequence |
|
GERMLINE_VDJ_D_MASK |
string |
Unaligned germline V(D)J nucleotides sequence with Ns masking the NP1-D-NP2 regions |
|
GERMLINE_IMGT |
germline_alignment |
string |
Full IMGT-numbered germline V(D)J nucleotide sequence |
GERMLINE_IMGT_V_REGION |
string |
IMGT-numbered germline V segment nucleotide sequence |
|
GERMLINE_IMGT_D_MASK |
string |
IMGT-numbered germline V(D)J nucleotide sequence with Ns masking the NP1-D-NP2 regions |
|
GERMLINE_V_CALL |
string |
Clonal consensus germline V assignment |
|
GERMLINE_D_CALL |
string |
Clonal consensus germline D assignment |
|
GERMLINE_J_CALL |
string |
Clonal consensus germline J assignment |
|
GERMLINE_REGIONS |
string |
String showing germline segments positions encoded as V, D, J, N, and P characters |
|
Alignment Annotations |
|||
V_SCORE |
v_score |
float |
Alignment score for the V |
V_IDENTITY |
v_identity |
float |
Alignment identity for the V |
V_EVALUE |
v_support |
float |
E-value for the alignment of the V |
V_CIGAR |
v_cigar |
string |
CIGAR string for the alignment of the V |
D_SCORE |
d_score |
float |
Alignment score for the D |
D_IDENTITY |
d_identity |
float |
Alignment identity for the D |
D_EVALUE |
d_support |
float |
E-value for the alignment of the D |
D_CIGAR |
d_cigar |
string |
CIGAR string for the alignment of the D |
J_SCORE |
j_score |
float |
Alignment score for the J |
J_IDENTITY |
j_identity |
float |
Alignment identity for the J |
J_EVALUE |
j_support |
float |
E-value for the alignment of the J |
J_CIGAR |
j_cigar |
string |
CIGAR string for the alignment of the J |
VDJ_SCORE |
float |
Alignment score for the V(D)J |
|
TIgGER Annotations |
|||
V_CALL_GENOTYPED |
string |
Adjusted V allele assignment(s) following TIgGER genotype inference |
|
Preprocessing Annotations |
|||
PRCONS |
string |
pRESTO UMI consensus primer |
|
PRIMER |
string |
pRESTO primers list |
|
CONSCOUNT |
consensus_count |
integer |
Number of reads contributing to the UMI consensus sequence |
DUPCOUNT |
duplicate_count |
integer |
Copy number of the sequence |
UMICOUNT |
integer |
UMI count for the sequence |