Filtering records¶
The ParseDb.py tool provides a basic set of operations for manipulating Change-O database files from the commandline, including removing or updating rows and columns.
Removing non-productive sequences¶
After building a Change-O database from either IMGT/HighV-QUEST or IgBLAST output, you may wish to subset your data to only productive sequences. This can be done in one of two roughly equivalent ways using the ParseDb.py tool:
1 2 | ParseDb.py select -d HD13M_db-pass.tsv -f productive -u T ParseDb.py split -d HD13M_db-pass.tsv -f productive |
The first line above uses the select subcommand to output a single file
labeled parse-select
containing only records with the value of T
(-u T
) in the productive
column
(-f productive
).
Alternatively, the second line above uses the split subcommand to output
multiple files with each file containing records with one of the values found in the
productive
column (-f productive
). This will
generate two files labeled productive-T
and productive-F
.
Removing disagreements between the C-region primers and the reference alignment¶
If you have data that includes both heavy and light chains in the same library,
the V-segment and J-segment alignments from IMGT/HighV-QUEST or IgBLAST may not
always agree with the isotype assignments from the C-region primers. In these cases,
you can filter out such reads with the select subcommand of ParseDb.py.
An example function call using an imaginary file db.tsv
is provided below:
1 2 3 4 | ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IGH" \ --logic all --regex --outname heavy ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IG[LK]" \ --logic all --regex --outname light |
These commands will require that all of the v_call
, j_call
and c_call
fields (-f v_call j_call c_call
and
--logic all
) contain the string IGH
(lines 1-2)
or one of IGK
or IGL
(lines 3-4). The --regex
argument allows for partial matching and interpretation of regular expressions. The
output from these two commands are two files, one containing only heavy chains
(heavy_parse-select.tsv
) and one containg only light chains (light_parse-select.tsv
).
Exporting records to FASTA files¶
You may want to use external tools, or tools from pRESTO, on your Change-O result files. The ConvertDb.py tool provides two options for exporting data from tab-delimited files to FASTA format.
Standard FASTA¶
The fasta subcommand allows you to export sequences and annotations to FASTA formatted files in the pRESTO annototation scheme:
ConvertDb.py fasta -d HD13M_db-pass.tsv --if sequence_id \
--sf sequence_alignment --mf v_call duplicate_count
Where the column containing the sequence identifier is specified by
--if sequence_id
, the nucleotide sequence column is
specified by --sf sequence_id
, and additional annotations
to be added to the sequence header are specified by
--mf v_call duplicate_count
.
BASELINe FASTA¶
The baseline subcommand generates a FASTA derivative format required by the
BASELINe web tool. Generating these
files is similar to building standard FASTA files, but requires a few more options.
An example function call using an imaginary file db.tsv
is provided below:
ConvertDb.py baseline -d db.tsv --if sequence_id \
--sf sequence_alignment --mf v_call duplicate_count \
--cf clone_id --gf germline_alignment_d_mask
The additional arguments required by the baseline subcommand include the
clonal grouping (--cf clone_id
) and germline sequence
(--gf germline_alignment_d_mask
) columns added by
the DefineClones and CreateGermlines tasks,
respectively.
Note
The baseline subcommand requires the CLONE
column to be sorted.
DefineClones.py generates a sorted CLONE
column by default. However,
you needed to alter the order of the CLONE
column at some point,
then you can re-sort the clonal assignments using the sort
subcommand of ParseDb.py. An example function call using an imaginary
file db.tsv
is provided below:
ParseDb.py sort -d db.tsv -f clone_id
Which will sort records by the value in the clone_id
column
(-f clone_id
).