.. _Filtering: Filtering records ================================================================================ The :ref:`ParseDb` tool provides a basic set of operations for manipulating Change-O database files from the commandline, including removing or updating rows and columns. .. _Filtering-Functional: Removing non-productive sequences -------------------------------------------------------------------------------- After building a Change-O database from either :ref:`IMGT/HighV-QUEST ` or :ref:`IgBLAST ` output, you may wish to subset your data to only productive sequences. This can be done in one of two roughly equivalent ways using the :ref:`ParseDb` tool: .. code-block:: none :linenos: ParseDb.py select -d HD13M_db-pass.tsv -f productive -u T ParseDb.py split -d HD13M_db-pass.tsv -f productive The first line above uses the :program:`select` subcommand to output a single file labeled ``parse-select`` containing only records with the value of ``T`` (:option:`-u T `) in the ``productive`` column (:option:`-f productive `). Alternatively, the second line above uses the :program:`split` subcommand to output multiple files with each file containing records with one of the values found in the ``productive`` column (:option:`-f productive `). This will generate two files labeled ``productive-T`` and ``productive-F``. Removing disagreements between the C-region primers and the reference alignment -------------------------------------------------------------------------------- If you have data that includes both heavy and light chains in the same library, the V-segment and J-segment alignments from IMGT/HighV-QUEST or IgBLAST may not always agree with the isotype assignments from the C-region primers. In these cases, you can filter out such reads with the :program:`select` subcommand of :ref:`ParseDb`. An example function call using an imaginary file ``db.tsv`` is provided below: .. code-block:: none :linenos: ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IGH" \ --logic all --regex --outname heavy ParseDb.py select -d db.tsv -f v_call j_call c_call -u "IG[LK]" \ --logic all --regex --outname light These commands will require that all of the ``v_call``, ``j_call`` and ``c_call`` fields (:option:`-f v_call j_call c_call ` and :option:`--logic all `) contain the string ``IGH`` (lines 1-2) or one of ``IGK`` or ``IGL`` (lines 3-4). The :option:`--regex ` argument allows for partial matching and interpretation of regular expressions. The output from these two commands are two files, one containing only heavy chains (``heavy_parse-select.tsv``) and one containg only light chains (``light_parse-select.tsv``). Exporting records to FASTA files -------------------------------------------------------------------------------- You may want to use external tools, or tools from `pRESTO `__, on your Change-O result files. The :ref:`ConvertDb` tool provides two options for exporting data from tab-delimited files to FASTA format. Standard FASTA ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :program:`fasta` subcommand allows you to export sequences and annotations to FASTA formatted files in the `pRESTO annototation scheme `__:: ConvertDb.py fasta -d HD13M_db-pass.tsv --if sequence_id \ --sf sequence_alignment --mf v_call duplicate_count Where the column containing the sequence identifier is specified by :option:`--if sequence_id `, the nucleotide sequence column is specified by :option:`--sf sequence_id `, and additional annotations to be added to the sequence header are specified by :option:`--mf v_call duplicate_count `. BASELINe FASTA ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :program:`baseline` subcommand generates a FASTA derivative format required by the `BASELINe `__ web tool. Generating these files is similar to building standard FASTA files, but requires a few more options. An example function call using an imaginary file ``db.tsv`` is provided below:: ConvertDb.py baseline -d db.tsv --if sequence_id \ --sf sequence_alignment --mf v_call duplicate_count \ --cf clone_id --gf germline_alignment_d_mask The additional arguments required by the :program:`baseline` subcommand include the clonal grouping (:option:`--cf clone_id `) and germline sequence (:option:`--gf germline_alignment_d_mask `) columns added by the :ref:`DefineClones ` and :ref:`CreateGermlines ` tasks, respectively. .. note:: The :program:`baseline` subcommand requires the ``CLONE`` column to be sorted. :ref:`DefineClones` generates a sorted ``CLONE`` column by default. However, you needed to alter the order of the ``CLONE`` column at some point, then you can re-sort the clonal assignments using the :program:`sort` subcommand of :ref:`ParseDb`. An example function call using an imaginary file ``db.tsv`` is provided below:: ParseDb.py sort -d db.tsv -f clone_id Which will sort records by the value in the ``clone_id`` column (:option:`-f clone_id `).