Classification¶
As with the training step, you can organize your BAM files in two distinct ways:
- Place all BAM files of the category in the same directory.
- Make a flat text file, where each line points to a path of a BAM file.
This time, there are no separate categories, as all BAM files are a-priori unknown.
Note
Your BAM files must be indexed.
Warning
As mentioned before, you must use the exact same contig and chunksize settings in this step as were used during the training step.
As with the training step, metric collection can run in multicore mode during classifcation as well.
Once you have prepared your BAM files, and chosen your parameters, you will use the model you generated during the training step to classify your BAM files into contaminated (“positive”) and uncontamined (“negative”) groups.
Additionally, there is an optional third category with label “unknown”. This represents samples for which we are unsure to which category they belong to. By default, samples are categorized as “unknown” when the class probability of the most likely category is below 0.75. You can override this parameter, but it must be higher than 0.5 and lower than 1.0.
The classifications will be stored to disk in a three-column tab-delimited text file, with the following columns:
- Name of the BAM file that was classified.
- Assigned category (“pos” or “neg” for positive and negative classifications, respectively, and “unknown” for the unknown category).
- The probability of the assigned category.
E.g. an example output file could look like
filename predicted_class class_probability
a.bam neg 0.95
b.bam neg 0.88
c.bam pos 0.75
d.bam unknown 0.55
Examples¶
Directory method, chrM, chunksize = 100, cores = 3¶
rna_cd-classify -m model.json -d bams_dir -j 3 -c chrM \
--chunksize 100 -o classifications.out
List method, chrM, chunksize = 100, cores = 3¶
rna_cd-classify -m model.json -l bams.list -j 3 -c chrM \
--chunksize 100 -o classifications.out
Usage¶
rna_cd-classify¶
rna_cd-classify [OPTIONS]
Options
-
--chunksize
<chunksize>
¶ Chunksize in bases. Default = 100
-
-c
,
--contig
<contig>
¶ Name of mitochrondrial contig in your BAM files. Default = chrM
-
-j
,
--cores
<cores>
¶ Number of cores to use for processing of BAM files. Default = 1
-
-d
,
--directory
<directory>
¶ Path to directory with BAM files to be tested. Mutually exclusive with –list-items
-
-l
,
--list-items
<list_items>
¶ Path to file containing list of paths to BAM files to be tested. Mutually exclusive with –directory
-
-m
,
--model
<model>
¶ Path to model. [required]
-
-o
,
--output
<output>
¶ Path to output file containing classifications. [required]
-
-t
,
--unknown-threshold
<unknown_threshold>
¶ Threshold of most likely probability below which sampleswll be assinged as ‘unknown’. Default = 0.75