Training¶
To train the support vector machine, you need a set of contaminated (“positive”) and a set of uncontaminated (“negative”) BAM files. For each category, you can organize your BAM files in two distinct ways:
- Place all BAM files of the category in the same directory.
- Make a flat text file, where each line points to a path of a BAM file.
Note
Your BAM files must be indexed.
Once you have this in place, you need to choose the contig in your BAM file
that you want to collect metrics for, and the chunksize. rna_cd will split
your contig of interest in chunks with a maximum size chunksize
, and
collect a number of metrics for each chunk. When you later use the model for
classifications you must use the same contig and chunksize as you used
during the training step.
In our hands, the mitochondrial contig, at a chunksize of 100 bases, gives enough information to be trainable. When choosing the mitochondrial contig, one also benefits from its small size, which makes the training step fast.
Training can work in multicore mode. When using multiple cores, you will process multiple BAM files simultaneously. This can drastically speed up the metric collection for large numbers of BAM files.
Lastly, you have to set the amount of fold cross validations. By default this is 3, but you may set it to any positive integer.
The created model will saved to disk as a JSON file. The JSON file contains the pickled model.
Optionally, you can save a plot of the top two principal components of the training samples to disk.
Examples¶
Directory method, chrM, chunksize = 100, cores = 3¶
rna_cd-train -c chrM -pd positives_dir -nd negatives_dir -j 3 \
--chunksize 100 -o model.json
List method, chrM, chunksize = 100, cores = 3¶
rna_cd-train -c chrM -pl positives.list -nl negatives.list -j 3 \
--chunksize 100 -o model.json
List method, chrM, chunksize = 100, cores = 3, with plot¶
rna_cd-train -c chrM -pl positives.list -nl negatives.list -j 3 \
--chunksize 100 -o model.json --plot-out pca.png
Usage¶
rna_cd-train¶
rna_cd-train [OPTIONS]
Options
-
--chunksize
<chunksize>
¶ Chunksize in bases. Default = 100
-
-c
,
--contig
<contig>
¶ Name of mitochrondrial contig in your BAM files. Default = chrM
-
-pd
,
--positives-dir
<positives_dir>
¶ Path to directory containing positive BAM files. Mutually exclusive with –positives-list
-
-nd
,
--negatives-dir
<negatives_dir>
¶ Path to directory containing negative BAM files. Mutually exlusive with –negatives-list
-
-pl
,
--positives-list
<positives_list>
¶ Path to file containing a list of paths to positive BAM files. Mutually exclusive with –positives-dir
-
-nl
,
--negatives-list
<negatives_list>
¶ Path to file containing a list of paths to negative BAM files. Mutuallly exclusive with –negatives-dir
-
--cross-validations
<cross_validations>
¶ Number of folds for cross validation run. Default = 3
-
--verbosity
<verbosity>
¶ Verbosity value for cross validation step. Default = 1
-
-j
,
--cores
<cores>
¶ Number of cores to use for processing of BAM files and cross validations. Default = 1
-
--plot-out
<plot_out>
¶ Optional path to PCA plot.
-
-o
,
--model-out
<model_out>
¶ Path where model will be stored. [required]