API documentation¶
bam_process¶
-
rna_cd.bam_process.
chop_contig
(size: int, chunksize: int) → Iterator[Tuple[int, int]]¶ For a contig of given size, generate regions maximally chunksize long. We use _0_ based indexing
-
rna_cd.bam_process.
softclip_bases
(reader: pysam.libcalignmentfile.AlignmentFile, contig: str, region: Tuple[int, int]) → int¶ Calculate amount of softclip bases for a region
-
rna_cd.bam_process.
coverage
(reader: pysam.libcalignmentfile.AlignmentFile, contig: str, region: Tuple[int, int], method: Callable = <function mean>) → float¶ Calculate average/median/etc coverage for a region
-
rna_cd.bam_process.
process_bam
(path: pathlib.Path, chunksize: int = 100, contig: str = 'chrM') → numpy.ndarray¶ Process bam file to an ndarray
Returns: numpy ndarray of shape (n_features,)
-
rna_cd.bam_process.
make_array_set
(bam_files: List[pathlib.Path], labels: List[Any], chunksize: int = 100, contig: str = 'chrM', cores: int = 1) → Tuple[numpy.ndarray, numpy.ndarray]¶ Make set of numpy arrays corresponding to data and labels. I.e. train/testX and train/testY in scikit-learn parlance.
Parameters: - bam_files – List of paths to bam files
- labels – list of labels.
- cores – number of cores to use for processing
Returns: tuple of X and Y numpy arrays. X has shape (n_files, n_features). Y has shape (n_files,).
cli¶
-
rna_cd.cli.
directory_callback
(ctx, param, value)¶ Click callback function for getting bam/cram files from a directory.
-
rna_cd.cli.
list_callback
(ctx, param, value)¶ Click callback function for getting bam/cram files from a list file.
-
rna_cd.cli.
path_callback
(ctx, param, value)¶ Generic str to path callback. To be used for click.Path types that ought to return pathlib.Path
-
rna_cd.cli.
unknown_threshold_callback
(ctx, param, value)¶ Click callback function for threshold that has to be between 0.5 and 1.0
models¶
-
class
rna_cd.models.
PredClass
¶ An enumeration.
-
rna_cd.models.
plot_pca
(searcher: sklearn.model_selection._search.GridSearchCV, arr_X: numpy.ndarray, arr_Y: numpy.ndarray, img_out: pathlib.Path) → None¶ Plot PCA with training samples of pipeline.
-
rna_cd.models.
predict_labels_and_prob
(model, bam_files: List[pathlib.Path], chunksize: int = 100, contig: str = 'chrM', cores: int = 1, unknown_threshold: float = 0.75) → List[rna_cd.models.Prediction]¶ Predict labels and probabilities for a list of bam files.
Parameters: unknown_threshold – The probability threshold below which samples are considered to be ‘unknown’. Must be between 0.5 and 1.0 Returns: list of Prediction classes
-
rna_cd.models.
train_svm_model
(positive_bams: List[pathlib.Path], negative_bams: List[pathlib.Path], chunksize: int = 100, contig: str = 'chrM', cross_validations: int = 3, verbosity: int = 1, cores: int = 1, plot_out: Optional[pathlib.Path] = None) → sklearn.model_selection._search.GridSearchCV¶ Run SVM training on a list of positive BAM files (i.e. _with_ contamination) and a list of negative BAM files (i.e. _without_ contamination).
For all bam files features are collected over one contig. This contig is binned, and for each bin two different metrics of coverage are collected, in addition to the softclip rate.
These features are then fed to a sklearn pipeline with three steps:
- A scaling step using StandardScaler
- A dimensional reduction step using PCA.
- A classification step using an SVM.
Hyperparameters are tuned using a grid search with cross validations.
Optionally saves a plot of the top two PCA components with the training samples.
Parameters: - positive_bams – List of BAM files with contaminations
- negative_bams – List of BAM files without contaminations.
- chunksize – The size in bases for each chunk (bin)
- contig – The name of the contig.
- cross_validations – The amount of cross validations
- verbosity – Verbosity parameter of sklearn. Increase to see more messages.
- cores – Amount of cores to use for both metric collection and training.
- plot_out – Optional path for PCA plot.
Returns: GridSearchCV object containing tuned pipeline.
utils¶
-
rna_cd.utils.
dir_to_bam_list
(path: pathlib.Path) → List[pathlib.Path]¶ Load a directory containing bam or cram files
-
rna_cd.utils.
echo
(msg: str)¶ Wrapper around click.secho to include datetime
-
rna_cd.utils.
load_list_file
(path: pathlib.Path) → List[pathlib.Path]¶ Load a file containing containing a list of files
-
rna_cd.utils.
load_sklearn_object_from_disk
(path: pathlib.Path) → Any¶ Load a JSON-serialized object from disk
-
rna_cd.utils.
save_sklearn_object_to_disk
(obj: Any, path: pathlib.Path)¶ Save an object with some metadata to disk as serialized JSON