API documentation

bam_process

rna_cd.bam_process.chop_contig(size: int, chunksize: int) → Iterator[Tuple[int, int]]

For a contig of given size, generate regions maximally chunksize long. We use _0_ based indexing

rna_cd.bam_process.softclip_bases(reader: pysam.libcalignmentfile.AlignmentFile, contig: str, region: Tuple[int, int]) → int

Calculate amount of softclip bases for a region

rna_cd.bam_process.coverage(reader: pysam.libcalignmentfile.AlignmentFile, contig: str, region: Tuple[int, int], method: Callable = <function mean>) → float

Calculate average/median/etc coverage for a region

rna_cd.bam_process.process_bam(path: pathlib.Path, chunksize: int = 100, contig: str = 'chrM') → numpy.ndarray

Process bam file to an ndarray

Returns:numpy ndarray of shape (n_features,)
rna_cd.bam_process.make_array_set(bam_files: List[pathlib.Path], labels: List[Any], chunksize: int = 100, contig: str = 'chrM', cores: int = 1) → Tuple[numpy.ndarray, numpy.ndarray]

Make set of numpy arrays corresponding to data and labels. I.e. train/testX and train/testY in scikit-learn parlance.

Parameters:
  • bam_files – List of paths to bam files
  • labels – list of labels.
  • cores – number of cores to use for processing
Returns:

tuple of X and Y numpy arrays. X has shape (n_files, n_features). Y has shape (n_files,).

cli

rna_cd.cli.directory_callback(ctx, param, value)

Click callback function for getting bam/cram files from a directory.

rna_cd.cli.list_callback(ctx, param, value)

Click callback function for getting bam/cram files from a list file.

rna_cd.cli.path_callback(ctx, param, value)

Generic str to path callback. To be used for click.Path types that ought to return pathlib.Path

rna_cd.cli.unknown_threshold_callback(ctx, param, value)

Click callback function for threshold that has to be between 0.5 and 1.0

models

class rna_cd.models.PredClass

An enumeration.

rna_cd.models.plot_pca(searcher: sklearn.model_selection._search.GridSearchCV, arr_X: numpy.ndarray, arr_Y: numpy.ndarray, img_out: pathlib.Path) → None

Plot PCA with training samples of pipeline.

rna_cd.models.predict_labels_and_prob(model, bam_files: List[pathlib.Path], chunksize: int = 100, contig: str = 'chrM', cores: int = 1, unknown_threshold: float = 0.75) → List[rna_cd.models.Prediction]

Predict labels and probabilities for a list of bam files.

Parameters:unknown_threshold – The probability threshold below which samples are considered to be ‘unknown’. Must be between 0.5 and 1.0
Returns:list of Prediction classes
rna_cd.models.train_svm_model(positive_bams: List[pathlib.Path], negative_bams: List[pathlib.Path], chunksize: int = 100, contig: str = 'chrM', cross_validations: int = 3, verbosity: int = 1, cores: int = 1, plot_out: Optional[pathlib.Path] = None) → sklearn.model_selection._search.GridSearchCV

Run SVM training on a list of positive BAM files (i.e. _with_ contamination) and a list of negative BAM files (i.e. _without_ contamination).

For all bam files features are collected over one contig. This contig is binned, and for each bin two different metrics of coverage are collected, in addition to the softclip rate.

These features are then fed to a sklearn pipeline with three steps:

  1. A scaling step using StandardScaler
  2. A dimensional reduction step using PCA.
  3. A classification step using an SVM.

Hyperparameters are tuned using a grid search with cross validations.

Optionally saves a plot of the top two PCA components with the training samples.

Parameters:
  • positive_bams – List of BAM files with contaminations
  • negative_bams – List of BAM files without contaminations.
  • chunksize – The size in bases for each chunk (bin)
  • contig – The name of the contig.
  • cross_validations – The amount of cross validations
  • verbosity – Verbosity parameter of sklearn. Increase to see more messages.
  • cores – Amount of cores to use for both metric collection and training.
  • plot_out – Optional path for PCA plot.
Returns:

GridSearchCV object containing tuned pipeline.

utils

rna_cd.utils.dir_to_bam_list(path: pathlib.Path) → List[pathlib.Path]

Load a directory containing bam or cram files

rna_cd.utils.echo(msg: str)

Wrapper around click.secho to include datetime

rna_cd.utils.load_list_file(path: pathlib.Path) → List[pathlib.Path]

Load a file containing containing a list of files

rna_cd.utils.load_sklearn_object_from_disk(path: pathlib.Path) → Any

Load a JSON-serialized object from disk

rna_cd.utils.save_sklearn_object_to_disk(obj: Any, path: pathlib.Path)

Save an object with some metadata to disk as serialized JSON