Package 'swissknife'

Title:	Handy code shared in the FMI CompBio group
Description:	A collection of useful R functions performing various tasks that might be re-usable and worth sharing.
Authors:	Michael Stadler [aut, cre], Charlotte Soneson [aut], Panagiotis Papasaikas [aut], Dania Machlab [aut], Fiona Ross [aut], Friedrich Miescher Institute for Biomedical Research [cph]
Maintainer:	Michael Stadler <[email protected]>
License:	GPL (>= 3)
Version:	0.42
Built:	2025-02-05 05:59:27 UTC
Source:	https://github.com/fmicompbio/swissknife

Help Index

swissknife - handy code shared in the FMI CompBio group
Copy utility functions to package
Annotate regions.
Count frequency of differences between values in integer vectors.
Calculate phasograms (same strand alignment distances).
Get hex color
Estimate the nucleosome repeat length (NRL) from a phasogram.
Get regions tiling a genome.
Tabulate insert sizes from paired-end alignments in bam files.
Get mappable regions of a genome.
Assign labels to cells using known marker genes
Access example data
Calculate normalized expression of a gene set
Parse R and R package versions from session informations
Create a bitmap-rendered plot.
Plot gene region
Plot annotated phasogram.
Plot Selected Variable Genes
Prepare GTF file for use with plotGeneRegion
Read sample tsv files from seqdata storage
Sample control elements that match a target distribution.
Select Variable Genes in Single Cell RNA-seq
Calculate gene-expression specificity scores.
Map numerical values to colors.
Aggregate different rows assigned to the same ID by calculating a weighted mean

swissknife - handy code shared in the FMI CompBio group

Description

swissknife is a collection of useful R functions performing various tasks that might be re-usable and worth sharing.

Author(s)

Maintainer: Michael Stadler [email protected]

Authors:

Charlotte Soneson [email protected]
Panagiotis Papasaikas [email protected]
Dania Machlab [email protected]
Fiona Ross [email protected]

Other contributors:

Friedrich Miescher Institute for Biomedical Research [copyright holder]

Copy utility functions to package

Description

This function copies handy utility functions to a new script in a specified location. Currently, the script contains the following utility functions:

.assertScalar() - convenience function to check the validity of scalar variables.
.assertVector() - convenience function to check the validity of vector variables.

Usage

addUtilsFunctions(outFile = "R/utils.R", copyTests = TRUE)
addUtilsFunctions(outFile = "R/utils.R", copyTests = TRUE)

Arguments

`outFile`	Character scalar, giving the path to which the script should be copied. The path is relative to the root of the active project. If a file with this name already exists, the function will ask for confirmation before overwriting it.
`copyTests`	Logical scalar, defining whether to copy unit tests for the utility functions to `tests/testthat/test-<name>.R`, where `<name>` is the base name of `outFile`. If the target package is not yet set up to use `testthat`, the function will also run `usethis::use_testthat()` to generate the required folder structure and add `testthat` to the list of suggested package in the DESCRIPTION file.

Author(s)

Charlotte Soneson

Annotate regions.

Description

Annotate a GRanges object with sets of reference GRanges or GRangesList objects, with respect to overlaps and nearest neighbors.

Usage

annotateRegions(
  x,
  hasOverlap = list(),
  fracOverlap = list(),
  numOverlap = list(),
  nearest = list(),
  ignore.strand = TRUE
)
annotateRegions(
  x,
  hasOverlap = list(),
  fracOverlap = list(),
  numOverlap = list(),
  nearest = list(),
  ignore.strand = TRUE
)

Arguments

`x`	The `GRanges` object to annotate.
`hasOverlap`	Named `list` with `GRanges` or `GRangesList` object(s). For each list element, a logical vector "X.hasOverlap" will be added to the `mcols` of the result, with `TRUE` for each tile that overlaps any region in that element. "X" is obtained from `names(hasOverlap)`.
`fracOverlap`	Named `list` with `GRanges` or `GRangesList` object(s). For each list element, a numeric vector "X.fracOverlap" will be added to the `mcols` of the result, with a value between 0 and 1 giving the fraction of bases in a tile that overlaps with any region in that element. "X" is obtained from `names(fracOverlap)`.
`numOverlap`	Named `list` with `GRanges` or `GRangesList` object(s). For each list element, two numeric vectors "X.numOverlapWithin" and "X.numOverlapAny" will be added to the `mcols` of the result, giving the number of ranges in that element that are fully contained within a tile, or that overlap with a tile in any way, respectively. "X" is obtained from `names(numOverlap)`.
`nearest`	Named `list` with `GRanges` or `GRangesList` object(s). For each list element, two numeric vectors "X.nearestName" and "X.nearestDistance" will be added to the `mcols` of the result, giving the name and distance of the nearest range in that element for each tile. "X" is obtained from `names(nearest)`, and the values of "X.nearestName" from `names(nearest$X)`. If multiple nearest ranges are at the same distance from a tile, an arbitrary one is reported in "X.nearestName".
`ignore.strand`	Logical scalar passed to `findOverlaps` when searching for overlaps between `x` and reference regions.

Value

A GRanges similar to x, with annotations added to its metadata columns (mcols).

Author(s)

Michael Stadler

Examples

library(GenomicRanges)

x <- GRanges("chr1", IRanges(c(1, 12), width = 10))
tss <- GRanges("chr1", IRanges(c(1, 10, 30), width = 1,
                               names = paste0("t", 1:3)))
blacklist <- GRanges("chr1", IRanges(20, width = 5))
annotateRegions(x, hasOverlap = list(Blacklist = blacklist),
                fracOverlap = list(Blacklist = blacklist),
                numOverlap = list(TSS = tss),
                nearest = list(TSS = tss))

library(GenomicRanges)

x <- GRanges("chr1", IRanges(c(1, 12), width = 10))
tss <- GRanges("chr1", IRanges(c(1, 10, 30), width = 1,
                               names = paste0("t", 1:3)))
blacklist <- GRanges("chr1", IRanges(20, width = 5))
annotateRegions(x, hasOverlap = list(Blacklist = blacklist),
                fracOverlap = list(Blacklist = blacklist),
                numOverlap = list(TSS = tss),
                nearest = list(TSS = tss))

Count frequency of differences between values in integer vectors.

Description

Given two ascendingly sorted integer vectors query and reference, calculate and count the differences between their elements that are greater than zero and less than maxd. The number of observed distances d are reported in cnt[d], and maxd corresponds to the length(cnt). The function is called by calcPhasogram, which provides a higher level, more conventient interface.

Usage

calcAndCountDist(query, reference, cnt)
calcAndCountDist(query, reference, cnt)

Arguments

`query`	first `integer` vector.
`reference`	second `integer` vector. Distances are calculated from each element in `query` to each greater element in `reference`.
`cnt`	`NumericVector` to store the result in. The length of `cnt` defines the maximal distance that will be included in the analysis, and new counts will be added to the values of `cnt`.

Value

numeric vector cnt, where cnt[d] correspond to the number of observed distances d.

Author(s)

Michael Stadler

Calculate phasograms (same strand alignment distances).

Description

Calculate the frequencies of same strand alignment distances, for example from MNase-seq data to estimate nucleosome repeat length. Distance calculations are implemented in C++ (calcAndCountDist) for efficiency.

Usage

calcPhasogram(fname, regions = NULL, rmdup = TRUE, dmax = 3000L)
calcPhasogram(fname, regions = NULL, rmdup = TRUE, dmax = 3000L)

Arguments

`fname`	`character` vector with one or several bam files. If multiple files are given, distance counts from all will be summed.
`regions`	`GRanges` object. Only alignments falling into these regions will be used. If `NULL` (the default), all alignments are used.
`rmdup`	`logical(1)` indicating if duplicates should be removed. If `TRUE` (the default), only one of several alignments starting at the same coordinate is used.
`dmax`	`numeric(1)` specifying the maximal distance between same strand alignments to count.

Value

integer vector with dmax elements, with the element at position d giving the observed number of alignment pairs at that distance.

Author(s)

Michael Stadler

References

Phasograms were originally described in Valouev et al., Nature 2011 (doi:10.1038/nature10002). The implementation here differs in two ways from the original algorithms:

It does not implement removing of positions that have been seen less than n times (referred to as a n-pile subset in the paper).
It does allow to retain only alignments that fall into selected genomic intervals (regions argument).

Examples

if (requireNamespace("GenomicAlignments", quietly = TRUE) &&
    requireNamespace("Rsamtools", quietly = TRUE)) {
    bamf <- system.file("extdata", "phasograms", "mnase_mm10.bam",
                        package = "swissknife")
    pg <- calcPhasogram(bamf)
    print(estimateNRL(pg, usePeaks = 1:4)[1:2])
    plotPhasogram(pg, usePeaks = 1:4, xlim = c(0,1000))
}

if (requireNamespace("GenomicAlignments", quietly = TRUE) &&
    requireNamespace("Rsamtools", quietly = TRUE)) {
    bamf <- system.file("extdata", "phasograms", "mnase_mm10.bam",
                        package = "swissknife")
    pg <- calcPhasogram(bamf)
    print(estimateNRL(pg, usePeaks = 1:4)[1:2])
    plotPhasogram(pg, usePeaks = 1:4, xlim = c(0,1000))
}

Get hex color

Description

The function returns a color in hex form given a valid name of a color in R.

Usage

col2hex(col, alpha = 255)
col2hex(col, alpha = 255)

Arguments

`col`	a `character`, `integer` or vector of both types containing the names of the colors or colors as integers.
`alpha`	a numerical value in the range [0,1] or [0,255] that indicates the transparency of the color(s). If the given values are between 0 and 1, they are mapped to be between 0 and 255. An alpha value of 1 assumes the [0,1] range and provides maximum color. The default is set to 255.

Value

a character or character vector with the hex colors.

Author(s)

Dania Machlab

Examples

y <- rnorm(1000,0,1)
cols <- rep("red", length(y))
alpha <- seq(0,1,length.out=length(y))
hexcols <- col2hex(cols, alpha)
plot(1:length(y), y, bg=hexcols, pch=21)

y <- rnorm(1000,0,1)
cols <- rep("red", length(y))
alpha <- seq(0,255,length.out=length(y))
hexcols <- col2hex(cols, alpha)
plot(1:length(y), y, bg=hexcols, pch=21)

y <- rnorm(1000,0,1)
cols <- rep("red", length(y))
alpha <- seq(0,1,length.out=length(y))
hexcols <- col2hex(cols, alpha)
plot(1:length(y), y, bg=hexcols, pch=21)

y <- rnorm(1000,0,1)
cols <- rep("red", length(y))
alpha <- seq(0,255,length.out=length(y))
hexcols <- col2hex(cols, alpha)
plot(1:length(y), y, bg=hexcols, pch=21)

Estimate the nucleosome repeat length (NRL) from a phasogram.

Description

Estimate the nucleosome repeat length (NRL) from the frequencies of same-strand alignment distances (phasogram), e.g. generated by calcPhasogram. The NRL is obtained from the slope of a linear fit to the modes in the phasogram.

Usage

estimateNRL(
  x,
  mind = 140L,
  usePeaks = 1:8,
  span1 = 100/length(x),
  span2 = 1500/length(x)
)
estimateNRL(
  x,
  mind = 140L,
  usePeaks = 1:8,
  span1 = 100/length(x),
  span2 = 1500/length(x)
)

Arguments

`x`	`numeric` vector giving the counts of alignment distances (typically the output of `calcPhasogram`.
`mind`	`integer(1)` specifying the minimal distance to be used for NRL estimation. The default value (140) ignores any distance too short to span at least a single nucleosome.
`usePeaks`	`integer` vector selecting the modes (peaks) in the phasogram used in NRL estimation.
`span1`	`numeric(1)` giving the smoothing parameter for de-trending loess fit (high pass filter).
`span2`	`numeric(1)` giving the smoothing parameter for de-noising loess fit (low pass filter).

Value

A list with elements:

nrl: the estimated nucleosome repeat length
nrl.CI95: the 95% confidence interval
xs: smoothed (de-trended) phasogram
loessfit: the de-noising fit to the de-trended phasogram
lmfit: the linear fit to the phasogram peaks
peaks: the peak locations
mind: minimal distance included in the fit
span1: smoothing parameter for de-trending loess fit
span2: smoothing parameter for de-noising loess fit
usePeaks: the peaks used in the fit

Author(s)

Michael Stadler

Examples

  # see the help for calcPhasogram() for a full example

# see the help for calcPhasogram() for a full example

Get regions tiling a genome.

Description

Get sequential, potentially annotated regions of a fixed lengths (tiles) along chromosomes of a genome.

Usage

getGenomicTiles(
  genome,
  tileWidth,
  hasOverlap = list(),
  fracOverlap = list(),
  numOverlap = list(),
  nearest = list(),
  addSeqComp = TRUE
)
getGenomicTiles(
  genome,
  tileWidth,
  hasOverlap = list(),
  fracOverlap = list(),
  numOverlap = list(),
  nearest = list(),
  addSeqComp = TRUE
)

Arguments

`genome`	The genome to work on. Either a `BSgenome` object, a `character` scalar with the name of an installed `BSgenome` or with a file path and name pointing to a fasta file with the genome sequence, or a named `numeric` vector giving the names and lengths of chromosomes.
`tileWidth`	`numeric` scalar with the tile length.
`hasOverlap`, `fracOverlap`, `numOverlap`, `nearest`	Named `list`s with `GRanges` or `GRangesList` object(s) used to annotate genomic tiles. See `annotateRegions` for details.
`addSeqComp`	`logical` scalar. If `TRUE` and primary sequence can be obtained from `genome`, also add sequence composition features for each tile to the annotations. Currently, the following features are included: percent of G+C bases ("percGC"), CpG observed-over-expected ratio ("CpGoe").

Details

The last tile in each chromosome is dropped if it would be shorter than tileWidth. Generated tiles are unstranded (*) and therefore overlaps or searching for nearest neighbors are ignoring strands of annotations (ignore.strand=TRUE).

Value

A GRanges object with genome tiling regions. Optional tile annotations are contained in its metadata columns (mcols).

Author(s)

Michael Stadler

Examples

library(GenomicRanges)

tss <- GRanges("chr1", IRanges(c(1, 10, 30), width = 1,
                               names = paste0("t", 1:3)))
blacklist <- GRanges("chr1", IRanges(20, width = 5))
getGenomicTiles(c(chr1 = 45, chr2 = 12), tileWidth = 10,
                hasOverlap = list(Blacklist = blacklist),
                fracOverlap = list(Blacklist = blacklist),
                numOverlap = list(TSS = tss),
                nearest = list(TSS = tss))

library(GenomicRanges)

tss <- GRanges("chr1", IRanges(c(1, 10, 30), width = 1,
                               names = paste0("t", 1:3)))
blacklist <- GRanges("chr1", IRanges(20, width = 5))
getGenomicTiles(c(chr1 = 45, chr2 = 12), tileWidth = 10,
                hasOverlap = list(Blacklist = blacklist),
                fracOverlap = list(Blacklist = blacklist),
                numOverlap = list(TSS = tss),
                nearest = list(TSS = tss))

Tabulate insert sizes from paired-end alignments in bam files.

Description

Read and tabulate the insert sizes from paired-end alignments contained in one or several bam files. By default, all properly aligned read pairs are included. Optionally, alignments can be restricted to those in a specific genomic region (regions argument) or the number of alignments read can be limited (nmax argument).

Usage

getInsertSizeDistFromBam(
  fname,
  regions = NULL,
  nmax = NA_integer_,
  isizemax = 800,
  exclude = c("chrM", "chrY", "chrX")
)
getInsertSizeDistFromBam(
  fname,
  regions = NULL,
  nmax = NA_integer_,
  isizemax = 800,
  exclude = c("chrM", "chrY", "chrX")
)

Arguments

`fname`	`character` vector with paths to one or several bam files. If multiple files are given, insert sizes from all will be pooled and tabulated together.
`regions`	`GRanges` object. Only alignments falling into these regions will be used. If `NULL` (the default), all alignments are used.
`nmax`	`numeric(1)` specifying the maximal number of alignments to read. If `NA` (the default), the alignments in `regions` (if `regions` are not `NULL`) or in the bam file will be used.
`isizemax`	`numeric(1)` specifying the maximal insert size to report. Larger insert sizes will be set to `isizemax` with on their number will be reported.
`exclude`	`character` vector with chromosome names to be excluded. Alignments on these chromosomes will be excluded. `exclude` will be ignored if `regions` is not `NULL`.

Value

integer vector with the number of insert sizes. The element at position i gives the observed number of alignment pairs with an insert size of i. The number of insert sizes greater than isizemax that were set to isizemax are reported in the attribute "ncapped".

Author(s)

Michael Stadler

Examples

if (requireNamespace("Rsamtools", quietly = TRUE)) {
    bamf <- system.file("extdata", "getInsertSizeDistFromBam", "atac_mm10.bam",
                        package = "swissknife")
    isize <- getInsertSizeDistFromBam(bamf)
    attr(isize, "ncapped")
    plot(isize, type = "l",
         xlab = "Insert size (bp)", ylab = "Number of fragments")
}

if (requireNamespace("Rsamtools", quietly = TRUE)) {
    bamf <- system.file("extdata", "getInsertSizeDistFromBam", "atac_mm10.bam",
                        package = "swissknife")
    isize <- getInsertSizeDistFromBam(bamf)
    attr(isize, "ncapped")
    plot(isize, type = "l",
         xlab = "Insert size (bp)", ylab = "Number of fragments")
}

Get mappable regions of a genome.

Description

Given a k-mer length and the maximum number of allowed hits per k-mer, find all mappable regions in a genome.

Usage

getMappableRegions(
  genome,
  genomeIndex,
  kmerLength = 50,
  maxHits = 1,
  Ncpu = 2,
  quiet = TRUE
)
getMappableRegions(
  genome,
  genomeIndex,
  kmerLength = 50,
  maxHits = 1,
  Ncpu = 2,
  quiet = TRUE
)

Arguments

`genome`	The genome sequence to work on. Either a `BSgenome` object, a `character` scalar with the name of an installed `BSgenome` or with a file path and name pointing to a fasta file with the genome sequence.
`genomeIndex`	`character` scalar with the path to the bowtie index and prefix to align against, in the form `</path/to/index>/<prefix>`, or the name of an installed `Rbowtie` index package created by the QuasR package for an installed `BSgenome` package.
`kmerLength`	`numeric` scalar specifying the k-mer length (width of overlapping windows in `genome`), usually set to the typical read length for which to get the mappable regions.
`maxHits`	`numeric` scalar specifying the maximum number of hits (matches) of a k-mer in the `genome` to be considered mappable.
`Ncpu`	`numeric` scalar specifying the number of CPU threads to use for alignments.
`quiet`	`logical` scalar indicating if progress information should be printed on the console.

Details

Sequences of all overlapping windows are extracted from the genome and aligned to the provided genome index using bowtie with parameters -f -v 0 -a -B 1 -m maxHits. If no more than maxHits hits are found, the window is defined mappable.

Value

A GRanges object with mappable regions. All plus-strand sequences in genome of length kmerLength with their start (leftmost) position overlapping the GRanges object do not generate more than maxHits hits when aligned to the genome.

Author(s)

Michael Stadler

Examples

if (requireNamespace("Rbowtie", quietly = TRUE)) {
    library(Rbowtie)

    genomefile <- system.file("extdata", "getMappableRegions", "hg19sub.fa", package = "swissknife")
    indexdir <- tempfile()
    indexpre <- "index"
    indexname <- file.path(indexdir, indexpre)
    idx <- bowtie_build(genomefile, indexdir)

    mapgr <- getMappableRegions(genomefile, indexname, 50, quiet = FALSE)
    print(mapgr)
}

if (requireNamespace("Rbowtie", quietly = TRUE)) {
    library(Rbowtie)

    genomefile <- system.file("extdata", "getMappableRegions", "hg19sub.fa", package = "swissknife")
    indexdir <- tempfile()
    indexpre <- "index"
    indexname <- file.path(indexdir, indexpre)
    idx <- bowtie_build(genomefile, indexdir)

    mapgr <- getMappableRegions(genomefile, indexname, 50, quiet = FALSE)
    print(mapgr)
}

Assign labels to cells using known marker genes

Description

Given marker gene sets for cell types, identify cells with high expression of the marker genes (positive examples), then use these cells to create a reference transcriptome profile for each cell type and identify additional cells of each type using SingleR. These marker genes should specifically expressed a single cell type, e.g. CD3 which is expressed by all T cell subtypes would not be suitable for specific T cell subtypes.

Usage

labelCells(
  sce,
  markergenes,
  fraction_topscoring = 0.01,
  expr_values = "logcounts",
  normGenesetExpressionParams = list(R = 200),
  aggregateReferenceParams = list(power = 0.5),
  SingleRParams = list(),
  BPPARAM = SerialParam()
)
labelCells(
  sce,
  markergenes,
  fraction_topscoring = 0.01,
  expr_values = "logcounts",
  normGenesetExpressionParams = list(R = 200),
  aggregateReferenceParams = list(power = 0.5),
  SingleRParams = list(),
  BPPARAM = SerialParam()
)

Arguments

`sce`	`SingleCellExperiment` object.
`markergenes`	Named `list` of `character` vectors with the marker genes for each cell types. The marker genes must be a subset of `rownames(sce)`.
`fraction_topscoring`	`numeric` vector of length 1 or the same length as `markergenes` giving the fraction(s) of top scoring cells for each cell type to pick to create the reference transcriptome profile.
`expr_values`	Integer scalar or string indicating which assay of `sce` contains the expression values.
`normGenesetExpressionParams`	`list` with additional parameters for `normGenesetExpression`.
`aggregateReferenceParams`	`list` with additional parameters for `aggregateReference`.
`SingleRParams`	`list` with additional parameters for `SingleR`.
`BPPARAM`	An optional `BiocParallelParam` instance determining the parallel back-end to be used during evaluation.

Value

A list of three elements named cells, refs and labels. cells contains a list with the numerical indices of the top scoring cells for each cell type. refs contains the pseudo-bulk transcriptome profiles used as a reference for label assignment, as returned by aggregateReference. labels contains a DataFrame with the annotation statistics for each cell (one cell per row), generated by SingleR.

Author(s)

Michael Stadler

Examples

if (requireNamespace("SingleR", quietly = TRUE) &&
    requireNamespace("SingleCellExperiment", quietly = TRUE)) {
    
    # create SingleCellExperiment with cell-type specific genes
    library(SingleCellExperiment)
    n_types <- 3
    n_per_type <- 30
    n_cells <- n_types * n_per_type
    n_genes <- 500
    fraction_specific <- 0.1
    n_specific <- round(n_genes * fraction_specific)

    set.seed(42)
    mu <- ceiling(runif(n = n_genes, min = 0, max = 30))
    u <- do.call(rbind, lapply(mu, function(x) rpois(n_cells, lambda = x)))
    rownames(u) <- paste0("g", seq.int(nrow(u)))
    celltype.labels <- rep(paste0("t", seq.int(n_types)), each = n_per_type)
    celltype.genes <- split(sample(rownames(u), size = n_types * n_specific),
                            rep(paste0("t", seq.int(n_types)), each = n_specific))
    for (i in seq_along(celltype.genes)) {
        j <- celltype.genes[[i]]
        k <- celltype.labels == paste0("t", i)
        u[j, k] <- 2 * u[j, k]
    }
    v <- log2(u + 1)
    sce <- SingleCellExperiment(assays=list(counts=u, logcounts=v))

    # define marker genes (subset of true cell-type-specific genes)
    marker.genes <- lapply(celltype.genes, "[", 1:5)
    marker.genes

    # predict cell types
    res <- labelCells(sce, marker.genes,
                      fraction_topscoring = 0.1,
                      normGenesetExpressionParams = list(R = 50))

    # high-scoring cells used as references for each celltype
    res$cells

    # ... from these, pseudo-bulks were created:
    res$refs

    # ... and used to predict labels for all cells
    res$labels$pruned.labels

    # compare predicted to true cell types
    table(true = celltype.labels, predicted = res$labels$pruned.labels)
}
      
if (requireNamespace("SingleR", quietly = TRUE) &&
    requireNamespace("SingleCellExperiment", quietly = TRUE)) {
    
    # create SingleCellExperiment with cell-type specific genes
    library(SingleCellExperiment)
    n_types <- 3
    n_per_type <- 30
    n_cells <- n_types * n_per_type
    n_genes <- 500
    fraction_specific <- 0.1
    n_specific <- round(n_genes * fraction_specific)

    set.seed(42)
    mu <- ceiling(runif(n = n_genes, min = 0, max = 30))
    u <- do.call(rbind, lapply(mu, function(x) rpois(n_cells, lambda = x)))
    rownames(u) <- paste0("g", seq.int(nrow(u)))
    celltype.labels <- rep(paste0("t", seq.int(n_types)), each = n_per_type)
    celltype.genes <- split(sample(rownames(u), size = n_types * n_specific),
                            rep(paste0("t", seq.int(n_types)), each = n_specific))
    for (i in seq_along(celltype.genes)) {
        j <- celltype.genes[[i]]
        k <- celltype.labels == paste0("t", i)
        u[j, k] <- 2 * u[j, k]
    }
    v <- log2(u + 1)
    sce <- SingleCellExperiment(assays=list(counts=u, logcounts=v))

    # define marker genes (subset of true cell-type-specific genes)
    marker.genes <- lapply(celltype.genes, "[", 1:5)
    marker.genes

    # predict cell types
    res <- labelCells(sce, marker.genes,
                      fraction_topscoring = 0.1,
                      normGenesetExpressionParams = list(R = 50))

    # high-scoring cells used as references for each celltype
    res$cells

    # ... from these, pseudo-bulks were created:
    res$refs

    # ... and used to predict labels for all cells
    res$labels$pruned.labels

    # compare predicted to true cell types
    table(true = celltype.labels, predicted = res$labels$pruned.labels)
}

Access example data

Description

Make example data available, typically for use in teaching.

Usage

loadExampleData(name = "list", envir = globalenv(), verbose = TRUE)
loadExampleData(name = "list", envir = globalenv(), verbose = TRUE)

Arguments

`name`	An optional character scalar specifying the data set(s) to be made available. The special name `"list"` (default) is used to print a data frame of available data sets with descriptions. The special name `"latest"` will select the latest data set(s) available.
`envir`	specifies the environment in which the data should be made available. By default, `envir = globalenv()`, which creates the example data objects in the user workspace. Possible alternative environment are for example `parent.frame()`, which is the environment in which `loadExampleData()` was called.
`verbose`	A logical scalar. If `TRUE`, report what is being selected and made available.

Value

A data.frame (invisibly) with one row for each dataset that was made available in the global environment.

Author(s)

Michael Stadler

Examples

loadExampleData()
loadExampleData("mycars")

loadExampleData()
loadExampleData("mycars")

Calculate normalized expression of a gene set

Description

Calculate normalized expression for a set of genes in each cell from a SingleCellExperiment, using random sets of similarly expressed genes as background to account for cell quality and sequencing depth.

Usage

normGenesetExpression(
  sce,
  genes,
  expr_values = "logcounts",
  subset.row = NULL,
  R = 200,
  nbins = 100,
  BPPARAM = SerialParam()
)
normGenesetExpression(
  sce,
  genes,
  expr_values = "logcounts",
  subset.row = NULL,
  R = 200,
  nbins = 100,
  BPPARAM = SerialParam()
)

Arguments

`sce`	`SingleCellExperiment` object.
`genes`	`character` vector with the genes in the set. Must be a subset of `rownames(sce)`.
`expr_values`	Integer scalar or string indicating which assay of `sce` contains the expression values.
`subset.row`	Sample random genes only from these. If `NULL` (the default), the function will sample from all genes in `sce`. Alternatively, `subset.row` can be a logical, integer or character vector indicating the rows (genes) of `sce` to use for sampling. This allows for example to exclude highly variable genes from the sampling which are likely expressed only in certain cell types.
`R`	Integer scalar giving the number of random gene sets to sample for normalization.
`nbins`	Integer scalar, specifying the number of bins to group the average expression levels into before sampling (passed to `sampleControlElements`). Higher numbers of bins will increase the match to the target distribution(s), but may fail if there are few elements to sample from.
`BPPARAM`	An optional `BiocParallelParam` instance determining the parallel back-end to be used during evaluation.

Value

A numeric vector with normalized gene set scores for each cell in sce.

Author(s)

Michael Stadler

Examples

if (require(SingleCellExperiment)) {
    # get sce
    example(SingleCellExperiment, echo=FALSE)
    rownames(sce) <- paste0("g", seq.int(nrow(sce)))
    
    # calculate gene set expression scores
    markers <- c("g1", "g13", "g27")
    scores <- normGenesetExpression(sce, markers, R = 50)
    
    # compare expression of marker genes with scores
    plotdat <- cbind(scores, t(logcounts(sce)[markers, ]))
    cor(plotdat)
    pairs(plotdat)
}

if (require(SingleCellExperiment)) {
    # get sce
    example(SingleCellExperiment, echo=FALSE)
    rownames(sce) <- paste0("g", seq.int(nrow(sce)))
    
    # calculate gene set expression scores
    markers <- c("g1", "g13", "g27")
    scores <- normGenesetExpression(sce, markers, R = 50)
    
    # compare expression of marker genes with scores
    plotdat <- cbind(scores, t(logcounts(sce)[markers, ]))
    cor(plotdat)
    pairs(plotdat)
}

Parse R and R package versions from session informations

Description

The function parses the R version and R package versions from session information (created by sessionInfo(), tested with R 3.6) in files provided in infiles. Two types of files are currently supported:

Rout:: Files containing R console output (created by R CMD BATCH script.R output.Rout
md:: Files containing markdown output created by rmarkdown::render('input.Rmd', clean = FALSE), which will keep the intermediate .md file.

Usage

parsePkgVersions(infiles)
parsePkgVersions(infiles)

Arguments

infiles

Character vector with text files (extension must be either .Rout or .md), containing session information to parse out.

Value

A list of lists with one element in the outer list for each R version, contianing an innter list with elements files and packages.

Author(s)

Michael Stadler

Examples

f <- list.files(system.file("extdata", "parsePkgVersions",
                            package = "swissknife"),
                full.names = TRUE)
parsePkgVersions(f)

f <- list.files(system.file("extdata", "parsePkgVersions",
                            package = "swissknife"),
                full.names = TRUE)
parsePkgVersions(f)

Create a bitmap-rendered plot.

Description

plotBitScatter is a wrapper around plot which renders the plot area as a bitmap (png), but keeps all other elements (axes, labels, etc.) as vector elements. This is especially useful for keeping the size of PDF files with scatter plots with many elements small, while retaining editability of axes.

Usage

plotBitScatter(
  x,
  y = NULL,
  ...,
  densCols = TRUE,
  colpal = c("#00007F", "blue", "#007FFF", "cyan", "#7FFF7F", "yellow", "#FF7F00", "red",
    "#7F0000"),
  xpixels = 1000,
  ypixels = NULL,
  pointsize = NULL
)
plotBitScatter(
  x,
  y = NULL,
  ...,
  densCols = TRUE,
  colpal = c("#00007F", "blue", "#007FFF", "cyan", "#7FFF7F", "yellow", "#FF7F00", "red",
    "#7F0000"),
  xpixels = 1000,
  ypixels = NULL,
  pointsize = NULL
)

Arguments

`x`	`numeric` vector with x-coordinates of points, or a two-column matrix with x- and y- coordinates.
`y`	`numeric` vector with y-coordinates of points (same length as `x`). Can be `NULL`, in which case `x` must be a two-column matrix.
`...`	any further arguments to be passed to `plot`
`densCols`	`logical(1)`. If `TRUE` and `col` is not given as an additional argument, then the local density of points will be used as colors, using the palette spanned by the colors in `colpal`.
`colpal`	vector of colors defining the palette for automatic density-based coloring.
`xpixels`	the number of pixels in the x dimension used for rendering the plotting area.
`ypixels`	the number of pixels in the y dimension used for rendering the plotting area. If `NULL` (the default), will be calculated automatically as `xpixels * par('pin')[2] / par('pin')[1]`, such that the aspect ratio of the current plotting region is observed. This may not work (e.g. when using `layout()`), as this may result in negative values returned by `par('pin')`. In that case, `ypixels` should be set manually using this argument.
`pointsize`	the size of points used for the png device when rendering the plot. If `NULL` (the default), will be calculated automatically as `12 / graphics::grconvertX(par("pin")[1], from = "inches", to = "device") * xpixels`. This may not work (e.g. when using `layout()`), as this may result in negative values returned by `par('pin')`. In that case, `pointsize` should be set manually using this argument.

Details

xpixels controls the resolution of the rendered plotting area. In order to keep circular plotting symbols circlular (e.g. pch = 1), ypixels is automatically calculated using xpixels and the aspect ratio of the current plotting area. If the plotting device is rescaled after calling plotBitScatter, circular plotting symbols may become skewed.

Value

NULL (invisibly)

Author(s)

Michael Stadler

Examples

x <- rnorm(1000)
y <- rnorm(1000)
par(mfrow=c(1,2))
plotBitScatter(x, y, main = "bitmap")
plot(x, y, main = "default")

x <- rnorm(1000)
y <- rnorm(1000)
par(mfrow=c(1,2))
plotBitScatter(x, y, main = "bitmap")
plot(x, y, main = "default")

Plot gene region

Description

Visualize the gene model for a gene of interest, or for all genes in a provided region, and/or show one or more coverage tracks based on bigwig file(s).

Usage

plotGeneRegion(
  gtf = "",
  granges = NULL,
  chr = "",
  start = NA_real_,
  end = NA_real_,
  showgene = "",
  bigwigFiles = "",
  bigwigCond = "",
  geneTrackTitle = "Genes",
  transcriptIdColumn = "transcript_id",
  geneIdColumn = "gene_id",
  geneSymbolColumn = "gene_name",
  lowerPadding = 0.15,
  upperPadding = 0.05,
  colorByStrand = FALSE,
  featureColors = c(plusmain = "#0E14D0", minusmain = "#D0350E", plusother = "#9E9BEB",
    minusother = "#DA907E"),
  condColors = NULL,
  scaleDataTracks = FALSE,
  plotTitle = NULL,
  ...
)
plotGeneRegion(
  gtf = "",
  granges = NULL,
  chr = "",
  start = NA_real_,
  end = NA_real_,
  showgene = "",
  bigwigFiles = "",
  bigwigCond = "",
  geneTrackTitle = "Genes",
  transcriptIdColumn = "transcript_id",
  geneIdColumn = "gene_id",
  geneSymbolColumn = "gene_name",
  lowerPadding = 0.15,
  upperPadding = 0.05,
  colorByStrand = FALSE,
  featureColors = c(plusmain = "#0E14D0", minusmain = "#D0350E", plusother = "#9E9BEB",
    minusother = "#DA907E"),
  condColors = NULL,
  scaleDataTracks = FALSE,
  plotTitle = NULL,
  ...
)

Arguments

`gtf`	Character scalar, path to gtf file (tested with Ensembl/Gencode files).
`granges`	GRanges object, typically generated from a GTF file using the `prepareGTF` function. This is an alternative to providing the link to the gtf file directly, and will take precedence over the `gtf` argument if provided.
`chr`	Character scalar, name of the chromosome to show.
`start`, `end`	Numeric scalars, start and end position of the region to show.
`showgene`	Character scalar, the gene ID/name to display. Will take precedence over positional range specification if provided.
`bigwigFiles`	Named character vector, paths to bigwig files.
`bigwigCond`	Named character vector, the grouping of the bigwig files (used for coloring of the coverage tracks).
`geneTrackTitle`	Character scalar, name of the gene track.
`transcriptIdColumn`	Character scalar, the column in the gtf file that contains the transcript ID. Passed to `prepareGTF`.
`geneIdColumn`	Character scalar, the column in the gtf file that contains the gene ID. Passed to `prepareGTF`.
`geneSymbolColumn`	Character scalar, the column in the gtf file that contains the gene symbol (if available). Set to `""` if not available (in which case the gene IDs will be used in its place). Passed to `prepareGTF`.
`lowerPadding`, `upperPadding`	Numeric scalars, setting the amount of padding in the lower and upper range of the plot, respectively. For example, a value of 0.05 will expand the range by 0.05 * (max coordinate - min coordinate) in the specified direction.
`colorByStrand`	Logical scalar, determining whether gene features are colored by the annotated strand.
`featureColors`	Named character vector of length 4, with elements `plusmain`, `minusmain`, `plusother`, `minusother`, giving the colors to use for the features if `colorByStrand` is TRUE.
`condColors`	Either NULL or a named character vector (with the same names as the unique values of `bigwigCond`), giving the colors to use for the coverage tracks if `bigwigCond` is provided.
`scaleDataTracks`	Logical scalar, indicating whether the data tracks should be scaled to have the same y-axis limits.
`plotTitle`	Character scalar, the title of the final plot. If `NULL` (the default), it will be automatically defined based on the displayed gene or region.
`...`	Additional arguments to be passed to `Gviz::plotTracks`.

Details

The gene annotation can be provided either as a path to a gtf file, or as a GRanges object (generated using the prepareGTF function to ensure compatibility). The region to display can be determined either by specifying a gene (ID or symbol) or by specifying a viewing range (chromosome, start and end positions).

Author(s)

Charlotte Soneson

Examples

if (requireNamespace("Gviz", quietly = TRUE)) {
    gtffile <- system.file("extdata/plotGeneRegion/mm10_ensembl98.gtf", 
                           package = "swissknife")
    plotGeneRegion(gtf = gtffile, 
                   showgene = "Tnfaip3")
               
    bwf <- system.file("extdata/plotGeneRegion/mnase_mm10.bw", 
                       package = "swissknife")
    names(bwf) <- "bwf1"
    plotGeneRegion(gtf = gtffile, 
                   bigwigFiles = bwf,
                   chr = "chr10", start = 20000000, end = 20005000)
    plotGeneRegion(bigwigFiles = bwf,
                   chr = "chr10", start = 20000000, end = 20005000)
               
    bwf2 <- c(bwf, bwf)
    names(bwf2) <- c("bwf1", "bwf2")
    bwc2 <- c("c1", "c2")
    names(bwc2) <- names(bwf2)
    plotGeneRegion(gtf = gtffile, bigwigFiles = bwf2, bigwigCond = bwc2, 
                   showgene = "Map3k5")
}

if (requireNamespace("Gviz", quietly = TRUE)) {
    gtffile <- system.file("extdata/plotGeneRegion/mm10_ensembl98.gtf", 
                           package = "swissknife")
    plotGeneRegion(gtf = gtffile, 
                   showgene = "Tnfaip3")
               
    bwf <- system.file("extdata/plotGeneRegion/mnase_mm10.bw", 
                       package = "swissknife")
    names(bwf) <- "bwf1"
    plotGeneRegion(gtf = gtffile, 
                   bigwigFiles = bwf,
                   chr = "chr10", start = 20000000, end = 20005000)
    plotGeneRegion(bigwigFiles = bwf,
                   chr = "chr10", start = 20000000, end = 20005000)
               
    bwf2 <- c(bwf, bwf)
    names(bwf2) <- c("bwf1", "bwf2")
    bwc2 <- c("c1", "c2")
    names(bwc2) <- names(bwf2)
    plotGeneRegion(gtf = gtffile, bigwigFiles = bwf2, bigwigCond = bwc2, 
                   showgene = "Map3k5")
}

Plot annotated phasogram.

Description

Plot phasogram and annotate it with estimated nucleosome repeat length (NRL).

Usage

plotPhasogram(x, hide = TRUE, xlim = NULL, verbosePlot = FALSE, ...)
plotPhasogram(x, hide = TRUE, xlim = NULL, verbosePlot = FALSE, ...)

Arguments

`x`	`numeric` vector giving the counts of alignment distances (typically the output of `calcPhasogram`.
`hide`	If `TRUE` (the default), hide phasogram counts not used in the NRL estimate (`mind` parameter from `estimateNRL`).
`xlim`	`numeric(2)` with the x-axis (phase) limits in the first two plots (see Details). if `NULL` (the default), the full range defined by `x` and `hide` will be used.
`verbosePlot`	If `TRUE`, create three plots instead of just a single plot (see Details).
`...`	Additional arguments passed to `estimateNRL` to control NRL estimation.

Details

The function will visualize an annotated phasogram. For verbosePlot=FALSE (the default), it will create a single annotated plot. For verbosePlot=TRUE, it will create three plots (using par(mfrow=c(1,3))):

raw phase counts with de-trending and de-noising loess fits
residual phase counts with de-noising loess fit and detected peaks
linear fit to peaks and NRL estimation

Value

The return value from the call to estimateNRL (invisibly).

Author(s)

Michael Stadler

Examples

  # see the help for calcPhasogram() for a full example

# see the help for calcPhasogram() for a full example

Plot Selected Variable Genes

Description

This function take the output from selVarGenes and plots the genes that have been selected to be highly variable across the cells. It plot the log2 coefficient of variation as a function of the log mean.

Usage

plotSelVarGenes(
  selVarGenes_list = NULL,
  xlab = "logMean",
  ylab = "logCV",
  main = "Selected Variable Genes",
  pch = 16,
  col = "#BEBEBE40",
  sel_col = "steelblue",
  colByBin = FALSE,
  asp = 1,
  ...
)
plotSelVarGenes(
  selVarGenes_list = NULL,
  xlab = "logMean",
  ylab = "logCV",
  main = "Selected Variable Genes",
  pch = 16,
  col = "#BEBEBE40",
  sel_col = "steelblue",
  colByBin = FALSE,
  asp = 1,
  ...
)

Arguments

`selVarGenes_list`	the output list from the `selVarGenes` function.
`xlab`	label for x-axis.
`ylab`	label for y-axis.
`main`	title for plot.
`pch`	point pch.
`col`	point color.
`sel_col`	point color of the selected variable genes.
`colByBin`	if TRUE, color the genes by the bin they've been assigned to.
`asp`	the y/x aspect ratio. Set to 1 when `colByBin` is TRUE.
`...`	additional parameters for the `plot` function.

Value

plot

Author(s)

Dania Machlab

Examples

if (requireNamespace("SingleCellExperiment", quietly = TRUE)) {
    # packages
    library(SingleCellExperiment)
   
    # create example count matrix
    # ... poisson distr per gene
    mu <- ceiling(runif(n = 2000, min = 0, max = 100))
    counts <- do.call(rbind, lapply(mu, function(x){rpois(1000, lambda = x)}))
    counts <- counts + 1
    # ... add signal to subset of genes (rows) and cells (columns)
    i <- sample(x = 1:nrow(counts), size = 500)
    j <- sample(x = 1:ncol(counts), size = 500)
    counts[i, j] <- counts[i, j] + sample(5:10, length(i), replace = TRUE)
 
    # create SCE
    sce <- SingleCellExperiment(list(counts = counts))

    # calculate sizeFactors
    libsizes <- colSums(counts)
    sizeFactors(sce) <- libsizes / mean(libsizes)

    # select variable genes
    varGenes <- selVarGenes(sce)
   
    # plot
    plotSelVarGenes(varGenes)
    plotSelVarGenes(varGenes, colByBin=TRUE)
}

if (requireNamespace("SingleCellExperiment", quietly = TRUE)) {
    # packages
    library(SingleCellExperiment)
   
    # create example count matrix
    # ... poisson distr per gene
    mu <- ceiling(runif(n = 2000, min = 0, max = 100))
    counts <- do.call(rbind, lapply(mu, function(x){rpois(1000, lambda = x)}))
    counts <- counts + 1
    # ... add signal to subset of genes (rows) and cells (columns)
    i <- sample(x = 1:nrow(counts), size = 500)
    j <- sample(x = 1:ncol(counts), size = 500)
    counts[i, j] <- counts[i, j] + sample(5:10, length(i), replace = TRUE)
 
    # create SCE
    sce <- SingleCellExperiment(list(counts = counts))

    # calculate sizeFactors
    libsizes <- colSums(counts)
    sizeFactors(sce) <- libsizes / mean(libsizes)

    # select variable genes
    varGenes <- selVarGenes(sce)
   
    # plot
    plotSelVarGenes(varGenes)
    plotSelVarGenes(varGenes, colByBin=TRUE)
}

Prepare GTF file for use with plotGeneRegion

Description

This function sets the names of the transcript and gene ID columns of the gtf file to "transcript" and "gene", removes version tags of the transcripts/genes and retains only the "exon" entries. The purpose is to make the file amenable to plotting with Gviz, using the plotGeneRegion function.

Usage

prepareGTF(
  gtf,
  transcriptIdColumn = "transcript_id",
  geneIdColumn = "gene_id",
  geneSymbolColumn = "gene_name"
)
prepareGTF(
  gtf,
  transcriptIdColumn = "transcript_id",
  geneIdColumn = "gene_id",
  geneSymbolColumn = "gene_name"
)

Arguments

`gtf`	Character scalar, path to gtf file (tested with Ensembl/Gencode files).
`transcriptIdColumn`	Character scalar, the column in the gtf file that contains the transcript ID.
`geneIdColumn`	Character scalar, the column in the gtf file that contains the gene ID.
`geneSymbolColumn`	Character scalar, the column in the gtf file that contains the gene symbol (if available). Set to `""` if not available (in which case the gene IDs will be used in its place).

Author(s)

Charlotte Soneson

Examples

gtf <- prepareGTF(gtf = system.file("extdata/plotGeneRegion/mm10_ensembl98.gtf",
                                    package = "swissknife"))
                                    
gtf <- prepareGTF(gtf = system.file("extdata/plotGeneRegion/mm10_ensembl98.gtf",
                                    package = "swissknife"))

Read sample tsv files from seqdata storage

Description

The function searches the provided seqdataDir for tsv files corresponding to the provided sampleIds and returns a data.frame containing the metadata for all these samples.

Usage

readSampleTsvs(
  seqdataDir = "/tungstenfs/groups/gbioinfo/seqdata",
  sampleIds,
  keepMulti = TRUE,
  ...
)
readSampleTsvs(
  seqdataDir = "/tungstenfs/groups/gbioinfo/seqdata",
  sampleIds,
  keepMulti = TRUE,
  ...
)

Arguments

`seqdataDir`	Character scalar, the path to the directory containing the tsv files.
`sampleIds`	Character vector with sample IDs, which will be matched against the file names in `seqDataDir`. The sample IDs should not contain the `.tsv` suffix.
`keepMulti`	Logical scalar, indicating whether to keep samples that match more than one tsv file. If `TRUE`, these samples are represented by multiple rows in the table. If `FALSE`, these samples are excluded. In any case, a warning will be generated, listing the samples with multiple matching files.
`...`	Additional arguments that will be passed to `list.files`, e.g. to make the search case-insensitive or search recursively.

Value

A data.frame with metadata for the provided sampleIds.

Author(s)

Charlotte Soneson

Examples

if (requireNamespace("dplyr") && requireNamespace("tidyr")) {
    print(readSampleTsvs(seqdataDir = system.file("extdata/readSampleTsvs", 
                                                  package = "swissknife"), 
                         sampleIds = c("readSampleTsvsEx1",
                                       "readSampleTsvsEx2",
                                       "readSampleTsvsEx3")))
}

if (requireNamespace("dplyr") && requireNamespace("tidyr")) {
    print(readSampleTsvs(seqdataDir = system.file("extdata/readSampleTsvs", 
                                                  package = "swissknife"), 
                         sampleIds = c("readSampleTsvsEx1",
                                       "readSampleTsvsEx2",
                                       "readSampleTsvsEx3")))
}

Sample control elements that match a target distribution.

Description

Randomly sample from a set of control (background) elements, such that the selected elements are similarly distributed as a given set of target (foreground) elements.

Usage

sampleControlElements(
  x,
  idxTarget,
  idxControl = NULL,
  nbins = 50,
  oversample = 1
)
sampleControlElements(
  x,
  idxTarget,
  idxControl = NULL,
  nbins = 50,
  oversample = 1
)

Arguments

`x`	`numeric` vector (or list of `numeric` vectors). `idxTarget` and `idxControl` refer to the elements of `x`. If `x` is a list, all elements must have the same length.
`idxTarget`	`numeric` or `logical` vector specifying the elements in `x` that define the target distribution to be matched by the control elements.
`idxControl`	`numeric` or `logical` vector specifying the complete set of possible control elements in `x` (default: all that are not in `idxTarget`), from which a subset is to be sampled.
`nbins`	`numeric(1)` or `numeric(length(x))` if `x` is a list, specifying the number of bins to group the values of x into. Higher numbers of bins will increase the match to the target distribution(s), but may fail if there are few elements to sample from (will throw a warning).
`oversample`	The number of control elements to sample for each target element.

Value

numeric vector with round(length(idxTarget) * oversample) elements, specifying the index (positions) of the sampled control elements.

Author(s)

Michael Stadler

Examples

x <- c(runif(1000, min = 0, max = 10),
       rnorm(200, mean = 5, sd = 1))
s <- sampleControlElements(x, idxTarget = 1001:1200, idxControl = 1:1000)
par(mfrow=c(2,2))
h <- hist(x, breaks = 20, main = "all")
hist(x[1:1000], breaks = h$breaks, main = "all control")
hist(x[1001:1200], breaks = h$breaks, main = "target")
hist(x[s], breaks = h$breaks, main = "sampled control")

x <- c(runif(1000, min = 0, max = 10),
       rnorm(200, mean = 5, sd = 1))
s <- sampleControlElements(x, idxTarget = 1001:1200, idxControl = 1:1000)
par(mfrow=c(2,2))
h <- hist(x, breaks = 20, main = "all")
hist(x[1:1000], breaks = h$breaks, main = "all control")
hist(x[1001:1200], breaks = h$breaks, main = "target")
hist(x[s], breaks = h$breaks, main = "sampled control")

Select Variable Genes in Single Cell RNA-seq

Description

This function selects the most variable genes from a SingleCellExperiment object using the plot that displays the log2 coefficient of variation as a function of the log2 mean for all genes across all the cells.

Usage

selVarGenes(
  data = NULL,
  assay.type = "counts",
  logPseudo = 1,
  Nmads = 3,
  minCells = 5,
  minExpr = 1,
  exclTopExprFrac = 0.01,
  span = 0.2,
  control = stats::loess.control(surface = "direct"),
  nBins = 100,
  nBinsDense = ceiling(nrow(data)/4),
  ...
)
selVarGenes(
  data = NULL,
  assay.type = "counts",
  logPseudo = 1,
  Nmads = 3,
  minCells = 5,
  minExpr = 1,
  exclTopExprFrac = 0.01,
  span = 0.2,
  control = stats::loess.control(surface = "direct"),
  nBins = 100,
  nBinsDense = ceiling(nrow(data)/4),
  ...
)

Arguments

`data`	`SingleCellExperiment` object or normalized count `matrix` containing the genes as rows and cells as columns.
`assay.type`	the type of assay to use if `data` is a `SingleCellExperiment`. It can be either 'counts' or 'logcounts'. The default is 'counts'.
`logPseudo`	pseudo-count to use when using the logcounts slot from the `SingleCellExperiment` to transform back to normalized raw count space.
`Nmads`	number of MADs beyond which genes are selected per bin.
`minCells`	keep genes with minimum expression in at least this number of cells.
`minExpr`	keep genes with expression greater than or equal to this in `minCells` cells in the normalized count matrix.
`exclTopExprFrac`	the fraction of top expressed genes that will be excluded from the loess fit (value between 0 and 1).
`span`	span parameter for `loess` function.
`control`	control parameters for `loess` function.
`nBins`	number of bins or groups to place the points(genes) into.
`nBinsDense`	number of bins or groups to use to place the points(genes) into when calculating more accurate distance values to the curve from the loess fit.
`...`	additional parameters for the `loess` function from the `stats` package.

Details

The function takes in a SingleCellExperiment object and calculates the normalized counts by dividing the raw counts by the corresponding sizeFactors per cell, or a matrix of already normalized counts. Only genes that have an expression greater than or equal to minExpr in at least minCells cells will be kept. If assay.type is set to 'logcounts', that assay is transformed back to the raw normalized count space by performing 2^logcounts(data) - 1, under the assumption the logcounts data is in log2 form and had a pseudocount of 1.

The genes that vary most on the log2(coefficient of variation) vs log2(mean) plot of genes will be selected. A loess fit is done on this plot and the distance (euclidean by default) each point has to the curve is calculated in two steps.

In the first step, genes are assigned to bins by taking the minimum distance to the curve. By default we select 100 points on the loess fit and calculate the distances each gene has to all those points on the curve. Each gene is assigned to the point on the curve for which it has the shortest distance. In the second step, more accurate distances to the curve are calculated by using a higher number of points on the curve. Distances are calculated using the dist.matrix function.

Finally, for each bin, the most variable genes are selected using the more accurate distance measures. Genes that fall below the loess fit are assigned a negative sign and the genes that are Nmads MADs away from the median are selected.

Value

a list of length 2:

varGenes: vector containing the names of the most variable genes.
geneInfo: data.frame with genes as rows and columns containing calculated measures:
- logMean: log2(mean) expression of genes across cells.
- logCV: log2(coefficient of variation) of genes across cells.
- pred_logCV: predicted log2(coefficient of variation) from loess fit.
- assigned_bin: bin each gene has been assigned to.
- distance: accurate distance measuses. Points below the loess fit get a negative sign.

Author(s)

Dania Machlab

Examples

if (requireNamespace("wordspace", quietly = TRUE) &&
    requireNamespace("SingleCellExperiment", quietly = TRUE)) {
    # packages
    library(SingleCellExperiment)
   
    # create example count matrix
    # ... poisson distr per gene
    mu <- ceiling(runif(n = 2000, min = 0, max = 100))
    counts <- do.call(rbind, lapply(mu, function(x){rpois(1000, lambda = x)}))
    counts <- counts + 1
    # ... add signal to subset of genes (rows) and cells (columns)
    i <- sample(x = 1:nrow(counts), size = 500)
    j <- sample(x = 1:ncol(counts), size = 500)
    counts[i, j] <- counts[i, j] + sample(5:10, length(i), replace = TRUE)
   
    # create SCE 
    sce <- SingleCellExperiment(list(counts = counts))
   
    # calculate sizeFactors
    libsizes <- colSums(counts)
    sizeFactors(sce) <- libsizes / mean(libsizes)
 
    # select variable genes
    varGenes <- selVarGenes(sce, assay.type="counts")
   
    # plot
    plotSelVarGenes(varGenes, colByBin=TRUE)
    plotSelVarGenes(varGenes)
}

if (requireNamespace("wordspace", quietly = TRUE) &&
    requireNamespace("SingleCellExperiment", quietly = TRUE)) {
    # packages
    library(SingleCellExperiment)
   
    # create example count matrix
    # ... poisson distr per gene
    mu <- ceiling(runif(n = 2000, min = 0, max = 100))
    counts <- do.call(rbind, lapply(mu, function(x){rpois(1000, lambda = x)}))
    counts <- counts + 1
    # ... add signal to subset of genes (rows) and cells (columns)
    i <- sample(x = 1:nrow(counts), size = 500)
    j <- sample(x = 1:ncol(counts), size = 500)
    counts[i, j] <- counts[i, j] + sample(5:10, length(i), replace = TRUE)
   
    # create SCE 
    sce <- SingleCellExperiment(list(counts = counts))
   
    # calculate sizeFactors
    libsizes <- colSums(counts)
    sizeFactors(sce) <- libsizes / mean(libsizes)
 
    # select variable genes
    varGenes <- selVarGenes(sce, assay.type="counts")
   
    # plot
    plotSelVarGenes(varGenes, colByBin=TRUE)
    plotSelVarGenes(varGenes)
}

Calculate gene-expression specificity scores.

Description

Calculate expression specificity scores for genes that quantify specific expression of a gene in groups of samples (e.g. from different tissues).

Usage

specificityScore(
  x,
  method = c("tau", "TSI", "counts"),
  group = NULL,
  thresh = 0,
  expr_values = "logcounts",
  na.rm = FALSE
)

## S4 method for signature 'matrix'
specificityScore(
  x,
  method = c("tau", "TSI", "counts"),
  group = NULL,
  thresh = 0,
  expr_values = "logcounts",
  na.rm = FALSE
)

## S4 method for signature 'SummarizedExperiment'
specificityScore(
  x,
  method = c("tau", "TSI", "counts"),
  group = NULL,
  thresh = 0,
  expr_values = "logcounts",
  na.rm = FALSE
)
specificityScore(
  x,
  method = c("tau", "TSI", "counts"),
  group = NULL,
  thresh = 0,
  expr_values = "logcounts",
  na.rm = FALSE
)

## S4 method for signature 'matrix'
specificityScore(
  x,
  method = c("tau", "TSI", "counts"),
  group = NULL,
  thresh = 0,
  expr_values = "logcounts",
  na.rm = FALSE
)

## S4 method for signature 'SummarizedExperiment'
specificityScore(
  x,
  method = c("tau", "TSI", "counts"),
  group = NULL,
  thresh = 0,
  expr_values = "logcounts",
  na.rm = FALSE
)

Arguments

`x`	Expression data, either a `matrix` with expression values for genes (rows) in each sample (columns), or a `SummarizedExperiment` or `SingleCellExperiment` object containing such expression data in one of the assays (selected by `expr_values`).
`method`	`character` scalar selecting the type of expression specificity score to be calculated. One of: "tau", "TSI", "counts". See "Details" for method-specific information.
`group`	`character` or `factor` of length `ncol(x)` that groups the measurements into clusters or tissues, for which expression specificity scores are to be calculated. If `NULL` (the default), each column of `x` is assumed to be its own group. If multiple columns belong to the same group, these columns are first aggregated (averaged) before score calculations.
`thresh`	`numeric` scalar defining the expression threshold. Values greater than this threshold are interpreted as expressed (used only for some of the methods, see "Details").
`expr_values`	Integer scalar or string indicating which assay of `x` contains the expression values, for `x` of type `SummarizedExperiment` or `SingleCellExperiment`. Ignored if `x` is a `matrix`.
`na.rm`	Logical scalar. If `TRUE`, `NA` values are excluded in the calculations.

Value

A numeric vector of length nrow(x) with gene scores.

Author(s)

Michael Stadler

References

For a review of tissue-specificity scores, see: "A benchmark of gene expression tissue-specificity metrics" Nadezda Kryuchkova-Mostacci and Marc Robinson-Rechavi Brief Bioinform. 2017 Mar; 18(2): 205–214. doi: 10.1093/bib/bbw008, PMCID: PMC5444245, PMID: 26891983

Examples

x <- rbind(g1 = runif(5),
           g2 = c(1, 0, 0, 0, 0),
           g3 = c(.6, .1, .1, .1, .1))
specificityScore(x)
specificityScore(x, method = "TSI")
specificityScore(x, method = "counts", thresh = 0.5)

x <- rbind(g1 = runif(5),
           g2 = c(1, 0, 0, 0, 0),
           g3 = c(.6, .1, .1, .1, .1))
specificityScore(x)
specificityScore(x, method = "TSI")
specificityScore(x, method = "counts", thresh = 0.5)

Map numerical values to colors.

Description

valueToColor takes a numerical vector and maps each value to an R color string.

Usage

valueToColor(
  x,
  rng = range(x, na.rm = TRUE),
  col = c("#5E4FA2", "#3288BD", "#66C2A5", "#ABDDA4", "#E6F598", "#FFFFBF", "#FEE08B",
    "#FDAE61", "#F46D43", "#D53E4F", "#9E0142"),
  NA.col = "lightgray",
  alpha = NULL
)
valueToColor(
  x,
  rng = range(x, na.rm = TRUE),
  col = c("#5E4FA2", "#3288BD", "#66C2A5", "#ABDDA4", "#E6F598", "#FFFFBF", "#FEE08B",
    "#FDAE61", "#F46D43", "#D53E4F", "#9E0142"),
  NA.col = "lightgray",
  alpha = NULL
)

Arguments

`x`	`numeric` vector with values to be mapped to colors.
`rng`	`numeric(2)` giving the range of values to be mapped to colors. By default, this will be the range of finite values in `x`.
`col`	vector with R colors defining the palette (must be a valid argument to `col2rgb`.
`NA.col`	Single R color to use for `NA` values in `x`.
`alpha`	`NULL` (default) or `numeric(1)` between 0 and 255, giving the alpha channel value for the colors (0 = fully transparent, 255 = fully opaque). `NULL` will use fully opaque colors (`alpha = 255`). `alpha` is ignored if `col` already contain colors with defined alpha values.

Details

The values in [rng[1], rng[2]] will be linearly mapped to the color palette defined by col. Any values in x less (greater) than rng[1] (rng[2]) will be assigned the same color as rng[1] (rng[2]).

Value

A character vector of the same length of x with R colors in hexadecimal string-encoded RGB format.

Author(s)

Michael Stadler

Examples

x <- rnorm(1000)
y <- rnorm(1000)
cols <- valueToColor(x + y)
plot(x, y, pch = 20, col = cols, main = "default")

x <- rnorm(1000)
y <- rnorm(1000)
cols <- valueToColor(x + y)
plot(x, y, pch = 20, col = cols, main = "default")

Aggregate different rows assigned to the same ID by calculating a weighted mean

Description

First row means are calculated to summarize across replicates identified by the groupCol in the colData. Then different row means that are assigned to the same feature ID given by the idCol in the rowData are summarized by calculating a weighted mean. This weighted mean is the sum of the squared row means divided by the sum of the row means. If all row means are 0, they remain 0 in the output.

Usage

weightedMeanByID(
  SE,
  assay,
  idCol = "GENEID",
  groupCol = "group",
  log2Transformed = TRUE
)
weightedMeanByID(
  SE,
  assay,
  idCol = "GENEID",
  groupCol = "group",
  log2Transformed = TRUE
)

Arguments

`SE`	a `SummarizedExperiment` object that contains an assay with values to be aggregated, a colData column that assigns samples to their group and a rowData column with IDs to indicate which rows to combine.
`assay`	the name of the assay in the SummarizedExperiment object that should be aggregated.
`idCol`	the column name in the rowData of the SummarizedExperiment indicating the feature ID.
`groupCol`	the column name in the colData of the SummarizedExperiment indicating which columns belong to the same group and should be averaged as replicates, before the weighted mean is calculated across rows.
`log2Transformed`	a `logical` indicating whether values in the assay are log2 transformed. If log2Transformed is TRUE, an exponential transformation will be applied before aggregating the values and another log transformation afterwards.

Value

The output is a data.frame with one column for each of the unique names in the groupCol and one row for each of the unique IDs in the idCol. The row and column names are the respective unique values. The entries represent the weighted means for each unique feature ID. If all the input values were NA, the aggregated value is also NA, while for all zero, the output remains zero. If log2Transformed is true the output will be log2 transformed again.

Author(s)

Fiona Ross

Examples

set.seed(123)
meansRows <- sample(1:100, 10, replace = TRUE)
dat <- unlist(lapply(meansRows, function(m) {
    rnorm(n = 5, mean = m, sd = 0.1*m)
}))
ma <- matrix(dat, nrow = 10, ncol = 5, byrow = TRUE)
IDs <- data.frame(ID = sample(c("A", "B", "C", "D"), size = 10, replace = TRUE))
Groups <- data.frame(group = c("Y","Y", "Z", "Z", "Z"))
mockSE <- SummarizedExperiment::SummarizedExperiment(
                               assays = list(counts = ma),
                               rowData = IDs,
                               colData = Groups)
weightedMeanByID(mockSE, "counts", idCol = "ID", log2Transformed = FALSE)                                

set.seed(123)
meansRows <- sample(1:100, 10, replace = TRUE)
dat <- unlist(lapply(meansRows, function(m) {
    rnorm(n = 5, mean = m, sd = 0.1*m)
}))
ma <- matrix(dat, nrow = 10, ncol = 5, byrow = TRUE)
IDs <- data.frame(ID = sample(c("A", "B", "C", "D"), size = 10, replace = TRUE))
Groups <- data.frame(group = c("Y","Y", "Z", "Z", "Z"))
mockSE <- SummarizedExperiment::SummarizedExperiment(
                               assays = list(counts = ma),
                               rowData = IDs,
                               colData = Groups)
weightedMeanByID(mockSE, "counts", idCol = "ID", log2Transformed = FALSE)

Package 'swissknife'

Help Index

swissknife - handy code shared in the FMI CompBio group

Description

Author(s)

See Also

Copy utility functions to package

Description

Usage

Arguments

Author(s)

Annotate regions.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Count frequency of differences between values in integer vectors.

Description

Usage

Arguments

Value

Author(s)

Calculate phasograms (same strand alignment distances).

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Get hex color

Description

Usage

Arguments

Value

Author(s)

Examples

Estimate the nucleosome repeat length (NRL) from a phasogram.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Get regions tiling a genome.

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Tabulate insert sizes from paired-end alignments in bam files.

Description

Usage

Arguments

Value

Author(s)

See Also

Examples

Get mappable regions of a genome.

Description

Usage

Arguments

Details

Value

Author(s)

See Also

Examples

Assign labels to cells using known marker genes

Description

Usage

Arguments

Value