Title: | Handy code shared in the FMI CompBio group |
---|---|
Description: | A collection of useful R functions performing various tasks that might be re-usable and worth sharing. |
Authors: | Michael Stadler [aut, cre], Charlotte Soneson [aut], Panagiotis Papasaikas [aut], Dania Machlab [aut], Fiona Ross [aut], Friedrich Miescher Institute for Biomedical Research [cph] |
Maintainer: | Michael Stadler <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.42 |
Built: | 2024-11-02 21:18:18 UTC |
Source: | https://github.com/fmicompbio/swissknife |
swissknife is a collection of useful R functions performing various tasks that might be re-usable and worth sharing.
Maintainer: Michael Stadler [email protected]
Authors:
Charlotte Soneson [email protected]
Panagiotis Papasaikas [email protected]
Dania Machlab [email protected]
Fiona Ross [email protected]
Other contributors:
Friedrich Miescher Institute for Biomedical Research [copyright holder]
Useful links:
This function copies handy utility functions to a new script in a specified location. Currently, the script contains the following utility functions:
.assertScalar()
- convenience function to check the validity
of scalar variables.
.assertVector()
- convenience function to check the validity
of vector variables.
addUtilsFunctions(outFile = "R/utils.R", copyTests = TRUE)
addUtilsFunctions(outFile = "R/utils.R", copyTests = TRUE)
outFile |
Character scalar, giving the path to which the script should be copied. The path is relative to the root of the active project. If a file with this name already exists, the function will ask for confirmation before overwriting it. |
copyTests |
Logical scalar, defining whether to copy unit tests for
the utility functions to |
Charlotte Soneson
Annotate a GRanges
object with
sets of reference GRanges
or
GRangesList
objects, with
respect to overlaps and nearest neighbors.
annotateRegions( x, hasOverlap = list(), fracOverlap = list(), numOverlap = list(), nearest = list(), ignore.strand = TRUE )
annotateRegions( x, hasOverlap = list(), fracOverlap = list(), numOverlap = list(), nearest = list(), ignore.strand = TRUE )
x |
The |
hasOverlap |
Named |
fracOverlap |
Named |
numOverlap |
Named |
nearest |
Named |
ignore.strand |
Logical scalar passed to
|
A GRanges
similar to x
, with
annotations added to its metadata columns (mcols
).
Michael Stadler
getGenomicTiles
that uses this function,
findOverlaps
and
nearest
in package GenomicRanges used
internally.
library(GenomicRanges) x <- GRanges("chr1", IRanges(c(1, 12), width = 10)) tss <- GRanges("chr1", IRanges(c(1, 10, 30), width = 1, names = paste0("t", 1:3))) blacklist <- GRanges("chr1", IRanges(20, width = 5)) annotateRegions(x, hasOverlap = list(Blacklist = blacklist), fracOverlap = list(Blacklist = blacklist), numOverlap = list(TSS = tss), nearest = list(TSS = tss))
library(GenomicRanges) x <- GRanges("chr1", IRanges(c(1, 12), width = 10)) tss <- GRanges("chr1", IRanges(c(1, 10, 30), width = 1, names = paste0("t", 1:3))) blacklist <- GRanges("chr1", IRanges(20, width = 5)) annotateRegions(x, hasOverlap = list(Blacklist = blacklist), fracOverlap = list(Blacklist = blacklist), numOverlap = list(TSS = tss), nearest = list(TSS = tss))
Given two ascendingly sorted integer vectors query
and reference
, calculate
and count the differences between their elements that are greater than zero
and less than maxd
. The number of observed distances d
are reported in cnt[d]
,
and maxd
corresponds to the length(cnt)
. The function is
called by calcPhasogram
, which provides a higher level, more
conventient interface.
calcAndCountDist(query, reference, cnt)
calcAndCountDist(query, reference, cnt)
query |
first |
reference |
second |
cnt |
|
numeric
vector cnt
, where cnt[d]
correspond to
the number of observed distances d
.
Michael Stadler
Calculate the frequencies of same strand alignment distances,
for example from MNase-seq data to estimate nucleosome repeat length.
Distance calculations are implemented in C++ (calcAndCountDist
)
for efficiency.
calcPhasogram(fname, regions = NULL, rmdup = TRUE, dmax = 3000L)
calcPhasogram(fname, regions = NULL, rmdup = TRUE, dmax = 3000L)
fname |
|
regions |
|
rmdup |
|
dmax |
|
integer
vector with dmax
elements, with the element at
position d
giving the observed number of alignment pairs at that
distance.
Michael Stadler
Phasograms were originally described in Valouev et al., Nature 2011 (doi:10.1038/nature10002). The implementation here differs in two ways from the original algorithms:
It does not implement removing of positions that have been seen less
than n
times (referred to as a n
-pile subset in the paper).
It does allow to retain only alignments that fall into selected
genomic intervals (regions
argument).
estimateNRL
to estimate the nucleosome repeat length
from a phasogram, plotPhasogram
to visualize an annotated
phasogram, calcAndCountDist
for low-level distance counting.
if (requireNamespace("GenomicAlignments", quietly = TRUE) && requireNamespace("Rsamtools", quietly = TRUE)) { bamf <- system.file("extdata", "phasograms", "mnase_mm10.bam", package = "swissknife") pg <- calcPhasogram(bamf) print(estimateNRL(pg, usePeaks = 1:4)[1:2]) plotPhasogram(pg, usePeaks = 1:4, xlim = c(0,1000)) }
if (requireNamespace("GenomicAlignments", quietly = TRUE) && requireNamespace("Rsamtools", quietly = TRUE)) { bamf <- system.file("extdata", "phasograms", "mnase_mm10.bam", package = "swissknife") pg <- calcPhasogram(bamf) print(estimateNRL(pg, usePeaks = 1:4)[1:2]) plotPhasogram(pg, usePeaks = 1:4, xlim = c(0,1000)) }
The function returns a color in hex form given a valid name of a color in R.
col2hex(col, alpha = 255)
col2hex(col, alpha = 255)
col |
a |
alpha |
a numerical value in the range [0,1] or [0,255] that indicates the transparency of the color(s). If the given values are between 0 and 1, they are mapped to be between 0 and 255. An alpha value of 1 assumes the [0,1] range and provides maximum color. The default is set to 255. |
a character
or character
vector with the hex colors.
Dania Machlab
y <- rnorm(1000,0,1) cols <- rep("red", length(y)) alpha <- seq(0,1,length.out=length(y)) hexcols <- col2hex(cols, alpha) plot(1:length(y), y, bg=hexcols, pch=21) y <- rnorm(1000,0,1) cols <- rep("red", length(y)) alpha <- seq(0,255,length.out=length(y)) hexcols <- col2hex(cols, alpha) plot(1:length(y), y, bg=hexcols, pch=21)
y <- rnorm(1000,0,1) cols <- rep("red", length(y)) alpha <- seq(0,1,length.out=length(y)) hexcols <- col2hex(cols, alpha) plot(1:length(y), y, bg=hexcols, pch=21) y <- rnorm(1000,0,1) cols <- rep("red", length(y)) alpha <- seq(0,255,length.out=length(y)) hexcols <- col2hex(cols, alpha) plot(1:length(y), y, bg=hexcols, pch=21)
Estimate the nucleosome repeat length (NRL) from the frequencies
of same-strand alignment distances (phasogram), e.g. generated by
calcPhasogram
. The NRL is obtained from the slope of a linear
fit to the modes in the phasogram.
estimateNRL( x, mind = 140L, usePeaks = 1:8, span1 = 100/length(x), span2 = 1500/length(x) )
estimateNRL( x, mind = 140L, usePeaks = 1:8, span1 = 100/length(x), span2 = 1500/length(x) )
x |
|
mind |
|
usePeaks |
|
span1 |
|
span2 |
|
A list
with elements:
the estimated nucleosome repeat length
the 95% confidence interval
smoothed (de-trended) phasogram
the de-noising fit to the de-trended phasogram
the linear fit to the phasogram peaks
the peak locations
minimal distance included in the fit
smoothing parameter for de-trending loess fit
smoothing parameter for de-noising loess fit
the peaks used in the fit
Michael Stadler
calcPhasogram
to calculate the phasogram from
alignments, plotPhasogram
to visualize an annotated phasogram
# see the help for calcPhasogram() for a full example
# see the help for calcPhasogram() for a full example
Get sequential, potentially annotated regions of a fixed lengths (tiles) along chromosomes of a genome.
getGenomicTiles( genome, tileWidth, hasOverlap = list(), fracOverlap = list(), numOverlap = list(), nearest = list(), addSeqComp = TRUE )
getGenomicTiles( genome, tileWidth, hasOverlap = list(), fracOverlap = list(), numOverlap = list(), nearest = list(), addSeqComp = TRUE )
genome |
The genome to work on. Either a |
tileWidth |
|
hasOverlap , fracOverlap , numOverlap , nearest
|
Named |
addSeqComp |
|
The last tile in each chromosome is dropped if it would be shorter
than tileWidth
. Generated tiles are unstranded (*
) and
therefore overlaps or searching for nearest neighbors are ignoring
strands of annotations (ignore.strand=TRUE
).
A GRanges
object with genome tiling regions.
Optional tile annotations are contained in its metadata columns (mcols
).
Michael Stadler
tileGenome
and annotateRegions
used by getGenomicTiles
internally.
library(GenomicRanges) tss <- GRanges("chr1", IRanges(c(1, 10, 30), width = 1, names = paste0("t", 1:3))) blacklist <- GRanges("chr1", IRanges(20, width = 5)) getGenomicTiles(c(chr1 = 45, chr2 = 12), tileWidth = 10, hasOverlap = list(Blacklist = blacklist), fracOverlap = list(Blacklist = blacklist), numOverlap = list(TSS = tss), nearest = list(TSS = tss))
library(GenomicRanges) tss <- GRanges("chr1", IRanges(c(1, 10, 30), width = 1, names = paste0("t", 1:3))) blacklist <- GRanges("chr1", IRanges(20, width = 5)) getGenomicTiles(c(chr1 = 45, chr2 = 12), tileWidth = 10, hasOverlap = list(Blacklist = blacklist), fracOverlap = list(Blacklist = blacklist), numOverlap = list(TSS = tss), nearest = list(TSS = tss))
Read and tabulate the insert sizes from paired-end alignments
contained in one or several bam files. By default, all properly aligned
read pairs are included. Optionally, alignments can be restricted to
those in a specific genomic region (regions
argument) or the number
of alignments read can be limited (nmax
argument).
getInsertSizeDistFromBam( fname, regions = NULL, nmax = NA_integer_, isizemax = 800, exclude = c("chrM", "chrY", "chrX") )
getInsertSizeDistFromBam( fname, regions = NULL, nmax = NA_integer_, isizemax = 800, exclude = c("chrM", "chrY", "chrX") )
fname |
|
regions |
|
nmax |
|
isizemax |
|
exclude |
|
integer
vector with the number of insert sizes. The element at
position i
gives the observed number of alignment pairs with an
insert size of i
. The number of insert sizes greater than
isizemax
that were set to isizemax
are reported in the
attribute "ncapped"
.
Michael Stadler
scanBam
used to read alignments.
if (requireNamespace("Rsamtools", quietly = TRUE)) { bamf <- system.file("extdata", "getInsertSizeDistFromBam", "atac_mm10.bam", package = "swissknife") isize <- getInsertSizeDistFromBam(bamf) attr(isize, "ncapped") plot(isize, type = "l", xlab = "Insert size (bp)", ylab = "Number of fragments") }
if (requireNamespace("Rsamtools", quietly = TRUE)) { bamf <- system.file("extdata", "getInsertSizeDistFromBam", "atac_mm10.bam", package = "swissknife") isize <- getInsertSizeDistFromBam(bamf) attr(isize, "ncapped") plot(isize, type = "l", xlab = "Insert size (bp)", ylab = "Number of fragments") }
Given a k-mer length and the maximum number of allowed hits per k-mer, find all mappable regions in a genome.
getMappableRegions( genome, genomeIndex, kmerLength = 50, maxHits = 1, Ncpu = 2, quiet = TRUE )
getMappableRegions( genome, genomeIndex, kmerLength = 50, maxHits = 1, Ncpu = 2, quiet = TRUE )
genome |
The genome sequence to work on. Either a |
genomeIndex |
|
kmerLength |
|
maxHits |
|
Ncpu |
|
quiet |
|
Sequences of all overlapping windows are extracted from the genome
and aligned to the provided genome index using bowtie
with parameters -f -v 0 -a -B 1 -m maxHits
. If no more
than maxHits
hits are found, the window is defined mappable.
A GRanges
object with mappable regions.
All plus-strand sequences in genome
of length kmerLength
with their start (leftmost) position overlapping the GRanges
object
do not generate more than maxHits
hits when aligned to the genome.
Michael Stadler
bowtie
in package Rbowtie used by
getMappableRegions
to align reads to the genome;
bowtie_build
in package Rbowtie for
indexing a genome.
if (requireNamespace("Rbowtie", quietly = TRUE)) { library(Rbowtie) genomefile <- system.file("extdata", "getMappableRegions", "hg19sub.fa", package = "swissknife") indexdir <- tempfile() indexpre <- "index" indexname <- file.path(indexdir, indexpre) idx <- bowtie_build(genomefile, indexdir) mapgr <- getMappableRegions(genomefile, indexname, 50, quiet = FALSE) print(mapgr) }
if (requireNamespace("Rbowtie", quietly = TRUE)) { library(Rbowtie) genomefile <- system.file("extdata", "getMappableRegions", "hg19sub.fa", package = "swissknife") indexdir <- tempfile() indexpre <- "index" indexname <- file.path(indexdir, indexpre) idx <- bowtie_build(genomefile, indexdir) mapgr <- getMappableRegions(genomefile, indexname, 50, quiet = FALSE) print(mapgr) }
Given marker gene sets for cell types, identify cells with
high expression of the marker genes (positive examples), then use these
cells to create a reference transcriptome profile for each cell type and
identify additional cells of each type using SingleR
. These
marker genes should specifically expressed a single cell type, e.g.
CD3 which is expressed by all T cell subtypes would not be suitable
for specific T cell subtypes.
labelCells( sce, markergenes, fraction_topscoring = 0.01, expr_values = "logcounts", normGenesetExpressionParams = list(R = 200), aggregateReferenceParams = list(power = 0.5), SingleRParams = list(), BPPARAM = SerialParam() )
labelCells( sce, markergenes, fraction_topscoring = 0.01, expr_values = "logcounts", normGenesetExpressionParams = list(R = 200), aggregateReferenceParams = list(power = 0.5), SingleRParams = list(), BPPARAM = SerialParam() )
sce |
|
markergenes |
Named |
fraction_topscoring |
|
expr_values |
Integer scalar or string indicating which assay of
|
normGenesetExpressionParams |
|
aggregateReferenceParams |
|
SingleRParams |
|
BPPARAM |
An optional |
A list
of three elements named cells
, refs
and labels
.
cells
contains a list
with the numerical indices of the top
scoring cells for each cell type.
refs
contains the pseudo-bulk transcriptome profiles used as a
reference for label assignment, as returned by aggregateReference
.
labels
contains a DataFrame
with the
annotation statistics for each cell (one cell per row), generated by
SingleR
.
Michael Stadler
normGenesetExpression
used to calculate scores for
marker gene sets; aggregateReference
used to
create reference profiles; SingleR
used to assign
labels to cells.
if (requireNamespace("SingleR", quietly = TRUE) && requireNamespace("SingleCellExperiment", quietly = TRUE)) { # create SingleCellExperiment with cell-type specific genes library(SingleCellExperiment) n_types <- 3 n_per_type <- 30 n_cells <- n_types * n_per_type n_genes <- 500 fraction_specific <- 0.1 n_specific <- round(n_genes * fraction_specific) set.seed(42) mu <- ceiling(runif(n = n_genes, min = 0, max = 30)) u <- do.call(rbind, lapply(mu, function(x) rpois(n_cells, lambda = x))) rownames(u) <- paste0("g", seq.int(nrow(u))) celltype.labels <- rep(paste0("t", seq.int(n_types)), each = n_per_type) celltype.genes <- split(sample(rownames(u), size = n_types * n_specific), rep(paste0("t", seq.int(n_types)), each = n_specific)) for (i in seq_along(celltype.genes)) { j <- celltype.genes[[i]] k <- celltype.labels == paste0("t", i) u[j, k] <- 2 * u[j, k] } v <- log2(u + 1) sce <- SingleCellExperiment(assays=list(counts=u, logcounts=v)) # define marker genes (subset of true cell-type-specific genes) marker.genes <- lapply(celltype.genes, "[", 1:5) marker.genes # predict cell types res <- labelCells(sce, marker.genes, fraction_topscoring = 0.1, normGenesetExpressionParams = list(R = 50)) # high-scoring cells used as references for each celltype res$cells # ... from these, pseudo-bulks were created: res$refs # ... and used to predict labels for all cells res$labels$pruned.labels # compare predicted to true cell types table(true = celltype.labels, predicted = res$labels$pruned.labels) }
if (requireNamespace("SingleR", quietly = TRUE) && requireNamespace("SingleCellExperiment", quietly = TRUE)) { # create SingleCellExperiment with cell-type specific genes library(SingleCellExperiment) n_types <- 3 n_per_type <- 30 n_cells <- n_types * n_per_type n_genes <- 500 fraction_specific <- 0.1 n_specific <- round(n_genes * fraction_specific) set.seed(42) mu <- ceiling(runif(n = n_genes, min = 0, max = 30)) u <- do.call(rbind, lapply(mu, function(x) rpois(n_cells, lambda = x))) rownames(u) <- paste0("g", seq.int(nrow(u))) celltype.labels <- rep(paste0("t", seq.int(n_types)), each = n_per_type) celltype.genes <- split(sample(rownames(u), size = n_types * n_specific), rep(paste0("t", seq.int(n_types)), each = n_specific)) for (i in seq_along(celltype.genes)) { j <- celltype.genes[[i]] k <- celltype.labels == paste0("t", i) u[j, k] <- 2 * u[j, k] } v <- log2(u + 1) sce <- SingleCellExperiment(assays=list(counts=u, logcounts=v)) # define marker genes (subset of true cell-type-specific genes) marker.genes <- lapply(celltype.genes, "[", 1:5) marker.genes # predict cell types res <- labelCells(sce, marker.genes, fraction_topscoring = 0.1, normGenesetExpressionParams = list(R = 50)) # high-scoring cells used as references for each celltype res$cells # ... from these, pseudo-bulks were created: res$refs # ... and used to predict labels for all cells res$labels$pruned.labels # compare predicted to true cell types table(true = celltype.labels, predicted = res$labels$pruned.labels) }
Make example data available, typically for use in teaching.
loadExampleData(name = "list", envir = globalenv(), verbose = TRUE)
loadExampleData(name = "list", envir = globalenv(), verbose = TRUE)
name |
An optional character scalar specifying the data set(s) to
be made available. The special name |
envir |
specifies the environment in which the data should be made
available. By default, |
verbose |
A logical scalar. If |
A data.frame
(invisibly) with one row for each dataset
that was made available in the global environment.
Michael Stadler
loadExampleData() loadExampleData("mycars")
loadExampleData() loadExampleData("mycars")
Calculate normalized expression for a set of genes in each cell
from a SingleCellExperiment
, using random sets of similarly
expressed genes as background to account for cell quality and
sequencing depth.
normGenesetExpression( sce, genes, expr_values = "logcounts", subset.row = NULL, R = 200, nbins = 100, BPPARAM = SerialParam() )
normGenesetExpression( sce, genes, expr_values = "logcounts", subset.row = NULL, R = 200, nbins = 100, BPPARAM = SerialParam() )
sce |
|
genes |
|
expr_values |
Integer scalar or string indicating which assay of
|
subset.row |
Sample random genes only from these. If |
R |
Integer scalar giving the number of random gene sets to sample for normalization. |
nbins |
Integer scalar, specifying the number of bins to group the
average expression levels into before sampling (passed to
|
BPPARAM |
An optional |
A numeric
vector with normalized gene set scores for each
cell in sce
.
Michael Stadler
if (require(SingleCellExperiment)) { # get sce example(SingleCellExperiment, echo=FALSE) rownames(sce) <- paste0("g", seq.int(nrow(sce))) # calculate gene set expression scores markers <- c("g1", "g13", "g27") scores <- normGenesetExpression(sce, markers, R = 50) # compare expression of marker genes with scores plotdat <- cbind(scores, t(logcounts(sce)[markers, ])) cor(plotdat) pairs(plotdat) }
if (require(SingleCellExperiment)) { # get sce example(SingleCellExperiment, echo=FALSE) rownames(sce) <- paste0("g", seq.int(nrow(sce))) # calculate gene set expression scores markers <- c("g1", "g13", "g27") scores <- normGenesetExpression(sce, markers, R = 50) # compare expression of marker genes with scores plotdat <- cbind(scores, t(logcounts(sce)[markers, ])) cor(plotdat) pairs(plotdat) }
The function parses the R version and R package versions from
session information (created by sessionInfo()
, tested with R 3.6) in
files provided in infiles
. Two types of files are currently
supported:
Rout: Files containing R console output
(created by R CMD BATCH script.R output.Rout
md: Files
containing markdown output created by rmarkdown::render('input.Rmd',
clean = FALSE)
, which will keep the intermediate .md
file.
parsePkgVersions(infiles)
parsePkgVersions(infiles)
infiles |
Character vector with text files (extension must be either
|
A list
of list
s with one element in the outer list for
each R version, contianing an innter list with elements files
and
packages
.
Michael Stadler
f <- list.files(system.file("extdata", "parsePkgVersions", package = "swissknife"), full.names = TRUE) parsePkgVersions(f)
f <- list.files(system.file("extdata", "parsePkgVersions", package = "swissknife"), full.names = TRUE) parsePkgVersions(f)
plotBitScatter
is a wrapper around plot
which renders the
plot area as a bitmap (png), but keeps all other elements (axes, labels, etc.)
as vector elements. This is especially useful for keeping the size of PDF files
with scatter plots with many elements small, while retaining editability of axes.
plotBitScatter( x, y = NULL, ..., densCols = TRUE, colpal = c("#00007F", "blue", "#007FFF", "cyan", "#7FFF7F", "yellow", "#FF7F00", "red", "#7F0000"), xpixels = 1000, ypixels = NULL, pointsize = NULL )
plotBitScatter( x, y = NULL, ..., densCols = TRUE, colpal = c("#00007F", "blue", "#007FFF", "cyan", "#7FFF7F", "yellow", "#FF7F00", "red", "#7F0000"), xpixels = 1000, ypixels = NULL, pointsize = NULL )
x |
|
y |
|
... |
any further arguments to be passed to |
densCols |
|
colpal |
vector of colors defining the palette for automatic density-based coloring. |
xpixels |
the number of pixels in the x dimension used for rendering the plotting area. |
ypixels |
the number of pixels in the y dimension used for rendering the
plotting area. If |
pointsize |
the size of points used for the png device when rendering the plot.
If |
xpixels
controls the resolution of the rendered plotting area.
In order to keep circular plotting symbols circlular (e.g. pch = 1
),
ypixels
is automatically calculated using xpixels
and the
aspect ratio of the current plotting area. If the plotting device is
rescaled after calling plotBitScatter
, circular plotting symbols
may become skewed.
NULL
(invisibly)
Michael Stadler
x <- rnorm(1000) y <- rnorm(1000) par(mfrow=c(1,2)) plotBitScatter(x, y, main = "bitmap") plot(x, y, main = "default")
x <- rnorm(1000) y <- rnorm(1000) par(mfrow=c(1,2)) plotBitScatter(x, y, main = "bitmap") plot(x, y, main = "default")
Visualize the gene model for a gene of interest, or for all genes in a provided region, and/or show one or more coverage tracks based on bigwig file(s).
plotGeneRegion( gtf = "", granges = NULL, chr = "", start = NA_real_, end = NA_real_, showgene = "", bigwigFiles = "", bigwigCond = "", geneTrackTitle = "Genes", transcriptIdColumn = "transcript_id", geneIdColumn = "gene_id", geneSymbolColumn = "gene_name", lowerPadding = 0.15, upperPadding = 0.05, colorByStrand = FALSE, featureColors = c(plusmain = "#0E14D0", minusmain = "#D0350E", plusother = "#9E9BEB", minusother = "#DA907E"), condColors = NULL, scaleDataTracks = FALSE, plotTitle = NULL, ... )
plotGeneRegion( gtf = "", granges = NULL, chr = "", start = NA_real_, end = NA_real_, showgene = "", bigwigFiles = "", bigwigCond = "", geneTrackTitle = "Genes", transcriptIdColumn = "transcript_id", geneIdColumn = "gene_id", geneSymbolColumn = "gene_name", lowerPadding = 0.15, upperPadding = 0.05, colorByStrand = FALSE, featureColors = c(plusmain = "#0E14D0", minusmain = "#D0350E", plusother = "#9E9BEB", minusother = "#DA907E"), condColors = NULL, scaleDataTracks = FALSE, plotTitle = NULL, ... )
gtf |
Character scalar, path to gtf file (tested with Ensembl/Gencode files). |
granges |
GRanges object, typically generated from a GTF file using the
|
chr |
Character scalar, name of the chromosome to show. |
start , end
|
Numeric scalars, start and end position of the region to show. |
showgene |
Character scalar, the gene ID/name to display. Will take precedence over positional range specification if provided. |
bigwigFiles |
Named character vector, paths to bigwig files. |
bigwigCond |
Named character vector, the grouping of the bigwig files (used for coloring of the coverage tracks). |
geneTrackTitle |
Character scalar, name of the gene track. |
transcriptIdColumn |
Character scalar, the column in the gtf file that
contains the transcript ID. Passed to |
geneIdColumn |
Character scalar, the column in the gtf file that
contains the gene ID. Passed to |
geneSymbolColumn |
Character scalar, the column in the gtf file that
contains the gene symbol (if available). Set to |
lowerPadding , upperPadding
|
Numeric scalars, setting the amount of padding in the lower and upper range of the plot, respectively. For example, a value of 0.05 will expand the range by 0.05 * (max coordinate - min coordinate) in the specified direction. |
colorByStrand |
Logical scalar, determining whether gene features are colored by the annotated strand. |
featureColors |
Named character vector of length 4, with elements
|
condColors |
Either NULL or a named character vector (with the same
names as the unique values of |
scaleDataTracks |
Logical scalar, indicating whether the data tracks should be scaled to have the same y-axis limits. |
plotTitle |
Character scalar, the title of the final plot. If
|
... |
Additional arguments to be passed to |
The gene annotation can be provided either as a path to a gtf file, or as a
GRanges object (generated using the prepareGTF
function to ensure
compatibility). The region to display can be determined either by
specifying a gene (ID or symbol) or by specifying a viewing range
(chromosome, start and end positions).
Charlotte Soneson
if (requireNamespace("Gviz", quietly = TRUE)) { gtffile <- system.file("extdata/plotGeneRegion/mm10_ensembl98.gtf", package = "swissknife") plotGeneRegion(gtf = gtffile, showgene = "Tnfaip3") bwf <- system.file("extdata/plotGeneRegion/mnase_mm10.bw", package = "swissknife") names(bwf) <- "bwf1" plotGeneRegion(gtf = gtffile, bigwigFiles = bwf, chr = "chr10", start = 20000000, end = 20005000) plotGeneRegion(bigwigFiles = bwf, chr = "chr10", start = 20000000, end = 20005000) bwf2 <- c(bwf, bwf) names(bwf2) <- c("bwf1", "bwf2") bwc2 <- c("c1", "c2") names(bwc2) <- names(bwf2) plotGeneRegion(gtf = gtffile, bigwigFiles = bwf2, bigwigCond = bwc2, showgene = "Map3k5") }
if (requireNamespace("Gviz", quietly = TRUE)) { gtffile <- system.file("extdata/plotGeneRegion/mm10_ensembl98.gtf", package = "swissknife") plotGeneRegion(gtf = gtffile, showgene = "Tnfaip3") bwf <- system.file("extdata/plotGeneRegion/mnase_mm10.bw", package = "swissknife") names(bwf) <- "bwf1" plotGeneRegion(gtf = gtffile, bigwigFiles = bwf, chr = "chr10", start = 20000000, end = 20005000) plotGeneRegion(bigwigFiles = bwf, chr = "chr10", start = 20000000, end = 20005000) bwf2 <- c(bwf, bwf) names(bwf2) <- c("bwf1", "bwf2") bwc2 <- c("c1", "c2") names(bwc2) <- names(bwf2) plotGeneRegion(gtf = gtffile, bigwigFiles = bwf2, bigwigCond = bwc2, showgene = "Map3k5") }
Plot phasogram and annotate it with estimated nucleosome repeat length (NRL).
plotPhasogram(x, hide = TRUE, xlim = NULL, verbosePlot = FALSE, ...)
plotPhasogram(x, hide = TRUE, xlim = NULL, verbosePlot = FALSE, ...)
x |
|
hide |
If |
xlim |
|
verbosePlot |
If |
... |
Additional arguments passed to |
The function will visualize an annotated phasogram. For
verbosePlot=FALSE
(the default), it will create a single annotated
plot. For verbosePlot=TRUE
, it will create three plots (using
par(mfrow=c(1,3))
):
raw phase counts with de-trending and de-noising loess fits
residual phase counts with de-noising loess fit and detected peaks
linear fit to peaks and NRL estimation
The return value from the call to estimateNRL
(invisibly).
Michael Stadler
calcPhasogram
to calculate the phasogram from
alignments, estimateNRL
to estimate nucleosome repeat length
# see the help for calcPhasogram() for a full example
# see the help for calcPhasogram() for a full example
This function take the output from selVarGenes
and plots the genes that have been
selected to be highly variable across the cells. It plot the log2 coefficient of variation as
a function of the log mean.
plotSelVarGenes( selVarGenes_list = NULL, xlab = "logMean", ylab = "logCV", main = "Selected Variable Genes", pch = 16, col = "#BEBEBE40", sel_col = "steelblue", colByBin = FALSE, asp = 1, ... )
plotSelVarGenes( selVarGenes_list = NULL, xlab = "logMean", ylab = "logCV", main = "Selected Variable Genes", pch = 16, col = "#BEBEBE40", sel_col = "steelblue", colByBin = FALSE, asp = 1, ... )
selVarGenes_list |
the output list from the |
xlab |
label for x-axis. |
ylab |
label for y-axis. |
main |
title for plot. |
pch |
point pch. |
col |
point color. |
sel_col |
point color of the selected variable genes. |
colByBin |
if TRUE, color the genes by the bin they've been assigned to. |
asp |
the y/x aspect ratio. Set to 1 when |
... |
additional parameters for the |
plot
Dania Machlab
if (requireNamespace("SingleCellExperiment", quietly = TRUE)) { # packages library(SingleCellExperiment) # create example count matrix # ... poisson distr per gene mu <- ceiling(runif(n = 2000, min = 0, max = 100)) counts <- do.call(rbind, lapply(mu, function(x){rpois(1000, lambda = x)})) counts <- counts + 1 # ... add signal to subset of genes (rows) and cells (columns) i <- sample(x = 1:nrow(counts), size = 500) j <- sample(x = 1:ncol(counts), size = 500) counts[i, j] <- counts[i, j] + sample(5:10, length(i), replace = TRUE) # create SCE sce <- SingleCellExperiment(list(counts = counts)) # calculate sizeFactors libsizes <- colSums(counts) sizeFactors(sce) <- libsizes / mean(libsizes) # select variable genes varGenes <- selVarGenes(sce) # plot plotSelVarGenes(varGenes) plotSelVarGenes(varGenes, colByBin=TRUE) }
if (requireNamespace("SingleCellExperiment", quietly = TRUE)) { # packages library(SingleCellExperiment) # create example count matrix # ... poisson distr per gene mu <- ceiling(runif(n = 2000, min = 0, max = 100)) counts <- do.call(rbind, lapply(mu, function(x){rpois(1000, lambda = x)})) counts <- counts + 1 # ... add signal to subset of genes (rows) and cells (columns) i <- sample(x = 1:nrow(counts), size = 500) j <- sample(x = 1:ncol(counts), size = 500) counts[i, j] <- counts[i, j] + sample(5:10, length(i), replace = TRUE) # create SCE sce <- SingleCellExperiment(list(counts = counts)) # calculate sizeFactors libsizes <- colSums(counts) sizeFactors(sce) <- libsizes / mean(libsizes) # select variable genes varGenes <- selVarGenes(sce) # plot plotSelVarGenes(varGenes) plotSelVarGenes(varGenes, colByBin=TRUE) }
This function sets the names of the transcript and gene ID columns of the
gtf file to "transcript" and "gene", removes version tags of the
transcripts/genes and retains only the "exon" entries. The purpose is to
make the file amenable to plotting with Gviz, using the
plotGeneRegion
function.
prepareGTF( gtf, transcriptIdColumn = "transcript_id", geneIdColumn = "gene_id", geneSymbolColumn = "gene_name" )
prepareGTF( gtf, transcriptIdColumn = "transcript_id", geneIdColumn = "gene_id", geneSymbolColumn = "gene_name" )
gtf |
Character scalar, path to gtf file (tested with Ensembl/Gencode files). |
transcriptIdColumn |
Character scalar, the column in the gtf file that contains the transcript ID. |
geneIdColumn |
Character scalar, the column in the gtf file that contains the gene ID. |
geneSymbolColumn |
Character scalar, the column in the gtf file that
contains the gene symbol (if available). Set to |
Charlotte Soneson
gtf <- prepareGTF(gtf = system.file("extdata/plotGeneRegion/mm10_ensembl98.gtf", package = "swissknife"))
gtf <- prepareGTF(gtf = system.file("extdata/plotGeneRegion/mm10_ensembl98.gtf", package = "swissknife"))
The function searches the provided seqdataDir
for tsv
files corresponding to the provided sampleIds
and returns a
data.frame
containing the metadata for all these samples.
readSampleTsvs( seqdataDir = "/tungstenfs/groups/gbioinfo/seqdata", sampleIds, keepMulti = TRUE, ... )
readSampleTsvs( seqdataDir = "/tungstenfs/groups/gbioinfo/seqdata", sampleIds, keepMulti = TRUE, ... )
seqdataDir |
Character scalar, the path to the directory containing the tsv files. |
sampleIds |
Character vector with sample IDs, which will be matched
against the file names in |
keepMulti |
Logical scalar, indicating whether to keep samples that
match more than one tsv file. If |
... |
Additional arguments that will be passed to |
A data.frame
with metadata for the provided sampleIds
.
Charlotte Soneson
if (requireNamespace("dplyr") && requireNamespace("tidyr")) { print(readSampleTsvs(seqdataDir = system.file("extdata/readSampleTsvs", package = "swissknife"), sampleIds = c("readSampleTsvsEx1", "readSampleTsvsEx2", "readSampleTsvsEx3"))) }
if (requireNamespace("dplyr") && requireNamespace("tidyr")) { print(readSampleTsvs(seqdataDir = system.file("extdata/readSampleTsvs", package = "swissknife"), sampleIds = c("readSampleTsvsEx1", "readSampleTsvsEx2", "readSampleTsvsEx3"))) }
Randomly sample from a set of control (background) elements, such that the selected elements are similarly distributed as a given set of target (foreground) elements.
sampleControlElements( x, idxTarget, idxControl = NULL, nbins = 50, oversample = 1 )
sampleControlElements( x, idxTarget, idxControl = NULL, nbins = 50, oversample = 1 )
x |
|
idxTarget |
|
idxControl |
|
nbins |
|
oversample |
The number of control elements to sample for each target element. |
numeric
vector with round(length(idxTarget) * oversample)
elements, specifying the index (positions) of the sampled control elements.
Michael Stadler
x <- c(runif(1000, min = 0, max = 10), rnorm(200, mean = 5, sd = 1)) s <- sampleControlElements(x, idxTarget = 1001:1200, idxControl = 1:1000) par(mfrow=c(2,2)) h <- hist(x, breaks = 20, main = "all") hist(x[1:1000], breaks = h$breaks, main = "all control") hist(x[1001:1200], breaks = h$breaks, main = "target") hist(x[s], breaks = h$breaks, main = "sampled control")
x <- c(runif(1000, min = 0, max = 10), rnorm(200, mean = 5, sd = 1)) s <- sampleControlElements(x, idxTarget = 1001:1200, idxControl = 1:1000) par(mfrow=c(2,2)) h <- hist(x, breaks = 20, main = "all") hist(x[1:1000], breaks = h$breaks, main = "all control") hist(x[1001:1200], breaks = h$breaks, main = "target") hist(x[s], breaks = h$breaks, main = "sampled control")
This function selects the most variable genes from a SingleCellExperiment
object
using the plot that displays the log2 coefficient of variation as a function of the log2 mean for
all genes across all the cells.
selVarGenes( data = NULL, assay.type = "counts", logPseudo = 1, Nmads = 3, minCells = 5, minExpr = 1, exclTopExprFrac = 0.01, span = 0.2, control = stats::loess.control(surface = "direct"), nBins = 100, nBinsDense = ceiling(nrow(data)/4), ... )
selVarGenes( data = NULL, assay.type = "counts", logPseudo = 1, Nmads = 3, minCells = 5, minExpr = 1, exclTopExprFrac = 0.01, span = 0.2, control = stats::loess.control(surface = "direct"), nBins = 100, nBinsDense = ceiling(nrow(data)/4), ... )
data |
|
assay.type |
the type of assay to use if |
logPseudo |
pseudo-count to use when using the logcounts slot from the |
Nmads |
number of MADs beyond which genes are selected per bin. |
minCells |
keep genes with minimum expression in at least this number of cells. |
minExpr |
keep genes with expression greater than or equal to this in |
exclTopExprFrac |
the fraction of top expressed genes that will be excluded from the loess fit (value between 0 and 1). |
span |
span parameter for |
control |
control parameters for |
nBins |
number of bins or groups to place the points(genes) into. |
nBinsDense |
number of bins or groups to use to place the points(genes) into when calculating more accurate distance values to the curve from the loess fit. |
... |
additional parameters for the |
The function takes in a SingleCellExperiment
object and calculates the normalized
counts by dividing the raw counts by the corresponding sizeFactors per cell, or a matrix of already normalized counts.
Only genes that have an expression greater than or equal to minExpr
in at least minCells
cells will be
kept. If assay.type
is set to 'logcounts', that assay is transformed back to the raw normalized count space
by performing 2^logcounts(data) - 1, under the assumption the logcounts data is in log2 form and had a pseudocount of 1.
The genes that vary most on the log2(coefficient of variation) vs log2(mean) plot of genes will be selected. A loess fit is done on this plot and the distance (euclidean by default) each point has to the curve is calculated in two steps.
In the first step, genes are assigned to bins by taking the minimum distance to the curve. By default
we select 100 points on the loess fit and calculate the distances each gene has to all those points on
the curve. Each gene is assigned to the point on the curve for which it has the shortest distance. In
the second step, more accurate distances to the curve are calculated by using a higher number of
points on the curve. Distances are calculated using the dist.matrix
function.
Finally, for each bin, the most variable genes are selected using the more accurate distance
measures. Genes that fall below the loess fit are assigned a negative sign and the genes that
are Nmads
MADs away from the median are selected.
a list
of length 2:
varGenes: vector
containing the names of the most variable genes.
geneInfo: data.frame
with genes as rows and columns containing
calculated measures:
logMean: log2(mean) expression of genes across cells.
logCV: log2(coefficient of variation) of genes across cells.
pred_logCV: predicted log2(coefficient of variation) from loess fit.
assigned_bin: bin each gene has been assigned to.
distance: accurate distance measuses. Points below the loess fit get a negative sign.
Dania Machlab
if (requireNamespace("wordspace", quietly = TRUE) && requireNamespace("SingleCellExperiment", quietly = TRUE)) { # packages library(SingleCellExperiment) # create example count matrix # ... poisson distr per gene mu <- ceiling(runif(n = 2000, min = 0, max = 100)) counts <- do.call(rbind, lapply(mu, function(x){rpois(1000, lambda = x)})) counts <- counts + 1 # ... add signal to subset of genes (rows) and cells (columns) i <- sample(x = 1:nrow(counts), size = 500) j <- sample(x = 1:ncol(counts), size = 500) counts[i, j] <- counts[i, j] + sample(5:10, length(i), replace = TRUE) # create SCE sce <- SingleCellExperiment(list(counts = counts)) # calculate sizeFactors libsizes <- colSums(counts) sizeFactors(sce) <- libsizes / mean(libsizes) # select variable genes varGenes <- selVarGenes(sce, assay.type="counts") # plot plotSelVarGenes(varGenes, colByBin=TRUE) plotSelVarGenes(varGenes) }
if (requireNamespace("wordspace", quietly = TRUE) && requireNamespace("SingleCellExperiment", quietly = TRUE)) { # packages library(SingleCellExperiment) # create example count matrix # ... poisson distr per gene mu <- ceiling(runif(n = 2000, min = 0, max = 100)) counts <- do.call(rbind, lapply(mu, function(x){rpois(1000, lambda = x)})) counts <- counts + 1 # ... add signal to subset of genes (rows) and cells (columns) i <- sample(x = 1:nrow(counts), size = 500) j <- sample(x = 1:ncol(counts), size = 500) counts[i, j] <- counts[i, j] + sample(5:10, length(i), replace = TRUE) # create SCE sce <- SingleCellExperiment(list(counts = counts)) # calculate sizeFactors libsizes <- colSums(counts) sizeFactors(sce) <- libsizes / mean(libsizes) # select variable genes varGenes <- selVarGenes(sce, assay.type="counts") # plot plotSelVarGenes(varGenes, colByBin=TRUE) plotSelVarGenes(varGenes) }
Calculate expression specificity scores for genes that quantify specific expression of a gene in groups of samples (e.g. from different tissues).
specificityScore( x, method = c("tau", "TSI", "counts"), group = NULL, thresh = 0, expr_values = "logcounts", na.rm = FALSE ) ## S4 method for signature 'matrix' specificityScore( x, method = c("tau", "TSI", "counts"), group = NULL, thresh = 0, expr_values = "logcounts", na.rm = FALSE ) ## S4 method for signature 'SummarizedExperiment' specificityScore( x, method = c("tau", "TSI", "counts"), group = NULL, thresh = 0, expr_values = "logcounts", na.rm = FALSE )
specificityScore( x, method = c("tau", "TSI", "counts"), group = NULL, thresh = 0, expr_values = "logcounts", na.rm = FALSE ) ## S4 method for signature 'matrix' specificityScore( x, method = c("tau", "TSI", "counts"), group = NULL, thresh = 0, expr_values = "logcounts", na.rm = FALSE ) ## S4 method for signature 'SummarizedExperiment' specificityScore( x, method = c("tau", "TSI", "counts"), group = NULL, thresh = 0, expr_values = "logcounts", na.rm = FALSE )
x |
Expression data, either a |
method |
|
group |
|
thresh |
|
expr_values |
Integer scalar or string indicating which assay of
|
na.rm |
Logical scalar. If |
A numeric
vector of length nrow(x)
with gene scores.
Michael Stadler
For a review of tissue-specificity scores, see: "A benchmark of gene expression tissue-specificity metrics" Nadezda Kryuchkova-Mostacci and Marc Robinson-Rechavi Brief Bioinform. 2017 Mar; 18(2): 205–214. doi: 10.1093/bib/bbw008, PMCID: PMC5444245, PMID: 26891983
x <- rbind(g1 = runif(5), g2 = c(1, 0, 0, 0, 0), g3 = c(.6, .1, .1, .1, .1)) specificityScore(x) specificityScore(x, method = "TSI") specificityScore(x, method = "counts", thresh = 0.5)
x <- rbind(g1 = runif(5), g2 = c(1, 0, 0, 0, 0), g3 = c(.6, .1, .1, .1, .1)) specificityScore(x) specificityScore(x, method = "TSI") specificityScore(x, method = "counts", thresh = 0.5)
valueToColor
takes a numerical vector and maps each value
to an R color string.
valueToColor( x, rng = range(x, na.rm = TRUE), col = c("#5E4FA2", "#3288BD", "#66C2A5", "#ABDDA4", "#E6F598", "#FFFFBF", "#FEE08B", "#FDAE61", "#F46D43", "#D53E4F", "#9E0142"), NA.col = "lightgray", alpha = NULL )
valueToColor( x, rng = range(x, na.rm = TRUE), col = c("#5E4FA2", "#3288BD", "#66C2A5", "#ABDDA4", "#E6F598", "#FFFFBF", "#FEE08B", "#FDAE61", "#F46D43", "#D53E4F", "#9E0142"), NA.col = "lightgray", alpha = NULL )
x |
|
rng |
|
col |
vector with R colors defining the palette (must be a valid argument
to |
NA.col |
Single R color to use for |
alpha |
|
The values in [rng[1], rng[2]]
will be linearly mapped to the
color palette defined by col
. Any values in x
less (greater)
than rng[1]
(rng[2]
) will be assigned the same color as
rng[1]
(rng[2]
).
A character
vector of the same length of x
with
R colors in hexadecimal string-encoded RGB format.
Michael Stadler
colorRamp
and rgb
for the functions called by valueToColor
.
x <- rnorm(1000) y <- rnorm(1000) cols <- valueToColor(x + y) plot(x, y, pch = 20, col = cols, main = "default")
x <- rnorm(1000) y <- rnorm(1000) cols <- valueToColor(x + y) plot(x, y, pch = 20, col = cols, main = "default")
First row means are calculated to summarize across replicates identified by the groupCol in the colData. Then different row means that are assigned to the same feature ID given by the idCol in the rowData are summarized by calculating a weighted mean. This weighted mean is the sum of the squared row means divided by the sum of the row means. If all row means are 0, they remain 0 in the output.
weightedMeanByID( SE, assay, idCol = "GENEID", groupCol = "group", log2Transformed = TRUE )
weightedMeanByID( SE, assay, idCol = "GENEID", groupCol = "group", log2Transformed = TRUE )
SE |
a |
assay |
the name of the assay in the SummarizedExperiment object that should be aggregated. |
idCol |
the column name in the rowData of the SummarizedExperiment indicating the feature ID. |
groupCol |
the column name in the colData of the SummarizedExperiment indicating which columns belong to the same group and should be averaged as replicates, before the weighted mean is calculated across rows. |
log2Transformed |
a |
The output is a data.frame
with one column for each of the
unique names in the groupCol and one row for each of the unique IDs in the
idCol. The row and column names are the respective unique values. The entries
represent the weighted means for each unique feature ID. If all the input
values were NA, the aggregated value is also NA, while for all zero, the
output remains zero. If log2Transformed is true the output will be log2
transformed again.
Fiona Ross
set.seed(123) meansRows <- sample(1:100, 10, replace = TRUE) dat <- unlist(lapply(meansRows, function(m) { rnorm(n = 5, mean = m, sd = 0.1*m) })) ma <- matrix(dat, nrow = 10, ncol = 5, byrow = TRUE) IDs <- data.frame(ID = sample(c("A", "B", "C", "D"), size = 10, replace = TRUE)) Groups <- data.frame(group = c("Y","Y", "Z", "Z", "Z")) mockSE <- SummarizedExperiment::SummarizedExperiment( assays = list(counts = ma), rowData = IDs, colData = Groups) weightedMeanByID(mockSE, "counts", idCol = "ID", log2Transformed = FALSE)
set.seed(123) meansRows <- sample(1:100, 10, replace = TRUE) dat <- unlist(lapply(meansRows, function(m) { rnorm(n = 5, mean = m, sd = 0.1*m) })) ma <- matrix(dat, nrow = 10, ncol = 5, byrow = TRUE) IDs <- data.frame(ID = sample(c("A", "B", "C", "D"), size = 10, replace = TRUE)) Groups <- data.frame(group = c("Y","Y", "Z", "Z", "Z")) mockSE <- SummarizedExperiment::SummarizedExperiment( assays = list(counts = ma), rowData = IDs, colData = Groups) weightedMeanByID(mockSE, "counts", idCol = "ID", log2Transformed = FALSE)