In this vignette, we show how to prepare reference files for estimating abundances of spliced and unspliced abundances with alignment-free methods (e.g., Salmon, alevin or kallisto). Such abundances are used as input, e.g., for estimation of RNA velocity in single-cell data.
The first step is to generate a GRangesList
object
containing the genomic ranges for the features of interest (spliced
transcripts, and either unspliced transcripts or intron sequences). This
is done using the getFeatureRanges()
function, based on a
reference GTF file. Here, we exemplify this with a small subset of a GTF
file from Gencode
release 28. We extract genomic ranges for spliced transcript and
introns, where the latter are defined for each transcript separately
(using the same terminology as in the BUSpaRse
package). For each intron, a flanking sequence of 50 nt is added on each
side to accommodate reads mapping across an exon/intron boundary.
gtf <- system.file("extdata/gencode.v28.annotation.sub.gtf",
package = "eisaR")
grl <- getFeatureRanges(
gtf = gtf,
featureType = c("spliced", "intron"),
intronType = "separate",
flankLength = 50L,
joinOverlappingIntrons = FALSE,
verbose = TRUE
)
#> Import genomic features from the file as a GRanges object ... OK
#> Prepare the 'metadata' data frame ... OK
#> Make the TxDb object ...
#> Warning in .get_cds_IDX(mcols0$type, mcols0$phase): The "phase" metadata column contains non-NA values for
#> features of type stop_codon. This information was ignored.
#> OK
#> 'select()' returned 1:1 mapping between keys and columns
#> Extracting spliced transcript features
#> Extracting introns using the separate approach
The output from getFeatureRanges()
is a
GRangesList
object, with all the features of interest (both
spliced transcripts and introns):
grl
#> GRangesList object of length 895:
#> $ENST00000456328.2
#> GRanges object with 3 ranges and 5 metadata columns:
#> seqnames ranges strand | exon_id exon_rank
#> <Rle> <IRanges> <Rle> | <character> <integer>
#> [1] chr1 11869-12227 + | ENSE00002234944.1 1
#> [2] chr1 12613-12721 + | ENSE00003582793.1 2
#> [3] chr1 13221-14409 + | ENSE00002312635.1 3
#> transcript_id gene_id type
#> <character> <character> <character>
#> [1] ENST00000456328.2 ENSG00000223972.5 exon
#> [2] ENST00000456328.2 ENSG00000223972.5 exon
#> [3] ENST00000456328.2 ENSG00000223972.5 exon
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
#>
#> $ENST00000450305.2
#> GRanges object with 6 ranges and 5 metadata columns:
#> seqnames ranges strand | exon_id exon_rank
#> <Rle> <IRanges> <Rle> | <character> <integer>
#> [1] chr1 12010-12057 + | ENSE00001948541.1 1
#> [2] chr1 12179-12227 + | ENSE00001671638.2 2
#> [3] chr1 12613-12697 + | ENSE00001758273.2 3
#> [4] chr1 12975-13052 + | ENSE00001799933.2 4
#> [5] chr1 13221-13374 + | ENSE00001746346.2 5
#> [6] chr1 13453-13670 + | ENSE00001863096.1 6
#> transcript_id gene_id type
#> <character> <character> <character>
#> [1] ENST00000450305.2 ENSG00000223972.5 exon
#> [2] ENST00000450305.2 ENSG00000223972.5 exon
#> [3] ENST00000450305.2 ENSG00000223972.5 exon
#> [4] ENST00000450305.2 ENSG00000223972.5 exon
#> [5] ENST00000450305.2 ENSG00000223972.5 exon
#> [6] ENST00000450305.2 ENSG00000223972.5 exon
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
#>
#> $ENST00000473358.1
#> GRanges object with 3 ranges and 5 metadata columns:
#> seqnames ranges strand | exon_id exon_rank
#> <Rle> <IRanges> <Rle> | <character> <integer>
#> [1] chr1 29554-30039 + | ENSE00001947070.1 1
#> [2] chr1 30564-30667 + | ENSE00001922571.1 2
#> [3] chr1 30976-31097 + | ENSE00001827679.1 3
#> transcript_id gene_id type
#> <character> <character> <character>
#> [1] ENST00000473358.1 ENSG00000243485.5 exon
#> [2] ENST00000473358.1 ENSG00000243485.5 exon
#> [3] ENST00000473358.1 ENSG00000243485.5 exon
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
#>
#> ...
#> <892 more elements>
The metadata
slot of the GRangesList
object
contains a list of the feature IDs for each type of feature:
lapply(S4Vectors::metadata(grl)$featurelist, head)
#> $spliced
#> [1] "ENST00000456328.2" "ENST00000450305.2" "ENST00000473358.1"
#> [4] "ENST00000469289.1" "ENST00000607096.1" "ENST00000606857.1"
#>
#> $intron
#> [1] "ENST00000456328.2-I" "ENST00000456328.2-I1"
#> [3] "ENST00000450305.2-I" "ENST00000450305.2-I1"
#> [5] "ENST00000450305.2-I2" "ENST00000450305.2-I3"
As we can see, the intron IDs are identified by a -I
suffix. Each feature is further annotated to a gene ID. For the intronic
features, the corresponding gene ID also bears the -I
suffix appended to the original gene ID. Having separate gene IDs for
spliced transcripts and introns allows, for example, joint
quantification of spliced and unspliced versions of a gene with alevin.
Adding a suffix rather than defining a completely new gene ID allows us
to easily match corresponding spliced and unspliced feature IDs. To
further simplify the latter, the metadata of the
GRangesList
object returned by
getFeatureRanges()
contains data.frame
s
matching the corresponding gene IDs (as well as transcript IDs, if
unspliced transcripts are extracted):
head(S4Vectors::metadata(grl)$corrgene)
#> spliced intron
#> 1 ENSG00000223972.5 ENSG00000223972.5-I
#> 2 ENSG00000243485.5 ENSG00000243485.5-I
#> 3 ENSG00000284332.1 ENSG00000284332.1-I
#> 4 ENSG00000268020.3 ENSG00000268020.3-I
#> 5 ENSG00000240361.2 ENSG00000240361.2-I
#> 6 ENSG00000186092.6 ENSG00000186092.6-I
Once the genomic ranges of the features of interest are extracted, we
can obtain the nucleotide sequences using the
extractTranscriptSeqs()
function from the GenomicFeatures
package. In addition to the ranges, this requires the genome sequence.
This can be obtained, for example, from the corresponding BSgenome
package, or from an external FASTA file.
suppressPackageStartupMessages({
library(BSgenome)
library(BSgenome.Hsapiens.UCSC.hg38)
})
seqs <- GenomicFeatures::extractTranscriptSeqs(
x = BSgenome.Hsapiens.UCSC.hg38,
transcripts = grl
)
seqs
#> DNAStringSet object of length 895:
#> width seq names
#> [1] 1657 GTTAACTTGCCGTCAGC...ACACTGTTGGTTTCTG ENST00000456328.2
#> [2] 632 GTGTCTGACTTCCAGCA...ACAGGGGAATCCCGAA ENST00000450305.2
#> [3] 712 GTGCACACGGCTCCCAT...TGAGGGATAAATGTAT ENST00000473358.1
#> [4] 535 TCATCAGTCCAAAGTCC...GTATGATTTTAAAGTC ENST00000469289.1
#> [5] 138 GGATGCCCAGCTAGTTT...AGAATTAAGCATTTTA ENST00000607096.1
#> ... ... ...
#> [891] 178 GCGCCAGCCGGACCCCA...CGCGTATTAACGAGAG ENST00000304952.1...
#> [892] 193 GGAGATGACCGTGAGAC...GTACCGCGCCGGCTTC ENST00000481869.1-I
#> [893] 193 GGAGATGACCGTGAGAC...GTACCGCGCCGGCTTC ENST00000484667.2-I
#> [894] 352 GCGCCAGCCGGACCCCA...TCCTGGAGATGACCGT ENST00000484667.2-I1
#> [895] 826 ACCTCCGCCTGCCAGGC...AGAGATTGTGGTGAGC ENST00000458555.1-I
The resulting DNAStringSet
can be written to a FASTA
file and used to generate an index for alignment-free methods such as
Salmon and kallisto.
In addition to extracting feature sequences, we can also export a GTF file with the full set of features. This is useful, for example, in order to generate a linked transcriptome for later import of estimated abundances with tximeta.
In the exported GTF file, each entry of grl
will
correspond to a “transcript” feature, and each individual range
corresponds to an “exon” feature. In addition, each gene is represented
as a “gene” feature.
Finally, we can export a data.frame
(or a tab-separated
test file) providing a conversion table between “transcript” and “gene”
identifiers. This is needed to aggregate the transcript-level abundance
estimates from alignment-free methods such as Salmon and
kallisto to the gene level.
df <- getTx2Gene(grl)
head(df)
#> transcript_id gene_id
#> ENST00000456328.2 ENST00000456328.2 ENSG00000223972.5
#> ENST00000450305.2 ENST00000450305.2 ENSG00000223972.5
#> ENST00000473358.1 ENST00000473358.1 ENSG00000243485.5
#> ENST00000469289.1 ENST00000469289.1 ENSG00000243485.5
#> ENST00000607096.1 ENST00000607096.1 ENSG00000284332.1
#> ENST00000606857.1 ENST00000606857.1 ENSG00000268020.3
tail(df)
#> transcript_id gene_id
#> ENST00000304952.10-I1 ENST00000304952.10-I1 ENSG00000188290.10-I
#> ENST00000304952.10-I2 ENST00000304952.10-I2 ENSG00000188290.10-I
#> ENST00000481869.1-I ENST00000481869.1-I ENSG00000188290.10-I
#> ENST00000484667.2-I ENST00000484667.2-I ENSG00000188290.10-I
#> ENST00000484667.2-I1 ENST00000484667.2-I1 ENSG00000188290.10-I
#> ENST00000458555.1-I ENST00000458555.1-I ENSG00000224969.1-I
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils
#> [7] datasets methods base
#>
#> other attached packages:
#> [1] BSgenome.Hsapiens.UCSC.hg38_1.4.5
#> [2] BSgenome_1.73.1
#> [3] BiocIO_1.15.2
#> [4] Biostrings_2.73.2
#> [5] XVector_0.45.0
#> [6] edgeR_4.3.21
#> [7] limma_3.61.12
#> [8] QuasR_1.45.2
#> [9] Rbowtie_1.45.0
#> [10] rtracklayer_1.65.0
#> [11] GenomicFeatures_1.57.1
#> [12] AnnotationDbi_1.67.0
#> [13] Biobase_2.65.1
#> [14] GenomicRanges_1.57.2
#> [15] GenomeInfoDb_1.41.2
#> [16] IRanges_2.39.2
#> [17] S4Vectors_0.43.2
#> [18] BiocGenerics_0.51.3
#> [19] eisaR_1.19.0
#> [20] BiocStyle_2.33.1
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.3 bitops_1.0-9
#> [3] deldir_2.0-4 httr2_1.0.5
#> [5] biomaRt_2.61.3 rlang_1.1.4
#> [7] magrittr_2.0.3 Rhisat2_1.21.0
#> [9] matrixStats_1.4.1 compiler_4.4.1
#> [11] RSQLite_2.3.7 png_0.1-8
#> [13] vctrs_0.6.5 txdbmaker_1.1.2
#> [15] stringr_1.5.1 pwalign_1.1.0
#> [17] pkgconfig_2.0.3 crayon_1.5.3
#> [19] fastmap_1.2.0 dbplyr_2.5.0
#> [21] utf8_1.2.4 Rsamtools_2.21.2
#> [23] rmarkdown_2.28 UCSC.utils_1.1.0
#> [25] bit_4.5.0 xfun_0.48
#> [27] zlibbioc_1.51.2 cachem_1.1.0
#> [29] jsonlite_1.8.9 progress_1.2.3
#> [31] blob_1.2.4 highr_0.11
#> [33] DelayedArray_0.31.14 BiocParallel_1.39.0
#> [35] jpeg_0.1-10 prettyunits_1.2.0
#> [37] R6_2.5.1 VariantAnnotation_1.51.2
#> [39] bslib_0.8.0 stringi_1.8.4
#> [41] RColorBrewer_1.1-3 jquerylib_0.1.4
#> [43] Rcpp_1.0.13 SummarizedExperiment_1.35.5
#> [45] knitr_1.48 SGSeq_1.39.0
#> [47] igraph_2.1.1 tidyselect_1.2.1
#> [49] Matrix_1.7-1 abind_1.4-8
#> [51] yaml_2.3.10 codetools_0.2-20
#> [53] RUnit_0.4.33 hwriter_1.3.2.1
#> [55] curl_5.2.3 tibble_3.2.1
#> [57] lattice_0.22-6 ShortRead_1.63.2
#> [59] KEGGREST_1.45.1 evaluate_1.0.1
#> [61] BiocFileCache_2.13.2 xml2_1.3.6
#> [63] filelock_1.0.3 pillar_1.9.0
#> [65] BiocManager_1.30.25 MatrixGenerics_1.17.1
#> [67] generics_0.1.3 RCurl_1.98-1.16
#> [69] hms_1.1.3 GenomicFiles_1.41.0
#> [71] glue_1.8.0 maketools_1.3.1
#> [73] tools_4.4.1 interp_1.1-6
#> [75] sys_3.4.3 locfit_1.5-9.10
#> [77] GenomicAlignments_1.41.0 buildtools_1.0.0
#> [79] XML_3.99-0.17 grid_4.4.1
#> [81] latticeExtra_0.6-30 GenomeInfoDbData_1.2.13
#> [83] restfulr_0.0.15 cli_3.6.3
#> [85] rappdirs_0.3.3 fansi_1.0.6
#> [87] S4Arrays_1.5.11 dplyr_1.1.4
#> [89] sass_0.4.9 digest_0.6.37
#> [91] SparseArray_1.5.45 rjson_0.2.23
#> [93] memoise_2.0.1 htmltools_0.5.8.1
#> [95] lifecycle_1.0.4 httr_1.4.7
#> [97] statmod_1.5.0 bit64_4.5.2