What is RNA-Seq?

The scope of Next-Gen Sequencing (NGS) and RNA sequencing (RNA-seq) are rapidly evolving and expanding.

What is RNA-Seq?

The central dogma of molecular biology has portrayed RNA as the intermediate molecule between DNA and protein, however this view is becoming increasingly outdated. RNA is vastly underestimated by the constraints of this ideology, and the growing field of RNA-seq seeks to elucidate the many roles of RNA in the modulation of cellular processes, not only as the intermediate molecule but its other functions in the cell as well. RNA-seq provides researchers a window into the RNA environment of a cell during different physiological or pathological states or during different stages of development to determine cellular responses to these changes. RNA-seq allows for high throughput NGS, providing both qualitative and quantitative information about the different RNA species present in a given sample.

There are many different types of RNA-seq. Direct RNA-seq sequences the RNA in a cell directly. This method avoids the bias introduced by complementary DNA (cDNA) synthesis, polymerase chain reaction (PCR), or adaptor ligation. However, RNA is an unstable molecule, so most RNA-seq workflows begin with conversion of RNA into cDNA. Total RNA-seq sequences all RNAs present in a sample. 3’-mRNA seq produces a summary of the gene expression within a cell. Small RNA-seq involves a size selection step during RNA isolation and looks at important non-coding RNA transcripts such as cell-free RNA and miRNAs. Single-cell RNA-seq provides an expression profile on the single cell level to avoid potential biases from sequencing mixed groups of cells. Transcriptomics looks at the mRNA species within a sample. Ribosome footprinting examines what RNA molecules are being actively translated at the time of RNA isolation. There are still more less common technologies available for RNA-seq. The large variety of techniques can be attributed to the fact that RNA-seq technology can be adapted to answer many different types of research questions.

Why use RNA-Seq?

Prior to RNA-seq, the best technology for detection of gene expression was microarrays. Microarrays consist of thousands of defined spots on a slide that contained known sequences which would fluoresce when the samples were bound to the known constructs. RNA-seq is a more versatile and robust technology. It is not limited to known genomic sequences. RNA-seq does not rely on specific probes, so non-model or novel organisms can be sequenced without having a reference genome. RNA-seq can determine novel transcripts, alternative splice variants, single nucleotide polymorphisms (SNPs), insertions/deletions, and other RNA variations.

The lack of probes and primers also reduces the bias of an RNA-seq run as compared to the probe reliant microarrays. RNA-seq is also becoming increasingly inexpensive as the technology continues to develop. In addition, RNA-seq has less background signal compared to microarrays because reads can be mapped to regions of the genome. RNA-seq can be used to determine RNA expression levels more accurately than microarrays, which rely on relative quantities rather than absolute quantities which are possible with RNA-seq. Absolute quantification allows for comparisons between experiments for RNA-seq, whereas the relative quantification of microarrays makes this impossible.

As previously mentioned, there is a wide variety of RNA-seq techniques to answer different questions. Some common applications for RNA-seq include differential gene expression analysis, novel gene identification, and splice variant analysis. To accommodate the variety of applications, RNA-seq workflows can differ significantly, but there are three main steps to all RNA-seq: library preparation, sequencing, and analysis.

Library Preparation

RNA Extraction

The very first step in the workflow of RNA-seq is to extract the RNA of interest from your samples. There are many methods of RNA extraction, so choose your favorite method that results in high-quality, DNA-free RNA.

cDNA Synthesis

After the RNA has been extracted, it is reverse transcribed into first-strand cDNA. This forms a duplex of RNA and its complementary DNA molecule. On a molecular level, DNA is more stable than RNA, thus DNA is often the preferred molecule for sequencing workflows.


Now that we have cDNA, we can undergo an optional process of selection. Selection will consist of either enrichment of target molecules or depletion of overly abundant molecules. This step is important for downstream efficiency. Since cellular RNA is 80-95% ribosomal RNA (rRNA), it is critical to remove as much rRNA as possible from the total RNA sample. Overly abundant transcripts, such as rRNA and globin, will take up the vast majority of reads during a sequencing run, which is a waste of money, reagents, and read depth. For an optimized RNA sequencing run, it is in your best interest to remove these overly abundant transcripts. There are three main methods of obtaining high numbers of sequencing reads from targets of interest and removing transcripts of low importance: target enrichment, probe-based depletion, and enzymatic depletion.

How does target enrichment work?

One method to increase the number of sequencing reads for transcripts of interest is to enrich the samples. A popular method of target enrichment is mRNA selection, often done through poly d(T) magnetic beads. Chains of thymine (T) molecules are covalently bound to magnetic beads. The d(T) oligo is complementary to the poly(A) tail on the 3’ end of mature mRNAs. See figure below for step by step pictorial. Step (1) the beads with their oligos are added to a total RNA sample. Step (2) the mRNA transiently binds to the oligo d(T) chains attached to the beads. Step (3) the beads are reserved by collecting them against the side of the tube with a magnet while the rest of the sample is washed away. Step (4) the mRNA is eluted from the beads leaving a purely mRNA extract to be processed into libraries. This process can bias the pool of transcripts in the sample. If transcripts are very long or handling of the samples is rough, the transcripts can shear leading to overrepresentation of the 3’ end of the mRNAs. This method also fails to enrich other potential RNAs of interest such as microRNAs (miRNA) and long non-coding RNAs (lncRNA) which do not usually have a poly(A) tail.

Figure Showing Poly DT Beads

How does probe-based rRNA depletion work?

Probe-based rRNA depletion relies on DNA probes bound to magnetic beads that are complementary to rRNA sequences. Step (1) the bead-bound probes are added to the total RNA sample. Step (2) the complementary rRNA transcripts transiently bind to the probes on the magnetic beads. Step (3) the beads are then separated from the rest of the sample with a magnet. The rRNA is still attached to the beads. Step (4) The supernatant, not the beads, is transferred to a new, clean tube. This remaining solution is now depleted of rRNA and can move forward to downstream processing. A major problem with probe-based rRNA depletion is that it is organism specific.

rRNA sequences vary between different organisms so each organism would require its own panel of probes to effectively deplete rRNA from the sample. Probe panels are currently commercially available for a variety of model organisms. However, non-model organisms further complicate the process as they require development of their own unique probe panels. Another factor to consider when using a bead-based hybridization procedure (whether it be mRNA enrichment or rRNA depletion) is that the process involves a lengthy incubation period for the hybridization reaction, sometimes requiring overnight incubation. The probe-based rRNA depletion is commercially available in such kits as the Illumina Ribo-Zero rRNA Depletion Kit with probe panels for human, mouse, rat, and bacterial rRNA.

Figure showing TreSEQ rRNA

How does enzymatic rRNA depletion work?

Enzymatic rRNA depletion does not require the use of probes. Instead, enzymatic depletion involves the kinetics of hybridization reactions between the RNA and cDNA.

Step (1) our cDNA-RNA hybrids are denatured to single strands.
Step (2) highly abundant constructs, such as rRNAs, are more likely to hybridize to their matching cDNAs in the reaction mixture. When they hybridize, they form rRNA-cDNA duplexes.
Step (3) the formation of duplexes allows the enzyme to bind and then degrade the cDNA from the duplex leaving only the rRNA. As the reaction moves forward, the high concentration of rRNA constructs relative to their cDNA partners drives the reaction even further until rRNA and other abundant constructs are effectively depleted from the sample.
Step (4) after enzymatic depletion of highly abundant molecules, the remaining constructs are the target RNA molecules. Because this reaction relies on molecular kinetics, the higher the input for the reaction the faster the reaction proceeds, leading to an inverse relationship between input and incubation times.

This probe-free approach is beneficial because there are no organism-specific panels required for separate purchase; one kit can do them all! It is universal. This is especially beneficial for projects involving non-model organisms that previously required the development of their organism-specific probe panels. Another benefit of an enzymatic approach is the reduction in depletion bias. Probe-based approaches only deplete constructs that bind to probes, whereas the enzymatic approach depletes the most abundant constructs first and most efficiently. The enzymatic depletion is commercially available as Zymo-Seq RiboFree Total RNA Library Kit.

Figure showing ZymoSEQ RiboFree

Adaptor Ligation and Indexing

Once our cDNA has been synthesized and our transcripts of interest are no longer crowded out by rRNAs and overly abundant constructs, it is time for adaptors to be ligated onto the cDNA. Adaptors are short, synthetic oligonucleotides that are attached to the end of cDNA strands. Adaptors serve two main functions: binding transcripts for sequencing and priming sites for sequencing. The adaptor sequences are complementary to the sequences that the fragments are covalently bound to in the sequencing flow cell. The flow cell is a glass slide with lanes coated in a lawn of the two different types of oligos complementary to our adaptor sequences. This allows for our transcripts to transiently bind to the flow cells for sequencing. The second function of adaptors is to serve as priming sites for the polymerases used in sequencing.

After adaptors are ligated to the cDNA molecules, many library preparations undergo a process of indexing. Indexing involves PCR amplification of the molecules while adding a unique sequence, often termed “barcode”, to the transcripts. This barcode allows for the transcripts to be identified during the sequencing process after pooling samples. Pooling is a process that involves mixing numerous different samples together at a known concentration so they can be added to the flow cell and sequenced simultaneously. Pooling samples is often done to save time and money. After adaptor ligation and indexing, samples are ready for sequencing!

Step (1) the process of adaptor ligation and indexing involves the addition of the synthetic oligonucleotides to our target cDNA molecules. Step (2) an adaptor, with the unique barcode is ligated to the cDNA target. Illumina adaptors are commonly used, they are designated P5 and P7. P5 and P7 can be added to the 5’ or 3’ end of the cDNA, it depends on the particular library preparation kit. Step (3) the other adaptor is added to the other end of the cDNA molecule. Step (4) the cDNA and its new adaptors are PCR amplified to increase the concentration of the newly formed libraries. These amplified libraries are then quantified to determine their concentration. The concentration of the libraries is then normalized to ensure the libraries are sequenced evenly and that no one library is overrepresented during the sequencing process.

Figure showing ZymoSEQ RiboFree
Sequencing By Synthesis

There are a few different technologies for sequencing such as sanger sequencing, and more high throughput options such as pyrosequencing, ion torrent, and nanopore sequencing. We will focus on Illumina’s sequencing by synthesis technology as it remains the most popular sequencing method. There are two parts of sequencing by synthesis, which are cluster generation followed by the actual sequencing process.

How does cluster generation work?

Our samples are now each indexed, meaning they have a unique barcode tag that allows us to identify the samples after multiple samples are pooled together. The pooled samples are added to a flow cell in the sequencer. Step (1) the adaptorized transcripts can hybridize to the complementary oligos of the lawn so they are bound to the flow cell. The flow cell oligo serves as the primer for a polymerase to create a complement of the hybridized fragment.

Then the double stranded molecule is denatured, and the original template is washed away leaving only the newly synthesized strand that is bound directly to the flow cell. Step (2) the strand now folds over, and the adaptor region hybridizes to the other kind of oligo on the flow cell and a polymerase uses the new oligo as a primer to create a complementary strand again. Step (3) now there is a double stranded bridge of complementary strands. Step (4) the bridge is then denatured resulting in 2 single-stranded copies of the transcript, both bound to the flow cell. Step (5) this process is called bridge amplification and it is repeated many times resulting in the generation of many copies of the same molecules across the flow cell, these are the clusters.

How does sequencing by synthesis work?

Now that we have generated clusters, the reverse strands are cleaved and washed away. This leaves forward strands to begin sequencing. The 3’ ends are blocked to prevent unwanted priming. Sequencing begins with extension of the first sequencing primer to produce read 1, or the forward read. This read “reads” the oligo sequence in the 5’ to 3’ direction of the original fragment. Step (1) fluorescently tagged complementary nucleotides are added to the chain one base at a time based on the sequence of the template. Each nucleotide is tagged with a different color fluorescent signal. Each nucleotide is also a reversible terminator, meaning that after it is incorporated into the chain, another cannot be added. Step (2) after the nucleotide is added to the chain, a light source excites the clusters and a fluorescent signal is emitted and read by the sequencing machine.

The emission wavelength allows the computer to determine which base was added to the chain, which is a base call. The intensity of the signal produced will determine the confidence score for the accuracy of the base call. Step (3) after making the call, the reversible terminator is cleaved, and the chain is ready for the addition of the next nucleotide. This process of incorporating one nucleotide at a time and reading the signals is repeated and the number of cycles determines the length of the read. All identical strands in a cluster are read simultaneously. Clusters are sequenced in a massively parallel process meaning that millions of reads are generated at once as opposed to the processing of single amplicons at a time like with Sanger Sequencing.

After completing read 1, the read product is washed away. Now, an index read primer is hybridized to the template and the index is read in the same fashion as the first read. This allows for the sorting of reads to particular samples. After the indexes are read, the read product is washed away and the 3’ end of the template is deprotected so that the template can fold over and bind to the second oligo of the flow cell again. The second index is read like the first one. After the index is read, a polymerase extends the oligo once again forming a second stranded bridge. The strands are then linearized, and the forward strand is cleaved and washed away.

Once again, the 3’ ends of the reverse template are blocked. The second sequencing primer is added, and read 2, or the reverse read, is generated through the cyclical adding of fluorescently nucleotides just like the first read. This entire process generates millions of reads representing all the fragments in the flow cell. Now the reads generated by the sequencer are ready to be analyzed.

Figure showing ZymoSEQ RiboFree

Analysis Basics

Now that the samples have been sequenced, it is time to make sense of the massive amounts of raw data produced by the sequencing run. The raw data is output by the machine as a FASTQ or QSEQ file. These are plain text files and represent the data from the sequencing run using alphabetical, numerical, and punctuation characters. The sequence is reported as single character representations of the four nucleotides A, T, C, or G. If the sequencer is unsure about a particular base, it will call the nucleotide as “N”. Each base call generated is given a quality string, or quality score. This quality string refers to how accurately the sequencer made the correct base call in the sequence. The quality string represents a probability associated with the likelihood of an accurate base call. The probability value is also referred to as a Phred quality score. Phred scores are numerical values given to every base determination in a sequencing run. Poor quality reads are removed or trimmed and are not used in the alignment process.

Now that our sequenced base calls have been quality checked, we can begin the bioinformatics process of alignment. First, sequences from pooled sample libraries are separated based on the unique barcodes introduced during the indexing stage of library preparation. For each sample, reads with similar or exact matched stretches of base calls are locally clustered. Then, the reads of the forward and reverse strands, (these were read 1 and read 2 described above) are paired creating contiguous sequences. These contiguous sequences (otherwise known as contigs) are aligned to the reference genome to verify identification. Ambiguous alignment can be resolved with this paired end sequencing info. The information from this generation of contigs and alignment to the reference genome can now be used for analysis, including identification of SNPs or insertion-deletions (indels), read counting for absolute quantification, and phylogenetic or metagenomic analysis.

The field of NGS is growing rapidly as the technologies that allow for high-throughput sample processing grow more accessible. NGS is now both cost effective and time efficient. RNA sequencing is one of these powerful, new tools emerging from the field. RNA-seq is quickly expanding our knowledge of cellular processes and will continue to do so as the technology becomes even more widely used and available. Going from sample to sequence will continue to be streamlined, though it will continue to consist of preparation for sequencing through the addition of adaptors to constructs of interest, the actual sequencing process itself, and finally quality checks and analysis of the data generated in a the sequencing run. This tried and true process will continue to provide insight on the specifics of splice variations, differential gene expression, phylogenetics, novel gene identification, transcriptomics, and much more.

Learn About RNA-Seq Analysis Service From Zymo Research: