How can RNA splice junctions be detected?

First, I will explain how this will be done in the near future. Then, I will explain how it is done today. My explanation is ordered this way because the methods of the future will be much simpler than those of today.

You know from the central dogma of molecular biology that DNA generally needs to be transcribed into RNA so as to eventually become biologically functional proteins. After the DNA is faithfully transcribed into RNA, molecular processes already active in the cell will actually identify specific RNA (based on their sequences) for subsequent editing. Very often (in Eukaryotes), segments of the transcripts will be excised out in this process. The excised segments are called introns while the retained segments are called exons (think "EXpressed").

In the near future (I think maybe late 2014), experimental biologists will routinely be able to observe gigantic lists of "mature" (post-edit) RNA transcripts generated in cells. These RNA transcripts can then be aligned to the DNA.  After alignment, you would notice that for some of the RNA reads, the alignments are not perfect: The mature RNA will look like the original DNA, except that the RNA may have a number of gaps. These gaps are inferred to be the result of splicing. What is present in the RNA are the exons. The excised bases, which show up as gaps, are introns.

As of early 2012, however, the method I just described above cannot be applied directly in a high-throughput manner due to technological limitations on how we observe RNA reads. Instead, the state of the art for high-throughput sequencing does not produce reads long enough to cover entire transcripts. These are called RNA-seq experiments. The detection of RNA splicing junction is done routinely when you start with RNA-seq data. This is tricky because it is non-trivial to figure out which reads go with which genes; it is especially tricky because a single gene can produce multiple variants of splicing (different combinations of excision and non-excision at different sites within the same gene).

In a simple explanation of RNA-seq, the experiment produces linked pairs of reads of length 150+ (as of 2012 and the number is growing with better tech). You produce millions of these read-pairs from your sample. Next, you can map these sequences to positions on your chromosome via alignment. Ultimately, by examining the coverage of the observed RNA sequences on the original DNA chromosomes, you are able to infer the introns and splice junctions. The details of these junction detection methods are actually somewhat complicated, but you can find further detailed explanations in papers that introduce the methods.

Cufflinks is a very popular software solution that converts such reads into gene expression levels. It is therefore necessary to identify and to catalog these junctions, and this is done via the TopHat software (which belongs to the same suite as Cufflinks). The supplementary materials section of their paper provides a detailed statistical treatment of how you start with a giant list of short RNA-seq reads into genetic expression levels and splice sites.

http://cufflinks.cbcb.umd.edu/ho…

Many "read assembler", such as cufflinks, offer two modes: an annotation guided mode, and a de-novo mode. Their paper (see link above) gives further details on de-novo mode.

33 Replies to “How can RNA splice junctions be detected?”

  1. RNA-seq (sequencing sequences of RNA) allows you to sequence reads that will come from spliced and un-spliced transcripts (conceptually think "mRNA").  The trick is being able to separate the two categories. 

    Reads in the middle of an exon (not junction reads)  will often map with little or no mismatches back to the genome

    One Read (non-jxn)      5'———-3'
    Transcript:              

  2. ||
    Chromosome DNA: ================================

    But when you map a read that comes from a splicing jxn, half of the read comes from an upstream exon, and the other half a downstream exon.  When that reads maps back to the genome it will map with little or no mismatches to two separate places:

    One Read (jxn)               5'———–/                      \———3'
    Transcript:              

  3. ||
    Chromosome DNA: ================================

    The spicing signals "GU" and "AG" should appear where the read splits

    In house software we designed makes sure the GU/AG signals are present and that there are at least a couple of reads that "split the junction" the same way.

    Hope that helps

Leave a Reply

Your email address will not be published. Required fields are marked *