BioC++ core-0.7.0
The Modern C++ libraries for Bioinformatics.
 
Loading...
Searching...
No Matches
Nucleotide

Provides the different DNA and RNA alphabet types. More...

+ Collaboration diagram for Nucleotide:

Classes

class  bio::alphabet::dna15
 The 15 letter DNA alphabet, containing all IUPAC smybols minus the gap.. More...
 
class  bio::alphabet::dna16sam
 A 16 letter DNA alphabet, containing all IUPAC symbols minus the gap and plus an equality sign ('=').. More...
 
class  bio::alphabet::dna4
 The four letter DNA alphabet of A,C,G,T.. More...
 
class  bio::alphabet::dna5
 The five letter DNA alphabet of A,C,G,T and the unknown character N.. More...
 
class  bio::alphabet::nucleotide_base< derived_type, size >
 A CRTP-base that refines bio::alphabet::base and is used by the nucleotides. More...
 
class  bio::alphabet::rna15
 The 15 letter RNA alphabet, containing all IUPAC smybols minus the gap.. More...
 
class  bio::alphabet::rna4
 The four letter RNA alphabet of A,C,G,U.. More...
 
class  bio::alphabet::rna5
 The five letter RNA alphabet of A,C,G,U and the unknown character N.. More...
 

Concepts

concept  bio::alphabet::nucleotide
 A concept that indicates whether an alphabet represents nucleotides.
 

Function objects (Nucleotide)

constexpr auto bio::alphabet::complement
 Return the complement of a nucleotide object.
 

Detailed Description

Provides the different DNA and RNA alphabet types.

Introduction

Nucleotide sequences are at the core of most bioinformatic data processing and while it is possible to represent them in a regular std::string, it makes sense to have specialised data structures in most cases. This sub-module offers multiple nucleotide alphabets that can be used with regular containers and ranges.

Letter Description bio::alphabet::dna15 bio::alphabet::dna5 bio::alphabet::dna4 bio::alphabet::rna15 bio::alphabet::rna5 bio::alphabet::rna4
A Adenine A A A A A A
C Cytosine C C C C C C
G Guanine G G G G G G
T Thymine (DNA) T T T U U U
U Uracil (RNA) T T T U U U
M A or C M N A M N A
R A or G R N A R N A
W A or T W N A W N A
Y C or T Y N C Y N C
S C or G S N C S N C
K G or T K N G K N G
V A or C or G V N A V N A
H A or C or T H N A H N A
D A or G or T D N A D N A
B C or G or T B N C B N C
N A or C or G or T N N A N N A
Size 15 5 4 15 5 4

Keep in mind, that while we think of "the nucleotide alphabet" as consisting of four bases, there are indeed more characters defined with different levels of ambiguity. Depending on your application it will make sense to preserve this ambiguity or to discard it to save space and/or optimise computations. BioC++ offers six distinct nucleotide alphabet types to accommodate for this. There is a seventh alphabet bio::alphabet::dna16sam which implements the alphabet used in SAM/BAM/CRAM.

The specialised RNA alphabets are provided for convenience, however the DNA alphabets can handle being assigned a ‘'U’` character, as well. See below for the details.

Which alphabet to chose?

  1. in most cases, take bio::alphabet::dna15 (includes all IUPAC characters)
  2. if you are memory constrained and sequence data is actually the main memory consumer, use bio::alphabet::dna5
  3. if you use specialised algorithms that profit from a 2-bit representation, use bio::alphabet::dna4
  4. if you are doing only RNA input/output, use the respective bio::alphabet::rna* type
  5. to actually save space from using smaller alphabets, you need a compressed container (e.g. bio::ranges::bitcompressed_vector)

Printing and conversion to char

As with all alphabets in BioC++, none of the nucleotide alphabets can be directly converted to char or printed with iostreams. You need to explicitly call bio::alphabet::to_char to convert to char or use the {fmt}-library which automatically converts.

T and U are represented by the same rank and you cannot differentiate between them. The only difference between e.g. bio::alphabet::dna4 and bio::alphabet::rna4 is the output when calling to_char().

Assignment and conversions between nucleotide types

When assigning from char or converting from a larger nucleotide alphabet to a smaller one, loss of information can occur since obviously some bases are not available. When converting to bio::alphabet::dna5 or bio::alphabet::rna5, non-canonical bases (letters other than A, C, G, T, U) are converted to ‘'N’` to preserve ambiguity at that position, while for bio::alphabet::dna4 and bio::alphabet::rna4 they are converted to the first of the possibilities they represent (because there is no letter ‘'N’` to represent ambiguity). See the greyed out values in the table at the top for an overview of which conversions take place.

char values that are none of the IUPAC symbols, e.g. 'P', are always converted to the equivalent of assigning 'N', i.e. they result in 'A' for bio::alphabet::dna4 and bio::alphabet::rna4, and in 'N' for the other alphabets.

Literals

To avoid writing dna4{}.assign_char('C') every time, you may instead use the literal 'C'_dna4. All nucleotide types defined here have character literals and also string literals which return a vector of the respective type.

Concept

The nucleotide submodule defines bio::alphabet::nucleotide which encompasses all the alphabets defined in the submodule and refines bio::alphabet::alphabet. The only additional requirement is that their values can be complemented, see below.

Complement

Letter Description Complement
A Adenine T
C Cytosine G
G Guanine C
T Thymine (DNA) A
U Uracil (RNA) A
M A or C K
R A or G Y
W A or T W
Y C or T R
S C or G S
K G or T M
V A or C or G B
H A or C or T D
D A or G or T H
B C or G or T V
N A or C or G or T N

In the typical structure of DNA molecules (or double-stranded RNA), each nucleotide has a complement that it pairs with. To generate the complement value of a nucleotide letter, you can call an implementation of bio::alphabet::nucleotide::complement() on it.

For the ambiguous letters, the complement is the (possibly also ambiguous) letter representing the variant of the individual complements.

Variable Documentation

◆ complement

constexpr auto bio::alphabet::complement
inlineconstexpr

Return the complement of a nucleotide object.

Template Parameters
alph_typeType of the argument.
Parameters
alphThe nucleotide object for which you want to receive the complement.
Returns
The complement character of alph, e.g. 'C' for 'G'.

This is a function object. Invoke it with the parameter(s) specified above.

It is defined for all nucleotide alphabets in BioC++.

Example

using namespace bio::alphabet::literals;
int main()
{
auto r1 = 'A'_rna5.complement(); // calls member function rna5::complement(); r1 == 'U'_rna5
auto r2 = bio::alphabet::complement('A'_rna5); // calls global complement() function on the rna5 object; r2 == 'U'_rna5
}
Provides bio::alphabet::nucleotide.
constexpr auto complement
Return the complement of a nucleotide object.
Definition: concept.hpp:68
An inline namespace for alphabet literals. It exists to safely allow using namespace.
Definition: aa10li.hpp:183
Provides bio::alphabet::rna5, container aliases and string literals.

Customisation point

This is a customisation point (see Customisation). If you don't want to create your own alphabet, everything below is irrelevant to you!

This object acts as a wrapper and looks for an implementation with the following signature:

constexpr alph_type tag_invoke(bio::alphabet::custom::complement, alph_type const alph) noexcept
{}
Customisation tag for bio::alphabet::complement.
Definition: tag.hpp:49

Functions are found via ADL and considered only if they are marked noexcept (constexpr is not required, but recommended). The returned type must be (implicitly) convertible to alph_type (usually it should be the same type).

To specify the behaviour for your own alphabet type, simply provide the above function as a friend or free function.