BioC++ core-0.7.0
The Modern C++ libraries for Bioinformatics.
 
Loading...
Searching...
No Matches
Alphabet
+ Collaboration diagram for Alphabet:

Modules

 Aminoacid
 Provides the amino acid alphabets and functionality for translation from nucleotide.
 
 CIGAR
 Provides (semi-)alphabets for representing elements in CIGAR strings.
 
 Composite
 Provides templates for combining existing alphabets into new alphabet types.
 
 Custom
 Provides customisation tags and alphabet adaptations of standard char and uint types.
 
 Gap
 Provides the gap alphabet and functionality to make an alphabet a gapped alphabet.
 
 Mask
 Provides the mask alphabet and functionality for creating masked composites.
 
 Nucleotide
 Provides the different DNA and RNA alphabet types.
 
 Quality
 Provides the various quality score types.
 

Namespaces

namespace  bio::alphabet
 The alphabet module's namespace.
 
namespace  bio::alphabet::custom
 A namespace for third party and standard library specialisations of BioC++ customisation points.
 
namespace  bio::alphabet::literals
 An inline namespace for alphabet literals. It exists to safely allow using namespace.
 

Classes

class  bio::alphabet::base< derived_type, size, char_t >
 A CRTP-base that makes defining a custom alphabet easier. More...
 
class  bio::alphabet::base< derived_type, 1ul, char_t >
 Specialisation of bio::alphabet::base for alphabets of size 1. More...
 
struct  std::hash< alphabet_t >
 Struct for hashing a character. More...
 
class  bio::alphabet::proxy_base< derived_type, alphabet_type >
 A CRTP-base that eases the definition of proxy types returned in place of regular alphabets. More...
 

Concepts

concept  bio::alphabet::semialphabet
 The basis for bio::alphabet::alphabet, but requires only rank interface (not char).
 
concept  bio::alphabet::writable_semialphabet
 A refinement of bio::alphabet::semialphabet that adds assignability.
 
concept  bio::alphabet::alphabet
 The generic alphabet concept that covers most data types used in ranges.
 
concept  bio::alphabet::writable_alphabet
 Refines bio::alphabet::alphabet and adds assignability.
 

Typedefs

template<typename alphabet_type >
using bio::alphabet::char_t = decltype(bio::alphabet::to_char(std::declval< alphabet_type const >()))
 The char_type of the alphabet; defined as the return type of bio::alphabet::to_char.
 
template<typename semi_alphabet_type >
using bio::alphabet::rank_t = decltype(bio::alphabet::to_rank(std::declval< semi_alphabet_type >()))
 The rank_type of the semi-alphabet; defined as the return type of bio::alphabet::to_rank.
 

Variables

template<typename alph_t , typename wrap_t = meta::default_initialisable_wrap_t<alph_t>>
constexpr auto bio::alphabet::size
 A type trait that holds the size of a (semi-)alphabet.
 

Function objects

constexpr auto bio::alphabet::to_rank
 Return the rank representation of a (semi-)alphabet object.
 
constexpr auto bio::alphabet::assign_rank_to
 Assign a rank to an alphabet object.
 
constexpr auto bio::alphabet::to_char
 Return the char representation of an alphabet object.
 
constexpr auto bio::alphabet::assign_char_to
 Assign a char to an alphabet object.
 
template<typename alph_t >
constexpr auto bio::alphabet::char_is_valid_for
 Returns whether a character is in the valid set of a bio::alphabet::alphabet (usually implies a bijective mapping to an alphabet value).
 
constexpr auto bio::alphabet::assign_char_strictly_to
 Assign a character to an alphabet object, throw if the character is not valid.
 

Detailed Description

Introduction

Alphabets are a core component in BioC++. They enable us to represent the smallest unit of biological sequence data, e.g. a nucleotide or an amino acid.

In theory, these could just be represented as a char and this is how many people perceive them, but it makes sense to use a smaller, stricter and well-defined alphabet in almost all cases, because:

In BioC++ there are alphabet types for typical sequence alphabets like DNA and amino acid, but also for qualities, RNA structures and alignment gaps. In addition there are templates for combining alphabet types into new alphabets, and wrappers for existing data types like the canonical char.

In addition to concrete alphabet types, BioC++ provides multiple concepts that describe groups of alphabets by their properties and can be used to constrain templates so that they only work with certain alphabet types. See the Tutorial on Concepts for a gentle introduction to the topic.

The alphabet concepts

alphabet size

All alphabets in BioC++ have a fixed size. It can be queried via the bio::alphabet::size type trait and optionally also the alphabet_size static member of the alphabet (see below for "members VS free/global functions").

In some areas we provide alphabets types with different sizes for the same purpose, e.g. bio::alphabet::dna4 ('A', 'C', 'G', 'T'), bio::alphabet::dna5 (plus 'N') and bio::alphabet::dna15 (plus ambiguous characters defined by IUPAC). By convention most of our alphabets carry their size in their name (bio::alphabet::dna4 has size 4 a.s.o.).

A main reason for choosing a smaller alphabet over a bigger one is the possibility of optimising for space efficiency. Note, however, that a single letter by itself can never be smaller than a byte for architectural reasons. Actual space improvements are realised via secondary structures, e.g. when using a bio::ranges::bitcompressed_vector<bio::alphabet::dna4> instead of std::vector<bio::alphabet::dna4>. Also the single letter quality composite bio::alphabet::qualified<bio::alphabet::dna4, bio::alphabet::phred42> fits into one byte, because the product of the alphabet sizes (4 * 42) is smaller than 256; whereas the same composite with bio::alphabet::dna15 requires two bytes per letter (15 * 42 > 256).

Assigning and retrieving values

As mentioned above, we typically think of alphabets in their character representation, but we also require them in "rank representation" as programmers. In C and C++ it is quite difficult to cleanly differentiate between these, because the char type is considered an integral type and can be used to index an array (e.g. my_array['A'] translates to my_array[65]). Moreover the sign of char is implementation defined and on many platforms the smallest integer types int8_t and uint8_t are literally the same types as signed char and unsigned char respectively.

This leads to ambiguity when assigning and retrieving values:

int main()
{
// does not work:
// bio::alphabet::dna4 my_letter{0}; // we want to set the default, an A
// bio::alphabet::dna4 my_letter{'A'}; // we also want to set an A, but we are setting value 65
// std::cout << my_letter; // you expect 'A', but how would you access the number?
}
Provides bio::alphabet::dna4, container aliases and string literals.

To solve this problem, alphabets in BioC++ define two interfaces:

  1. a rank based interface with
  2. a character based interface with

To prevent the aforementioned ambiguity, you can neither assign from rank or char representation via operator=, nor can you cast the alphabet to either of it's representation forms, you need to explicitly use the interfaces.

For efficiency, the representation saved internally is normally the rank representation, and the character representation is generated via conversion tables. This is, however, not required as long as both interfaces are provided and all functions operate in constant time.

The same applies for printing characters although we provide overloads for the {fmt}-library in <bio/alphabet/fmt.hpp>.

Here is an example of explicit assignment of a rank and char, and how it can be printed via std::cout and {fmt}:

#include <iostream> // for std::cout
#include <bio/alphabet/fmt.hpp> // for fmt::print
int main()
{
bio::alphabet::assign_rank_to(0, my_letter); // assign an A via rank interface
bio::alphabet::assign_char_to('A', my_letter); // assign an A via char interface
std::cout << bio::alphabet::to_char(my_letter); // prints 'A'
std::cout << (unsigned)bio::alphabet::to_rank(my_letter); // prints 0
// we have to add the cast here, because uint8_t is also treated as a char type by default :(
// Using the format library:
fmt::print("{}", bio::alphabet::to_char(my_letter)); // prints 'A'
fmt::print("{}", my_letter); // prints 'A' (calls to_char() automatically!)
fmt::print("{}", bio::alphabet::to_rank(my_letter)); // prints 0 (casts uint8_t to unsigned automatically!)
}
The four letter DNA alphabet of A,C,G,T..
Definition: dna4.hpp:49
Core alphabet concept and free function/type trait wrappers.
constexpr auto to_char
Return the char representation of an alphabet object.
Definition: concept.hpp:192
constexpr auto assign_char_to
Assign a char to an alphabet object.
Definition: concept.hpp:260
constexpr auto to_rank
Return the rank representation of a (semi-)alphabet object.
Definition: concept.hpp:70
constexpr auto assign_rank_to
Assign a rank to an alphabet object.
Definition: concept.hpp:138

To reduce the burden of calling assign_char often, most alphabets in BioC++ provide custom literals for the alphabet and sequences over the alphabet:

int main()
{
using namespace bio::alphabet::literals;
bio::alphabet::dna4 my_letter = 'A'_dna4; // identical to assign_char_to('A', my_letter);
std::vector<bio::alphabet::dna4> my_seq = "ACGT"_dna4; // identical to calling assign_char for each element
}
An inline namespace for alphabet literals. It exists to safely allow using namespace.
Definition: aa10li.hpp:183

Note, however, that literals are not required by the concept.

Different concepts

All types that have valid implementations of the functions/functors described above model the concept bio::alphabet::writable_alphabet. This is the strongest (i.e. most refined) general case concept. There are more refined concepts for specific biological applications (like bio::alphabet::nucleotide), and there are less refined concepts that only model part of an alphabet:

Typically you will use bio::alphabet::alphabet in "read-only" situations (e.g. const parameters) and bio::alphabet::writable_alphabet whenever the values might be changed. Semi-alphabets are less useful in application code.

semialphabet writable_semialphabet alphabet writable_alphabet ) Aux
bio::alphabet::size
bio::alphabet::to_rank()
bio::alphabet::rank_t 🔗
bio::alphabet::assign_rank_to()
bio::alphabet::to_char()
bio::alphabet::char_t 🔗
bio::alphabet::assign_char_to()
bio::alphabet::char_is_valid_for()
bio::alphabet::assign_char_strictly_to() 🔗

The above table shows all alphabet concepts and related functions and type traits. The entities marked as "auxiliary" provide shortcuts to the other "essential" entities. This difference is only relevant if you want to create your own alphabet (you do not need to provide an implementation for the "auxiliary" entities, they are provided automatically).

Members VS free/global functions

The alphabet concept (as most concepts in BioC++) looks for free/global functions, i.e. you need to be able to call bio::alphabet::to_rank(my_letter), however most alphabets also provide a member function, i.e. my_letter.to_rank(). The same is true for the type trait bio::alphabet::size vs the static data member alphabet_size.

Members are provided for convenience and if you are an application developer who works with a single concrete alphabet type you are fine with using the member functions. If you, however, implement a generic function that accepts different alphabet types, you need to use the free function / type trait interface, because it is the only interface guaranteed to exist (member functions are not required/enforced by the concept).

Containers over alphabets

In BioC++ it is recommended you use the STL container classes like std::vector for storing sequence data, but you can use other class templates if they satisfy the respective bio::ranges::detail::container, e.g. std::deque or folly::fbvector or even Qt::QVector.

std::basic_string is also supported, however, we recommend against using it, because it is not safe (and not useful) to call certain members like .c_str() if our alphabets are used as value type.

We provide specialised containers with certain properties in the Ranges module.

A container over an bio::alphabet::alphabet automatically models the bio::alphabet::sequence concept.

Variable Documentation

◆ assign_char_strictly_to

constexpr auto bio::alphabet::assign_char_strictly_to
inlineconstexpr

Assign a character to an alphabet object, throw if the character is not valid.

Template Parameters
alph_typeType of the target object.
Parameters
chrThe character being assigned; must be of the bio::alphabet::char_t of the target object.
alphThe target object; its type must model bio::alphabet::alphabet.
Returns
Reference to alph if alph was given as lvalue, otherwise a copy.
Exceptions
bio::alphabet::invalid_char_assignmentIf bio::alphabet::char_is_valid_for<decltype(alph)>(chr) == false.

This is a function object. Invoke it with the parameters specified above.

Note that this is not a customisation point and it cannot be "overloaded". It simply invokes bio::alphabet::char_is_valid_for and bio::alphabet::assign_char_to.

Example

int main()
{
char c = '!';
bio::alphabet::assign_char_strictly_to('?', c); // calls bio::alphabet::custom::tag_invoke(bio::alphabet::custom::assign_char_strictly_to, 'A', c)
bio::alphabet::assign_char_strictly_to('A', d); // calls .assign_char('A') member
// also works for temporaries:
}
Core alphabet concept and free function/type trait wrappers.
Provides alphabet adaptations for standard char types.
The five letter DNA alphabet of A,C,G,T and the unknown character N..
Definition: dna5.hpp:50
Provides bio::alphabet::dna5, container aliases and string literals.
constexpr auto assign_char_strictly_to
Assign a character to an alphabet object, throw if the character is not valid.
Definition: concept.hpp:461

◆ assign_char_to

constexpr auto bio::alphabet::assign_char_to
inlineconstexpr

Assign a char to an alphabet object.

Template Parameters
alph_typeType of the target object.
Parameters
chrThe char being assigned; must be of the bio::alphabet::char_t of the target object.
alphThe target object.
Returns
Reference to alph if alph was given as lvalue, otherwise a copy.

This is a function object. Invoke it with the parameter(s) specified above.

It is defined for all (semi-)alphabets in BioC++.

Example

int main()
{
char c = '!';
bio::alphabet::assign_char_to('?', c); // calls bio::alphabet::custom::tag_invoke(bio::alphabet::custom::assign_char_to, 'A', c)
bio::alphabet::assign_char_to('A', d); // calls .assign_char('A') member
// also works for temporaries:
// invalid/unknown characters are converted:
}

Customisation point

This is a customisation point (see Customisation). If you don't want to create your own alphabet, everything below is irrelevant to you!

This object acts as a wrapper and looks for an implementation with the following signature:

constexpr alph_type & tag_invoke(bio::alphabet::custom::assign_char_to, char_type const char, alph_type & alph) noexcept
Customisation tag for bio::alphabet::assign_char_to.
Definition: tag.hpp:37

Functions are found via ADL and considered only if they are marked noexcept (constexpr is not required, but recommended) and if the returned type is exactly alph_type &.

To specify the behaviour for your own alphabet type, simply provide the above function as a friend or free function.

Note that temporaries of alph_type are handled by this function object and do not require an additional overload.

◆ assign_rank_to

constexpr auto bio::alphabet::assign_rank_to
inlineconstexpr

Assign a rank to an alphabet object.

Template Parameters
alph_typeType of the target object.
Parameters
rankThe rank being assigned; must be of the bio::alphabet::rank_t of the target object.
alphThe target object.
Returns
Reference to alph if alph was given as lvalue, otherwise a copy.

This is a function object. Invoke it with the parameter(s) specified above.

It is defined for all (semi-)alphabets in BioC++.

Example

int main()
{
char c = '!';
bio::alphabet::assign_rank_to(66, c); // calls bio::alphabet::custom::tag_invoke(bio::alphabet::custom::assign_rank_to, 66, c); == 'B'
bio::alphabet::assign_rank_to(2, d); // calls .assign_rank(2) member; == 'G'_dna5
// also works for temporaries:
// too-large ranks are undefined behaviour:
// bio::alphabet::dna5 d3 = bio::alphabet::assign_rank_to(50, bio::alphabet::dna5{});
}

Customisation point

This is a customisation point (see Customisation). If you don't want to create your own alphabet, everything below is irrelevant to you!

This object acts as a wrapper and looks for an implementation with the following signature:

cosntexpr alph_type & tag_invoke(bio::alphabet::custom::assign_rank_to, rank_type const rank, alph_type & alph) noexcept
Customisation tag for bio::alphabet::assign_rank_to.#.
Definition: tag.hpp:29

Implementations are found via ADL and considered only if they are marked noexcept (constexpr is not required, but recommended) and if the returned type is exactly alph_type &.

To specify the behaviour for your own alphabet type, simply provide the above function as a friend or free function.

Note that temporaries of alph_type are handled by this function object and do not require an additional overload.

◆ char_is_valid_for

template<typename alph_t >
constexpr auto bio::alphabet::char_is_valid_for
inlineconstexpr

Returns whether a character is in the valid set of a bio::alphabet::alphabet (usually implies a bijective mapping to an alphabet value).

Template Parameters
alph_typeThe alphabet type being queried.
Parameters
chrThe character being checked; must be convertible to bio::alphabet::char_t<alph_type>.
alphThe target object; its type must model bio::alphabet::alphabet.
Returns
true or false.

This is a function object. Invoke it with the parameter(s) specified above.

It is defined for all (semi-)alphabets in BioC++.

Example

int main()
{
bool b = bio::alphabet::char_is_valid_for<char>('A');
// calls bio::alphabet::custom::tag_invoke(bio::alphabet::custom::char_is_valid_for, 'A', char{}); always true
bool c = bio::alphabet::char_is_valid_for<bio::alphabet::dna5>('A');
// calls dna5::char_is_valid('A') member; == true
// for some alphabets, characters that are not uniquely mappable are still valid:
bool d = bio::alphabet::char_is_valid_for<bio::alphabet::dna5>('a');
// lower case also true
}

Default Behaviour

In contrast to the other alphabet related customisation points, it is optional to provide an implementation of this one for most¹ alphabets, because a default implementation exists.

The default behaviour is that all characters that are "preserved" when assigning to an object are valid, i.e. to_char(assign_char_to(chr, alph_t{})) == chr.

This means that e.g. assigning 'A' to bio::alphabet::dna4 would be valid, but 'a' would not be, because bio::alphabet::to_char() always produces upper-case for bio::alphabet::dna4. For this reason, many alphabets have a specialised validity-check that also accepts defines lower-case letters as valid.

¹ All alphabets where the type is std::is_nothrow_default_constructible.

Customisation point

This is a customisation point (see Customisation). If you don't want to create your own alphabet, everything below is irrelevant to you!

This object acts as a wrapper and looks for an implementation with the following signature:

constexpr bool tag_invoke(bio::alphabet::custom::char_is_valid_for, char_type const char, alph_type) noexcept
Customisation tag for bio::alphabet::assign_char_to.
Definition: tag.hpp:41

If no implementation is found, it behaves as specified above.

Implementations are found via ADL and considered only if they are marked noexcept (constexpr is not required, but recommended) and if the returned type is exactly bool.

To specify the behaviour for your own alphabet type, simply provide the above function as a friend or free function.

Note that the value of the alph_type argument is irrelevant, only the type is needed.

Note that if the alphabet type with cvref removed is not std::is_nothrow_default_constructible, this function object will instead look for:

constexpr bool tag_invoke(bio::alphabet::custom::char_is_valid_for, char_type const char, std::type_identity<alph_type>) noexcept

i.e. the type will be wrapped in std::type_identity so it can still be passed as a tag. In that case the default behaviour defined above does not work, and you are required to provide such an implementation.

◆ size

template<typename alph_t , typename wrap_t = meta::default_initialisable_wrap_t<alph_t>>
constexpr auto bio::alphabet::size
inlineconstexpr

A type trait that holds the size of a (semi-)alphabet.

Template Parameters
alph_typeThe (semi-)alphabet type being queried.

This is variable template. Instantiate it with an alphabet type.

It is defined for all (semi-)alphabets in BioC++.

Example

int main()
{
// calls bio::alphabet::custom::tag_invoke(bio::alphabet::custom::size, char{}); r2 == 256
auto r2 = bio::alphabet::size<char>;
// calls bio::alphabet::base's friend tag_invoke() which returns dna5::alphabet_size == 5
auto r3 = bio::alphabet::size<bio::alphabet::dna5>;
}

Customisation point

This is a customisation point (see Customisation). If you don't want to create your own alphabet, everything below is irrelevant to you!

This object acts as a wrapper and looks for an implementation with the following signature:

consteval size_t tag_invoke(bio::alphabet::custom::size, alph_type) noexcept
CPO tag definition for bio::alphabet::size.
Definition: tag.hpp:45

Implementations are found via ADL and considered only if they are marked noexcept, if they return a std::integral type and if they can be evaluated at compile-time (consteval is recommended, butconstexpr is possible, too).

To specify the behaviour for your own alphabet type, simply provide the above function as a friend or free function.

Note that if the alphabet type with cvref removed is not std::is_nothrow_default_constructible at compile-time, this function object will instead look for:

consteval size_t tag_invoke(bio::alphabet::custom::size, std::type_identity<alph_type>) noexcept

i.e. the type will be wrapped in std::type_identity so it can still be passed as a tag.

◆ to_char

constexpr auto bio::alphabet::to_char
inlineconstexpr

Return the char representation of an alphabet object.

Template Parameters
alph_typeType of the argument.
Parameters
alphThe alphabet object.
Returns
The char representation.

This is a function object. Invoke it with the parameter(s) specified above.

It is defined for all alphabets in BioC++.

Example

using namespace bio::alphabet::literals;
int main()
{
auto r2 = bio::alphabet::to_char('A'); // calls bio::alphabet::custom::tag_invoke(bio::alphabet::custom::to_char, 'A'); r2 == 'A'
auto r3 = bio::alphabet::to_char('A'_dna5); // calls .to_char() member; r3 == 'A'
}

Customisation point

This is a customisation point (see Customisation). If you don't want to create your own alphabet, everything below is irrelevant to you!

This object acts as a wrapper and looks for an implementation with the following signature:

constexpr char_type tag_invoke(bio::alphabet::custom::to_char, alph_type const alph) noexcept
Customisation tag for bio::alphabet::to_char.
Definition: tag.hpp:33

Implementations are found via ADL and considered only if they are marked noexcept (constexpr is not required, but recommended) and if the returned type models bio::meta::builtin_character.

To specify the behaviour for your own alphabet type, simply provide the above function as a friend or free function.

◆ to_rank

constexpr auto bio::alphabet::to_rank
inlineconstexpr

Return the rank representation of a (semi-)alphabet object.

Template Parameters
alph_typeType of the argument.
Parameters
alphThe (semi-)alphabet object.
Returns
The rank representation; an integral type.

This is a function object. Invoke it with the parameter(s) specified above.

It is defined for all (semi-)alphabets in BioC++.

Example

using namespace bio::alphabet::literals;
int main()
{
auto r2 = bio::alphabet::to_rank('A'); // calls bio::alphabet::custom::tag_invoke(bio::alphabet::custom::to_rank, 'A'); r2 == 65
auto r3 = bio::alphabet::to_rank('A'_dna5); // calls .to_char() member; r3 == 0
}

Customisation point

This is a customisation point (see Customisation). If you don't want to create your own alphabet, everything below is irrelevant to you!

This object acts as a wrapper and looks for an implementation with the following signature:

constexpr rank_type tag_invoke(bio::alphabet::custom::to_rank, alph_type const alph) noexcept
Customisation tag for bio::alphabet::to_rank.
Definition: tag.hpp:25

Implementations are found via ADL and considered only if they are marked noexcept (constexpr is not required, but recommended) and if the returned type models std::integral.

To specify the behaviour for your own alphabet type, simply provide the above function as a friend or free function.