BioC++ core-0.7.0
The Modern C++ libraries for Bioinformatics.
 
Loading...
Searching...
No Matches
How to write your own alphabet

This HowTo documents how to write a custom alphabet that can be used with the algorithms and data structures in BioC++.

DifficultyModerate
Duration45 min
Prerequisite tutorialsC++ Concepts, Alphabets in BioC++
Recommended readingAlphabet

Motivation

In the alphabet tutorial you have learned that alphabets are a core component in BioC++ and represent the smallest unit of biological sequence data. We introduced the common alphabets for nucleotide, amino acid and gap as well as structure and quality annotation. However, in your BioC++ application you may want to implement a custom alphabet in order to work efficiently with BioC++'s algorithms. To achieve this, your custom alphabet must meet certain requirements, which are defined in concepts.

For detailed information on the alphabet concepts please read the Alphabet page. In the following sections we demonstrate how to write an alphabet that models them.

A brief summary of the concepts used in this HowTo:

Step by step: Create your own alphabet

In the alphabet tutorial we have calculated the GC content of a nucleotide sequence. Guanine and Cytosine are complementary nucleobases, which pair in a DNA molecule by building 3 hydrogen bonds. Adenine and Thymine pair with only 2 hydrogen bonds. As a consequence, we denote Guanine and Cytosine as strong (S) and Adenine and Thymine as weak (W) nucleobases. In this section we want to implement a bio::alphabet::alphabet that consists of the characters S and W to represent strong and weak nucleobases.

Let's start with a simple struct that only holds the alphabet's numerical representation, namely the rank value. It is not specified how the value of an alphabet is stored internally, however alphabets typically store the rank as a member variable.

#include <bio/alphabet/concept.hpp> // alphabet concept checks
struct dna2
{
uint8_t rank{};
};
Core alphabet concept and free function/type trait wrappers.
Note
The type of the member variable is typically the smallest unsigned integer type that can hold the maximum rank. In our case we have only two values so bool would be sufficient. However bool is not actually smaller than uint8_t and does not always behave like an unsigned integral type so we use uint8_t here. You only need a different type if your alphabet has an alphabet size >=255.

If you want BioC++'s algorithms to accept it as an alphabet, you need to make sure that your type satisfies the requirements of the bio::alphabet::alphabet concept. A quick check can reveal that this is not the case:

static_assert(bio::alphabet::alphabet<dna2> == false); // NOT an alphabet
The generic alphabet concept that covers most data types used in ranges.
Definition: concept.hpp:643

Prerequisites

A look at the documentation of bio::alphabet::alphabet will reveal that it is actually a refinement of other concepts, more precisely bio::alphabet::semialphabet which in turn refines std::copy_constructible and std::totally_ordered. Let's check those:

static_assert(std::copy_constructible<dna2>); // ok
static_assert(std::totally_ordered<dna2> == false); // NO comparison operators
static_assert(bio::alphabet::semialphabet<dna2> == false); // NOT a semialphabet
static_assert(bio::alphabet::alphabet<dna2> == false); // NOT an alphabet
The basis for bio::alphabet::alphabet, but requires only rank interface (not char).
Definition: concept.hpp:562

You should see that your type models only the std::copy_constructible concept. Let's have a look at the documentation for std::totally_ordered. Can you guess what it describes?

It describes the requirements for types that are comparable via ==, !=, <, <=, > and >=.

Excercise

Make all these operators available on dna2. HINT: Since C++20, you don't actually need to implement every operator individually.
Solution

#include <bio/alphabet/concept.hpp> // alphabet concept checks
struct dna2
{
uint8_t rank{};
// Comparison operators
friend auto operator<=>(dna2 const & lhs, dna2 const & rhs) noexcept = default;
};
static_assert(std::totally_ordered<dna2>); // ok

Semialphabet

Let's move on to the more interesting concepts. bio::alphabet::semialphabet constitutes the rank interface that we introduced in the alphabet tutorial. Have a look at the API reference again. Beyond the conceptional requirements, it also requires that bio::alphabet::size and bio::alphabet::to_rank can be called on your alphabet.

Have a look at the respective API reference and also the documentation on customisation points.

You can add member functions for you type, and then make tag_invoke() wrappers, but for simplicity, we will put the functionality directly into the tag_invoke() overloads here:

#include <cassert>
#include <bio/alphabet/concept.hpp> // alphabet concept checks
struct dna2
{
uint8_t rank{};
/* Comparison operators */
constexpr friend auto operator<=>(dna2 const & lhs, dna2 const & rhs) noexcept = default;
/* Semialphabet */
consteval friend size_t tag_invoke(bio::alphabet::custom::size,
dna2) noexcept
{
return 2;
}
constexpr friend uint8_t tag_invoke(bio::alphabet::custom::to_rank,
dna2 const & d) noexcept
{
return d.rank;
}
uint8_t rk,
dna2 & d) noexcept
{
assert(rk < 2);
d.rank = rk;
return d;
}
};
consteval auto tag_invoke(size, char_type const) noexcept
The number of values the char type can take (e.g. 256 for char).
Definition: char.hpp:43
Customisation tag for bio::alphabet::assign_rank_to.#.
Definition: tag.hpp:29
CPO tag definition for bio::alphabet::size.
Definition: tag.hpp:45
Customisation tag for bio::alphabet::to_rank.
Definition: tag.hpp:25

As you can see from the static_assert, our dna2 alphabet now models bio::alphabet::semialphabet and bio::alphabet::writable_semialphabet:

static_assert(bio::alphabet::semialphabet<dna2>); // ok
A refinement of bio::alphabet::semialphabet that adds assignability.
Definition: concept.hpp:608

You can also try out the customisation points directly (they appear like free functions). Here is an example:

int main ()
{
dna2 chr{};
bio::alphabet::assign_rank_to(1, chr); // chr is assigned rank 1
uint8_t rnk = bio::alphabet::to_rank(chr); // query rank value => 1
}
constexpr auto to_rank
Return the rank representation of a (semi-)alphabet object.
Definition: concept.hpp:70
constexpr auto assign_rank_to
Assign a rank to an alphabet object.
Definition: concept.hpp:138

Alphabet

Now that you have a feeling for concepts, have a look at bio::alphabet::alphabet and bio::alphabet::writable_alphabet and make your type also model these concepts.

Excercise

  1. Implement the tag_invoke() function for "to_char" which returns the char 'S' or 'W' depending on the rank value.
  2. Implement the tag_invoke() function for "assign_char".
  3. There is a function object that tells us which characters are valid for a given alphabet: bio::alphabet::char_is_valid_for. By default, all characters are "valid" that are preserved when being assigned from and then be converted back. But in some cases you want to change the default behaviour, e.g. declaring lower-case letters to be valid, as well.
    Optional task: Implement the tag_invoke() function for "char_is_valid" to also accept 's' and 'w' as valid characters.
Solution
#include <cassert>
#include <bio/alphabet/concept.hpp> // alphabet concept checks
struct dna2
{
uint8_t rank{};
/* Comparison operators */
constexpr friend auto operator<=>(dna2 const & lhs, dna2 const & rhs) noexcept = default;
/* Semialphabet */
consteval friend size_t tag_invoke(bio::alphabet::custom::size,
dna2) noexcept
{
return 2;
}
constexpr friend uint8_t tag_invoke(bio::alphabet::custom::to_rank,
dna2 const & d) noexcept
{
return d.rank;
}
uint8_t rk,
dna2 & d) noexcept
{
assert(rk < 2);
d.rank = rk;
return d;
}
/* Alphabet */
dna2 const & d) noexcept
{
// map 0 => 'S' and 1 => 'W'
return d.rank == 0 ? 'S' : 'W';
}
char ch,
dna2 & d) noexcept
{
switch (ch)
{
case 'W': case 'w':
d.rank = 1; // allow assignment from uppercase and lowercase
break;
default:
d.rank = 0; // unknown characters are mapped to 0 (=> 'S')
}
return d;
}
// Optional: make lower-case letters valid
char ch,
dna2) noexcept
{
switch (ch)
{
case 'W': case 'w': case 'S': case 's': return true;
default: return false;
}
}
};
Customisation tag for bio::alphabet::assign_char_to.
Definition: tag.hpp:37
Customisation tag for bio::alphabet::assign_char_to.
Definition: tag.hpp:41
Customisation tag for bio::alphabet::to_char.
Definition: tag.hpp:33

At this point the bio::alphabet::alphabet concept should be modelled successfully and even bio::alphabet::writable_alphabet is fine because we implemented assign_char and char_is_valid.

static_assert(bio::alphabet::alphabet<dna2>); // ok
Refines bio::alphabet::alphabet and adds assignability.
Definition: concept.hpp:688

Shortcut: alphabet base template

Often it is not required to implement the entire class yourself, instead you can derive from bio::alphabet::base. The base class defines both members and tag_invoke() overloads for you––if you provide certain conversion tables. Read the documentation of bio::alphabet::base for details.

#include <array> // std::array
#include <bio/alphabet/base.hpp> // base
#include <bio/alphabet/concept.hpp> // alphabet concept checks
// derive from base
struct dna2 : public bio::alphabet::base<dna2, 2>
{
private:
// map 0 => 'S' and 1 => 'W'
static constexpr std::array<char_type, alphabet_size> rank_to_char {'S', 'W'};
static constexpr std::array<rank_type, 256> char_to_rank =
// initialise with an immediately evaluated lambda expression:
[] () constexpr
{
std::array<rank_type, 256> ret{}; // initialise all values with 0 (=> 'S')
ret['W'] = 1; // only 'W' and 'w' result in rank 1
ret['w'] = 1;
return ret;
} ();
// make the base class a friend so it can access the tables:
};
// check the concepts
static_assert(bio::alphabet::alphabet<dna2>); // ok
Provides bio::alphabet::base.
A CRTP-base that makes defining a custom alphabet easier.
Definition: base.hpp:55

Further examples

Implementation as enum class

This is an example of a minimal custom alphabet that provides implementations for all necessary customisation points.

As an enum class the values already have an order and therefore the class models std::totally_ordered without defining the (in)equality and comparison operators. Opposed to the examples above, we use free functions to implement the functionality (because we can't add friends to an enum).

#include <cstddef> // for size_t
#include <bio/alphabet/concept.hpp> // for bio::alphabet::alphabet
namespace my_namespace
{
enum class my_alph
{
ZERO,
ONE,
TWO
};
consteval size_t tag_invoke(bio::alphabet::custom::size, my_alph const &) noexcept
{
return 3;
}
constexpr size_t tag_invoke(bio::alphabet::custom::to_rank, my_alph const a) noexcept
{
return static_cast<size_t>(a);
}
constexpr my_alph & tag_invoke(bio::alphabet::custom::assign_rank_to, size_t const r, my_alph & a) noexcept
{
switch (r)
{
case 0:
a = my_alph::ZERO;
return a;
case 1:
a = my_alph::ONE;
return a;
default:
a = my_alph::TWO;
return a;
}
}
constexpr char tag_invoke(bio::alphabet::custom::to_char, my_alph const a) noexcept
{
switch (a)
{
case my_alph::ZERO:
return '0';
case my_alph::ONE:
return '1';
default:
return '2';
}
}
constexpr my_alph & tag_invoke(bio::alphabet::custom::assign_char_to, char const c, my_alph & a) noexcept
{
switch (c)
{
case '0':
a = my_alph::ZERO;
return a;
case '1':
a = my_alph::ONE;
return a;
default:
a = my_alph::TWO;
return a;
}
}
} // namespace my_namespace

Adaptation of a third party type

This example is similar to the previous one, but assuming that you cannot add anything to the namespace of the type that you wish to adapt. In that case, you need to add the tag_invoke() implementations to namespace bio::alphabet::cpo so that they are found by our customisation points.

#include <cstddef> // for size_t
#include <bio/alphabet/concept.hpp> // for bio::alphabet::alphabet
// this is from some other library:
namespace third_party_ns
{
enum class third_party_type
{
ZERO,
ONE,
TWO
};
} // namespace third_party_ns
// ------------------------------------------------------------------------------------
// in this case, overloads for tag_invoke are defined within bio::*::custom
{
consteval size_t tag_invoke(size, third_party_ns::third_party_type) noexcept
{
return 3;
}
constexpr size_t tag_invoke(to_rank, third_party_ns::third_party_type const a) noexcept
{
return static_cast<size_t>(a);
}
constexpr third_party_ns::third_party_type & tag_invoke(assign_rank_to,
size_t const r,
third_party_ns::third_party_type & a) noexcept
{
switch (r)
{
case 0:
a = third_party_ns::third_party_type::ZERO;
return a;
case 1:
a = third_party_ns::third_party_type::ONE;
return a;
default:
a = third_party_ns::third_party_type::TWO;
return a;
}
}
constexpr char tag_invoke(to_char, third_party_ns::third_party_type const a) noexcept
{
switch (a)
{
case third_party_ns::third_party_type::ZERO:
return '0';
case third_party_ns::third_party_type::ONE:
return '1';
default:
return '2';
}
}
constexpr third_party_ns::third_party_type & tag_invoke(assign_char_to,
char const c,
third_party_ns::third_party_type & a) noexcept
{
switch (c)
{
case '0':
a = third_party_ns::third_party_type::ZERO;
return a;
case '1':
a = third_party_ns::third_party_type::ONE;
return a;
default:
a = third_party_ns::third_party_type::TWO;
return a;
}
}
} // namespace bio::alphabet::custom
A namespace for third party and standard library specialisations of BioC++ customisation points.
Definition: concept.hpp:43

Implementation of a non-default-constructible class

This is an example of a custom alphabet that is not default-constructible and that has a non-default overload for bio::alphabet::char_is_valid_for.

Please note that for the overloads of bio::alphabet::size and bio::alphabet::char_is_valid_for our alphabet type has to be wrapped into std::type_identity<> to be recognised by the customisation point objects, because the type does not model std::is_nothrow_default_constructible after we have deleted the default constructor.

With the overload of bio::alphabet::char_is_valid_for, we allow assignment to the underlying rank type from '1', 't' and 'T' for value 1 as well as from '0', 'f' and 'F' for value 0.

#include <cstddef> // for size_t
#include <type_traits> // for std::type_identity
#include <bio/alphabet/concept.hpp> // for bio::alphabet::alphabet
namespace my_namespace
{
class my_alph
{
public:
uint8_t rank;
my_alph() = delete;
constexpr my_alph(my_alph const &) = default;
constexpr my_alph & operator=(my_alph const &) = default;
constexpr my_alph(uint8_t rank) : rank{rank} {}
constexpr friend auto operator<=>(my_alph lhs, my_alph rhs) = default;
consteval friend size_t tag_invoke(bio::alphabet::custom::size, std::type_identity<my_alph>) noexcept { return 2; }
constexpr friend uint8_t tag_invoke(bio::alphabet::custom::to_rank, my_alph const a) noexcept { return a.rank; }
constexpr friend my_alph & tag_invoke(bio::alphabet::custom::assign_rank_to, uint8_t const r, my_alph & a) noexcept
{
a.rank = r;
return a;
}
constexpr friend char tag_invoke(bio::alphabet::custom::to_char, my_alph const a) noexcept
{
if (a.rank)
return '1';
else
return '0';
}
constexpr friend my_alph & tag_invoke(bio::alphabet::custom::assign_char_to, char const c, my_alph & a) noexcept
{
switch (c)
{
case '0':
case 'F':
case 'f':
a.rank = 0;
return a;
default:
a.rank = 1;
return a;
}
}
char const c,
{
switch (c)
{
case '0':
case 'F':
case 'f':
case '1':
case 'T':
case 't':
return true;
default:
return false;
}
}
};
} // namespace my_namespace
static_assert(bio::alphabet::size<my_namespace::my_alph> == 2);
static_assert(bio::alphabet::char_is_valid_for<my_namespace::my_alph>('T'));
static_assert(!bio::alphabet::char_is_valid_for<my_namespace::my_alph>('!'));
Note
You should really make your alphabet types no-throw-default-constructible if you can!