This HowTo documents how to write a custom alphabet that can be used with the algorithms and data structures in BioC++.
Motivation
In the alphabet tutorial you have learned that alphabets are a core component in BioC++ and represent the smallest unit of biological sequence data. We introduced the common alphabets for nucleotide, amino acid and gap as well as structure and quality annotation. However, in your BioC++ application you may want to implement a custom alphabet in order to work efficiently with BioC++'s algorithms. To achieve this, your custom alphabet must meet certain requirements, which are defined in concepts.
For detailed information on the alphabet concepts please read the Alphabet page. In the following sections we demonstrate how to write an alphabet that models them.
A brief summary of the concepts used in this HowTo:
Step by step: Create your own alphabet
In the alphabet tutorial we have calculated the GC content of a nucleotide sequence. Guanine and Cytosine are complementary nucleobases, which pair in a DNA molecule by building 3 hydrogen bonds. Adenine and Thymine pair with only 2 hydrogen bonds. As a consequence, we denote Guanine and Cytosine as strong (S) and Adenine and Thymine as weak (W) nucleobases. In this section we want to implement a bio::alphabet::alphabet that consists of the characters S
and W
to represent strong and weak nucleobases.
Let's start with a simple struct that only holds the alphabet's numerical representation, namely the rank value. It is not specified how the value of an alphabet is stored internally, however alphabets typically store the rank as a member variable.
struct dna2
{
uint8_t rank{};
};
Core alphabet concept and free function/type trait wrappers.
- Note
- The type of the member variable is typically the smallest unsigned integer type that can hold the maximum rank. In our case we have only two values so
bool
would be sufficient. However bool
is not actually smaller than uint8_t
and does not always behave like an unsigned integral type so we use uint8_t
here. You only need a different type if your alphabet has an alphabet size >=255.
If you want BioC++'s algorithms to accept it as an alphabet, you need to make sure that your type satisfies the requirements of the bio::alphabet::alphabet concept. A quick check can reveal that this is not the case:
The generic alphabet concept that covers most data types used in ranges.
Definition: concept.hpp:643
Prerequisites
A look at the documentation of bio::alphabet::alphabet will reveal that it is actually a refinement of other concepts, more precisely bio::alphabet::semialphabet which in turn refines std::copy_constructible and std::totally_ordered. Let's check those:
static_assert(std::copy_constructible<dna2>);
static_assert(std::totally_ordered<dna2> == false);
The basis for bio::alphabet::alphabet, but requires only rank interface (not char).
Definition: concept.hpp:562
You should see that your type models only the std::copy_constructible concept. Let's have a look at the documentation for std::totally_ordered. Can you guess what it describes?
It describes the requirements for types that are comparable via ==
, !=
, <
, <=
, >
and >=
.
Excercise
Make all these operators available on dna2
. HINT: Since C++20, you don't actually need to implement every operator individually.
Solution
struct dna2
{
uint8_t rank{};
friend auto operator<=>(dna2 const & lhs, dna2 const & rhs) noexcept = default;
};
static_assert(std::totally_ordered<dna2>);
Semialphabet
Let's move on to the more interesting concepts. bio::alphabet::semialphabet constitutes the rank interface that we introduced in the alphabet tutorial. Have a look at the API reference again. Beyond the conceptional requirements, it also requires that bio::alphabet::size and bio::alphabet::to_rank can be called on your alphabet.
Have a look at the respective API reference and also the documentation on customisation points.
You can add member functions for you type, and then make tag_invoke()
wrappers, but for simplicity, we will put the functionality directly into the tag_invoke()
overloads here:
struct dna2
{
uint8_t rank{};
constexpr friend auto operator<=>(dna2 const & lhs, dna2 const & rhs) noexcept = default;
dna2) noexcept
{
return 2;
}
dna2 const & d) noexcept
{
return d.rank;
}
uint8_t rk,
dna2 & d) noexcept
{
assert(rk < 2);
d.rank = rk;
return d;
}
};
consteval auto tag_invoke(size, char_type const) noexcept
The number of values the char type can take (e.g. 256 for char).
Definition: char.hpp:43
Customisation tag for bio::alphabet::assign_rank_to.#.
Definition: tag.hpp:29
CPO tag definition for bio::alphabet::size.
Definition: tag.hpp:45
Customisation tag for bio::alphabet::to_rank.
Definition: tag.hpp:25
As you can see from the static_assert
, our dna2 alphabet now models bio::alphabet::semialphabet and bio::alphabet::writable_semialphabet:
A refinement of bio::alphabet::semialphabet that adds assignability.
Definition: concept.hpp:608
You can also try out the customisation points directly (they appear like free functions). Here is an example:
int main ()
{
dna2 chr{};
}
constexpr auto to_rank
Return the rank representation of a (semi-)alphabet object.
Definition: concept.hpp:70
constexpr auto assign_rank_to
Assign a rank to an alphabet object.
Definition: concept.hpp:138
Alphabet
Now that you have a feeling for concepts, have a look at bio::alphabet::alphabet and bio::alphabet::writable_alphabet and make your type also model these concepts.
Excercise
- Implement the tag_invoke() function for "to_char" which returns the char 'S' or 'W' depending on the rank value.
- Implement the tag_invoke() function for "assign_char".
- There is a function object that tells us which characters are valid for a given alphabet: bio::alphabet::char_is_valid_for. By default, all characters are "valid" that are preserved when being assigned from and then be converted back. But in some cases you want to change the default behaviour, e.g. declaring lower-case letters to be valid, as well.
Optional task: Implement the tag_invoke() function for "char_is_valid" to also accept 's' and 'w' as valid characters.
Solution
struct dna2
{
uint8_t rank{};
constexpr friend auto operator<=>(dna2 const & lhs, dna2 const & rhs) noexcept = default;
dna2) noexcept
{
return 2;
}
dna2 const & d) noexcept
{
return d.rank;
}
uint8_t rk,
dna2 & d) noexcept
{
assert(rk < 2);
d.rank = rk;
return d;
}
dna2 const & d) noexcept
{
return d.rank == 0 ? 'S' : 'W';
}
char ch,
dna2 & d) noexcept
{
switch (ch)
{
case 'W': case 'w':
d.rank = 1;
break;
default:
d.rank = 0;
}
return d;
}
char ch,
dna2) noexcept
{
switch (ch)
{
case 'W': case 'w': case 'S': case 's': return true;
default: return false;
}
}
};
Customisation tag for bio::alphabet::assign_char_to.
Definition: tag.hpp:37
Customisation tag for bio::alphabet::assign_char_to.
Definition: tag.hpp:41
Customisation tag for bio::alphabet::to_char.
Definition: tag.hpp:33
At this point the bio::alphabet::alphabet concept should be modelled successfully and even bio::alphabet::writable_alphabet is fine because we implemented assign_char
and char_is_valid
.
Refines bio::alphabet::alphabet and adds assignability.
Definition: concept.hpp:688
Shortcut: alphabet base template
Often it is not required to implement the entire class yourself, instead you can derive from bio::alphabet::base. The base class defines both members and tag_invoke() overloads for you––if you provide certain conversion tables. Read the documentation of bio::alphabet::base for details.
{
private:
[] () constexpr
{
ret['W'] = 1;
ret['w'] = 1;
return ret;
} ();
};
Provides bio::alphabet::base.
A CRTP-base that makes defining a custom alphabet easier.
Definition: base.hpp:55
Further examples
Implementation as enum class
This is an example of a minimal custom alphabet that provides implementations for all necessary customisation points.
As an enum class the values already have an order and therefore the class models std::totally_ordered without defining the (in)equality and comparison operators. Opposed to the examples above, we use free functions to implement the functionality (because we can't add friends to an enum).
namespace my_namespace
{
enum class my_alph
{
ZERO,
ONE,
TWO
};
{
return 3;
}
{
return static_cast<size_t>(a);
}
{
switch (r)
{
case 0:
a = my_alph::ZERO;
return a;
case 1:
a = my_alph::ONE;
return a;
default:
a = my_alph::TWO;
return a;
}
}
{
switch (a)
{
case my_alph::ZERO:
return '0';
case my_alph::ONE:
return '1';
default:
return '2';
}
}
{
switch (c)
{
case '0':
a = my_alph::ZERO;
return a;
case '1':
a = my_alph::ONE;
return a;
default:
a = my_alph::TWO;
return a;
}
}
}
Adaptation of a third party type
This example is similar to the previous one, but assuming that you cannot add anything to the namespace of the type that you wish to adapt. In that case, you need to add the tag_invoke() implementations to namespace bio::alphabet::cpo so that they are found by our customisation points.
namespace third_party_ns
{
enum class third_party_type
{
ZERO,
ONE,
TWO
};
}
{
consteval size_t tag_invoke(size, third_party_ns::third_party_type)
noexcept
{
return 3;
}
constexpr size_t tag_invoke(to_rank, third_party_ns::third_party_type
const a)
noexcept
{
return static_cast<size_t>(a);
}
constexpr third_party_ns::third_party_type &
tag_invoke(assign_rank_to,
size_t const r,
third_party_ns::third_party_type & a) noexcept
{
switch (r)
{
case 0:
a = third_party_ns::third_party_type::ZERO;
return a;
case 1:
a = third_party_ns::third_party_type::ONE;
return a;
default:
a = third_party_ns::third_party_type::TWO;
return a;
}
}
constexpr char tag_invoke(to_char, third_party_ns::third_party_type
const a)
noexcept
{
switch (a)
{
case third_party_ns::third_party_type::ZERO:
return '0';
case third_party_ns::third_party_type::ONE:
return '1';
default:
return '2';
}
}
constexpr third_party_ns::third_party_type &
tag_invoke(assign_char_to,
char const c,
third_party_ns::third_party_type & a) noexcept
{
switch (c)
{
case '0':
a = third_party_ns::third_party_type::ZERO;
return a;
case '1':
a = third_party_ns::third_party_type::ONE;
return a;
default:
a = third_party_ns::third_party_type::TWO;
return a;
}
}
}
A namespace for third party and standard library specialisations of BioC++ customisation points.
Definition: concept.hpp:43
Implementation of a non-default-constructible class
This is an example of a custom alphabet that is not default-constructible and that has a non-default overload for bio::alphabet::char_is_valid_for.
Please note that for the overloads of bio::alphabet::size and bio::alphabet::char_is_valid_for our alphabet type has to be wrapped into std::type_identity<>
to be recognised by the customisation point objects, because the type does not model std::is_nothrow_default_constructible after we have deleted the default constructor.
With the overload of bio::alphabet::char_is_valid_for, we allow assignment to the underlying rank type from '1', 't' and 'T' for value 1 as well as from '0', 'f' and 'F' for value 0.
namespace my_namespace
{
class my_alph
{
public:
uint8_t rank;
my_alph() = delete;
constexpr my_alph(my_alph const &) = default;
constexpr my_alph & operator=(my_alph const &) = default;
constexpr my_alph(uint8_t rank) : rank{rank} {}
constexpr friend auto operator<=>(my_alph lhs, my_alph rhs) = default;
{
a.rank = r;
return a;
}
{
if (a.rank)
return '1';
else
return '0';
}
{
switch (c)
{
case '0':
case 'F':
case 'f':
a.rank = 0;
return a;
default:
a.rank = 1;
return a;
}
}
char const c,
{
switch (c)
{
case '0':
case 'F':
case 'f':
case '1':
case 'T':
case 't':
return true;
default:
return false;
}
}
};
}
static_assert(bio::alphabet::size<my_namespace::my_alph> == 2);
static_assert(bio::alphabet::char_is_valid_for<my_namespace::my_alph>('T'));
static_assert(!bio::alphabet::char_is_valid_for<my_namespace::my_alph>('!'));
- Note
- You should really make your alphabet types no-throw-default-constructible if you can!