Let's document our regular expression!

In bioinformatics we use quite a lot of regular expressions to find known motifs in biological sequences (proteins or DNA/RNA sequences). We don't have a common repository for these motifs as regular expressions, not one source of truth. Some families of motifs have dedicated databases, some only have decades old paper describing the specificity with words. Some are not even represented by regular expressions but rather with hidden Markov models (HMM) and position-specific scoring matrices (PSSM).

No one knows who come up with them, no one knows where they come from. I wouldn't be surprised to learn that most regular expressions are transmitted from one bioinformatician to another like an oral tradition. I could find some regexp in a script on Github without knowing how this motif was found and who to cite for using it in a piece of research.

Another thing we do not track is the version of the motif (surely as research moves motifs need to be re-evaluated and changed) and for what type of regular expression engine is supported. Some regexps were made for certain engines supporting special features (like look-ahead and look-behind 👀) that are not supported by others. For my work, with protein cleavage sites, the support of these features makes my life 10x easier but I cannot always use them and have to fallback on different, on more basic, regular expressions.

I recently learned about YARA rules for the infosec people. Whole files contains collections of rules (including string search with regexps) to detect and classify malware. Each rule comes with a set of metadata.

I tried to make this YARA rule for the specificity motif of chymotrypsin, an enzyme digesting proteins. Using YARA rules themselves is not suitable for the kind of motif search we usually do in bioinformatics. But, I think it would be neat to have some kind of text format inspired by YARA rules.

YARA rule describing a chymotrypsin cleavage specificity motif. The rule is named, the metadata section contains reference to Expasy where I took the description of the motifs to make 4 regular expressions.

My go at using the YARA format to describe the regular expression for the specificity motif of chymotypsin

PS: please don't be weird, don't be a XML file with 200 lines of ontology and namespaces for just some small regular expressions.