A Modular NFA Architecture for Regular Expression Matching

Hao Wang,  Shi Pu,  Gabe Knezek,  Jyh-Charn Liu
TAMU


Abstract

We propose a non-deterministic finite automata (NFA) based architecture for regexp scanners on FPGA, called CES: the Character Class with Constraint Repetition (CCR) based regExp Scanner. CES is designed to realize a new MIN-MAX counting algorithm, which can solve both the character class ambiguity problem and the overlapped matching problem. CES also supports non-regular Perl grammars such as zero-width pattern and back-reference. We propose a CCR-syntax tree and its parsing scheme to map a Perl or POSIX regexp rule to a CES topology. The interconnection patterns, and operational parameters of CCR modules (CCRM), which are the building blocks of CES, can be easily configured by regular memory writes when regexp rules change, without re-synthesis of low-level logic For implementation, character classes of CCRs are stored in Block RAMs. The MIN-MAX algorithm uses two counters MIN and MAX to resolve the character class ambiguity problem. Two checkpoint counters are employed to implement overlapped matching detection. CES topologies optimized for different types of rules can run in different Partial Reconfigurable Regions (PRR), and can be swapped on the fly by a PRR controller. We developed a tool chain to automate (1) parsing of Perl and POSIX regexps to CCR representations, (2) mapping of CCRMs to a CES, and (3) downloading the CES implementation to a Virtex 5 LX110T device. This device can host up to 3000 CCRMs (approximately 200 Snort regexp rules depending on the lengths of rules), and run at an estimated throughput of 1.996 Gbps in simulation, and 863 Mbps between a PC and the V5 board in real tests. The Snort and SpamAssassin rule sets can be parsed and mapped in milliseconds. Once a base CES architecture is synthesized, it need not be re-synthesized to implement different rules. As a result, the physical reconfiguration of a CES on the V5 LX110T chip can be done in less than a second.