Preliminary Program

Design space exploration of throughput-optimized arrays from recurrence abstractions

Arpith Jacob, Jeremy Buhler, Roger Chamberlain
Washington University in St. Louis

Abstract

Many compute-bound software applications have seen order-of-magnitude speedups using application-specific accelerators built on specialized architectures such as field-programmable gate arrays. These architectures are particularly good at implementing systems of recurrence equations realized as systolic arrays. To efficiently find good realizations of an algorithm for a given hardware platform, we pursue high-level tools that can search the space of possible parallel array designs to optimize various design criteria. Most existing approaches produce an array that is latency-space optimal. We target applications that operate on a large collection of small inputs, e.g. a database of biological sequences. For these applications, overall throughput, rather than latency per input, is the most important measure of performance.
In this work we introduce a new design space exploration procedure to optimize throughput of a systolic array subject to area and bandwidth constraints of an FPGA device. We show that the throughput of an array is dependent on the maximum number of lattice points executed by any processor in the array, which is determined solely by the array's projection vector. We describe a bounded search process to find throughput-optimized projection vectors, discovering a range of array designs that are optimal for inputs of various sizes.
We have written a tool in C++ that accepts recurrence descriptions and produces throughput-optimized array mappings. We have applied our technique to the banded Smith-Waterman and Nussinov RNA folding algorithms, and present novel arrays that are 4-33x and 2-14x faster, respectively, than the currently used latency optimized array. We recount our recently published results that accelerate the folding of 22.6 million sequences to identify microRNAs. We used runtime reconfiguration on a Xilinx Virtex-4-LX100-12 FPGA to switch among a library of throughput-optimized Nussinov designs, achieving a 48% speedup over the single, latency-space optimal array.