This paper compares the speed, area, and latency of shift-and-add arithmetic implemented within fine-grained FPGA resources and within a proposed coarse-grained embedded block for FPGAs. It begins by optimizing the mapping of various shift-and-add architectures within the fine-grained resources of a commercial FPGA to determine which provides the best area, delay, and latency for various word-lengths. It then proposes a new coarse-grained block that supports 16, 32, and 64-bit shift-and-add arithmetic and finally compares coarse-grained implementations to the best fine-grained implementations. Our results show that the coarse-grain implementations are between 15 and 47 times smaller and 5 to 18 times faster, depending on the implementation.