Vision Transformers (ViTs) have emerged as a promising solution to enable efficient 3D Human Mesh Recovery (HMR) in augmented and virtual reality (AR/VR) applications. Despite many advancements in algorithm design, it remains a challenge to efficiently accelerate ViT-based HMR due to high computational complexity, substantial memory footprint, and compromised data locality. In this paper, we propose VITA, a hardware and algorithm co-design framework for ViT-based HMR with improved performance and energy efficiency. Specifically, on the algorithm side, we propose an average pooling model to replace conventional multi-head attention, which is further optimized with improved data locality. On the hardware side, we propose an accelerator architecture that can efficiently support various dataflows and computations demanded by pooling, normalization, and convolution operations. We evaluate the proposed VITA, and the evaluation result shows that the proposed VITA design can achieve 5.05x and 69.12x speedups on average over the state-of-the-art GPUs and CPUs on HMR tasks.
@inproceedings{tian2024vita,
title={VITA: ViT Acceleration for Efficient 3D Human Mesh Recovery via Hardware-Algorithm Co-Design},
author={Tian, Shilin and Szafranski, Chase and Zheng, Ce and Yao, Fan and Louri, Ahmed and Chen, Chen and Zheng, Hao},
booktitle={61st ACM/IEEE Design Automation Conference (DAC)},
year={2024},
organization={ACM/IEEE}
}