VITA Project

VITA: ViT Acceleration for Efficient 3D Human Mesh Recovery via Hardware-Algorithm Co-Design

61st ACM/IEEE Design Automation Conference (DAC), San Francisco, 2024
^*University of Central Florida
⁺The George Washington University

Abstract

Vision Transformers (ViTs) have emerged as a promising solution to enable efficient 3D Human Mesh Recovery (HMR) in augmented and virtual reality (AR/VR) applications. Despite many advancements in algorithm design, it remains a challenge to efficiently accelerate ViT-based HMR due to high computational complexity, substantial memory footprint, and compromised data locality. In this paper, we propose VITA, a hardware and algorithm co-design framework for ViT-based HMR with improved performance and energy efficiency. Specifically, on the algorithm side, we propose an average pooling model to replace conventional multi-head attention, which is further optimized with improved data locality. On the hardware side, we propose an accelerator architecture that can efficiently support various dataflows and computations demanded by pooling, normalization, and convolution operations. We evaluate the proposed VITA, and the evaluation result shows that the proposed VITA design can achieve 5.05x and 69.12x speedups on average over the state-of-the-art GPUs and CPUs on HMR tasks.

BibTeX

@inproceedings{tian2024vita, title={VITA: ViT Acceleration for Efficient 3D Human Mesh Recovery via Hardware-Algorithm Co-Design}, author={Tian, Shilin and Szafranski, Chase and Zheng, Ce and Yao, Fan and Louri, Ahmed and Chen, Chen and Zheng, Hao}, booktitle={61st ACM/IEEE Design Automation Conference (DAC)}, year={2024}, organization={ACM/IEEE} }

VITA: ViT Acceleration for Efficient 3D Human Mesh Recovery via Hardware-Algorithm Co-Design

This is the unprocessed, original video captured in the wild, featuring a single character. The video showcases the dancer in their natural environment, with no post-production edits or enhancements applied, preserving the raw and authentic movements of the performance.

This video presents the mesh visualization generated by the VITA model using the High-Resolution (HR) stream. Compared to the baseline model, VITA significantly enhances the quality of the reconstructed mesh, particularly in the fine details along the character’s fringe lines.

This video shows an intermediate step in mesh visualization, where the model isolates the character with a bounding box and marks key anatomical points with red circles.

Abstract

Poster

BibTeX