Meta Releases Sapiens2: Vision Transformers Pretrained on 1B Human Images
Meta has released Sapiens2 on HuggingFace — a suite of high-resolution vision transformers pretrained on 1 billion human images. The models support four human-centric perception tasks: pose estimation, body segmentation, depth normals, and point maps. The scale of pretraining (1B human images) makes Sapiens2 one of the largest human-centric vision pretraining datasets released publicly.
Why It Matters
Human-centric vision perception at this scale has direct applications in avatar generation, motion capture, AR/VR, robotics, and accessibility tools. The HuggingFace release with open weights makes Sapiens2 immediately usable by the research and developer community — lowering the barrier to building on one of the highest-quality human perception baselines available.