Point Transformer V3
Point Transformer V3: Simpler, Faster, Stronger
Abstractors
This paper is not motivated to seek innovation within the attention mechanism. Instead, it focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing, leveraging the power of scale. Drawing inspiration from recent advances in 3D large-scale representation learning, we recognize that model performance is more influenced by scale than by intricate design. Therefore, we present Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms that are minor to the overall performance after scaling, such as replacing the precise neighbor search by KNN with an efficient serialized neighbor mapping of point clouds organized with specific patterns. This principle enables significant scaling, expanding the receptive field from 16 to 1024 points while remaining efficient (a 3x increase in processing speed and a 10x improvement in memory efficiency compared with its predecessor, PTv2). PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and outdoor scenarios. Further enhanced with multi-dataset joint training, PTv3 pushes these results to a higher level.
PTv3 계열 및 경쟁 모델 종합 비교 (2025년 기준)
| 모델 | 발표 | 유형 | Params | ScanNet mIoU | NuScenes mIoU | 속도/효율 | 핵심 혁신 |
| PTv3 | CVPR'24 | Backbone | 46.1M | 77.5 (scratch) / 79.4 (pretrain) | 80.4 | 기준선 | Serialize-and-patch, xCPE |
| PTv3 Extreme | 2024 | Competition | 46.1M+ | — | Waymo 1위 | — | TTA, 앙상블, 데이터 증강 |
| Sonata | CVPR'25 | 3D SSL Pretrain | ~46M (Base) | 79.4+ (fine-tune) | — | PTv3와 동일 | Encoder-only self-distillation |
| Concerto | NeurIPS'25 | 2D-3D SSL | 39M / 108M / 208M | 80.7 | — | PTv3와 동일 | 2D-3D cross-modal joint embedding |
| Flash3D | CVPR'25 | Backbone | PTv3급 | PTv3 초과 | PTv3 초과 | 2.25x 빠름, 2.4x 메모리↓ | PSH + FlashAttention 통합 |
| LitePT | 2025.12 | Backbone | 12.7M ~ 86M | PTv3 동등 | 82.2 (S) | 3.6x 파라미터↓, 2x 빠름 | Conv-Attn hybrid, Point-ROPE |
정리
정확도 SOTA
Concerto (NeurIPS'25)가 ScanNet 80.7% mIoU로 현재 최고이다. PTv3를 backbone으로 쓰되, 2D-3D joint self-supervised learning을 통해 representation 품질 자체를 끌어올린 접근법이다.
효율성 SOTA
LitePT가 PTv3 대비 파라미터 3.6배 절감, 속도 2배 향상이면서 outdoor(NuScenes)에서는 오히려 더 좋은 성능을 보여준다. Flash3D도 GPU 아키텍처 수준의 최적화로 강력한 효율성을 보여준다.
실용적 선택 기준
- 실시간 처리 필요 → LitePT 또는 Flash3D (효율성 우위)
- 최고 정확도 필요 → Concerto pretrained PTv3 (정확도 우위)
- 범용 backbone → PTv3 (가장 검증된 생태계, Pointcept 코드베이스)
Documentations
- https://arxiv.org/abs/2312.10035
- [2312.10035] Point Transformer V3 - Simpler, Faster, Stronger
See also
- PointNet
- Computer Vision
- Point Cloud
- Deep Learning
- Transformer
- 3D Perception
- Flash3D Transformer - CVPR 2025
- LitePT
- PTv3 Extreme - Waymo Challenge 2024 1위
- Sonata - CVPR 2025 Highlight
- Concerto - NeurIPS 2025
- Pointcept - PTv3, Sonata, Concerto 등을 통합 지원하는 point cloud perception 연구 프레임워크