Skip to content

Point Transformer V3

Point Transformer V3: Simpler, Faster, Stronger

Abstractors

This paper is not motivated to seek innovation within the attention mechanism. Instead, it focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing, leveraging the power of scale. Drawing inspiration from recent advances in 3D large-scale representation learning, we recognize that model performance is more influenced by scale than by intricate design. Therefore, we present Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms that are minor to the overall performance after scaling, such as replacing the precise neighbor search by KNN with an efficient serialized neighbor mapping of point clouds organized with specific patterns. This principle enables significant scaling, expanding the receptive field from 16 to 1024 points while remaining efficient (a 3x increase in processing speed and a 10x improvement in memory efficiency compared with its predecessor, PTv2). PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and outdoor scenarios. Further enhanced with multi-dataset joint training, PTv3 pushes these results to a higher level.

PTv3 계열 및 경쟁 모델 종합 비교 (2025년 기준)

PTv3 계열 및 경쟁 모델 종합 비교 (2025년 기준)

모델

발표

유형

Params

ScanNet mIoU

NuScenes mIoU

속도/효율

핵심 혁신

PTv3

CVPR'24

Backbone

46.1M

77.5 (scratch) / 79.4 (pretrain)

80.4

기준선

Serialize-and-patch, xCPE

PTv3 Extreme

2024

Competition

46.1M+

Waymo 1위

TTA, 앙상블, 데이터 증강

Sonata

CVPR'25

3D SSL Pretrain

~46M (Base)

79.4+ (fine-tune)

PTv3와 동일

Encoder-only self-distillation

Concerto

NeurIPS'25

2D-3D SSL

39M / 108M / 208M

80.7

PTv3와 동일

2D-3D cross-modal joint embedding

Flash3D

CVPR'25

Backbone

PTv3급

PTv3 초과

PTv3 초과

2.25x 빠름, 2.4x 메모리↓

PSH + FlashAttention 통합

LitePT

2025.12

Backbone

12.7M ~ 86M

PTv3 동등

82.2 (S)

3.6x 파라미터↓, 2x 빠름

Conv-Attn hybrid, Point-ROPE

정리

정확도 SOTA

Concerto (NeurIPS'25)가 ScanNet 80.7% mIoU로 현재 최고이다. PTv3를 backbone으로 쓰되, 2D-3D joint self-supervised learning을 통해 representation 품질 자체를 끌어올린 접근법이다.

효율성 SOTA

LitePT가 PTv3 대비 파라미터 3.6배 절감, 속도 2배 향상이면서 outdoor(NuScenes)에서는 오히려 더 좋은 성능을 보여준다. Flash3D도 GPU 아키텍처 수준의 최적화로 강력한 효율성을 보여준다.

실용적 선택 기준

  • 실시간 처리 필요 → LitePT 또는 Flash3D (효율성 우위)
  • 최고 정확도 필요 → Concerto pretrained PTv3 (정확도 우위)
  • 범용 backbone → PTv3 (가장 검증된 생태계, Pointcept 코드베이스)

Documentations

https://arxiv.org/abs/2312.10035
[2312.10035] Point Transformer V3 - Simpler, Faster, Stronger

See also

Favorite site