Multi-task Learning with Localization Ambiguity Suppression for Occupancy Prediction by 42 dot team

Autonomous Driving

Multi-task Learning with Localization Ambiguity Suppression for Occupancy Prediction by 42 dot team

wandering developer 2023. 7. 9. 18:33

짧은 영상이라 번역진행함.

마지막 페이지가 핵심임.

Resnet50기반의 BEVDet4D을 베이스로 아래와 같이 여러 기법을 통해서 성능을 끌어 올렸다는 내용임.

내용이 어렵고 5분동안 이내용을 다 설명하니까 더 이해가 안된것 같음.

그냥 아래 적힌 내용을 이용해서 성능이 이정도 올라감으로 이해하고 넘어가자!!!

개인적인 감상평은 CVPR에서 많은 회사들이 발표를 했는데 42dot도 초기단계라는 느낌을 많이 받는 발표였음. 한 3년정도 뒤쳐진다는 느낌을 받음. 이유는 보통 deploy에대한 고민을 많이하고 그 부분 혹은 전체적인 환경에대한 부분을 다른 회사들은 고민한 흔적을 발표하는데 반해 42dot는 네트워크 얘기만 하고 있기 때문임. 일반적으로 transformer를 사용하면 실시간 처리가 힘들어서 많은 고민이 있는데 이런 부분에 대한 얘기가 하나도 없음. 그리고 자율주행에서 네트워크도 중요하지만 이것을 어떻게 어떤 환경에서 검증하고 개발했는지도 중요한데 그런 부분이 하나도 없음 대표적으로 auto labeling이 혹은 GPU 개발환경, 혹은 시나리오 시뮬레이션 환경 얘기가 될 수 있겠음.

암튼 아쉽지만 잘되기를 바람. 지금은 제대로 하는 한국 기업은 여기 밖에 없는것 같으니.....

(1) CVPR23 E2EAD | Team 42dot, Technical Report - YouTube
https://www.youtube.com/watch?v=HyTojp5bSxA

(00:03) 안녕하세요 여러분, 저는 팀 42 Dot의 Tang입니다. 이 발표에서는 CVPR 2023 자율주행 워크샵의 점유 예측 도전과제에 대한 저희 솔루션을 소개하려고 합니다. 제 발표의 제목은 "작은 다중태스크 학습과 위치 모호성 억제를 위한 점유 예측"입니다. 이 테스트는 다중 뷰 이미지로 주어진 복셀의 점유 상태를 세마틱 레이블로 예측하는 것입니다. 베이스라인 모델인 4D가 세 가지 주요 구성 요소로 이루어질 것입니다. 이를 통해 이미지 특징을 추출할 것입니다.

(00:50) 2D 특징을 3D 특징으로 변환하기 위한 View Transformer와 점유 인코딩을 통해 점유 예측을 생성합니다. 우리는 베이스라인에 몇 가지 개선점을 제안합니다. 우리는 테스트 에너지와 부분적인 감독 신호를 위해 매우 깊은 다중태스크 학습을 제안합니다. 네트워크 구성 요소에 대한 로컬라이제이션 이미지 UT 억제를 제안하여 점유 예측 정밀화를 수행합니다. 더 나은 점유 예측 결과를 위해 3D 작업에 실제 공간적 피라미드 폴링을 확장합니다. 데이터 불균형 문제를 다루고 마지막으로 다른 기법을 활용합니다.

(01:33) 성능 개선을 위해 시맨틱 감독이 emit 인코더에 적용되었습니다. 특히, 우리는 프라이빗 Appian 퓨처 레벨을 사용하고 이를 두 개의 ResNet 블록에 입력한 후 예측기로 이어졌습니다. 이미지 레벨의 시맨틱 주석이 없으므로 3D 특징을 학습하기 위해 Extend Actual Spatial Pyramid Pooling을 적용하여 Blast Contact를 통합합니다. 이를 통해 차이를 가진 3D 기능을 배울 수 있습니다.

이미지에 semantic label이 없으므로 lidar에 있는 label 정보를 사용했다는 내용이다.

(02:39) 또한, 점유 예측과 라인 레터 시스템을 정의하기 위해 블루투스 위치 모호성 억제를 추가했습니다. 객체의 존재는 카메라 시스템의 3D 센서에서 감지된 입자로 설명될 수 있습니다. 우리는 이미지에서 2D 객체를 로컬라이즈해야 하므로 거리와 작은 객체에서 로컬라이제이션 모호성이 발생합니다. 우리는 통계적 사전을 사용하여 낮은 자신감 위치를 억제하는 것을 제안합니다. 또한, 그림에서 보듯이 데이터 불균형 문제를 다룹니다. 일부 공통 클래스는 붉은 클래스보다 거의 10,000배 높습니다.

위 적혀있는 논문을 봐야 이해할것 같음. 말그대로 dice 손실 함수와 교차 엔트로피 기법을 이용해서 데이터 불균형 문제를 해결했다고 함.

성능을 올리기 위해서 1. 해상도를 높이고, 2. 시간적으로 여러 프레임을 사용하며, 3. 데이터 증폭을 통해서 학습데이터 량을 늘렸으며, 4. 신뢰도를 추가한 labeling을 진행했고, 5. 앙상블로 2개의 모델을 사용했다. 정도 임.

(03:29) 이 문제를 해결하기 위해 우리는 쿼리 교차 엔트로피와 Dice 유사도를 사용하였습니다. 또한, 고해상도 입력, 장기간의 시간적 테스트 시간 증강, 의사 라벨링 및 모델 앙상블과 같은 성능 향상을 위한 다른 일반적인 기법을 사용했습니다. 실험에는 Baseline으로 ResNet-15를 사용하였으며, 이를 기준으로 약 24시간 동안 16개의 GPU에서 모델을 훈련시켰습니다. 여기에 우리의 실험 결과가 있습니다. Baseline 모델은 33을 달성하였습니다.

앞에서는 5프레임을 사용했다고 했는데 시험에서는 2프레임인데...베이스는 2프레임이고 뒤에서 5프레임 사용해서 했다는 말 같음. 그러나 이점이 약간 헷갈림.

(04:21) 여기에서는 5 mIoU의 성능 향상을 보여줍니다. 우리는 시각적 감독 손실, 맞춤형 아키텍처(ASPB-3D), 고해상도 입력, 장기간 의사 라벨링, 맞춤형 증강, 로컬라이제이션 모호성 억제, 강력한 배경 처리 및 최종적으로 모델 앙상블과 같은 다양한 구성 요소를 추가했습니다. 우리의 최종 모델은 52의 mIoU를 달성했습니다.

(05:01) 테스트에서 45의 mIoU를 기록하였고, 점유 예측 도전에서 2위를 차지했습니다. 이로써 제 발표를 마치도록 하겠습니다. 더 많은 정보는 기술 보고서에서 확인하실 수 있습니다. 감사합니다.

원문 스크립트

Transcript:
(00:03) hello everyone this is team 42 Dot and my name is Tang in this presentation we would like to introduce our solution for occupancy prediction challenge in cvpr 2023 autonomous driving Workshop the title of my presentation is little multitask learning with localization ambiguity suppression for occupancy production the test is to predict occupancy state in Semitic label for a voxel into this space given multiview images a baseline model equals debuted 4D will consist of three main components and even in order to extract image features

(00:50) a view Transformer to transform 2D features into three features and an occupancy encoded to produce occupancy predictions we propose several improvements on the Baseline we propose very in-depth multitask learning for test energy and short part to supervision signal for a network component we propose localization image UT suppression for occupancy prediction refinement with extend actual special pyramid polling to 3D tasks for better occupancy prediction results we address data imbalance problem and finally we employ other techniques for

(01:33) performance Improvement perspective semantic supervision is applied into emit encoder in particular we use the private Appian future level and feed it into two resnet blocks followed by a predictor since the Image level semantic annotations are not available incorporation a blast contact with extend actual special payment pooling for three tasks to learn 3D features to different something with Enfield build

(02:39) in addition Bluetooth localization ambiguity suppression to Define occupancy prediction and line letter system the presence of object can be bred designed by detected particle from 3D sensor in camera system we need to localize 2D objects from images this leads to localization ambiguity in distance and small objects we propose to suppress low confident position using using statistical prior we also address the data imbalance problem as shown in the figure the partial common classes is nearly 10 000 times higher than than red glasses

(03:29) to address this problem we employ query cross entropy and dice us we also employ other common techniques for performance improvements such as high resolution input long-term temporal test time augmentation the pseudo labeling and moral ensemble for hours for our experiments with your rest net 15 as our Baseline and trade it would be 24 hour we we use uh 16 GPU i100 to trade our model in about three days here are our experiment results the Baseline model achieved 33.

(04:21) 5 miou here we show performance Improvement we're adding different components including perspective somatic supervision loss customized architecture aspb 3D I resolution input long contemporary labeling custom augmentation localization ambiguity suppression stronger background and finally is a model ensemble our final model achieved 52.

(05:01) 45 miou on test on display and we win second play in occupancy prediction challenge this is the end of my presentation more information is available on the technical report thank you

여기에 나왔던 논문 번역 문 함께 첨부함.

BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection

https://arxiv.org/pdf/2203.17054.pdf

github 저장소

https://github.com/HuangJunJie2017/BEVDet

GitHub - HuangJunJie2017/BEVDet: Official code base of the BEVDet series .

Official code base of the BEVDet series . Contribute to HuangJunJie2017/BEVDet development by creating an account on GitHub.

github.com

초록
단일 프레임 데이터는 한정된 정보를 포함하므로 기존의 비전 기반 다중 카메라 3D 객체 탐지 패러다임의 성능을 제한합니다. 이 영역에서 성능 경계를 근본적으로 끌어올리기 위해, 새로운 패러다임인 BEVDet4D가 제안되었습니다. 이는 공간적인 3D 작업 공간에서 확장 가능한 BEVDet 패러다임을 공간-시간적인 4D 작업 공간으로 전환합니다. 우리는 이전 프레임에서의 특징을 현재 프레임에서 해당 특징과 융합하기 위해 단순한 수정으로 원래의 BEVDet 프레임워크를 업그레이드합니다. 이렇게 함으로써 추가적인 계산 비용이 거의 없이 BEVDet4D가 두 개의 후보 특징을 질의하고 비교함으로써 시간적인 신호에 접근할 수 있게 됩니다. 이에 더하여, 우리는 두 인접한 특징에서 위치적인 오프셋 예측으로 속도 예측 과제를 단순화합니다. 이 결과로, 강력한 일반화 성능을 가진 BEVDet4D는 속도 오류를 최대 -62.9%로 감소시킵니다. 이로써 비전 기반 방법이 이 측면에서 처음으로 LiDAR나 레이더에 의존하는 방법들과 비교 가능해집니다. 도전 벤치마크인 nuScenes에서, 우리는 BEVDet4D-Base라는 고성능 구성으로 54.5%의 NDS 새로운 기록을 보고합니다. 동일한 추론 속도에서, 이는 이전 선도적인 방법인 BEVDet-Base보다 +7.3%의 NDS를 크게 능가합니다. 소스 코드는 추가 연구를 위해 공개적으로 이용 가능합니다[1].

[1] 출처 코드는 추가 연구를 위해 공개적으로 이용 가능합니다.

1. 소개
최근에는 자율주행이 연구 및 산업 커뮤니티에서 큰 관심을 받고 있습니다. 이 시나리오에서의 비전 기반 인지 작업에는 3D 객체 탐지, BEV 의미 분할, 움직임 예측 등이 포함됩니다. 이들 대부분은 단일 프레임의 데이터로 공간적으로만 3D 작업 공간에서 부분적으로 해결될 수 있습니다. 그러나 속도와 같은 시간적인 관련 대상에 대해서는 LiDAR나 레이더와 같은 센서를 사용하는 방법에 비해 현재의 단일 프레임 데이터 기반 비전 기반 패러다임의 성능이 훨씬 떨어집니다. 예를 들어, 최근 선도적인 방법인 BEVDet의 비전 기반 3D 객체 탐지에서 속도 오차는 LiDAR 기반 방법인 CenterPoint의 3배이며 레이더 기반 방법인 CenterFusion의 2배입니다. 이 간극을 좁히기 위해, 우리는 본 논문에서 BEVDet4D라는 새로운 패러다임을 제안하고, 비전 기반 자율주행을 공간-시간적인 4D 공간에서 개척합니다.

그림 2에서 설명된 바와 같이, BEVDet4D는 시간적 도메인의 풍부한 정보에 처음으로 접근하는 시도입니다. 이는 이전 프레임에서의 중간 BEV 특징을 유지함으로써 단순히 원래의 BEVDet를 확장합니다. 그런 다음 현재 프레임에서 유지된 특징을 공간 정렬 연산과 연결 연산만을 통해 현재 프레임의 해당 특징과 융합합니다. 그 외에도, 대부분의 다른 프레임워크 세부 사항은 변경하지 않았습니다. 이렇게 하면 추론 과정에 거의 추가 계산 비용이 들지 않으면서도 두 개의 후보 특징을 질의하고 비교함으로써 시간적 신호에 접근할 수 있게 됩니다. BEVDet4D의 프레임워크 구축은 간단하지만, 강력한 성능을 구축하는 것은 쉽지 않습니다. 공간 정렬 연산과 학습 대상은 우아한 프레임워크와 함께 협력하도록 신중하게 설계되어야 하며, 속도 예측 작업을 단순화하고 우수한 일반화 성능을 BEVDet4D와 함께 달성할 수 있도록 되어 있습니다.

우리는 도전 벤치마크인 nuScenes [2]에서 BEVDet4D의 타당성을 확인하고 특성을 연구하기 위해 포괄적인 실험을 진행합니다. 그림 1은 다른 패러다임의 추론 속도와 성능 사이의 균형을 보여줍니다. 꾸준한 개선을 위한 공식적인 것이 아닌, BEVDet4D-Tiny 구성은 0.909mAVE에서 0.337mAVE로 속도 오차를 62.9%로 줄입니다. 또한, 제안된 패러다임은 감지 점수(+2.6% mAP), 방향 오차(-12.0% mAOE), 속성 오차(-25.1% mAAE)와 같은 다른 지표에서도 상당한 개선을 이루었습니다. 그 결과, BEVDet4D-Tiny는 복합 지표인 NDS에서 베이스라인을 +8.4%로 능가합니다. 고성능 구성인 BEVDet4D-Base는 42.1% mAP와 54.5% NDS의 높은 점수를 기록하여 비전 기반 3D 객체 탐지의 모든 게시된 결과 [40, 41, 42, 22, 6, 18, 15]를 초월했습니다. 마지막으로, BEVDet4D는 추론 지연에 거의 무시할 만한 비용을 지불하며 이러한 우위성을 달성했는데, 이는 자율주행 시나리오에서 의미 있는 결과입니다.

2. 관련 작업
2.1. 시각 기반 3D 객체 감지
시각 기반 3D 객체 감지는 자율주행 분야에서 유망한 인지 작업입니다. 지난 몇 년 동안 KITTI [11] 벤치마크의 영향으로 단안 3D 객체 감지 기술은 빠른 발전을 이루었습니다 [26, 23, 47, 53, 49, 31, 39, 38, 16]. 그러나 한계된 데이터와 단일 시점으로 인해 더 복잡한 작업을 개발하는 데 어려움이 있었습니다.

최근에는 충분한 데이터와 주변 시점을 제공하는 대규모 벤치마크 [2, 35]가 제안되어 3D 객체 감지 분야의 패러다임 발전에 새로운 관점을 제공하고 있습니다. 이러한 벤치마크를 기반으로 다양한 다중 카메라 3D 객체 감지 패러다임이 경쟁력 있는 성능으로 개발되고 있습니다. 예를 들어, 2D 감지에서 성공한 FCOS [36]에서 영감을 받은 FCOS3D [40]는 3D 객체 감지 문제를 2D 객체 감지 문제로 취급하고 이미지 시점에서만 인식을 수행합니다. 이미지의 외형과 목표물 속성 간의 강력한 공간 상관관계를 활용하여 이 작업에서는 잘 작동하지만 목표물의 이동, 속도 및 방향을 인지하는 데는 비교적 성능이 떨어집니다. PGD [41]는 FCOS3D 패러다임을 더 발전시켜 목표물의 깊이 예측과 같은 미결 문제를 해결합니다. 이로 인해 기준 모델 대비 높은 정확성 향상이 이루어지지만 더 많은 계산 비용과 추가적인 추론 지연 시간을 필요로 합니다. DETR [3]을 따라 DETR3D [42]는 어텐션 패턴에서 3D 객체를 감지하는 방식을 제안하여 FCOS3D와 유사한 정확성을 보입니다. DETR3D는 절반의 계산 비용만 필요로 하지만 복잡한 계산 과정으로 인해 FCOS3D와 추론 속도가 동일해집니다. PETR [22]는 이러한 패러다임의 성능을 더 발전시키기 위해 3D 좌표 생성과 위치 인코딩을 도입합니다. 또한 BEVDet [15]와 같이 강력한 데이터 증강 전략을 활용합니다. DETR3D의 확장인 Graph-DETR3D [6]는 두 가지 측면에서 DETR3D를 확장합니다. CenterPoint [46]의 두 번째 단계와 유사하게, Graph-DETR3D는 객체 쿼리의 특징을 생성할 때 3D 공간에서 단일 점이 아닌 여러 점을 샘플링합니다. 다른 수정사항으로는 다중 스케일 훈련을 DETR3D 패러다임에서 사용할 수 있도록 스케일링 팩터에 따라 깊이 목표를 동적으로 조정하는 것입니다. BEVDet [15]은 시각 기반 3D 객체 감지에서 강력한 데이터 증강 전략을 적용한 첫 번째 시도입니다. BEVDet은 BEV(버드아이뷰) 공간에서 특징을 명시적으로 인코딩하므로 다중 작업 학습, 다중 센서 퓨전 및 시간 퓨전과 같은 여러 측면에서 확장 가능합니다. BEVDet4D는 BEVDet [15]의 시간적 확장입니다.

현재까지는 시각 기반 3D 객체 감지에서 시간적 단서를 활용하는 작업은 많지 않습니다. 따라서 기존의 패러다임들 [40, 41, 42, 22, 15]은 LiDAR 기반 [46]이나 레이더 기반 [27] 방법보다 속도와 같은 시간 관련 목표를 예측하는 데 상대적으로 성능이 떨어집니다. 우리의 지식으로는 [1]이 이 관점에서 유일한 선두주자입니다. 그러나 그들은 단일 프레임을 기반으로 결과를 예측하고 결과의 시간적 일관성을 위해 3D 칼만 필터를 활용합니다. 시간적 단서는 추론 프레임워크의 후처리 단계에서 활용되며, 엔드 투 엔드 학습 프레임워크에는 적용되지 않습니다. 달리 말하면, 우리는 시간적 단서를 활용하는 것에 처음으로 도전하며, 이는 우아하고 강력하면서도 확장 가능한 BEVDet4D에서 이루어집니다. BEVFormer [18]는 BEVDet4D의 동시 작업입니다. VID 문헌 [5, 8, 9]의 해당 방법과 유사하게, BEVFormer는 주로 어텐션 메커니즘 [37]을 사용하여 시공간적 4D 작업 공간에서 특징 융합에 초점을 맞추고 있습니다. BEVFormer의 비교 가능한 속도 정확도는 여러 인접 프레임의 특징을 융합함으로써 달성되며(총 4개의 프레임), 이는 대부분의 LiDAR 기반 방법 [2, 46]에서 여러 회전에서 얻은 점들과 유사합니다. 이는 제안된 BEVDet4D와 근본적으로 다르며, 단지 두 개의 인접 프레임만을 사용하여 더 우아한 패턴으로 더 높은 속도 정확도를 달성합니다.

2.2. 비디오에서의 객체 탐지
ImageNet VID 데이터셋 [33]을 기반으로 한 비디오 객체 탐지는 이미지-뷰 공간에서의 일반적인 객체 탐지 [19] 작업과 유사합니다. 차이점은 비디오에서 객체를 탐지할 때 시간적인 신호에 접근하여 탐지 정확도를 향상시킬 수 있다는 것입니다. 이 영역의 방법은 주로 두 가지 유형의 매체에 따라 시간적인 신호에 접근합니다. 예측 결과나 중간 특징을 기반으로 합니다. 전자는 추적 패턴에서 예측 결과를 최적화하는 비전 기반 3D 객체 탐지의 [1]와 유사합니다. 후자는 LSTM [14]과 같은 특수 아키텍처를 사용하여 이전 프레임의 특징을 재활용합니다. [21, 20, 25] 특징 압축을 위한 주의 메커니즘 [37] [5, 8, 9] 특징 조회를 위한 광학 흐름 [10] 특징 정렬을 위한 광학 흐름 [52, 51] 등입니다.
자율주행 시나리오에 특화된 BEVDet4D는 메커니즘에서 플로우 기반 방법과 유사하지만, 자이로 모션에 따라 공간 상의 상관관계에 접근하고 3D 공간에서 특징 집계를 수행합니다. 게다가, BEVDet4D는 주로 일반적인 비디오 객체 탐지 문헌의 범위에 포함되지 않은 속도 대상의 예측에 초점을 맞추고 있습니다.

3. 방법론
3.1. 네트워크 구조
그림 2와 같이 BEVDet4D의 전체적인 프레임워크는 BEVDet [15] 기준선을 기반으로 구축되며, 네 가지 모듈로 구성됩니다: 이미지 시점 인코더, 시점 변환기, BEV 인코더, 그리고 작업 특정 헤드(task-specific head). 이러한 모듈들의 구현 세부 사항은 변경하지 않았습니다. 시간적 단서를 활용하기 위해 BEVDet4D는 이전 프레임에서 시점 변환기에 의해 생성된 BEV 특징을 유지함으로써 기준선을 확장합니다.
그런 다음 유지된 특징과 현재 프레임의 특징을 병합합니다. 이전 하위단계에서 자세히 설명할 내용에 따라 학습 대상을 단순화하기 위해 병합하기 전에 정렬 작업을 수행합니다.
우리는 복잡한 융합 전략은 사용하지 않고 BEVDet4D 패러다임을 검증하기 위해 간단한 연결(concatenation) 작업을 적용합니다. 또한, 시점 변환기에 의해 생성된 특징은 희소합니다. 이는 후속 모듈이 시간적 단서를 활용하기에 너무 조잡합니다. 따라서 시간적 융합 이전에 추가적인 BEV 인코더를 적용하여 후보 특징을 조정합니다. 실제로 추가적인 BEV 인코더는 두 개의 기본적인 잔차 유닛 [13]으로 구성되며, 채널 수는 입력 특징과 동일하게 설정됩니다.

3.2. 속도 학습 과제 단순화하기
기호 정의 nuScense [2]를 따라 전역 좌표계를 Og − XY Z, 자기 좌표계를 Oe(T) − XY Z, 그리고 대상물 좌표계를 Ot(T) − XY Z로 표시합니다. 그림 3에 나와있는 것처럼 이동하는 자기차량과 두 개의 대상 차량으로 가상의 장면을 구성합니다. 대상물 중 하나는 전역 좌표계에서 정적인 상태 (즉, Os − XY Z에 녹색으로 표시)이고 다른 하나는 이동 중인 상태 (즉, Om − XY Z에 파란색으로 표시)입니다. 두 개의 인접한 프레임 (즉, 프레임 T − 1과 프레임 T)의 물체는 서로 다른 투명도로 구분됩니다. 물체의 위치는 P
x(t)로 표현됩니다.
여기서 x ∈ {g, e(T), e(T − 1)}는 위치가 정의된 좌표계를 나타냅니다. t ∈ {T, T − 1}은 위치가 기록된 시간을 나타냅니다. 우리는 Tdstsrc로 소스 좌표계에서 대상 좌표계로의 변환을 표시합니다.

대상물의 속도를 직접 예측하는 대신, 인접한 두 프레임에서 대상물의 이동을 예측하려고 합니다. 이렇게 하면 시간 요소가 제거되고 위치 이동은 두 BEV 특징의 차이를 통해 측정될 수 있습니다. 게다가, 우리는 자기운동과 관련없는 위치 이동을 학습하고자 합니다. 이렇게 하면 자기운동으로 인해 대상물의 위치 이동 분포가 더 복잡해지는 것을 방지할 수 있습니다.

예를 들어, 자기운동으로 인해 전역 좌표계에서 정적인 대상물 (즉, 그림 3의 녹색 상자)은 자기 좌표계에서 이동하는 대상물로 변화합니다. 구체적으로, BEV 특징의 수용 영역은 자기를 기준으로 대칭으로 정의됩니다. 두 인접한 프레임에서 시점 변환기에 의해 생성된 두 특징의 전역 좌표계에서의 수용 영역은 자기운동으로 인해 다양합니다. 정적인 물체의 경우 전역 좌표계에서의 위치는 P
gs(T)와 Pgs(T −1)으로 표현됩니다. 두 특징 간의 위치 이동은 다음과 같이 정의됩니다:

식 1에 따르면 두 특징을 직접 연결하면 (즉, 대상물의 두 특징 사이의 위치 이동) 다음 모듈의 학습 대상은 자기운동과 관련이 있게 됩니다 (즉, Te(T −1)e(T)). 이를 피하기 위해 우리는 인접한 프레임의 대상물을 Te(T)e(T −1)만큼 이동하여 자기운동 부분을 제거합니다.

식 2에 따라 학습 대상은 현재 프레임의 자기 좌표계에서 대상물의 이동으로 설정됩니다. 이는 자기운동과 관련이 없습니다.

실제로 식 2에서의 정렬 작업은 특징 정렬을 통해 이루어집니다. 이전 프레임의 후보 특징 F(T − 1, P
e(T −1))과 현재 프레임의 후보 특징 F(T, Pe(T))이 주어지면 정렬된 특징은 다음과 같이 얻을 수 있습니다:

식 3과 함께, Te(T −1)e(T)Pe(T)는 F(T − 1, Pe(T −1))의 희소한 특징에서 유효한 위치가 아닐 수 있으므로 양선형 보간법이 적용됩니다. 이 보간법은 정밀도 저하를 초래할 수 있는 최적화 방법입니다. 정밀도 저하의 크기는 BEV 특징의 해상도와 음의 상관 관계를 가집니다. 보다 정확한 방법은 뷰 변환기의 들어내기 연산에서 생성된 점군의 좌표를 조정하는 것입니다. 하지만 이 방법은 기본 BEVDet [15]에서 제안된 가속 방법의 선행 조건을 파괴하므로 이 논문에서는 사용하지 않습니다. 정밀도 저하의 크기는 향후 연구 섹션 4.3.2에서 양적으로 추정됩니다.

결론
우리는 BEVDet4D를 제안함으로써 비전 기반 자율주행을 공간-시간적인 4D 공간에서 개척하였습니다. 이로써 확장 가능한 BEVDet [15]을 공간적인 3D 작업 공간에서 공간-시간적인 4D 작업 공간으로 전환하였습니다. BEVDet4D는 BEVDet의 우아함을 유지하면서 다중 카메라 3D 객체 탐지에서의 성능을 현저히 향상시킵니다. 특히 속도 예측 측면에서의 성능을 향상시킵니다. 향후 연구는 시간적인 신호를 적극적으로 탐색하기 위한 프레임워크와 패러다임의 설계에 초점을 맞출 것입니다.

영어 원문도 같이 첨부함.

Abstract
Single frame data contains finite information which limits the performance of the existing vision-based multicamera 3D object detection paradigms. For fundamentally pushing the performance boundary in this area, a novel
paradigm dubbed BEVDet4D is proposed to lift the scalable BEVDet paradigm from the spatial-only 3D working
space into the spatial temporal 4D working space. We upgrade the naive BEVDet framework with a few modifications just for fusing the feature from the previous frame with the corresponding one in the current frame. In this
way, with negligible additional computing budget, we enable BEVDet4D to access the temporal cues by querying
and comparing the two candidate features. Beyond this, we simplify the task of velocity prediction by degenerating it
into the positional offset prediction in the two adjacent features. As a result, BEVDet4D with robust generalization
performance reduces the velocity error by up to -62.9%.
This makes the vision-based methods, for the first time, become comparable with those relied on LiDAR or radar in
this aspect. On challenge benchmark nuScenes, we report a new record of 54.5% NDS with the high-performance configuration dubbed BEVDet4D-Base. At the same inference
speed, this notably surpasses the previous leading method BEVDet-Base by +7.3% NDS.

The source code is publicly available for further research1

1. Introduction
Recently, autonomous driving draws great attention in both the research and the industry community. The visionbased perception tasks in this scene include 3D object detection, BEV semantic segmentation, motion prediction, and so on. Most of them can be partly solved in the spatialonly 3D working space with a single frame of data. However, with respect to the time-relevant targets like velocity, current vision-based paradigms with merely a single
frame of data perform far poorer than those with sensors like LiDAR or radar. For example, the velocity error of the
recently leading method BEVDet [15] in the vision-based 3D object detection is 3 times that of the LiDAR-based
method CenterPoint [46] and 2 times that of the radar-based method CenterFusion [27]. To close this gap, we propose a
novel paradigm dubbed BEVDet4D in this paper and pioneer the exploitation of vision-based autonomous driving in
the spatial-temporal 4D space.
As illustrated in Fig. 2, BEVDet4D makes the first attempt at accessing the rich information in the temporal domain. It simply extends the naive BEVDet [15] by retaining the intermediate BEV features in the previous frames. Then
it fuses the retained feature with the corresponding one in the current frame just by a spatial alignment operation and
a concatenation operation. Other than that, we kept most other details of the framework unchanged. In this way, we
place just a negligible extra computational budget on the inference process while enabling the paradigm to access the
temporal cues by querying and comparing the two candidate features. Though simple in constructing the framework of BEVDet4D, it is nontrivial to build its robust performance. The spatial alignment operation and the learning targets should be carefully designed to cooperate with the elegant framework so that the velocity prediction task can be simplified and superior generalization performance can be achieved with BEVDet4D.
We conduct comprehensive experiments on the challenge benchmark nuScenes [2] to verify the feasibility of
BEVDet4D and study its characteristics. Fig. 1 illustrates the trade-off between inference speed and performance
of different paradigms. Without bells and whistles, the BEVDet4D-Tiny configuration reduces the velocity error
by 62.9% from 0.909 mAVE to 0.337 mAVE. Besides, the proposed paradigm also has significant improvement in the
other indicators like detection score (+2.6% mAP), orientation error (-12.0% mAOE), and attribute error (-25.1%
mAAE). As a result, BEVDet4D Tiny exceeds the baseline by +8.4% on the composite indicator NDS. The highperformance configuration dubbed BEVDet4D-Base scores high as 42.1% mAP and 54.5% NDS, which has surpassed all published results in vision-based 3D object detection [40, 41, 42, 22, 6, 18, 15]. Last but not least, BEVDet4D achieves the aforementioned superiority just at a negligible cost in inference latency, which is meaningful in the scenario of autonomous driving.

2. Related Works
2.1. Vision-based 3D object detection
Vision-based 3D object detection is a promising perception task in autonomous driving. In the last few years, fueled by the KITTI [11] benchmark monocular 3D object detection has witness a rapid development [26, 23, 47, 53,
49, 31, 39, 38, 16]. However, the limited data and the single view disable it in developing more complicated tasks.
Recently, some large-scale benchmarks [2, 35] have been proposed with sufficient data and surrounding views, offering new perspectives toward the paradigm development
in the field of 3D object detection. Based on these benchmarks, some multi-camera 3D object detection paradigms
have been developed with competitive performance. For example, inspired by the success of FCOS [36] in 2D detection, FCOS3D [40] treats the 3D object detection problem as a 2D object detection problem and conducts perception just in image view. Benefitting from the strong spatial
correlation of the targets’ attribute with the image appearance, it works well in predicting this but is relatively poor
in perceiving the targets’ translation, velocity, and orientation. PGD [41] further develops the FCOS3D paradigm
by searching and resolving the outstanding shortcoming (i.e. the prediction of the targets’ depth). This offers a
remarkable accuracy improvement on the baseline but at the cost of more computational budget and additional inference latency. Following DETR [3], DETR3D [42] proposes to detect 3D objects in an attention pattern, which has similar accuracy as FCOS3D. Although DETR3D requires just half the computational budget, the complex calculation
pipeline slows down its inference speed to the same level as FCOS3D. PETR [22] further develops the performance
of this paradigm by introducing the 3D coordinate generation and position encoding. Besides, they also exploit the
strong data augmentation strategies just as BEVDet [15].
Another concurrent work dubbed Graph-DETR3D [6] also extends the DETR3D from two expects. Analogous to the
second stage in CenterPoint [46], Graph-DETR3D samples
multiple points in the 3D space instead of a single point
when generating the features of the object queries. Another
modification is making the multi-scale training become feasible for DETR3D paradigm by dynamically adjusting the
depth target according to the scaling factor. As a novel
paradigm, BEVDet [15] makes the first attempt at applying a strong data augmentation strategy in vision-based 3D
object detection. As BEVDet explicitly encodes features in
the BEV space, it is scalable in multiple aspects including
multi-tasks learning, multi-sensors fusion, and temporal fusion. BEVDet4D is the temporal extension of BEVDet [15].
So far, few works have exploited the temporal cues
in vision-based 3D object detection. Thus, the existing
paradigms [40, 41, 42, 22, 15] perform relatively poorly
in predicting the time-relevant targets like velocity than the
LiDAR-based [46] or radar-based [27] methods. To the best
of our knowledge, [1] is the only one pioneer in this perspective. However, they predict the results based on a single frame and exploit the 3D Kalman filter to update the
results for the temporal consistency of results between image sequences. The temporal cues are exploited in the postprocessing phase instead of the end-to-end learning framework. Differently, we make the first attempt in exploiting the temporal cues in the end-to-end learning framework
BEVDet4D, which is elegant, powerful, and still scalable.
BEVFormer [18] is a concurrent work of BEVDet4D. Analogous to those [5, 8, 9] in the VID literature, they mainly focus on the feature fusion in the spatial-temporal 4D working
space with the attention mechanism [37]. The comparable
velocity precision of BEVFormer is achieved by fusing features from multiple adjacent frames (i.e. 4 frames in total),
which is analogous to most LiDAR-based methods [2, 46]
with points from multiple sweeps. This is fundamentally
different from the proposed BEVDet4D, which uses merely
two adjacent frames and achieved a higher velocity precision in a more elegant pattern.

2.2. Object Detection in Video
Video object detection mainly fueled by the ImageNet VID dataset [33] is analogous to the well-known tasks of
common object detection [19] which performs and evaluates the object detection task in the image-view space. The
difference is that detecting objects in video can access the temporal cues for improving detection accuracy. The methods in this area access the temporal cues mainly according
to two kinds of mediums: the predicting results or the intermediate features. The former [45] is analogous to [1] in
vision-based 3D object detection, who optimizes the prediction results in a tracking pattern. The latter reutilizes
the features from the previous frame based on some special architectures like LSTM [14] for feature distillation
[21, 20, 25], attention mechanism [37] for feature querying [5, 8, 9], and optical flow [10] for feature alignment [52, 51].
Specific for the scene of autonomous driving, BEVDet4D is analogous to the flow-based methods in mechanism but accesses the spatial correlation according to the ego-motion
and conducts feature aggregation in the 3D space. Besides, BEVDet4D mainly focuses on the prediction of the velocity targets which is not in the scope of the common video object detection literature.

3. Methodology
3.1. Network Structure
As illustrated in Fig. 2, the overall framework of
BEVDet4D is built upon the BEVDet [15] baseline which is consists of four kinds of modules: an image-view encoder, a view transformer, a BEV encoder, and a task-specific head.
All implementation details of these modules are kept unchanged. To exploit the temporal cues, BEVDet4D extends the baseline by retaining the BEV features generated by the view transformer in the previous frame.
Then the retained feature is merged with the one in the current frame. Before that, an alignment operation is conducted to simplify the learning targets which will be detailed in the following subsection.
We apply a simple concatenation operation to merge the features for verifying the BEVDet4D paradigm. More complicated fusing strategies have not been exploited in this paper.
Besides, the feature generated by the view transformer is sparse, which is too coarse for the subsequential modules to exploit the temporal cues. Therefore, an extra BEV encoder is applied to adjust the candidate features before the temporal fusion. In practice, the extra BEV encoder consists of two naive residual units [13], whose channel number is set the same as the input feature.

3.2. Simplify the Velocity Learning Task
Symbol Definition Following nuScense [2], we denote
the global coordinate system as Og − XY Z, the ego coordinate system as Oe(T) − XY Z, and the targets coordinate system as Ot(T) − XY Z. As illustrated in Fig. 3,
we construct a virtual scene with a moving ego vehicle
and two target vehicles. One of the targets is static (i.e.,Os − XY Z painted green) in the global coordinate system,
while the other one is moving (i.e., Om − XY Z painted blue). The objects in two adjacent frames (i.e., frame T − 1 and frame T) are distinguished with different transparentness. The position of the objects is formulated as Px(t).x ∈ {g, e(T), e(T − 1)} denotes the coordinate system
where the position is defined in. t ∈ {T, T − 1} denotes
the time when the position is recorded. We use Tdstsrc to denote the transformation from the source coordinate system into the target coordinate system.
Instead of directly predicting the velocity of the targets,
we tend to predict the translation of the targets in the two adjacent frames. In this way, the learning task can be simplified as the time factor is removed and the positional shifting
can be measured just according to the difference between
the two BEV features. Besides, we tend to learn the position shifting that is irrelevant to the ego-motion. In this way,
the learning task can also be simplified as the ego-motion will make the distribution of the targets’ positional shifting
more complicated.
For example, due to the ego-motion, a static object (i.e.,
the green box in Fig. 3) in the global coordinate system will
be changed into a moving object in the ego coordinate system. More specifically, the receptive field of the BEV features is symmetrically defined around the ego. Considering
the two features generated by the view transformer in the
two adjacent frames, their receptive fields in the global coordinate system are diverse due to the ego-motion. Given a
static object, its position in the global coordinate system is
denoted as Pgs(T) and Pgs(T −1) in the two adjacent frames.
The positional shifting in the two features should be formulated as:

According to Eq. 1, if we directly concatenate the two features, the learning target (i.e., the positional shifting of the target in the two features) of the following modules is relevant to the ego motion (i.e., Te(T −1)e(T)). To avoid this, we shift the target in the adjacent frame by Te(T)e(T −1) to remove the fraction of ego-motion.

According to Eq. 2, the learning target is set as the object’s motion in the current frame’s ego coordinate system, which
is irrelevant to the ego-motion.

In practice, the alignment operation in Eq. 2 is achieved by feature alignment. Given the candidate features of the
previous frame F(T − 1, Pe(T −1)) and the current frameF(T, Pe(T)), the aligned feature can be obtained by:

Alone with Eq. 3, bilinear interpolation is applied as Te(T −1)e(T)Pe(T) may not be a valid position in the sparse
feature of F(T − 1, Pe(T −1)). The interpolation is a suboptimal method that will lead to precision degeneration.
The magnitude of the precision degeneration is negatively
correlated with the resolution of the BEV features. A more
precise method is to adjust the coordinates of the point
cloud generated by the lifting operation in the view transformer [30]. However, it is deprecated in this paper as it will destroy the precondition of the acceleration method
proposed in the naive BEVDet [15]. The magnitude of the precision degeneration will be quantitatively estimated in the ablation study Section. 4.3.2.

5. Conclusion
We pioneer the exploitation of vision-based autonomous driving in the spatial-temporal 4D space by proposing
BEVDet4D to lift the scalable BEVDet [15] from spatialonly 3D working space into spatial-temporal 4D working
space. BEVDet4D retains the elegance of BEVDet while substantially pushing the performance in multi-camera 3D
object detection, particularly in the velocity prediction aspect. Future works will focus on the design of framework
and paradigm for actively mining the temporal cues.