v3 논문 번역 : Advancing End-to-End 3D Detection and Tracking(feat. chat gpt)

v3 논문 번역 : Advancing End-to-End 3D Detection and Tracking(feat. chat gpt)

카테고리 없음 2024. 10. 6. 23:07
영상에 대해서 아무것도 모르는 내가 돌려도 잘 되어서

모델을 보는데 하나도 모르겠다.

그래서 논문을 봤지만 하나도 모르겠다.

우선 번역이라도 해서 한글로 읽어 봤다.

그래도 모르겠다.

모르지만 그냥 적어 봤다.....

서당개 10년이면 풍월을 읊으니까......

논문을 읽기 전에 아래 글을 읽으면 도움이 될것임.

참고로 모델은

img_backbon = "ResNet"

img_neck = "FPN"

depth_branch = "DenseDepthNet"

head = " attention기반....모델"

질문1. 이미지의 크기와 DenseDepthNet의 크기가 다른데 어떻게 처리하나?

대답:

https://github.com/HorizonRobotics/Sparse4D/tree/main?tab=readme-ov-file

GitHub - HorizonRobotics/Sparse4D

Contribute to HorizonRobotics/Sparse4D development by creating an account on GitHub.

github.com

v3 논문 번역 : Advancing End-to-End 3D Detection and Tracking

chat GPT를 이용해서 번역해봤다.

자율 주행 인식 시스템에서 3D 감지와 추적은 두 가지 기본적인 과제입니다. 이 논문은 Sparse4D 프레임워크를 기반으로 한층 더 심도 있게 이 분야를 탐구합니다. 우리는 **Temporal Instance Denoising(시간 인스턴스 잡음 제거)**과 **Quality Estimation(품질 추정)**이라는 두 가지 보조 훈련 과제를 도입하고, 구조적인 개선을 위해 **분리된 어텐션(decoupled attention)**을 제안하여 감지 성능을 크게 향상시켰습니다. 또한, 추론 시 인스턴스 ID를 할당하는 간단한 방식을 통해 탐지기를 추적기로 확장하여, 쿼리 기반 알고리즘의 장점을 더욱 부각시켰습니다.

nuScenes 벤치마크에서 수행한 광범위한 실험을 통해 제안된 개선 사항의 효과를 검증했습니다. ResNet50을 백본으로 사용한 실험에서 mAP, NDS, AMOTA가 각각 3.0%, 2.2%, 7.6% 향상되어, 최종 성능은 각각 46.9%, 56.1%, **49.0%**를 기록했습니다. 우리의 최고 모델은 nuScenes 테스트 세트에서 71.9% NDS와 67.7% AMOTA를 달성했습니다. 코드는 https://github.com/linxuewu/Sparse4D에서 제공될 예정입니다.

원문

더보기

In autonomous driving perception systems, 3D detection and tracking are the two fundamental tasks. This paper delves deeper into this field, building upon the Sparse4D framework. We introduce two auxiliary training tasks (Temporal Instance Denoising and Quality Estimation) and propose decoupled attention to make structural improvements, leading to significant enhancements in detection performance. Additionally, we extend the detector into a tracker using a straightforward approach that assigns instance ID during inference, further highlighting the advantages of query-based algorithms. Extensive experiments conducted on the nuScenes benchmark validate the effectiveness of the proposed improvements. With ResNet50 as the backbone, we witnessed enhancements of 3.0%, 2.2%, and 7.6% in mAP, NDS, and AMOTA, achieving 46.9%, 56.1%, and 49.0%, respectively. Our best model achieved 71.9% NDS and 67.7% AMOTA on the nuScenes test set. Code will be released at https://github.com/linxuewu/Sparse4D.

1. 서론

시간적 다중 시점 인식 연구 분야에서 희소 기반 알고리즘은 많은 발전을 이루었으며, 밀집된 BEV 기반 알고리즘에 필적하는 성능을 보이면서도 여러 가지 장점을 제공합니다【41, 6, 5, 43, 26, 27】. 이러한 장점은 다음과 같습니다:

자유로운 시점 변환: 희소 방법은 이미지 공간을 3D 벡터 공간으로 변환할 필요가 없습니다.

일정한 계산 부하: 감지 헤드에서 계산 부하는 인식 거리나 이미지 해상도와 무관하게 일정합니다.

엔드 투 엔드 방식으로 다운스트림 작업 통합 용이: 이러한 방식은 후속 작업의 통합을 쉽게 구현할 수 있습니다.

본 연구에서는 Sparse4Dv2【26, 27】를 개선을 위한 기준 알고리즘으로 선택했습니다. 알고리즘의 전체 구조는 그림 1에 설명되어 있으며, 이미지 인코더는 다중 시점 이미지를 다중 스케일 특징 맵으로 변환하고, 디코더 블록은 이러한 이미지 특징을 활용하여 인스턴스를 정제하고 인식 결과를 생성합니다.

Figure 3: Visualizing Attention Weights in Instance Self-Attention: 1) The first row reveals attention weights in vanilla self-attention, where pedestrians in red circles show unintended correlations with the target vehicle (green box). 2) The second row displays attention weights in decoupled attention, effectively addressing the issue.

먼저, 희소 기반 알고리즘은 밀집 기반 알고리즘에 비해 수렴 과정에서 더 큰 어려움을 겪고, 결국 최종 성능에 영향을 미친다는 점을 관찰했습니다. 이 문제는 2D 감지 연구 분야에서 충분히 조사되어 왔으며【17, 48, 53】, 주로 일대일 긍정 샘플 매칭 방식의 사용에서 기인합니다. 이 매칭 방식은 훈련 초기 단계에서 불안정하며, 일대다 매칭에 비해 긍정 샘플 수가 제한적이기 때문에 디코더 훈련의 효율성을 감소시킵니다.

또한, Sparse4D는 전역 교차 어텐션 대신 희소 특징 샘플링을 사용하여, 양성 샘플 부족으로 인해 인코더 수렴을 더욱 방해합니다. Sparse4Dv2【27】에서는 이미지 인코더의 수렴 문제를 부분적으로 완화하기 위해 밀집 깊이 감독이 도입되었습니다.

이 논문의 주요 목표는 디코더 훈련의 안정성을 강화하여 모델 성능을 향상시키는 데 있습니다. 이를 위해, 우리는 잡음 제거(denoising) 작업을 보조 감독으로 통합하고, 2D 단일 프레임 감지에서 사용된 잡음 제거 기법을 3D 시간적 감지로 확장했습니다. 이 작업은 안정적인 긍정 샘플 매칭을 보장할 뿐만 아니라 긍정 샘플 수를 크게 증가시킵니다.

더 나아가, 품질 추정(quality estimation) 작업을 보조 감독으로 도입하여, 출력 신뢰도 점수를 더 합리적으로 만들고, 감지 결과의 정확도를 개선하여 평가 지표를 향상시켰습니다.

또한, 우리는 Sparse4D의 인스턴스 자기 어텐션(instance self-attention) 및 시간적 교차 어텐션(temporal cross-attention) 모듈의 구조를 개선하여, 디커플드 어텐션(decoupled attention) 메커니즘을 도입했습니다. 이 메커니즘은 어텐션 가중치 계산 중 특징 간의 간섭을 줄이도록 설계되었습니다. 그림 3에서 설명한 것처럼, 앵커 임베딩과 인스턴스 특징이 어텐션 계산의 입력으로 추가될 때, 산출된 어텐션 가중치에서 **이상치(outlier)**가 발생하는 경우가 있습니다. 이는 타겟 특징 간 상호 상관 관계를 정확하게 반영하지 못해 올바른 특징을 집계할 수 없게 만듭니다.

더하기 방식을 연결(concatenation) 방식으로 교체함으로써 이러한 오류 현상의 발생을 크게 줄였습니다. 이 개선 사항은 Conditional DETR【33】와 유사점을 가지지만, 중요한 차이점은 우리가 쿼리 간 어텐션에 중점을 두는 반면, Conditional DETR은 쿼리와 이미지 특징 간 교차 어텐션에 집중한다는 점입니다. 또한, 우리 방법은 별개의 인코딩 방식을 사용합니다.

마지막으로, 엔드 투 엔드 인식 시스템의 기능을 발전시키기 위해, Sparse4D 프레임워크에 3D 다중 객체 추적 작업을 통합하여 객체의 움직임 궤적을 직접 출력하는 방안을 탐구했습니다. 기존의 추적-감지(tracking-by-detection) 방식과 달리, 우리는 데이터 연결(data association) 및 필터링의 필요성을 제거하고, 모든 추적 기능을 감지기에 통합했습니다. 또한, 기존의 감지 및 추적 통합 방식과는 달리, 우리의 추적기는 훈련 과정이나 손실 함수의 수정이 필요하지 않으며, 정답 ID 제공 없이도 사전 정의된 인스턴스-추적 회귀를 달성합니다.

우리의 추적기 구현은 감지기와 추적기를 최대한 통합하며, 감지기 훈련 과정에서 아무런 수정 없이 추가적인 미세 조정(fine-tuning) 없이도 작동합니다.

우리의 기여는 다음과 같이 요약할 수 있습니다:

Sparse4D-v3이라는 강력한 3D 인식 프레임워크를 제안했으며, 이를 위해 세 가지 효과적인 전략(시간 인스턴스 잡음 제거, 품질 추정, 디커플드 어텐션)을 도입했습니다.

Sparse4D를 엔드 투 엔드 추적 모델로 확장했습니다.

우리의 개선 사항이 nuScenes에서 효과적임을 입증했으며, 감지 및 추적 작업에서 최첨단 성능을 달성했습니다.

더보기

1 Introduction

In the field of temporal multi-view perception research, sparse-based algorithms have seen significant advancements [41, 6, 5, 43, 26, 27], reaching perception performance comparable to dense-BEVbased algorithms [21, 13, 11, 19, 18, 35, 44, 8] while offering several advantages: 1) Free view transform. These sparse methods eliminate the need for converting image space to 3D vector space. 2) Constant computational load in detection head, which is irrelevant to perception distance and image resolution. 3) Easier implementation of integrating downstream tasks by end-to-end manner. In this study, we select the sparse-based algorithm Sparse4Dv2 [26, 27] as our baseline for implementing improvements. The overall structure of the algorithm is illustrated in Figure 1. The image encoder transforms multi-view images into multi-scale feature maps, while the decoder blocks leverage these image features to refine instances and generate perception outcomes.

To begin with, we observe that sparse-based algorithms encounter greater challenges in convergence compared to dense-based counterparts, ultimately impacting their final performance. This issue has been thoroughly investigated in the realm of 2D detection [17, 48, 53], and is primarily attributed to the use of a one-to-one positive sample matching. This matching approach is unstable during the initial stages of training, and also results in a limited number of positive samples compared to one-to-many matching, thus reducing the efficiency of decoder training. Moreover, Sparse4D utilizes sparse feature sampling instead of global cross-attention, which further hampers encoder convergence due to the scarce positive samples. In Sparse4Dv2 [27], dense depth supervision has been introduced to partially mitigate these convergence issues faced by the image encoder. This paper primarily aims at enhancing model performance by focusing on the stability of decoder training. We incorporate the denoising task as auxiliary supervision and extend denoising techniques from 2D single-frame detection to 3D temporal detection. It not only ensures stable positive sample matching but also significantly increases the quantity of positive samples. Moreover, we introduce the task of quality estimation as auxiliary supervision. This renders the output confidence scores more reasonable, refining the accuracy of detection result ranking and, resulting in higher evaluation metrics. Furthermore, we enhance the structure of the instance self-attention and temporal cross-attention modules in Sparse4D, introducing a decoupled attention mechanism designed to reduce feature interference during the calculation of attention weights. As depicted in Figure 3, when the anchor embedding and instance feature are added as the input for attention calculation, there are instances of outlier values in the resulting attention weights. This fails to accurately reflect the inter-correlation among target features, leading to an inability to aggregate the correct features. By replacing addition with concatenation, we significantly mitigate the occurrence of this incorrect phenomenon. This enhancement shares similarities with Conditional DETR [33]. However, the crucial difference lies in our emphasis on attention among queries, as opposed to Conditional DETR, which concentrates on cross-attention between queries and image features. Additionally, our approach involves a distinct encoding methodology.

Finally, to advance the end-to-end capabilities of the perception system, we explore the integration of 3D multi-object tracking task into the Sparse4D framework, enabling the direct output of object motion trajectories. Unlike tracking-by-detection methods, we eliminate the need for data association and filtering, integrating all tracking functionalities into the detector. Moreover, distinct from existing joint detection and tracking methods, our tracker requires no modification to the training process or loss functions. It does not necessitate providing ground truth IDs, yet achieves predefined instanceto-tracking regression. Our tracking implementation maximally integrates the detector and tracker, 2 requiring no modifications to the training process of the detector and no additional fine-tuning. Our contributions can be summarized as follows: (1) We propose Sparse4D-v3, a potent 3D perception framework with three effective strategies: temporal instance denoising, quality estimation and decoupled attention. (2) We extend Sparse4D into an end-to-end tracking model. (3) We demonstrate the effectiveness of our improvements on nuScenes, achieving state-of-the-art performance in both detection and tracking tasks.

2. 관련 연구
2.1 엔드 투 엔드 감지를 위한 개선 사항
DETR【3】은 Transformer 아키텍처【38】를 활용하고, 일대일 매칭 훈련 접근 방식을 통해 **NMS(Non-Maximum Suppression)**의 필요성을 없애고 엔드 투 엔드 감지를 달성했습니다. DETR은 이후 일련의 개선 사항을 이끌어냈습니다.

Deformable DETR【51】은 참조점을 기반으로 전역 어텐션을 지역 어텐션으로 변경하여 모델의 훈련 탐색 공간을 크게 줄이고 수렴 속도를 향상시켰습니다. 또한, 어텐션의 계산 복잡도를 줄여 고해상도 입력 및 DETR 프레임워크 내에서 다중 스케일 특징을 사용할 수 있도록 했습니다.

Conditional-DETR【33】은 조건부 교차 어텐션을 도입하여 쿼리 내 콘텐츠와 공간 정보를 분리하고, 독립적으로 점곱을 통해 어텐션 가중치를 계산하여 모델 수렴을 가속화합니다. Conditional-DETR을 기반으로 한 Anchor-DETR【42】은 참조점을 명시적으로 초기화하여 앵커 역할을 하게 합니다. DAB-DETR【28】은 앵커의 초기화 및 공간 쿼리의 인코딩에 바운딩 박스 차원도 포함시킵니다.

더욱이, 많은 방법들이 훈련 매칭 관점에서 DETR의 수렴 안정성과 감지 성능을 향상시키기 위해 노력해왔습니다. DN-DETR【17】은 디코더에 쿼리 입력으로 추가된 잡음을 가진 정답을 인코딩하여 보조 감독을 위한 잡음 제거 작업을 수행합니다. DN-DETR을 기반으로 한 DINO【48】는 잡음이 포함된 음성 샘플을 도입하고, 쿼리 초기화를 위한 혼합 쿼리 선택(Mixed Query Selection)을 제안하여 DETR 프레임워크의 성능을 추가로 향상시킵니다.

Group-DETR【4】은 훈련 중 쿼리를 여러 그룹으로 복제하여 더 많은 훈련 샘플을 제공합니다. Co-DETR【53】는 훈련 중 밀집 헤드를 통합하여 두 가지 목적을 수행합니다. 이는 백본의 더 포괄적인 훈련을 가능하게 하고, 밀집 헤드 출력을 쿼리로 사용하여 디코더의 훈련을 향상시킵니다.

DETR3D【41】은 다중 뷰 3D 감지에 변형 가능한 어텐션을 적용하여 공간 특징 융합을 통한 엔드 투 엔드 3D 감지를 달성합니다. PETR 시리즈【29, 30, 39】는 3D 위치 인코딩을 도입하여 전역 어텐션을 활용하여 직접 다중 뷰 특징 융합을 수행하고 시간 최적화를 진행합니다. Sparse4D 시리즈【26, 27】는 인스턴스 특징 분리, 다점 특징 샘플링, 시간적 융합 측면에서 DETR3D를 개선하여 인식 성능을 향상시킵니다.

더보기

2 Related Works

2.1 Improvements for End-to-End Detection

DETR [3] lerverages the Transformer architecture [38], along with a one-to-one matching training approach, to eliminate the need for NMS and achieve end-to-end detection. DETR has led to a series of subsequent improvements. Deformable DETR [51] change global attention into local attention based on reference points, significantly narrowing the model’s training search space and enhancing convergence speed. It also reduces the computational complexity of attention, facilitating the use of high-resolution inputs and multi-scale features within DETR’s framework. Conditional-DETR [33] introduces conditional cross-attention, separating content and spatial information in the query and independently calculating attention weights through dot products, thereby accelerating model convergence. Building upon Conditional-DETR, Anchor-DETR[42] explicitly initializes reference points, serving as anchors. DAB-DETR [28] further includes bounding box dimensions into the initialization of anchors and the encoding of spatial queries. Moreover, many methods aim to enhance the convergence stability and detection performance of DETR from the perspective of training matching. DN-DETR [17] encodes ground truth with added noise as query input to the decoder, employing a denoising task for auxiliary supervision. Building upon DN-DETR, DINO [48] introduces noisy negative samples and proposes the use of Mixed Query Selection for query initialization, further improving the performance of the DETR framework. Group-DETR [4] replicates queries into multiple groups during training, providing more training samples. Co-DETR [53] incorporates dense heads during training, serving a dual purpose. It enables more comprehensive training of the backbone and enhances the training of the decoder by using the dense head output as a query.

DETR3D [41] applies deformable attention to multi-view 3D detection, achieving end-to-end 3D detection with spatial feature fusion. The PETR series [29, 30, 39] introduce 3D position encoding, leveraging global attention for direct multi-view feature fusion and conducting temporal optimization. The Sparse4D series [26, 27] enhance DETR3D in aspects like instance feature decoupling, multipoint feature sampling, temporal fusion, resulting in enhanced perceptual performance.

2.2 다중 객체 추적 (Multi-Object Tracking)
대부분의 다중 객체 추적(MOT) 방법은 추적-감지(tracking-by-detection) 프레임워크를 사용합니다. 이들은 탐지기 출력에 의존하여 데이터 연결 및 궤적 필터링과 같은 후처리 작업을 수행하며, 그 결과 복잡한 파이프라인이 형성되고 조정해야 할 수많은 하이퍼파라미터가 발생합니다. 이러한 접근 방식은 신경망의 능력을 충분히 활용하지 못합니다. 추적 기능을 직접 탐지기에 통합하기 위해, GCNet【25】, TransTrack【37】, TrackFormer【32】는 DETR 프레임워크를 활용합니다. 이들은 추적 쿼리를 기반으로 탐지된 대상의 프레임 간 전송을 구현하여 후처리에 대한 의존성을 크게 줄입니다.

MOTR【47】은 추적을 완전한 엔드 투 엔드 프로세스로 발전시킵니다. MOTRv3【46】은 MOTR의 탐지 쿼리 훈련의 한계를 해결하여 추적 성능을 크게 향상시킵니다. MUTR3D【49】는 이 쿼리 기반 추적 프레임워크를 3D 다중 객체 추적 분야에 적용합니다. 이러한 엔드 투 엔드 추적 방법은 몇 가지 공통된 특성을 공유합니다:

훈련 중, 이들은 추적 목표에 기반하여 매칭을 제한하여 추적 쿼리에 대해 일관된 ID 매칭을 보장하고, 탐지 쿼리에 대해서는 새로운 목표만 매칭합니다.

이들은 임계값이 높은 값을 사용하여 시간적 특징을 전달하며, 높은 신뢰도를 가진 쿼리만 다음 프레임으로 전달합니다.

우리의 접근 방식은 기존 방법과 다릅니다. 우리는 탐지기 훈련이나 추론 전략을 수정할 필요가 없으며, 추적 ID에 대한 정답을 요구하지 않습니다.

더보기

2.2 Multi-Object Track

Most multi-object tracking (MOT) methods use the tracking-by-detection framework.

They rely on detector outputs to perform post-processing tasks like data association and trajectory filtering, leading to a complex pipeline with numerous hyperparameters that need tuning. These approaches do not fully leverage the capabilities of neural networks. To integrate the tracking functionality directly into the detector, GCNet [25], TransTrack [37] and TrackFormer [32] utilize the DETR framework. They implement inter-frame transfer of detected targets based on track queries, significantly reducing post-processing reliance. MOTR [47] advances tracking to a fully end-to-end process. MOTRv3 [46] addresses the limitations in detection query training of MOTR, resulting in a substantial improvement in tracking performance. MUTR3D [49] applies this query-based tracking framework to the field of 3D multi-object tracking.

These end-to-end tracking methods share some common characteristics:

(1) During training, they constrain matching based on tracking objectives, ensuring consistent ID matching for tracking queries and matching only new targets for detection queries.

(2) They use a high threshold to transmit temporal features, passing only high-confidence queries to the next frame. Our approach diverges from existing methods. We don’t need to modify detector training or inference strategies, nor do we require ground truth for tracking IDs.

3 방법론 (Methodology)
네트워크 구조와 추론 파이프라인은 그림 1에서 나타내며, 이는 Sparse4Dv2【27】와 유사합니다. 이 섹션에서는 먼저 두 가지 보조 작업인 Temporal Instance Denoising(섹션 3.1)과 Quality Estimation(섹션 3.2)을 소개합니다. 이어서, 우리는 어텐션 모듈에 대한 간단한 개선 사항인 decoupled attention(섹션 3.3)을 제시합니다. 마지막으로, Sparse4D를 활용하여 3D MOT를 달성하는 방법을 설명합니다(섹션 3.4).

그림 4: Temporal Instance Denoising의 설명.
(a) 훈련 단계에서 인스턴스는 두 가지 구성 요소로 이루어집니다: 학습 가능한 인스턴스와 잡음이 포함된 인스턴스. 잡음이 포함된 인스턴스는 시간적 요소와 비시간적 요소 모두를 포함합니다. 잡음이 포함된 인스턴스의 경우, 사전 매칭 방법을 사용하여 양성과 음성 샘플을 할당합니다—정답과 앵커를 매칭하는 한편, 학습 가능한 인스턴스는 예측과 정답에 매칭됩니다. 테스트 단계에서는 다이어그램에서 초록색 블록만 유지됩니다. (b) 어텐션 마스크는 그룹 간의 특징 전파를 방지하는 데 사용됩니다. 여기서 회색은 쿼리와 키 간의 어텐션이 없음을 나타내고, 초록색은 그 반대를 의미합니다.

논문을 보면 GT기반으로 attention을 줘서 end to end기반으로 학습하는것 같은데..........

힘들다.......

다음에........
반응형

ABOUT ME

Connecting the dots Connecting the dots

티스토리툴바