CVPR23 - Tesla :: Connecting the dots

Autonomous Driving 2023. 6. 29. 01:31

(6) CVPR23 E2EAD | Phil Duan, Invited Talk - YouTube

https://youtu.be/OKDRsVXv49A?list=PL3N9otbGBVLc3jdm6yrPtCWdE7C8AsNAy

Transcript:
(00:00) okay so our next Kindle speaker is a field one from uh Tesla so filter is a senior staff software engineer at Tesla so he leads the autopilot team for occupancy Network object detection Behavior prediction also labeling and online perception so today's the third title a few is building Vision Foundation models for autonomous driving so let's welcome Phil let me figure out the hardest part of the presentation okay all right

(01:09) um Central introduction uh I'm very excited to to be here with with you guys uh to share some of the progress we have made on Tesla autopilot for the past few months um so for those of you who are okay let me actually hide this anyway for those of you who are not familiar with autopilot this is a self-driving uh feature a building for all the Teslas made in the world since about 2016 and the features range from basic autopilot which has like name keeping you know vehicle following to what we call a navigator autopilot which works on

(01:09) 음, 중앙 소개에 오신 것을 환영합니다. 여러분과 함께 여기 있어서 정말로 기쁩니다. 지난 몇 달 동안 테슬라 자율주행에 대한 몇 가지 진전 사항을 공유하려고 합니다. 자동주행에 익숙하지 않은 분들을 위해 설명드리자면, 이는 2016년 이후로 제조된 모든 테슬라 차량에 탑재된 자율주행 기능입니다. 이 기능은 기본적인 자동주행으로서 차선 유지 및 차량 추종과 같은 기능부터, 우리가 네비게이터 자동주행이라고 부르는 고급 기능까지 다양합니다.

(01:51) Highway for name changes you know follow the route take Forks to our uh most of the Bands software package called full stop driving and this basically takes you from a point A to point B with no human interventions at all and this is the video from one of our customers driving in San Francisco as you can see that it's it is capable of you know navigating through a complicated City taking turns CEO pedestrians and traffic sites and our uh the uniqueness about our approach is that we are very famous for on doing this with a vision only so all

(01:51) 고속도로용으로 이름이 변경된 것들을 알고 계신가요? 루트를 따라 가며 Forks로 이동하는 등, 우리의 가장 최신 소프트웨어 패키지인 풀 스톱 드라이빙을 사용합니다. 이는 사람의 개입 없이 지점 A에서 지점 B까지 이동하는 기능입니다. 이 비디오는 샌프란시스코에서 운전하는 우리 고객 중 한 분의 영상인데요, 복잡한 도시에서 차선 변경, 회전, 보행자와 교통 신호를 처리할 수 있음을 확인할 수 있습니다. 우리가 채택한 독특한 방식은 비전만으로 이 모든 것을 수행한다는 점입니다.

(02:26) this uh is done with just eight cameras around the car provides 360 360 degree field free view and these cameras runs on 36 frames per second and they produce uh processed by 144 terabytes terabs of compute uh in our car and uh we also do not use any additional sensor and HD Maps whatsoever and this works you know anywhere in the world um for basic autopilot and it works for um full subdiving with uh lost America including Vancouver so you know if you go to a Tesla store today buy a car you can do this yourself

(02:26) 이는 차량 주변에 설치된 8개의 카메라로 수행됩니다. 이 카메라들은 차량 주위 360도 시야를 제공하며, 초당 36프레임으로 작동합니다. 이 카메라들이 생산하는 영상은 차 안에서 144 테라바이트의 컴퓨팅으로 처리됩니다. 우리는 추가적인 센서나 HD 맵을 전혀 사용하지 않으며, 이는 전 세계 어디에서나 기본적인 자동주행에 사용될 수 있습니다. 또한 벤쿠버를 포함한 미국 전역에서 풀 자율주행이 가능합니다. 즉, 오늘 테슬라 매장에서 차량을 구매하면 직접 이 기능을 사용할 수 있습니다.

(03:04) and currently we have about 400 000 cars driving with the full stop driving uh software and the community they have uh driven more than 250 million miles so you can see that it's like a lot of data coming from um the fleet and for all the other cars uh autopilot is always running in the background even though it's uh even when human was driving so we provide the activity safety feature for things such as automatic emergency brake so when human makes a mistake the autopilot will try to take over and protect the safety

(03:04) 현재로서는 약 40만 대의 차량이 풀 스톱 드라이빙 소프트웨어를 탑재하고 있으며, 이 커뮤니티는 2억 5천만 마일 이상을 주행하였습니다. 이는 많은 양의 데이터가 플릿에서 생성되고 있다는 것을 알 수 있습니다. 다른 모든 차량에 대해서도, 자동주행은 항상 백그라운드에서 실행되며, 인간이 운전 중일지라도 활성화되어 있습니다. 이로써 우리는 자동 비상 브레이크와 같은 안전 기능을 제공합니다. 즉, 인간이 실수를 저지를 경우, 자동주행이 운전을 맡아 안전을 보호하려고 합니다.

(03:39) and besides that all better also runs in the background you'll collect more data to help us to build our neural networks so you can so there are roughly about three four million cars running uh you know where today to do this data collection and you can think about this are robots on wheels so besides autopilot start about two years ago we also started to make a humanoid Tesla bot called the Optimus so this is the uh basically in a humanoid of form factor it has a very impressive Hardware platform uh you know 200 plus

(03:39) 그리고 그 외에도, 올벳터 역시 백그라운드에서 실행되며, 우리가 신경망을 구축하는 데 도움이 되는 더 많은 데이터를 수집합니다. 오늘날 약 300-400만 대의 차량이 이 데이터 수집을 위해 돌아다니고 있다고 생각할 수 있습니다. 이것은 바로 바퀴 위의 로봇이라고 생각할 수 있습니다. 그러므로 2년 전부터 자동주행에 추가로, 우리는 옵티머스라는 이름의 인간형 테슬라 로봇을 만들기 시작했습니다. 이 로봇은 인간형 형태를 가지고 있으며, 매우 인상적인 하드웨어 플랫폼을 갖추고 있습니다. 200개 이상의

(04:19) degree frame body control it is built for a general purpose of Robotics application and as you can see that the robot obviously has a completely different form factor than the course but we were we're able to leverage a lot of lessons we learned to build the cars build a compute and build a software to bootstrap this uh this project and the Bots are basically you know Robertson lags and the uh as I said it just shares a lot of stuff from the car one example it is that it shares the same compute platform so this is the

(04:19) 도 degree의 프레임 바디 컨트롤을 가지며, 이는 로봇 애플리케이션의 일반적인 목적으로 설계되었습니다. 로봇은 당연히 차량과 완전히 다른 형태를 가지고 있지만, 우리는 자동차를 만들고, 컴퓨팅을 구축하고, 소프트웨어를 개발하는 데서 얻은 많은 교훈을 활용하여 이 프로젝트를 시작할 수 있었습니다. 이 로봇은 로봇 속다리와 같은 움직임을 하며, 말한 대로 자동차와 많은 부분을 공유합니다. 예를 들어, 같은 컴퓨팅 플랫폼을 공유합니다.

(04:52) in-house build uh FSD computer chip uh with a 144 terabyte Terra Ops per second um compute and this runs all the tests in the world and also is used to power the uh robots and because we have the same computer platform it's also fairly straightforward for us to build the same AI software stack so the options and the automated software team will work very close together you know we share the same repo and we build the computer vision models we try to leverage things so that we don't have to build two

(04:52) 자체 제작한 FSD 컴퓨터 칩을 사용하여 초당 144 테라 오페레이션을 수행하는 컴퓨팅을 갖추고 있으며, 이는 전 세계의 모든 테스트에 사용되고 로봇을 구동하는 데에도 사용됩니다. 또한, 동일한 컴퓨터 플랫폼을 사용하기 때문에 동일한 AI 소프트웨어 스택을 구축하는 것도 꽤 간단합니다. 따라서 옵티머스와 자율주행 소프트웨어 팀은 매우 밀접하게 협력합니다. 동일한 저장소를 공유하고 컴퓨터 비전 모델을 개발하며, 중복 작업을 최소화하기 위해 여러 가지를 활용하려고 노력합니다.

(05:28) separate applications and this also kind of enforces us we never build a new software we think about you know how do we build something that is foundational that that can be shared among the two applications so once the network I can mention is the occupation Network I think ties pretty well with the changes you have data in the workshop um so we effectively have the same network running on both the bar and the car with just a little bit to taking where you know certain things are more relevant to cars and services are more

(05:28) 다른 애플리케이션을 개별적으로 구축하는 것을 지양하며, 대신 두 애플리케이션 간에 공유할 수 있는 기반이 되는 소프트웨어를 어떻게 구축할지에 대해 고민합니다. 언급할 수 있는 또 다른 네트워크는 '옵티머스 네트워크'입니다. 이는 워크숍에서 언급한 변화와 잘 어울리는 것 같습니다. 사실, 우리는 로봇과 차량에서 동일한 네트워크를 동작시키고 있으며, 일부 부분만 조금씩 조정하여 차량에 더 관련된 내용이나 서비스에 더 관련된 내용을 처리합니다.

(05:58) relevant to the Bots even though it is a completely different environment you know outdoor versus indoor we're able to actually kind of build the build a network from like exactly the same model with just a little bit you know data difference and I think we showed this accuracy Network the first time at a cppr last year and since then we're kind of very excited to see how the research Community has react to this we've seen all kinds of different variations of this is coming out and there's also challenges with

(05:58) 로봇에 관련된 내용은 완전히 다른 환경이지만, 야외와 실내의 차이점이 있음에도 불구하고, 우리는 사실상 완전히 동일한 모델에서 약간의 데이터 차이만으로 네트워크를 구축할 수 있었습니다. 작년 cppr에서 처음으로 이 정확도 네트워크를 보여준 이후로, 우리는 연구 커뮤니티의 반응을 보고 매우 흥분했습니다. 이를 통해 다양한 변형이 등장했고, 도전 과제도 있습니다.

(06:30) hundreds of people uh participating and I'm definitely very excited to see where this is going next but for those of you who are not familiar with this I'll just do a quick recap on this this what the accuracy network does for us is that it takes all uh eight cameras uh image video streaming and produce a single unified volumetric occupancy in Vector space directly it has multi it has video contacts and for each point in the 3D space we predict whether it is occupied a lot and because of the video context

(06:30) 수백 명의 사람들이 참여하고 있으며, 이제 어디로 나아가게 될지 정말로 기대됩니다. 그러나 이에 대해 익숙하지 않은 분들을 위해 간단히 설명하겠습니다. 정확도 네트워크는 8개의 카메라로부터의 이미지 비디오 스트리밍을 이용하여 직접 하나의 통합된 부피 공간에서 볼륨 점유를 예측합니다. 이는 다차원 비디오 컨텍스트를 가지며, 3D 공간의 각 지점마다 점유 여부를 예측합니다.

(07:05) we are able to reason about inclusions and the predictive things are behind uh visibility region and the blue Contour shown earlier here briefly was the way we used to model the volumetric visibility and it also has a dynamical resolution as you can see from the video here a certain voxels are bigger than the others I will go to more details later and for each of the boxes we're also predict semantics as color coded here you know such as you know curves cars static road debris um those semantics are usually defined

(07:05) 우리는 포함 관계를 추론하고, 예측적인 요소를 시각 영역 뒤에 두는 능력을 갖추고 있습니다. 이전에 보여준 파란 윤곽은 우리가 부피 시각성을 모델링하는 방법이었으며, 이는 동적인 해상도를 갖고 있습니다. 이 비디오에서 보이는 것처럼 일부 보클은 다른 것보다 크기가 큽니다. 나중에 더 자세한 내용을 다루겠습니다. 각각의 박스에 대해 우리는 색상으로 표시된 것과 같이 의미론적인 예측도 수행합니다. 이는 곡선, 자동차, 정적인 도로 잔여물과 같은 의미론적인 요소로 일반적으로 정의됩니다.

(07:42) for more driving relevance classes and as you can see for Optimus robot we will use kind of a different set of semantics it also produces occlusive flow for motion since this network is very generalized um it does not you know make any assumptions on the object shape or motion so it is able to capture you know arbitrary motion such as a serving um trainer here and these networks right now runs in all Teslas in the world with the FSD computer and it's incredibly efficient in memory and compute restable every 10 milliseconds with our

(07:42) 운전 관련 클래스에 대한 더 많은 의미론적인 예측이 있습니다. 옵티머스 로봇을 위해 우리는 다른 의미론적인 요소 집합을 사용할 것이며, 이는 움직임을 위한 가림 흐름(occlusive flow)도 생성합니다. 이 네트워크는 매우 일반화되어 있기 때문에, 객체의 형태나 움직임에 대한 가정을 하지 않습니다. 따라서, 서빙하는 트레이너와 같은 임의의 움직임을 포착할 수 있습니다. 이 네트워크는 현재 FSD 컴퓨터를 장착한 전 세계의 모든 테슬라 차량에서 동작하며, 메모리와 컴퓨팅 효율이 매우 뛰어나며, 10밀리초마다 재계산됩니다.

(08:42) it also helps oscillations able to remove ISP in the loop and for uh and it is uh we we don't even do any long tone mapping at all and all the images generated here are actually done uh post processes just for human eyes we take these images and we Rectify that with the camera calibration and we then fit this rectified images into a set of ragnets and the biofins to extract multi level image features then we construct a set of 3D position queries along with the IMG space features as keys and values and we feed

(08:42) 이는 루프 내에서 ISP를 제거하는 데 도움을 주며, 장애물의 진동을 줄이는 데 도움이 됩니다. 또한, 우리는 전혀 긴 톤 매핑을 하지 않고, 여기에서 생성된 모든 이미지는 사실상 인간의 눈을 위한 후처리만을 위해 수행됩니다. 이러한 이미지를 가져와 카메라 보정과 함께 정정하여, 이 정정된 이미지를 일련의 경계선과 특징을 추출하는 생체적 기능에 맞추어 적용합니다. 그런 다음 키(key)와 값(value)으로서 IMG 공간 특징과 함께 3D 위치 쿼리 집합을 구성하고, 이를 피드합니다.

(09:21) them into a tension module the output of the tension module is some high dimensional spatial features and we align the special features with vehicle biometry to create to to capture motion and this is kind of like a simplifies here but it's a lot of work done happening here to just make sure the uh the vehicle odometry is done incorrectly for to recreate what happened in the car and to align the features precisely so and after we do this we're able to create a very rich spatial and temporal features which will then pass through a set of

(09:21) 이들을 장력 모듈에 피드하면, 장력 모듈의 출력은 고차원 공간적 특징이 됩니다. 우리는 이러한 공간적 특징을 차량 생체계와 일치시켜 움직임을 포착합니다. 이것은 단순화된 예시지만, 실제로는 많은 작업이 여기에서 수행됩니다. 차량 오도메트리가 정확하게 수행되어 차량 내부에서 일어난 사건을 재현하고 특징을 정확하게 정렬하기 위한 작업이 이루어집니다. 이를 통해 매우 풍부한 공간 및 시간적 특징을 생성할 수 있으며, 이는 일련의 과정을 거쳐 전달됩니다.

(09:58) the convolutions to provide the final output for the volume which included occupancy accuracy flow semantics and so on so forth but the output of this volume is in a fixed size the voxel gray which is not a precise enough for control as you know that you know when when things are close by you have to really know these uh Dimensions very precisely and since I'm farther away it probably does not matter that much and the way we solve this problem is that we also predict feature Maps per on a per box or basis and then

(09:58) 합성곱 연산을 통해 볼륨의 최종 출력을 제공합니다. 이 출력은 점유 여부, 정확도 흐름, 의미론 등을 포함합니다. 그러나 이 볼륨의 출력은 고정된 크기의 복셀 그리드로 표현되며, 제어에는 충분히 정확하지 않습니다. 가까운 사물의 경우 이러한 차원을 매우 정확하게 알아야 합니다. 멀리 떨어진 경우에는 크게 중요하지 않을 수도 있습니다. 이 문제를 해결하기 위해 우리는 각 박스마다 특징 맵을 예측하고, 이를 기반으로 크기 조정을 수행합니다.

(10:33) we fit it into those peroxal feature Maps into a set of MLPs we see a set of 3D positional queries so as the output of the network we can query any 3D positions around the car to get the output so after knowing the model pattern let's look at one more video here we have an articulated bus parked on the side of street industry illustrated by the red L here as we get closer the bus starting move as you can see the front of the bus turns blue first indicating the model is predicting non-zero velocity and as a bus keeps moving you can see

(10:33) 이러한 특징 맵을 일련의 MLP에 맞추어 적용합니다. 이후, 3D 위치 쿼리를 사용하여 네트워크의 출력을 얻을 수 있습니다. 따라서, 자동차 주변의 어떤 3D 위치든지 쿼리하여 출력을 얻을 수 있습니다. 모델 패턴을 알게된 후에, 한 가지 더 비디오를 살펴보겠습니다. 여기에는 도로 가장자리에 주차된 관절 버스가 있으며, 이를 빨간색 "L"로 표시했습니다. 점점 가까워짐에 따라 버스가 움직이기 시작하는데, 버스의 앞부분이 먼저 파란색으로 변하면서 모델이 비제로 속도를 예측하고 있음을 나타냅니다. 버스가 계속 움직이면서

(11:13) the hosting turns blue and you can see a very light shape capture the motion and the position of the bus so this is kind of like a um awkward test for traditional object detection because you have to figure out okay what's the formulation do I use one or two keyboard to represent this bus but for occupancy Network it does not matter all we care about is what is visible and the what is a motion so the our car will be able to actually handle and go to this bus correctly besides this um box of gray we're so

(11:13) 버스의 호칭이 파란색으로 변하면서 모양이 매우 가볍게 움직이고 버스의 위치와 움직임을 포착할 수 있습니다. 이는 전통적인 객체 탐지에 대한 어려운 테스트입니다. 버스를 표현하기 위해 한 개 또는 두 개의 키포인트를 사용해야 하는지 결정해야 합니다. 하지만 점유 여부 네트워크에서는 이러한 사항이 중요하지 않습니다. 우리가 관심을 가지는 것은 가시적인 것과 움직임입니다. 따라서 우리의 차량은 이러한 버스를 올바르게 처리하고 통과할 수 있습니다. 이외에도 우리는 회색의 박스 영역을 통해

(11:49) predict uh Service uh output and the surface has it has both 3D geometry as well as semantics so by semantics I mean we also capture the corner of the surface and also load markings on Surface you know whether it's a stop line 99s or like any yield sign stop signs that on the ground and we found this to be incredibly helpful for driving the unusual uh driving environments such as San Francisco which has a lot of you know uni Road and Kirby Road where knowing what the surface looks like in 3D space remove the flatware assumption and we

(11:49) 예측된 서피스의 출력도 함께 제공합니다. 이 서피스에는 3D 지오메트리와 의미론이 함께 포함됩니다. 여기서 의미론이란 서피스의 모서리와 표면에 있는 로드 마킹을 포착하는 것을 말합니다. 예를 들어, 정지선, 99S 또는 지면에 있는 양보 표지판, 정지 표지판 등이 있습니다. 우리는 이러한 정보가 산프란시스코와 같은 독특한 운전 환경에서 매우 도움이 되는 것을 발견했습니다. 산프란시스코는 단일 도로와 커브 도로가 많이 있어서 3D 공간에서 서피스의 모습을 알고 있는 것이 평면 가정을 없애는 데 매우 유용합니다.

(12:27) found that incredibly improve our driving smoothness so and the surface of the volume are actually not output separately uh in fact the volume output is actually aligned with a surface implicitly so that you know when you go through a different uh surface area the voxels does not move up and down it always just aligns with the driver's surface and as we can see from this video here you know the car is going through a Hillcrest area you can see this 3D shape of the surface very nicely and it aligns with the box over as well so we we can use

(12:27) 우리는 이를 통해 운전의 부드러움을 놀라울 정도로 향상시킬 수 있다는 것을 발견했습니다. 또한, 볼륨의 표면은 실제로 별도로 출력되지 않습니다. 사실, 볼륨 출력은 표면과 암묵적으로 정렬됩니다. 따라서 다른 표면 영역을 통과할 때 볼셀이 위아래로 이동하지 않고 항상 운전자의 표면과 정렬됩니다. 이 비디오에서 볼 수 있듯이 자동차가 힐크레스트 지역을 통과하면서 표면의 3D 모양을 아주 잘 볼 수 있으며, 이는 박스 볼륨과도 정렬됩니다. 따라서 우리는 이를 이용할 수 있습니다.

(13:06) this for um for a lot of differences for example the planning planner can use this information to decide how fast we should be driving here and also it helps us to reason about occlusions because sometimes the occlusion is not really due to any obstacle it's just because the macro surface as the surface is like going down so we kind of built this network um and it runs in a car for about I think probably in almost two years now and then it works very well so we start using what else can we build on top of

(13:06) 이를 많은 다양한 방식으로 활용할 수 있습니다. 예를 들어, 운전 계획자는 이 정보를 활용하여 여기서 얼마나 빨리 운전해야 할지 결정할 수 있으며, 또한 우리가 가려는 방향에 대한 가림막에 대해 이해하는 데 도움이 됩니다. 때로는 가림막이 장애물 때문이 아니라 표면이 아래로 내려감으로써 발생하는 경우도 있습니다. 따라서 우리는 이러한 네트워크를 구축하고 차량에서 약 2년 동안 운영해왔으며 매우 잘 작동합니다. 그래서 우리는 이를 기반으로 무엇을 더 구축할 수 있는지에 대해 고민하기 시작했습니다.

(13:38) this so early this year we had this very fun project where we start to predict a distance field around the car based on top of the Oculus Network so this literally just put a small head on top of the network and the Dual regression test to get a distance feel and we show we then show this distance field on UI for customers as a manual driving feature for you know parking or maneuver you know through very narrow area and the this is the video with one of our customers doing a parallel parking and you can see from the right side that

(13:38) 이번 해 초반에 저희는 매우 재미있는 프로젝트를 진행했습니다. Oculus Network 위에 작은 헤드를 놓고 차 주변의 거리 필드를 예측하는 것이었습니다. 이는 실제로 네트워크 상단에 작은 헤드를 놓고 이중 회귀 테스트를 통해 거리 필드를 얻는 것을 의미합니다. 그리고 이 거리 필드를 UI에 표시하여 고객들이 주차나 좁은 공간에서의 조종과 같은 수동 운전 기능으로 사용할 수 있게 했습니다. 아래 비디오는 우리 고객 중 한 분이 평행 주차를 하는 모습을 보여주며 오른쪽에서 거리 필드가 표시되는 것을 확인할 수 있습니다.

(14:14) this Contour is generated by this decent field um so compared to like traditionally traditionally usually this is done with Ultrasonics right so we just like most of cars have Ultrasonics on the front on the back and trying to just measure distance directly and the advantage here is that first of all have 360 degree coverage let me just play this video one more time and then it also is able to capture uh the obstacles at different height because we don't have the sensor limitation for example the curve and

(14:14) 이 등고선은 해당 거리 필드에 기반하여 생성됩니다. 기존에는 일반적으로 초음파를 사용하여 이 작업을 수행했었습니다. 대부분의 차량은 전면과 후면에 초음파 센서를 장착하여 직접 거리를 측정하려고 했습니다. 여기서의 장점은 우선 360도의 범위를 커버할 수 있다는 점입니다. 이 비디오를 한 번 더 재생하고, 또한 곡선과 같은 장애물의 높이를 캡처할 수 있는 것입니다. 왜냐하면 우리는 센서의 제약이 없기 때문입니다.

(14:46) Ultrasonics is usually pretty better with S curve detection and third is that it can it covers arbitrary range right so we don't have like a range limit from the ultrasound sensor anymore and it's able to basically capture according to as large as what we want from the access network and we just show as a range that is comfortable for the uh for the driver to use so uh this vision based hypocrisy is just a small application but while working on this we realized that this model actually has really rich features

(14:46) 초음파 센서는 일반적으로 S자 곡선 감지에 더 나은 성능을 보입니다. 또한 임의의 범위를 커버할 수 있습니다. 이제 우리는 초음파 센서로부터의 범위 제한이 없으며, Oculus Network에서 원하는만큼 큰 범위를 캡처할 수 있습니다. 그리고 운전자가 사용하기 편한 범위로 보여줄 수 있습니다. 이 비전 기반의 거리 측정은 작은 응용 프로그램이지만, 이 작업을 수행하는 동안 우리는 이 모델이 실제로 매우 풍부한 특징을 가지고 있다는 것을 깨달았습니다.

(15:19) that's able to capture a lot of things uh going around us in the world so we start to think okay how can we leverage this and uh build a really uh strong uh World model so if you zoom out a little bit and look at this network right so the barcode Network all it does is really just created this water model in the virtual space of space features and this kind of ties to the previous talk where you know we really want to just build like a really nice 4D word model to understand the word and once we do that all the

(15:19) 주변 세계에서 우리 주변에서 일어나는 많은 사물을 포착할 수 있는 것입니다. 그래서 우리는 어떻게 이를 활용하고 강력한 월드 모델을 구축할 수 있을지 고민하기 시작했습니다. 조금 더 확대하여 이 네트워크를 살펴보면, 주행 모델은 가상 공간에서 공간 특징의 월드 모델을 생성하는 데 그치는 것입니다. 이는 이전의 이야기와 연결되는데, 우리는 실제로 4D 월드 모델을 구축하여 세상을 이해하고자 합니다. 이를 달성하면 모든 것이 더욱 이해하기 쉬워집니다.

(15:49) downstream tasks are actually just tiny has that is very easy and we can just like plug in do some fine tune on that it usually just works very well I couldn't find a very nice picture to capture this so I just put a Tesla logo there to represented the vector space features so but what we're really looking for from this from this direction is really just like how do we build this 40 space word model feature and then we can just use that to capture everything we want for driving for human human robot or for anything

(15:49) 하위 작업들은 실제로는 매우 간단한 작업들이며, 우리는 그냥 연결해서 미세 조정을 하면 대부분 잘 작동합니다. 이를 잘 포착한 그림을 찾지 못해서 테슬라 로고를 그림에 넣어 벡터 공간 특징을 대표하도록 했습니다. 하지만 우리가 이 방향에서 정말로 찾고 있는 것은 이 4D 공간 월드 모델 특징을 어떻게 구축할 것인가이며, 그런 다음 우리는 그것을 활용하여 주행, 인간-로봇 상호작용, 또는 다른 목적에 필요한 모든 것을 포착할 수 있습니다.

(16:26) that related to anybody Ai and if we don't do this correctly we'll be able to use this feature to capture everything that is relevant for driving you know whether it is the aforementioned you know a volume or Surface which has you know all the information we need for for understanding where to go and for objects you know for names and for traffic control and we instead of trying to build a different models to tackle each one separately we really want to just like build one model that does everything together and traditionally we

(16:26) 이는 어떤 사람이나 인공지능과 관련이 있는 중요한 부분입니다. 이를 올바르게 처리하지 않으면 주행에 필요한 모든 내용을 캡처하는 데 사용할 수 있는 기능을 사용하지 못할 수 있습니다. 앞서 언급한 것처럼, 주행에 필요한 정보를 이해하기 위한 것이든, 객체의 이름이든, 교통 제어에 대한 것이든, 우리는 별도의 모델을 각각 다루기 위해 구축하는 대신 모든 것을 함께 처리하는 하나의 모델을 구축하려고 합니다. 이는 전통적으로 다음과 같은 방식으로 이루어져왔습니다.

(16:58) we shared a lot of features on images space which you know just extract the QD information that information is actually not quite enough for doing this kind of a foundation model and now since then we have changed our approach to actually combine everything in retrospace so that we just need to build one model that does everything very well and then just then connect them with different tasks and for Downstream applications and in order to train such a big model obviously you need tons of data so we are very proud that we built this very

(16:58) 우리는 이미지 공간에서 많은 특징을 공유했는데, 이는 QD 정보만 추출하기 때문에 이러한 기반 모델에는 충분하지 않습니다. 그 이후로 저희는 접근 방식을 변경하여 모든 것을 회귀 공간에서 결합하여 하나의 모델을 구축하기로 결정했습니다. 그리고 그 다음에는 다양한 작업과 하위 응용 프로그램에 대해 이러한 모델을 연결하기만 하면 됩니다. 이렇게 큰 모델을 훈련하기 위해서는 당연히 많은 데이터가 필요합니다. 저희가 이렇게 큰 모델을 구축했다는 것에 자랑스럽게 생각합니다.

(17:30) mature data engine in the past few years and this uh in my opinion is one of the biggest advantages we have here so what do we do here if you start from the bottom is that we never we have a model we deploy it to the car and then we mine the model with millions of Teslas on the road and we mine mostly to two things first is that we use heuristics to detect uh things that happen very rarely so we can get the data back and then feeding that into our Network the second is that we also run this compared to what a human does so do you understand

(17:30) 지난 몇 년 동안 저희는 성숙한 데이터 엔진을 보유하고 있습니다. 제 의견으로는 이것이 우리의 가장 큰 장점 중 하나입니다. 우리가 여기서 하는 것은, 가장 하단부터 시작하여, 모델을 차량에 배포한 후, 도로 위에서 수백만 대의 테슬라 차량에서 모델을 채굴합니다. 주로 두 가지를 위해 데이터를 채굴하는데, 첫째로, 우리는 드물게 발생하는 사건을 감지하기 위해 휴리스틱을 사용하여 데이터를 얻어내고, 그것을 우리의 네트워크에 피드합니다. 둘째로, 우리는 이것을 인간의 행동과 비교하여 실행합니다. 이해하셨나요?

(18:01) that okay what is the consistency here what did our model do wrong so that the human made a different decision and once we have this event trigger we put this data back and then we automatically label this data to generate this beautiful ground shoes in City space and then we used to train this model and we call this approach of free learning between nerd everything for more free and most of the work happening at Tesla auto pad is actually just cranked this data engine and getting the free learning going and we have

(18:01) 그래서, 인간과 모델이 다른 결정을 내렸을 때, 우리 모델이 무엇을 잘못했는지, 어떤 일관성이 있는지를 파악합니다. 이벤트가 트리거되면 해당 데이터를 돌려보내고, 우리는 이 데이터를 자동으로 레이블링하여 도시 공간에서 아름다운 지면을 생성합니다. 그런 다음 이 모델을 훈련시키기 위해 이 데이터를 사용합니다. 이를 우리는 "free learning"이라고 부르며, Tesla 오토파일에서 하는 대부분의 작업은 이 데이터 엔진을 개발하고 free learning을 진행하는 것입니다.

(18:33) Engineers looking at data all the time and the other people need to spend more time on data than the model you can give some examples of data we showed this video at AI Day last year which how which shows how we leverage All The Fleets uh from all it has a vehicle from the same internet section we put them together to reconstruct the the 3D surface and the names uh joined which will be able to actually create a create a grandchild neighbors that much better than if we were just to do that with a single trip you can see all the

(18:33) 엔지니어들은 데이터를 계속해서 살펴보고, 다른 사람들은 모델보다 데이터에 더 많은 시간을 투자해야 합니다. 우리는 지난 해 AI Day에서 이 비디오를 공개했는데, 이 비디오는 모든 Tesla 차량의 플릿 데이터를 활용하여 3D 표면을 재구성하고, 이름을 부여하여 훨씬 더 우수한 지면 지도를 생성하는 과정을 보여줍니다. 이를 통해 우리는 단일 트립으로만 작업하는 것보다 훨씬 더 좋은 결과를 얻을 수 있습니다.

(19:12) names are very consistent across different videos across different trips and then then we feed this data to into our Network to train our Network and the good thing about this is that you know our customer drives the the driver test like all kinds of different locations and we get the data mostly from the places where with a lot of Teslas and guess what that's also the area we care the most about so we can immediately fit this data back to our training and trainer Network deployed to the car and the customer will just have a much

(19:12) 여러 비디오와 여러 여행에서 이름은 매우 일관성 있습니다. 그리고 이 데이터를 우리의 네트워크에 입력하여 네트워크를 훈련시킵니다. 이 방법의 좋은 점은 고객이 다양한 장소에서 운전을 하면서 데이터를 제공한다는 것입니다. 우리는 대부분의 데이터를 많은 수의 Tesla가 있는 장소에서 얻습니다. 그리고 놀랍게도, 우리는 그 장소에 가장 많은 관심을 가지고 있습니다. 따라서 우리는 이 데이터를 즉시 훈련에 활용하여 네트워크를 훈련시키고, 그 네트워크를 차량에 배포합니다. 그 결과, 고객은 훨씬 더 우수한 성능을 경험할 수 있습니다.

(19:44) better model because of their driving and this year we extend this single approach uh besides um links we start to model everything so this is a 3D Reconstruction from multiple trips of Teslas um in the same area and we put together all this is generated from eight cameras of uh of video streams that we put together through different trips and you can see that it's able to capture all these very high quality information and we can use that to train all kinds of stuff to make sure the world is consistent and all this data is very

(19:44) 우리는 올해 이 접근 방식을 확장하여 링크 외에도 모든 것을 모델링하기 시작했습니다. 이는 같은 지역에서 Tesla의 여러 여행에서 얻은 데이터로부터의 3D 재구성입니다. 우리는 다양한 여행을 통해 얻은 8개의 카메라 비디오 스트림을 모두 함께 결합하여 이를 생성합니다. 그 결과, 이는 매우 고품질의 정보를 포착할 수 있으며, 우리는 이를 사용하여 세계가 일관성을 유지하도록 다양한 작업을 훈련시킬 수 있습니다. 이 모든 데이터는 매우 가치있는 것입니다.

(20:20) rich in terms of that we don't have a separate data set just to provide one branches anymore we we just have a unified data set that captures everything we need to train such a foundation model and besides the data quantity and quantity the other thing that really matters is data diversity right so uh if you only have you know boring data this thing will rarely works so this is also one of our biggest advantages where we can get all this data back from free so this is just some examples of the random Tesla customer uh driving so I'll

(20:20) 데이터의 풍부함은 이제 우리가 하나의 가지만 제공하기 위해 별도의 데이터셋을 갖고 있지 않다는 점에서 매우 풍부합니다. 우리는 이제 모든 것을 훈련하는 데 필요한 것을 포착하는 통합된 데이터셋을 가지고 있습니다. 또한, 데이터의 다양성도 매우 중요합니다. 지루한 데이터만 가지고 있다면 이 접근 방식은 거의 작동하지 않을 것입니다. 이것은 우리가 모든 이 데이터를 무료로 얻을 수 있는 가장 큰 장점 중 하나입니다. 이는 Tesla 고객의 무작위 주행 데이터의 몇 가지 예시입니다.

(20:54) just pause for a second as you guys to see what kind of crazy scenarios we can see here a lot of the stuff it's you know when you you will rarely run into in real life but once you you know aggregate over four million cars it actually where stuff happens every single day and what we do here is that we put this data back and put in the data engine directly and the run the auto labeling pipeline fit into uh fit into the data set and the model training once you put all this stuff together you'll be able to just build a very

(20:54) 잠시 멈추어 여러분이 여기서 어떤 미친 시나리오를 볼 수 있는지 확인해보세요. 많은 것들이 실제로는 거의 겪지 않을 상황이지만, 400만 대의 차량을 종합하면 매일 무슨 일이 일어나는지 확인할 수 있습니다. 우리가 여기서 하는 것은 이 데이터를 직접 데이터 엔진에 넣고 자동 레이블링 파이프라인을 실행하여 데이터셋에 적합시키고 모델 훈련을 하는 것입니다. 이 모든 것을 함께 모아서 매우 효율적인 모델을 만들 수 있습니다.

(21:26) large highly Diversified highly quality data set and we use this trade Foundation model and in our opinion this is the pass forward to build a foundation model for oncomes driving and as well as uh eventually uh the whole embodied AI um this work presented here is done by dance of veritennial Engineers and Tesla autopilot uh it's my privilege to uh present on their behalf and if you guys are interested in this please consider joining us thank you so any question from others yeah I know that you'll almost claimed that

(21:26) 크고 다양하며 고품질의 데이터셋을 사용하여 이러한 기반 모델을 구축하고, 우리는 이것이 온카바운드 주행을 위한 앞으로의 방향이며, 최종적으로는 전반적인 적용된 AI에 대한 것이라고 생각합니다. 이 작업은 Tesla 오토파일럿에서 다수의 엔지니어들이 수행한 것이며, 저는 그들을 대신하여 발표하는 영광입니다. 이에 관심이 있는 경우 우리에 합류를 고려해주시기 바랍니다. 질문이 있으신가요? 네, 거의 주장했듯이...

(22:13) we will have end-to end systems in the fsb12 direction and I wanted to know how you how the occupancy Network you have mentioned can be changed or even fit into this kind of entrance system yeah so yeah that's a good question Yeah so basically we're building this network so so it's not really aqueous Network anymore right so we're basically building this Foundation model to capture this world model and we think that'll be essential to use T4 end-to-end driving and as for the endurance driving uh application uh

(22:13) 우리는 FSD12 방향으로 엔드 투 엔드 시스템을 갖게 될 것이고, 질문하신 것처럼 Occupancy Network는 이러한 엔드 투 엔드 시스템에 어떻게 변경되거나 적용될 수 있는지 알고 싶었습니다. 그래서 네, 좋은 질문입니다. 실제로 우리는 더 이상 Occupancy Network를 사용하지 않습니다. 우리는 이 Foundation 모델을 구축하여 세계 모델을 포착하는 것이 주 목적입니다. 우리는 T4 엔드 투 엔드 주행에 이를 활용하는 것이 중요하다고 생각합니다. 또한 엔드 투 엔드 주행 응용 프로그램에 대해서는

(22:44) itself uh we're currently uh we're not ready to uh show what we have done so far yet and the uh I I guess stay tuned so probably you know the mainly useful for building the properties yes exactly yeah hey thanks for the nice talk um how do you compare uh attention or self-attention or cross tension based occupancy estimation with uh traditional multiview stereo so like I would say that um a big fan of the customers and we've had definitely try the different uh approaches and you know the Transformers just like works

(22:44) 현재로서 우리는 아직까지 우리가 지금까지 한 작업을 공개할 준비가 되어 있지 않습니다. 그리고 말씀하신 것처럼 엔드 투 엔드 주행 응용 프로그램에 대해서는 계속해서 주목하시기 바랍니다. 아직까지 우리가 한 작업을 보여줄 준비가 되어 있지 않습니다. 어떤 점에 있어서는 전통적인 다중 뷰 스테레오 방식과의 차이점을 어떻게 비교하시는지 궁금합니다. 저는 고객들의 팬이기 때문에 여러 가지 접근 방식을 시도해 보았습니다. 그리고 Transformer가 잘 작동한다는 것을 알 수 있었습니다.

(23:42) amazingly well and it is also very resistant in terms of you know a big fan of like learn based stuff where um the information will distract extract the impressed more implicitly were for different cars when you do multiview uh geometry it is like usually a very deep like uphill battle in order to get every tiny every tiny bit of a geometry detail correct so that you can build everything and uh correctly and this kind of stuff just works much better in my opinion with the attention module which uh goes to play well with our strengths of a

(23:42) 놀랍게도, 학습 기반의 접근 방식은 매우 효과적이며, 정보를 암묵적으로 추출하여 다양한 차량에 대해 매우 저항성이 있는 결과를 얻을 수 있습니다. 다중 뷰 지오메트리를 수행할 때, 모든 작은 지오메트리 세부 사항을 올바르게 얻기 위해서는 보통 매우 어려운 과정입니다. 그러나 이러한 종류의 작업은 내견으로는 어텐션 모듈과 잘 결합하여 더 잘 작동합니다. 어텐션 모듈은 우리의 강점과 잘 어울리며 모든 것을 올바르게 구축할 수 있도록 도와줍니다.

(24:19) huge amount of data we learn from those data thermologists nurse is very nasty thank you hello thanks uh thank you for your wonderful speaking and I have two questions the first is now the first aid what do you think uh the problem of occupancy Network for I have talked to many uh many researchers they think of the occupancy Network are too dense of the future of something are too dense and too heavy to do and and everything every disadvantage of segment model uh the occupancy model also share the disadvantage so what do you think

(24:19) 우리는 방대한 양의 데이터를 학습하여 많은 지식을 얻었고, 그 데이터에서 얻은 지식은 매우 유용합니다. 감사합니다. 안녕하세요, 감사합니다. 멋진 발표에 대해 감사드리며, 두 가지 질문이 있습니다. 첫 번째 질문은 현재 상태에서 첫 번째 도움말을 어떻게 생각하시는지이고, 두 번째 질문은 차지하는 문제에 대한 것입니다. 많은 연구자들과 이야기해 보았는데, 차지 네트워크(occupancy network)가 너무 밀집되어 미래에 대한 정보가 너무 밀집되고 무거워지는 문제점이 있다고 생각합니다. 또한, 모든 단점을 공유하는 세그먼트 모델과 같은 문제점을 가지고 있다고 합니다. 이에 대해 어떻게 생각하시는지 궁금합니다.

(25:01) the problem or the advantages of the network of occupancy Network and and the when they are either threatened to and what is the next after occupancy Network yeah um so um yeah so they're basically quite a lot of different representations for like this argument right so we use kind of dense sponsor gray and you know you can also do any other representation your implicitly your SDF you know this kind of stuff um so the um as I described in snip we have this dynamic resolution sync which is a way to help us to make the uh make

(25:01) 차지 네트워크(occupancy network)의 문제나 장점, 그리고 차지 네트워크가 어려움을 겪을 때에는 무엇이 다음 단계인지 궁금하신 것이 맞나요? 네, 그렇습니다. 이에 대해 여러 다른 표현 방법들이 있습니다. 우리는 밀집한 스폰서 그리드를 사용하고 있으며, 다른 표현 방법으로는 암시적인 SDF(Signed Distance Function) 등도 사용할 수 있습니다. 제가 설명한 대로, 우리는 동적 해상도 동기화(dynamic resolution sync)를 사용하여 문제를 해결하고자 하는데, 이는

(25:41) the network actually small enough to run in real time right so we we don't just like not even just say okay we need you know X centimeter resolution so we have to you know segment the entire world such thing which is obviously not scalable so I would say that the dynamic resolution thing we did here is one of the main reasons that we're able to actually achieve a high perception performance while running in in hyper fast and efficient and a lot of a lot of Advantage we have is also with our own neural net accelerator PC we build

(25:41) 실시간으로 실행할 수 있는 충분히 작은 네트워크입니다. 우리는 X 센티미터의 해상도를 요구해서 전체 세계를 세그먼트화할 필요가 없습니다. 이는 확장 가능하지 않습니다. 따라서 여기서 우리가 수행한 동적 해상도 동기화는 실제로 고성능 인지 성능을 달성할 수 있는 주요 이유 중 하나입니다. 또한, 우리가 갖는 많은 장점 중 하나는 우리 자체의 신경망 가속기인 PC입니다.

(26:13) in house so that we're able to actually accelerate this with what we need and I kind of just said that in like one sentence but it's an incredible amount of work actually happening in the background just to make this thing run fast enough in the car thank you is there any idea yeah so the next idea is that I would say that the occupancy seems really just like building these features for the water model we uh we eventually it's not gonna like I don't think we'll explicitly be driving on top

(26:13) 내부에서 이를 가속화할 수 있도록 개발하여 우리가 필요한 것을 실제로 가속화할 수 있습니다. 이것을 한 문장으로 간략히 언급했지만, 실제로는 이를 위해 놀라운 양의 작업이 백그라운드에서 진행되고 있습니다. 이것이 자동차에서 충분히 빠르게 실행될 수 있도록 만드는 데 많은 작업이 필요합니다. 감사합니다. 다음 아이디어가 있나요? 음, 다음 아이디어는 저희가 물체 모델링을 위해 이러한 기능을 구축하는 것 같습니다. 그러나 저는 명시적으로 위에 운전을 하는 것이 아니라고 생각합니다.

(26:43) of that so um this features give us this 4D understanding of the word and I think the next steps a lot of next steps will be really just like on the application Level how do we derive different applications you know very lightly lightweighted way to uh to represent different scenarios okay thank you yeah I feel thanks for the great talk so like uh people are now very confident in the foundation models especially after strategy becomes that's not everything so people are like uh more and more confident in the explanation models

(26:43) 이러한 기능들은 우리에게 단어의 4D 이해를 제공합니다. 다음 단계들은 실제로는 응용 수준에서 어떻게 다양한 응용을 유도하는가에 대한 것이라고 생각합니다. 가벼운 방식으로 다양한 시나리오를 나타내는 방법을 찾는 것이 주요 과제일 것입니다. 감사합니다. 그래서, 대담한 이야기에 감사드립니다. 최근에는 전략이 모든 것이 아니라는 것을 알게 되면서 사람들은 특히 기초 모델에 대해 매우 자신감을 갖고 있습니다. 설명 모델에 대해서도 점점 더 자신감을 가지고 있다고 생각합니다.

(27:20) especially how is it without software constructing tasks so I'm curious about like how can you qualify the data distribution for the uh for the foundational Foundation model uh you know driving task like as you mentioned there are many factors in driving like uh Dynamic mode static more static object and also the laying factors and the different classes so as you mentioned like when you get more data you will determine what whether the data is real in our current data and we're feeding about our model so I'm curious

(27:20) 특히 소프트웨어 구축 작업 없이 어떻게 데이터 분포를 평가할 수 있는지가 궁금합니다. 예를 들어, 운전 작업과 같은 기초 모델의 데이터 분포를 어떻게 평가하는지 궁금합니다. 언급하신 대로, 운전에는 동적인 요소와 정적인 요소, 그리고 다양한 클래스들과 레이어링 요소들이 많이 있습니다. 데이터가 더 많아지면 현재 데이터와 모델에 대한 데이터가 실제로 일치하는지 판단할 것입니다. 그래서 저도 궁금합니다.

(27:52) about how how can you quantify whether the current data is well or how whether the data is valuable for the returning of our model yeah yeah that's a good question yeah so um I guess the day data is basically used for whatever we need right so what we do is that we build a very good uh there's just two I would say there are two sets of this uh uh for this uh first is that we will have a very complicated uh email system to actually capture what is actually required for driving right so because we're not building data for the safer

(27:52) 현재 데이터가 잘되었는지, 데이터가 모델의 학습에 가치 있는지를 어떻게 정량화할 수 있는지에 대해 궁금하신 것이 맞나요? 네, 매우 좋은 질문입니다. 우리가 하는 일에 필요한 대로 데이터가 사용됩니다. 우리가 하는 일은 실제 운전에 필요한 것을 잘 포착하기 위해 매우 복잡한 이메일 시스템을 구축하는 것입니다. 왜냐하면 우리는 안전한 운전을 위해 데이터를 구축하는 것이기 때문입니다.

(28:26) building that we're we're building this data set to address problems so we need to really measure our progress in terms of different scenarios uh so that we know which one is working well which one is not working well and the um sorry I lost my channel sorry and uh on the on the flip side it's also like [Music] um there there isn't like this scanning or basically you know has proven to be very very well done in with a language model Community right so uh what we also look at is that we were applying skin in

(28:26) 데이터셋을 구축하는 것은 문제를 해결하기 위한 것이므로 우리는 다양한 시나리오에서의 진행 상황을 실제로 측정해야 합니다. 어떤 것이 잘 작동하고 어떤 것이 잘 작동하지 않는지 알아야 하기 때문입니다. 죄송합니다, 채널을 잃어버렸네요. 죄송합니다. 그리고 또 다른 측면에서는, 스캐닝이나 검증은 언어 모델 커뮤니티에서 매우 잘 수행된 것으로 입증되었습니다. 우리는 스캐닝을 적용하는 것도 고려하고 있습니다.

(29:02) order to video module as well right so to Vision models so um how how do we basically build a model and what basically at what point the model is the data is big enough right and whether data is about like oh models about Mac right so obviously because this runs in the embedded system in real time so you cannot just not evenly scale up the model so usually what we do is that we basically figure out what the model uh what's what is the best size of the model that is runs in your car and then we we scale that data according to

(29:02) 비전 모델에 비디오 모듈을 추가하는 것도 고려되고 있습니다. 그래서 모델을 어떻게 구축하는지, 언제 모델이 데이터가 충분한지, 데이터가 맥스인지에 대해서도 알아야 합니다. 실제로 이것은 임베디드 시스템에서 실시간으로 실행되기 때문에 모델을 일정하게 확장할 수는 없습니다. 보통 우리는 모델이 차에서 실행되는 최적의 크기를 찾은 다음 그에 맞춰 데이터를 확장합니다.

(29:33) compute and according to different parameters to figure out what is the best uh uh best combination and while doing that usually it gives us a very good idea about how the the quantity and distribution yes thanks for the answer thank you hi thanks for the great talk so I have some questions regarding deploying uh Transformers on on vehicle so so actually uh Transformers uh we have like we basically find Transformers are much more difficult to actually deploy we are we are not just talking about Academic Year we are we're actually talking about

(29:33) 계산 및 다른 매개변수에 따라 가장 적합한 조합을 찾아내는 것을 통해 어떤 것이 최적인지, 양과 분포에 대한 좋은 아이디어를 얻을 수 있습니다. 답변 감사합니다. 감사합니다. 대담한 이야기에 대한 질문이 있습니다. 차량에 Transformer를 배포하는 것에 관해 몇 가지 질문이 있습니다. Transformer를 실제로 배포하는 것은 매우 어렵다고 생각됩니다. 우리는 학계에 대해 이야기하는 것뿐만 아니라 실제로 배포하는 것을 의미합니다.

(30:09) engineering and and to build a product actually you need to make everything wrong time real time on the vehicle and the hardware Perth is uh I I think Transformer operators create a lot of difficulties here I'm pretty sure you so I want to wonder if you could share some insight and also I wonder like because right now uh at least for camera based uh perception a lot of uh the state-of-the-art research were also like purely CN uh face so when you are like when you are building the product at uh Tesla do you find that actually there

(30:09) 엔지니어링 및 제품을 구축하기 위해서는 모든 것을 실시간으로 차량 내에서 처리해야 하며, 하드웨어 측면에서도 Transformer 연산자들이 많은 어려움을 야기한다고 생각합니다. 어떤 통찰을 공유해 주실 수 있는지 궁금합니다. 또한, 현재는 적어도 카메라 기반 인식에 대해서는 많은 최첨단 연구가 순수한 CNN 기반으로 이루어지고 있다고 알고 있는데요. Tesla에서 제품을 개발하면서 실제로 이러한 점을 발견하셨나요?

(30:44) are cases or there are insights where there are something that trans that CN simply cannot do the same good job as Transformer that you are willing to pay the cost for deploying the Transformers so uh maybe generalization at scale or maybe like some others that that's my question yeah thank you yeah so uh absolutely deploying Transformer is not very very straightforward I would say so uh we actually uh goes in the previous question where we have an in-house neural accelerator which actually gives us a lot of Advantage because when we

(30:44) Transformer와 비교하여 CN이 동일한 효과를 낼 수 없는 경우나 Transformer를 배포하는 비용을 지불할 의향이 있는 경우, 그런 상황이나 통찰력은 있을까요? 예를 들어, 규모의 일반화나 다른 상황에서의 차이 등이 있을 수 있습니다. 제 질문입니다. 네, 저는 그렇습니다. Transformer를 배포하는 것은 매우 간단하지 않다고 말할 수 있습니다. 우리는 실제로 이전의 질문에서 언급한 것처럼, 집 내에서 제작한 신경 가속기를 가지고 있습니다. 이는 많은 장점을 제공합니다.

(31:18) build this chip we don't have to worry about somebody else has to use this chip right we only build this for our own application which obviously Transformer is a big part of that so we have a very uh very good deployment team actually work very hard on how to optimize this for deployment and so far uh I think initially was quite challenging but once you get working it actually runs incredibly fast with our computer and we also did a bunch of study for single versus Transformer as well and for us is that you know we're not you know English

(31:18) 이 칩을 구축할 때 우리는 다른 사람들이 이 칩을 사용해야 하는 걱정을 할 필요가 없습니다. 우리는 우리 자신의 응용 프로그램을 위해 이를 구축하는 것이기 때문입니다. 물론 Transformer는 그 중요한 부분입니다. 그래서 우리는 매우 좋은 배포 팀을 가지고 있으며, 이를 배포하기 위해 어떻게 최적화해야 하는지에 대해 열심히 연구하고 있습니다. 지금까지는 처음에는 어려움이 있었지만, 한 번 작동되면 컴퓨터에서 놀라울 정도로 빠르게 실행됩니다. 또한 단일 모델과 Transformer에 대한 여러 연구도 수행했으며, 우리에게는 영어가 아니라는 점이 중요합니다.

(31:50) we don't we don't really worry too much about the philosophical debate of Syrian versus Transformer we just use whatever runs uh fast and also does a job and Transformers are obviously scale very well with the number of parameters right but we are because of the real time system so we're also not going that route to just like go from there and the when the parameter counts are lower sometimes you know certain scene and actually does better than Transformer I think you know highlighting you know complex for example and we definitely

(31:50) 우리는 Syntactic 모델과 Transformer 모델 사이의 철학적인 논쟁에 대해 너무 걱정하지 않습니다. 우리는 단지 빠르게 실행되고 일을 수행하는 모델을 사용합니다. Transformer 모델은 매개변수의 수에 따라 확장성이 뛰어나지만, 우리는 실시간 시스템을 다루기 때문에 매개변수의 수가 적은 모델도 때로는 Transformer보다 더 좋을 수 있습니다. 예를 들어, 특정 상황에서는 복잡한 모델이 더 우수한 결과를 낼 수 있습니다. 우리는 그러한 상황을 강조하고 있습니다.

(32:18) kind of try to take advantage of both and put together to build such Network thank you okay thanks for the talk I'm just talking a little bit about the challenges for generating occupation map right there so you I guess I thought there are some new consider some uh like night vision or some a lot like Vision consider right there but uh I'm just wondering if you consider some adverse weather conditions and contaminations on the sensor or something like that then that made your challenge to generate a Christmas right there absolutely

(32:18) 네, 저희는 두 가지를 최대한 활용하여 네트워크를 구축하려고 노력했습니다. 감사합니다. 좋은 발표였습니다. 저는 직업 맵 생성에 대한 도전 요소에 대해 약간 언급하고 싶습니다. 야간 비전이나 특정 시야 고려 사항이 있을 수 있지만, 저희는 일부 재난 상황이나 센서의 오염과 같은 악천후 조건도 고려했을까요? 그런 경우에는 직업 맵 생성이 어려워질 수 있다고 생각합니다.

(32:56) absolutely yes so those are definitely the most challenging scenarios for us right so uh um basically our customers drive everywhere you know and a lot of places at different season with very very challenging you know scenarios I think I probably have seen more adversarial weather events probably than anybody in the room here to say over word and we have to deal with all that and um on one on one side that we the fortune the fortune thing we have is that we have tons of data from those in were able to actually targeting those

(32:56) 그래요, 그렇습니다. 그것들은 저희에게 가장 도전적인 시나리오들입니다. 저희 고객들은 어디서나 운전을 하며, 다양한 계절과 매우 도전적인 상황에서 운전을 합니다. 저는 아마도 방금 이곳의 누구보다도 더 많은 악천후 상황을 목격한 것 같아요. 그리고 저희는 그 모든 것을 다루어야 합니다. 한편으로는, 다행히도 우리는 그러한 상황에서 얻은 대량의 데이터를 가지고 있으며, 이를 활용하여 정확히 대응할 수 있습니다.

(33:28) problems specifically solve them one by one on the other side definitely on the modeling side and the ground choose generated side I would say most work is actually uh you know spent on dealing with such scenarios you know the normal you know normal sunny day driving scenario is actually quite straightforward to do where you should not have any engineer you know in the loop you know you get a clip back you run it just works and most of the time and the effort we're spinning out is actually for all these serial series

(33:28) 특히 그 문제들을 하나씩 해결하려고 특별히 노력합니다. 반면에 모델링 측면과 생성된 지면 데이터 측면에서는 이러한 시나리오들을 다루기 위해 대부분의 작업이 실제로 진행되고 있습니다. 일반적인 맑은 날 운전 시나리오는 상당히 간단하게 처리할 수 있습니다. 엔지니어의 개입이 필요하지 않고, 영상을 입력받아 실행하면 대부분 문제없이 작동합니다. 그러나 우리가 투자하는 시간과 노력의 대부분은 이러한 어려운 시나리오를 위한 것입니다.

(33:54) on Tesla under studying about the how to uh tackle how to deal with that weather weather condition right there I was still staring about that part concern for FS FSD yeah so now you are developing yeah that to handle those kind of bad uh challenge conclusion right there yeah yeah I mean so what do we do is that the ones we do we already for example you can drive FSD in snow in days in 40 days for employee works works just fine and we we also we gradually moved to even there's always don't tell

(33:54) 테슬라에 대해 연구하면서, 그 날씨 조건을 다루는 방법에 대해 공부하고 있었습니다. 그 부분에 대한 관심이 계속 남아있었고, FSD(Full Self-Driving)에 대한 우려도 있었습니다. 그래서 지금은 그런 나쁜 도전적인 상황을 다루기 위해 개발 중입니다. 맞아, 그래서 우리가 해야 할 일은 이미 해봤던 것들이 있습니다. 예를 들어, FSD를 눈이 오는 날에 40일 동안 운전해 볼 수 있고, 잘 작동합니다. 그리고 우리는 점차적으로 더 나아갈 것이며, 항상 그런 도전들이 있을지라도 말하지 않을 것입니다.

(34:33) you when to try to solve rest so if you drive past that you'll notice that sometimes it'll tell you whether it's degraded so certain functionality is is not working as well as expected so that is also very important to have a good understanding of what is what you are capable of doing or what you're not capable of doing and to ensure and safety and uh you know in the normal obviously we'll have to solve all this problem hi in your demo it seems that you created the wrong truth and it seems

(34:33) 당신은 그 나머지를 해결하려고 노력할 때, 지나가면 때로는 기능이 저하되어 일부 기능이 예상대로 작동하지 않을 수 있음을 알 수 있습니다. 그래서 자신이 무엇을 할 수 있는지, 할 수 없는지를 잘 이해하는 것도 매우 중요하며, 안전과 정상 작동을 보장하기 위해서입니다. 당신의 데모에서는 잘못된 진실을 만든 것처럼 보입니다.

(35:05) that only the static objects are created uh do you create the contrast of the moving objects yes yes absolutely well moving objects is a big part of the driving you'll have that too uh it seems that creating the ground screws or moving on business is difficult uh could you explain how the difference of the creating process yeah so um I guess uh it's a bit of a history here where when we started we're only moving objects right because those are more Irrelevant for driving and so we actually uh I think it's fairly a fairly

(35:05) 정적인 오브젝트만 생성한다고 했는데, 움직이는 오브젝트의 대조를 생성하나요? 네, 절대적으로 그렇습니다. 움직이는 오브젝트는 운전의 큰 부분을 차지하며, 그것도 생성합니다. 그라운드 스크류를 생성하거나 이동 중인 사업을 어렵게 하는 것 같습니다. 차이점 생성 과정을 설명해주시겠어요? 음, 여기에는 약간의 역사가 있는데, 시작할 때는 움직이는 오브젝트만 있었습니다. 왜냐하면 운전에는 더 중요하기 때문입니다. 그래서 실제로는 상당히

(35:40) common approach among all communities that how we need a model moving object you know we started by just doing you know a cuboid right just like fit every car into a box and then um but then gradually we realized the problems such as the articular bus ones were shown here so we have to actually model the exact shape of of the object and a lot of them are down basically through what we call offline tracker basically we take all the videos back from our cars and then we process offline to figure out what is going on

(35:40) 모든 커뮤니티에서 공통적으로 사용되는 접근 방식은 움직이는 오브젝트를 어떻게 모델링해야 하는지입니다. 처음에는 간단하게 모든 차량을 상자에 맞게 만드는 방식으로 시작했습니다. 그러나 점차적으로 여러 문제점을 깨달았는데, 특히 여기서 보여지는 예와 같이 정확한 오브젝트의 형태를 모델링해야 한다는 것을 깨달았습니다. 이를 위해 대부분의 작업은 오프라인 추적기라는 것을 통해 수행됩니다. 우리는 자동차에서 영상을 모두 가져와 오프라인으로 처리하여 무엇이 일어나고 있는지 파악합니다.

(36:11) so they're usually uh involved with three different things first is that we have kind of like traditional robotics algorithm to use to help you know um you know process you know temporal information second is that because you have future information uh available offline where you're able to go forward and back for to actually create a ground shows that is much better than what you see in a car in real time the service that we also build large offline models to do this so the models you know we use to run the offline ground Choice

(36:11) 그래서 일반적으로 세 가지 다른 요소와 관련되어 있습니다. 첫째, 우리는 전통적인 로봇 알고리즘을 사용하여 시간적인 정보를 처리하는 데 도움이 되는 것입니다. 둘째, 미래 정보를 가지고 있으므로 오프라인에서 앞뒤로 이동하여 실시간으로 보는 차량에서 볼 수 있는 것보다 훨씬 좋은 그라운드 스크류를 생성할 수 있습니다. 우리는 또한 이를 위해 대규모의 오프라인 모델을 구축합니다. 마지막으로, 오프라인 그라운드 스크류를 실행하기 위해 사용하는 모델입니다.

(36:42) generation is especially for moving objects are much bigger than one we run in a car because you know it runs offline we're actually able to do that so usually uh so that's the way how we get a random for moving objects thank you due to the time maybe let's uh thank you again

(36:42) 이동하는 오브젝트에 대한 생성은 실제 자동차에서 실행하는 것보다 훨씬 큽니다. 이는 오프라인에서 실행되기 때문에 가능합니다. 일반적으로 우리는 이렇게 움직이는 오브젝트에 대한 정보를 얻습니다. 시간이 부족하여 마무리하도록 하겠습니다. 다시 한 번 감사드립니다.

'Autonomous Driving' 카테고리의 다른 글

Multi-task Learning with Localization Ambiguity Suppression for Occupancy Prediction by 42 dot team (1)	2023.07.09
VERY DEEP CONVOLUTIONAL NETWORKSFOR LARGE-SCALE IMAGE RECOGNITION 논문 번역 (0)	2023.07.05
CVPR23 E2EAD \| 3D Occupancy Prediction Challenge (0)	2023.06.29
샤오팽(Xpeng) 자율주행 수준에대한 설명 (0)	2023.06.23
Xpeng Perception로직 설명 관련 블로그 번역(CVPR2023) (0)	2023.06.23

ABOUT ME

Connecting the dots Connecting the dots

'Autonomous Driving' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'Autonomous Driving' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바