The ROAD Challenge: Event Detection for Situation Awareness in Autonomous Driving
Call for participation
https://sites.google.com/view/roadchallangeiccv2021/challenge
Aim of the Challenge
The accurate detection and anticipation of actions performed by multiple road agents (pedestrians, vehicles, cyclists and so on) is a crucial task to address for enabling autonomous vehicles to make autonomous decisions in a safe, reliable way. While the task of teaching an autonomous vehicle how to drive can be tackled in a brute-force fashion through direct reinforcement learning, a sensible and attractive alternative is to first provide the vehicle with situation awareness capabilities, to then feed the resulting semantically meaningful representations of road scenarios (in terms of agents, events and scene configuration) to a suitable decision-making strategy. In perspective, this has also the advantage of allowing the modelling of the reasoning process of road agents in a theory-of-mind approach, inspired by the behaviour of the human mind in similar contexts.
Accordingly, the goal of this Challenge is to put to the forefront of the research in autonomous driving the topic of situation awareness, intended as the ability to create semantically useful representations of dynamic road scenes, in terms of the notion of a road event.
The ROAD dataset
This concept is at the core of the new ROad event Awareness Dataset (ROAD) for Autonomous Driving
https://github.com/gurkirt/road-dataset
ROAD is the first benchmark of its kind, a multi-label dataset designed to allow the community to investigate the use of semantically meaningful representations of dynamic road scenes to facilitate situation awareness and decision making. It contains 22 long-duration videos (ca 8 minutes each) annotated in terms of “road events”, defined as triplets of Agent, Action and Location labels and represented as ‘tubes’, i.e., series of frame-wise bounding box detections.
ROAD is a large, high-quality benchmark comprising 122K labelled video frames and 560K detection bounding boxes associated with 1.7M labels.
The above GitHub repository contains all the necessary instructions to pre-process the 22 ROAD videos, unpack them to the correct directory structure and run the provided baseline model.
Tasks and Challenges
ROAD allows one to validate detection tasks associated with any meaningful combination of the three base labels. For this Challenge we consider three video-level detection Tasks:
T1. Agent detection, in which the output is in the form of agent tubes collecting the bounding boxes associated with an active road agent in consecutive frames.
T2. Action detection, where the output is in the form of action tubes formed by bounding boxes around an action of interest in each video frame.
T3. Road event detection, where by road event we mean a triplet (Agent, Action, Location) as explained above, once again represented as a tube of frame-level detections.
Each Task thus consists in regressing whole series (‘tubes’) of temporally-linked bounding boxes associated with relevant instances, together with their class label(s).
Baseline
As a baseline for all three detection tasks we propose a simple yet effective 3D feature pyramid network with focal loss, an architecture we call 3D-RetinaNet:
http://arxiv.org/abs/2102.11585
The code is publicly available on GitHub:
https://github.com/gurkirt/3D-RetinaNet
Timeframe
Challenge participants have 18 videos at their disposal for training and validation. The remaining 4 videos are to be used to test the final performance of their model. This will apply to all three Tasks.
The timeframe for the Challenge is as follows:
· Training and validation fold release: April 30 2021
· Test fold release: July 20 2021
· Submission of results: August 10 2021
· Announcement of results: August 12 2021
· Challenge event @ workshop: October 10-17 2021
Evaluation
Performance in each task is measured by video mean average precision (video-mAP), with an Intersection over Union (IoU) detection threshold set to 0.1, 0.2 and 0.5 (signifying a 10%, 20% and 50% overlap between predicted and true bounding box within each tube), because of the challenging nature of the data. The final performance of each task will be determined by the equally-weighted average of the performances at the three thresholds.
In the first stage of the Challenge participants will, for each task, submit their predictions as generated on the validation fold and get the evaluation metric in return, in order to get a feel of how well their method(s) work. In the second stage they will submit the predictions generated on the test fold which will be used for the final ranking.
A separate ranking will be produced for each of the Tasks.
Evaluation will take place on the EvalAI platform.
https://eval.ai/web/challenges/challenge-page/1059
For each Challenge stage and each Task the maximum number of submissions is capped at 50, with an additional constraint of 5 submissions per day.
Detailed instructions about how to download the data and submit your predictions for evaluation at both validation and test time, for all three Tasks, are provided on the Challenge website.