Object Recognition in Video Dataset

Motion-based Segmentation and Recognition Dataset
(this is a draft version of this page)

Please cite:

(1)

Segmentation and Recognition Using Structure from Motion Point Clouds, ECCV 2008 (pdf)
Brostow, Shotton, Fauqueur, Cipolla (bibtex)

(2)

Semantic Object Classes in Video: A High-Definition Ground Truth Database (pdf)
Pattern Recognition Letters (to appear)
Brostow, Fauqueur, Cipolla (bibtex)

Description:

The Cambridge-driving Labeled Video Database (CamVid) is the first collection of videos with object class semantic labels, complete with metadata. The database provides ground truth labels that associate each pixel with one of 32 semantic classes.

The database addresses the need for experimental data to quantitatively evaluate emerging algorithms. While most videos are filmed with fixed-position CCTV-style cameras, our data was captured from the perspective of a driving automobile. The driving scenario increases the number and heterogeneity of the observed object
classes.

Over ten minutes of high quality 30Hz footage is being provided, with corresponding semantically labeled images at 1Hz and in part, 15Hz. The CamVid Database offers four contributions that are relevant to object analysis researchers. First, the per-pixel semantic segmentation of over 700 images was specified manually, and was then inspected and confirmed by a second person for accuracy. Second, the high-quality and large resolution color video images in the database represent valuable extended duration digitized footage to those interested in driving scenarios or ego-motion. Third, we filmed calibration sequences for the camera color response and intrinsics, and computed a 3D camera pose for each frame in the sequences. Finally, in support of expanding this or other databases, we offer custom-made labeling software for assisting users who wish to paint precise class-labels for other images and videos. We evaluated the relevance of the database by measuring the performance of an algorithm from each of three distinct domains: multi-class object recognition, pedestrian detection, and label propagation.

Overview Video:

Avi, 30 Mb, xVid compressed. (playback tips or get the free Mac/Windows player.
or
Mpg, 11 Mb, mpeg-1 compressed (more compatible, but lower quality)

CamVid Database
(just samples shown. For all the videos, see below)

Original Video Sequences:

Link to FTP server with video files (very big!)
Link to codecs + utility for extracting frames from those big files
(read the inventory.txt)

Labeled Images
(701 so far)

Link to zip file with painted class labels for stills from the video sequences.
Txt file listing classes and label colors as RGB triples (sorted).
(Note: the corresponding raw input images only - at 1Hz,
already extracted from the respective videos are here temporarily(556Mb).)

Camera extrinsics

Link to files and code (if link breaks someday, go here)
The relevant line that you care about to get the projection matrix of 1 camera is in MotBoostEvalOneFrame.m (see how LoadBoujou_2Dtrax_3dBans_Misc.m calls it):
curC = Cs( frameNum-offsetForFrameNums, 1:3);

Example camera pose trajectory, stored in Boujou Animation Format:
each line containing "AddDecompCameraKey" has a K and R matrix and t vector,
so that P = K * R * [I -t]

seq06R0

Description: 3030 frames at 30Hz == 1:41 min
Sample Frame
VideoFile in MXF format*

seq16E5

Description: 6120 frames at 30Hz == 3:24 min
Sample Frame
VideoFiles 1 and 2 in MXF format* (note: these are 2 halves of 1 zip file)

seq16E5_15Hz
(see also CamSeq01)

Description: 202 frames at 30Hz == 0:06 min
Sample Frame
VideoFiles 1 and 2 in MXF format* (note: same files as above, but use a different script)

seq05VD

Description: 5130 frames at 30Hz == 2:51 min
Sample Frame
VideoFile in MXF format*

seq01TP

Description: 3720 frames at 30Hz == 2:04 min
Sample Frame
VideoFile in MXF format*

Listing of (RGB)-Class assignments (alphabetical) Listing in color-order used by MSRC (with "XX")

Moving objects
Animal
Pedestrian
Child
Rolling cart/luggage/pram
Bicyclist
Motorcycle/scooter
Car (sedan/wagon)
SUV / pickup truck
Truck / bus
Train
Misc

Road
Road == drivable surface
Shoulder
Lane markings drivable
Non-Drivable

Ceiling
Sky
Tunnel
Archway

Fixed objects
Building
Wall
Tree
Vegetation misc.
Fence
Sidewalk
Parking block
Column/pole
Traffic cone
Bridge
Sign / symbol
Misc text
Traffic light
Other

Hand-Labeled Frames:

seq06R0

Description: 101 frames at 1Hz == 1:41 min
Sample Frame Preview Video

seq16E5

Description: 204 frames at 1Hz == 3:24 min
Sample Frame Preview Video

seq16E5_15Hz
(see also CamSeq01)

Description: 101 frames at 15Hz == 0:06 min
Sample Frame Preview Video

seq05VD

Description: 101 frames at 1Hz == 1:41 min
Sample Frame Preview Video

seq01TP

Description: 124 frames at 1Hz == 2:04 min
Sample Frame Preview Video

Paint-Stroke Logs of Manual Labeling:

Example log file, where each of the user's mouse-strokes was recorded to include:
the class label being applied, size and type of brush or pre-segmentation used, location of each click point and drag-path, and duration for each stroke.

InteractLabeler Software:

InteractLabeler.zip for Windows (3.4Mb)
InteractLabeler Documentation
InteractLabeler instructions, as given to volunteers

*MXF format:

This format is like Avi or Quicktime in that it is a wrapper for multimedia files. In our case, just the video channel has data, and is HD format. To decode, use this utility (link) along with the scripts provided.