Re-DID Real-life Events Dyadic Interaction Dataset (Re-DID)

Re-DID Real-life Events Dyadic Interaction Dataset (Re-DID)
		;

Abstract

This paper proposes a method to detect and localize dyadic human interactions in real videos. The idea stems from the significant difference between an action performed by a single subject and an interaction between two persons. In the first case all the visual information is concentrated on the subject, while in the latter case the action of a person is related to the interacting person's attitude, following an action/reaction principle. This kind of behavior is significant especially in natural and real scenarios, in which people are moving freely without the awareness of being recorded. To highlight these features and provide researchers with a common ground for comparisons, we have collected and annotated a new dataset, retrieving from YouTube 30 different videos of a specific type of interaction, namely urban fight situations. The proposed dataset is one of the most challenging annotated video collection concerning dyadic interactions, due to the intrinsic intra-class variability characterizing real fights. In addition, we provide an extensive experimental analysis on this dataset and we demonstrate that the visual information extracted in the area associated to the interpersonal space plays a fundamental role in detecting fights.

Data Description
All the videos in the dataset are retrieved from YouTube; 25 of them are recorded using car mounted Dash-Cams, the remaining ones have been taken by other devices such as mobile phones. The length of the videos varies from 0:20 to 4:02 (mm:ss) and the resolution has been normalized to 1280x720 for the sake of homogeneity. The dataset includes 73 different fight instances under different lighting (day, night) and weather conditions (sunny, rainy), different original video resolution (native 1280x720, upsampled videos), different camera views (wide angle, fish-eye, zoomed view), moving and static scenes. The dataset has annotations of the position of the subjects' bounding boxes for each frame and relative ID, the temporal window where the interaction occurs, and the position of the interpersonal spaces (see paper cited below) precomputed for the ground truth. For what concerns the interaction triggering and ending, we have considered a general rule for the annotation process, starting with the first contact between the involved subjects until a relevant distancing is takes place.

Datasets Comparison

Annotations
Person's BBoxes
frame number	X top-left corner	Y top-left corner	width	height	object's type	ID
6	153	9	141	99	person	1

Action's BBoxes
starting frame number	ending frame number	ID1	action	ID2
525	670	0	fight	2

Interpersonal space's BBoxes
frame number	X top-left corner	Y top-left corner	width	height	ID1	ID2
203	365	57	60	60	0	1

Visual Features

Visual features are in binary format where each value is in single precision and each trajectory is composed as follows:

- 10 [header (frame, mean_x, mean_y, var_x, var_y, length, scale, x_pos, y_pos, t_pos)]

- 30+30 [trajectory position + trajectory normalized (x,y, 15 frames long)]

- 96 [HOG descriptor (8x2x2x3)]

- 108 [HOF descriptor (9x2x2x3)]

- 96 [MBH descriptor X(8x2x2x3)]

- 96 [MBH descriptor Y(8x2x2x3)]

Reference

	You can download the paper here Please cite this:
	@INPROCEEDINGS{rota2015real, title = "Real-Life Violent Social Interaction Detection", booktitle = "Proceedings of the IEEE International Conference of Image Processing", author = "Rota, P and Conci, N and Sebe, N and Rehg, J M", year = 2015, conference = "ICIP" }

Download files

Original (1280x720)
Videos	Actions' BBoxes
Person's BBoxes	Interpersonal space BBoxes
Visual Features


Reduced (340x360)
Videos	Actions' BBoxes
Person's BBoxes	Interpersonal space BBoxes
Visual Features

Contacts

For further details please contact:
Paolo Rota	Nicola Conci	Nicu Sebe