Social Event Detection 2014 (SED 2014) dataset

Overview: This page makes available for research purposes the dataset, challenge definitions, ground truth challenge results and corresponding evaluation script that were created and used in the 2014 edition of the Social Event Detection (SED) task of the MediaEval benchmarking activity.
 
The Social Event Detection (SED) task of MediaEval 2014 contained required participants to a) cluster a collection of images so that the images in each cluster are associated with a distinct social event and b) given a collection of images / clusters corresponding to social events, to retrieve those that match specific criteria. By social events, we mean that the events are planned by people, attended by people and that the media illustrating the events are captured by people. Each image in the dataset is accompanied by metadata typically found on the social web (including time-stamps, tags, geotags for a small subset of them).
 
For more information on the SED 2014 dataset, challenges and evaluation, please see the following publication. If you use the dataset for your research, please cite the paper:
 
G. Petkos, S. Papadopoulos, V. Mezaris. "Social Event Detection at MediaEval 2014: Challenges, Dataset and Evaluation", Proc. MediaEval 2014 Workshop, Barcelona, Spain, October 2014.
 
How to access:
The dataset has two parts, the first contains a set of 362,578 images that belong to 17,834 events. Its metadata file, in JSON format, can be obtained from here:
 

Metadata file for first part

 

To obtain the image files please use the URL listed for each image in the metadata file. The association of these images to the events is provided in the following file: 
 

Cluster mappings file for first part

 
This first part of the dataset is used for development in the first subtask as well as for development and testing purposes in the second subtask. In particular, the data in the first part of the dataset can be used as an example clustering for developing solutions to the first subtask. Additionally, for development purposes in the first subtask we provide a list of example queries here:
 

Development queries file

 
The second part of the dataset is used for testing in the first subtask. The corresponding metadata file can be found here:
 

Metadata file for second part  (has exactly the same format as the first metadata file)

 

The test queries for the second subtask can be found in the following file:

 

Test queries file

 
Finally, we provide the ground truth for the first subtask here:
 

First subtask ground truth

 
And the ground truth for the second subtask here:
 

Second subtask ground truth

 

Evaluation:

A script is provided to easily compute the evaluation measures on your results. You can find it here:
 

Evaluation resources

 
The script requires python 2.7. For instructions about how to run this script, please see the details in the ReadMe file that you will find in the above arhive. Please also note that your result files much have the formats below.

 

For the first subtask the file must be a plain ASCII file, in which each line represents the association of each image in the test set to a cluster. Each line must have the form "photoID-single tab-eventID". The photoIDs must be the same as those in the test set's metadata file, whereas the eventID can be any string that does not include a tab character. E.g. the following would be part of a valied file:
 

3587419916 7081

3750102971 7028

3740276442 7047

9230492703 122

1820390101 2080

 

For the second subtask the file must be a plain ASCII file, in which each line contains the events retrieved for each of the 10 test queries. Each line must have the form "testQueryID - single tab - comma separated eventIDs". The testQueryIDs must be the same as those in the test queries file, whereas the eventIDs must match those used in the image-event association file. The following is part of a valid submission.

 

Test-1     279, 2467, 2039, 8945, 2029

Test-2    3270, 129, 909, 1010, 5643, 9129, 2198

 

Many thanks to Timo Reuter for providing the largest part of the script, as well as for providing the ReSEED dataset on which a large part of the datasets is based.
 
 
Copyright notice: The images distributed as part of the Social Event Detection 2014 (SED 2014) dataset were collected from Flickr, where they were posted by their respective owners under a Creative Commons license. The Creative Commons attribution licenses allow for image use as long as the photographer is credited for the original creation. Possibly, use is granted under additional restrictions, but none of these preclude the use of the images for benchmarking purposes. While compiling the Social Event Detection 2014 (SED 2014) dataset, we collected only Creative Commons images, and also collected as much information possible about the creators of each image. The creator information, the exact license type and other relevant information are included in the image license file, which is distributed together with the images. We would like to take this opportunity to express our gratitude to the image photographers for allowing us to use their pictures: we greatly appreciate this and gladly acknowledge your work. Your names and license details are listed in image license file. Please let us know if you have special wishes on how you would like to be credited or have additional details that must be incorporated.