Daftar Isi: Indonesian Dataset Expansion of Microsoft Research Video Description Corpus and Its Similarity Analysis

Main Author:	RAHUTOMO, FAISAL
Other Authors:	Rahutomo, Faisal, Hafidh Ayatullah, Ahmad
Format:	Dataset
Terbitan:	Mendeley , 2018
Subjects:	Information Retrieval Semantics Natural Language Processing Similarity Measure Indonesian Language
Online Access:	https:/data.mendeley.com/datasets/d7vx5cc92y

Daftar Isi:

Microsoft research video description corpus is an openly dataset contains about 120K sentences. The sentences are a set of roughly parallel descriptions of more than 2,000 video snippets of 35 languages. Both paraphrase and bilingual relation are available but Indonesian description is not available in the dataset. This dataset is Indonesian expansion of Microsoft research video description corpus. The collection consists of 43,753 description texts of 1,959 short videos, parallel with Microsoft’s dataset. Adding more value to the dataset, the similarity metrics calculations of the texts are done. The metrics are cosine, jaccard, euclidian, and manhattan with average results are 0.22, 0.33, 2.38, and 6.08 respectively.