DMOZ 2006 Dataset and its Wikification

Main Author: Lorenzetti, Carlos
Other Authors: Maguitman, Ana, Baggio, Cecilia
Format: Dataset
Terbitan: Mendeley , 2019
Subjects:
Online Access: https:/data.mendeley.com/datasets/9mpgz8z257
Daftar Isi:
  • This dataset was retrieved with a crawler in 2006 from the Open Directory Project (ODP) (http://dmoz.org, https://en.wikipedia.org/wiki/DMOZ), which closed in 2017 and was reborn as Curlie (https://curlie.org/). The topics were selected from the third level of the ODP hierarchy. Some constraints were imposed on this selection to ensure the quality of the dataset. The minimum size for each selected topic was 100 URLs, and the language was restricted to English. For each topic, we collected all of its URLs as well as those in its subtopics. The retrieved HTML was parsed and cleaned to remove empty, pdf, flash, and other not useful files. The total number of collected pages was more than 350K from 448 topics. In 2018 the data was wikified.