Ministry of Environment of Québec (2011-2014) web archive collection derivatives

Main Author: Ruest, Nick
Format: info dataset Journal
Bahasa: eng
Terbitan: , 2020
Subjects:
Online Access: https://zenodo.org/record/3596786
Daftar Isi:
  • Web archive derivatives of the Ministry of Environment of Québec (2011-2014) collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup banq! These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages() Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url Binary Analysis Audio Images PDFs Presentation program files Spreadsheets Text files Videos Word processor files