SSHOC D4.12 Guidelines for the integration of Audio Capture data in Survey Interviews
Main Authors: | Tom Emery, Ruud Luijkx, Giovanni Borghesan, Henk van den Heuvel |
---|---|
Format: | Report publication-deliverable Journal |
Bahasa: | eng |
Terbitan: |
, 2019
|
Subjects: | |
Online Access: |
https://zenodo.org/record/3631169 |
Daftar Isi:
- This deliverable is the first associated with Task 4.4. Voice recorded interviews and audio analysis in the Social Sciences and Humanities Open Cloud project (SSHOC). This task will collect audio data in the form of voice recorded interviews from the Generations and Gender Survey. This audio data will then be processed and analysed by colleagues at CLARIN to contribute with automatic speech recognition, speaker attribution, part-of-speech-tagging, named entity labelling and other NLP tools. Survey methodologists and Social Scientists from the EVS and GGP will work together with data scientists and oral historians from CLARIN to develop a survey module specifically adapted to integrate audio recordings and their processing into the traditional data collection process. Thematically they will focus on the qualitative assessment of value statements. Once fielded, CLARIN will adapt existing auto- transcription tools to the specific needs of the audio survey data and make the transcribed files available for analysis. Data Archiving and Networked Services (NL) (DANS-KNAW) will oversee archiving and dissemination of the data, drawing on their significant experience with oral histories. In these guidelines, we set out the questionnaire and fieldwork principles for the implementation of the Audio Survey Modules. These will then be implemented in fieldwork in early 2021 and the data processed and analysed before the end of 2021 (Month 36 of the project). A specific questionnaire and guidelines are required as the aim of the project is unprecedented in several respects: Audio data is sometimes captured by surveys as part of data quality control, but the technical implementation and questionnaire content are rarely if ever optimized to ensure that the digital language data generated from the interview is optimized for linguistic analysis. What is unique about this task is that researchers from CLARIN were involved from the early stages of design in order to ensure that the substantive focus and technical implementation of the project would be able to produce audio data and transcripts that produce meaningful results when analysed with the tools at CLARIN’s disposal. These guidelines proceed as follows. First, the various aims of the project are outlined from the perspective of survey infrastructures (GGP & EVS) and linguistic infrastructures (CLARIN). These, help shape the technical requirements that are then laid out in section 4. Finally, the questions which we intend to field are then laid out.