How to research data management practices: Semi-structured Interviews

To attain a representative view of the state-of-play of Research Data Management at TU Graz, potential candidates for interviews were selected based on two factors: Their affiliation (faculty/department) and their position. The target was to interview a minimum of two people per faculty (regardless of department). This target has not been met, in part due to recruiting issues, although it should be noted that we ended up interviewing more candidates than we had planned. Additionally, departmental structures do not necessarily reflect differences in data management; therefore, the interviews we have been able to conduct provide a good overview of data management practices at TU Graz regardless of institutional affiliations. In total, 13 formal interviews were conducted with a total of 18 respondents holding various positions at their departments/faculties. Three formal meetings held at the Faculties for Architecture, Electrical Engineering, and Mechanical Engineering were included in the analysis (protocols were crafted during the meetings). Interview partners were identified by manually scanning TUGonline by faculty, initially identifying one-two researchers per faculty, all researchers with a teaching qualification according to their TUGonline profiles (Professors and Associate Professors), though it must be stressed that this strategy was not always successful. Additionally, deans of faculty were approached to name potential interviewees, but this strategy has not proved very effective. Researchers were contacted via email requesting an interview on data practices and, foregoing a positive response, to name alternative candidates. Those who declined did so for one of two reasons: (self-ascribed) lack of competence in the subject of data management, or general refusal to give interviews. Fortunately, those in the first group shared names of potential interviewees they considered to be a better fit.

For the interviews, we used a formalized list of interview questions along with possible follow-up questions to fall back on should the conversation come to a halt at any point. The questions were formulated in an open fashion to be able to record as much potential variation in the answers between cases as possible. All interviews have been professionally transcribed and coded using the software package Rqda. Coding was done by one researcher, with supervision and feedback from other researchers. The material was analysed paying special attention to data practices (types of data used by disciplines, methods of data collection, storage, and analysis, data sharing routines - or lack thereof). The interview questionnaire was designed to allow reconstruction of data handling practices in their wider institutional, disciplinary and practical context. The semi-standardized interview questionnaire contained broad questions about data in the context of research, (typical) research aims, data management practices, roles, and responsibilities, data storage and data sharing, and research culture more broadly (e.g. publication routines, reputation and credit, etc.). In keeping with the findings from initial gatekeeper contact, the interview questions refrained from using terms such as “research data management”, “data management”, and “policy”, and instead focused on understanding what researchers do with their data, a strategy which has proved worthwhile. 

Understanding the state-of-the-art: Standardized Survey

Based on preliminary interview findings as well as faculty visits and policy working group consultations, a survey was designed to gain quantitative insights into the way data intensity, data handling and research styles play out with respect to RDM at TU Graz. The survey was hosted on LimeSurvey and sent out via email to all members of scientific staff at TU Graz in September 2019. In total, the survey was kept open for 5 weeks. Two reminders were sent, one after two weeks and the second one week before the end of the survey. Additionally, an announcement was sent out one week in advance. Consultations were held with the responsible bodies at TU Graz to follow established protocols for surveys and to clarify issues of data protection. The survey was sent out to 1784 scientific staff members. 498 respondents started the survey, and of those, 259 completed the questionnaire. These were included in the analysis. No incentives were given out to encourage participation. Survey respondents are from all 7 faculties and from all academic positions. A more fine-grained analysis (e.g. at the departmental level) was impossible due to data-protection restrictions at TU Graz.

The survey consisted of 27 questions in five groups:

  1. Data Types
  2. Data Quantity
  3. Data Handling
  4. Obstacles to Research Data Management
  5. Demographics

The first group of questions concerned research outputs generated by researchers at TU Graz, the kinds of data formats typically used, as well as research outputs other than data (e.g. physical samples of any kind). The second group contained questions regarding typical data amounts per year and (average) storage space required. Data handling refers to practices of data sharing/handling as well as attitudes towards e.g. data reuse and repositories. Obstacles to Research Data Management refers to researchers’ experiences with RDM. Demographics contained three questions: faculty, position, and role of the respondent. These three items were designed to ensure data protection was observed but the answers would still allow for meaningful analysis. 
The survey items were adapted, in part, from a survey on RDM knowledge and practices among ERC grant winners commissioned by the European Research Council and written by the Public Policy and Management Institute (PPMI), Digital Curation Centre (DCC), Georg-August-Universität, Göttingen and Science-Metrix (PPMI 2018). The survey went through several rounds of refinement (formulations of items, order of items, translation). The survey was developed in English and then translated into German after items were finalized. One pretest was commissioned where volunteers from the policy working group were asked to complete and comment on the survey. For the pretest, the survey version was hosted on LimeSurvey. 10 responses were received which contained valuable criticisms and hints as to what should be amended. These suggestions were incorporated into the final survey design.

Variation in RDM Practices

The main outcome of our study of RDM practices is a multidimensional typology of data handling practices in research disciplines. As the analysis demonstrates, data sharing can be expected to vary along a minimum of three dimensions: 

  • Dimension 1: Data Intensity (high versus low amounts/complexity of data)
  • Dimension 2: Style of Data Handling (intensive care vs. discard)
  • Dimension 3: Reproducibility and Replicability

Intensity refers a) to the amounts of data research produces which varies greatly from research group to research group and b) the complexity of the data (in terms of e.g. dimensionality). Both variables impact the time and resources needed to manage the data (e.g. defining metadata, uploading data sets to repositories, etc.). Style of data handling refers to the time devoted to data handling (i.e. whether data are kept for the long term or – at the other end of the spectrum – discarded after analysis. The third dimension refers to the value accorded to ensuring reproducibility which varies greatly across research fields and based on research aims. Each case (discipline) can be accorded a place in this projected three-dimensional space. The three dimensions of data practice explain the value accorded to data and hence the propensity of individual researchers to share their data under certain circumstances. 
Key findings are summarized below:

  • Variation in Data Practices: Faculties, institutes, and research groups differ with respect to data amounts/complexity, data collection and analysis, and consequently the extent to which data are archived and which databases are used; respondents thus feel that RDM policies should be discipline-specific where needed but as general as possible
  • Disciplinary variation necessitates discipline-specific services in terms of e.g. data stewardship; many respondents pointed out that institutes need help in very specific tasks; very often, this is an effect of specialization; respondents desire support at the departmental level
  • Data Collection: The bulk of research data are collected by PhD candidates; fluctuation among PhD positions is thus a huge continuity problem for data management; accordingly, there is a desire that more effort be put into proper RDM training for PhDs; in general, more time needs to be allowed/planned for to guarantee adequate data management
  • Metadata: In many fields, there is no consensus as to which data to share and how to develop metadata schemes; this is especially pertinent in disciplines where there is no culture of sharing research data; here, researchers said they need support, especially with regards to funder mandates (e.g. DMPs); where there are established metadata schemas, researchers want to be able to search repositories by metadata
  • Data Analysis: Some use docker images to organize their data analysis; this is considered highly desirable as a way to organize the entire research process (storage of data and scripts)
  • Technical Aspects of Data Management: Data loss is a concern among all respondents; as a consequence, data security measures and backups are desired across all faculties; these should be paid for by the university (which respondents recognize requires a cultural shift)
  • Opt-in: All these support structures are preferred as opt-in versions, free to use for those who need them and without introducing any additional (administrative) burden
  • Publishing Research/Archiving Data: The publishing process is rather similar across the faculties (from planning an article to selecting a journal to submission to uploading data). Accordingly, support structures could be bundled in one organizational unit which would help to free up researchers’ time
  • Data Security and Backups: Data loss is a big concern among researchers and research administrators; this can have two major causes: fluctuation among PhD positions, and inability (technical or on the part of data handlers) to secure data; here backup options with adequate funding are highly desired
  • Data Sharing: The propensity to share data depends on what data are perceived as valuable and how. The value of data in turn depends on the effort that needs to be invested in data collection/data processing. This seems to be especially pertinent in the Life Sciences  where data sharing is a big priority (more so than in other disciplines we studied). In general, the value ascribed to data in a given field can be explained by reference to three factors:
    • Intensity (amount and complexity of data)
    • Handling (resources put into data handling)
    • Reproducibility (research style: what is the aim of the field)
  • Only disciplines “scoring” high on all three dimensions can be expected to developed culture of data sharing, and consequently, there is a lot of variation across TU Graz with respect to data management practices