Archiving South African digital research data : How ready are we ?

HOW TO CITE: Koopman MM, De Jager K. Archiving South African digital research data: How ready are we? S Afr J Sci. 2016;112(7/8), Art. #2015-0316, 7 pages. http://dx.doi.org/10.17159/ sajs.2016/20150316 Digital data archiving and research data management have become increasingly important for institutions in South Africa, particularly after the announcement by the National Research Foundation, one of the principal South African academic research funders, recommending these actions for the research that they fund. A case study undertaken during the latter half of 2014, among the biological sciences researchers at a South African university, explored the state of data management and archiving at this institution and the readiness of researchers to engage with sharing their digital research data through repositories. It was found that while some researchers were already engaged with digital data archiving in repositories, neither researchers nor the university had implemented systematic research data management.


Introduction
A number of articles published in this journal are pertinent to the topic of digital data archiving 1,2 , in particular the need for the preservation of long-term ecological data sets, which are crucial for understanding the management of the South African environment 3 . Research data have not traditionally had a home in university libraries or university archives, and have instead remained the responsibility of research units and researchers, or, in some cases, have been archived in special collections associated with a particular research unit and its specialised focus. 4 Data are the currency of research; but analogue and digital research data generated within academia have largely been an invisible resource utilised within the research unit and shared with a select group of trusted colleagues, and consequently their management is poorly understood. Digital data may have various states -raw data, which probably contain errors, require verification and, without metadata, only have meaning within a research discipline, and, at the other end of the spectrum, analysed data with metadata that can be downloaded from a repository and understood more broadly across disciplines. Each research discipline produces unique data, which require a range of specialised metadata languages and ontologies as well as subject-focused management and archiving solutions. 5 The international focus on research data makes it important for South African researchers and policymakers to engage with the imperatives of ensuring that data are managed in a way that enables long-term security and accessibility. Data have commercial and intrinsic value, and in both cases it is important that they are archived for future use, particularly because re-collecting data is costly, in both time and money. 6

The international context
The Advanced Research Projects Agency Network (ARPANET) was established in 1969 specifically to enable researchers to share data between laboratories in geographically distant locations. 7 ARPANET was the template upon which the Internet was subsequently built. The ubiquity of the Internet was the cornerstone of the open access initiative 8 which raised the question of universal access to research, particularly publicly funded research. There are, however, fundamental underlying factors that have led to the current preoccupation with research data archiving: • Global climate change research has alerted governments and researchers to the value of long-term ecological studies. 9 • Garnering funding has become an extremely competitive exercise and major funders want evidence that the research has not previously been undertaken, that the data collected will be preserved, and that the research will be open to scrutiny. 10 • Providing underlying data is regarded as a way to prevent fraud in research, as the findings in the publications are expected to have robust scientific data underlying the research. 11 • There is global awareness that digital records are in danger of being lost, or have already been lost because of inadequate management and preservation initiatives. 12 Concern about the accessibility of digital data is universal and a plethora of published articles on the topic can be identified in the literature. Numerous case studies have been published which report on surveys conducted among the researchers who generate the data to establish the fate of research data. [13][14][15] In each case the findings were similar: lack of institutional support for research data management, lack of suitable data repositories to archive data for the long term and no incentives or mandates in place to encourage systematic data archiving, resulting in researchers keeping their data within the research unit.
Compounding this situation are attitudes towards sharing data. On the one hand, there are defensible reasons for data sharing: On the other hand, there are the cautious and often negative attitudes of the researchers who produce the data and who are slow to archive or make their data available. 23 Ecological researchers do not have a tradition of sharing research data, other than with trusted colleagues and collaborators. In his interview for a Data Matters blog from Scientific Data, Gavin Simpson, a Canadian environmental scientist, succinctly presented the point of view of ecologists: 'If you've toiled in the field for years to collect data then you're not going to be very easily convinced to make the data available. It's not part of our culture'. 24 It would appear that the only way to resolve the concerns around archiving and sharing research data in a formal repository is to make data archiving mandatory, to formalise data management and to ensure that data generators benefit from sharing their data. Digital Object Identifiers (DOIs) for data enable data users to acknowledge data generators in the same way that the authors of articles and books are acknowledged.
Ensuring that data are available for long-term reuse, and that they can be acknowledged through DOIs will enable data generators to use data citations, in addition to article citations, when preparing funding proposals for further research.
A number of mainstream academic journals have made data archiving mandatory -American Naturalist, Molecular Ecology, Nature, the Public Library of Science (PLoS) journals, Royal Society of London journals and Science, to name a few in the field of ecology. Funder mandates are seen as the most reliable method for making data management and archiving a part of the research data life cycle 25 -the process whereby a researcher plans and documents the various steps in data creation, processing, and analysing as part of the research design. A data management plan includes the preservation of the data and a process whereby data can be shared and reused along with the detailed data description, or metadata, that must be archived with the data. A recent editorial in Nature 26 pertinent to open access publishing reveals that the Research Council UK, with oversight of seven public funders in the UK, has found that mandatory open access publishing continues to be problematic, with considerably less than 100% compliance. It is not surprising that archiving research data to make them openly accessible is in a far less developed state.
International initiatives that stand out in their response to digital data archiving initiatives include: Numerous international solutions can be used by researchers to archive their data and by policymakers and institutional managers as examples of best practice for a range of research disciplines. 31 The growth in digital data repositories has resulted in the establishment of an international, peer-reviewed process -'The Data Seal of Approval' -initiated at DANS, which enables institutions to evaluate the reliability of their repository. 32 A repository carrying the Data Seal of Approval is immediately recognisable to researchers and policymakers as a reliable source of data and a reliable site on which to deposit data. A survey was undertaken to investigate the state of data management and archiving within the Department of Biological Sciences at the University of Cape Town (UCT) and the readiness of researchers to engage with sharing their digital research data in repositories. It will be seen from the results of the survey reported below that these repositories are among those utilised by the academic researchers who were surveyed.

Survey of data archiving expertise and initiatives
Researchers from the Department of Biological Sciences at UCT participated in an online multiple-choice survey, designed to be both interrogative and informative, about their data management and archiving initiatives. The survey was a variation of the computerised self-administered questionnaire 33 -an anonymous web-based survey in which the respondents linked to an identified site and completed the questionnaire online without assistance. The survey was designed using Google Forms and consisted of 32 multiple-choice questions. The research was undertaken after ethical clearance from UCT (reference number UCTLIS201408-01).
Face-to-face interviews were conducted with a small group of research technicians and emeritus/retired researchers using the questions from the self-administered survey.
Out of an estimated target population of 318, a total of 163 researchers completed the survey. The survey was conducted over a 5-week period with weekly email reminders sent out to the target population.
To enable an understanding of the Department's researchers' data management issues and activities, the questions were divided into different categories:

Researcher funding streams
Biological research is generally expensive to fund, particularly marine and Antarctic research that require ocean-going vessels that are not available through the university. Such research requires international collaboration and involvement in government initiatives which are publicly funded programmes. The proportion of public funding of the respondents' research is high: 73% of research is at least partially funded through public funds (Table 1). Such funding renders researchers accountable to the public to make their research openly available and to ensure that their data are available for future research. Information on co-funding through international collaboration was extracted by an examination of published research output during 2007, 2010 and 2014. This examination demonstrated that private and overseas co-funding matched public funding in 2007, and exceeded public funding in 2010 and 2014. The authors collaborating on one paper may have been co-funded by more than one party, resulting in more cofunding categories than total articles ( Figure 1).

Publishing characteristics
Academic research findings were made available in the past through the publications of learned societies. When learned societies ceased their publications, the task of publishing findings was taken on by discipline-specific journals published by commercial or not-for-profit scholarly publishers. In both cases research was largely hidden from the general public.
The trend for researchers to make their publications openly accessible has grown because of funding and collaboration mandates and/or to ensure that the research receives the widest audience possible. During 2014, open-access articles amounted to 15% of the total published output of the Department of Biological Sciences (Figure 1). In order to comply with future public funding mandates, the percentage of articles -with accompanying underlying data -would be expected to at least match the percentage of public funding.
An investigation was undertaken into the publication output in scientific journals of researchers in the Department of Biological Sciences, in parallel with the survey, to establish how many articles were published with supplementary information -for example data, code, images or extended bibliographies -as a way of sharing other products relating to the published research. In 2007, only 6% of the published papers were accompanied by supplementary information. This percentage climbed to 14% in 2010 and jumped to 38% in 2014 ( Figure 1).

Data characteristics
It was found that researchers in the Department of Biological Sciences have worked with a range of long-term data sets, ranging from over 10 years to over 50 years in extent. Past research has generated digital data in many different formats, which have been archived on various media such as zip drives and 8-, 5¼-or 3½-inch floppy disks. Many digital data sets were in proprietary formats such as Lotus, dBase, Quattro Pro and other Corel products, or early versions of Microsoft, creating problems for long-term data accessibility. Emeritus, retired and senior academics reported data lost because of incompatibility with contemporary computer hardware, operating systems or software programs.  Other researchers had retained field or experimental notebooks which contained their raw data in analogue format. The majority (91%) of digital data formats generated by younger researchers were spreadsheets (in XML or CSV format). Improved management of digital data would prevent obsolescence and loss of data from this cohort of researchers.
Data were reused and shared within a controlled group of collaborating researchers. Very few researchers allowed open use of the data sets under their control, with the exception of the researchers in the Animal Demography Unit of the Department of Biological Sciences who managed large sets of 'citizen science' data that carried an open data mandate. Analogue data sets also existed within the department, but were largely invisible through lack of description and archiving. Many past students' data sets accompanied dissertations and remained as appendices in analogue theses, also lacking indexing and description. A number of retired research staff reported that their data either had been thrown away (n=3) or had nowhere to go (n=5) because of a lack of interest among colleagues and the institution. Early digital data had also been lost through lack of institutional support, foresight and responsible management. Instances of old digital data still in existence but inaccessible on contemporary computer platforms were common among senior academics. At the time of the investigation, there were no institutional plans in place to rescue these digital data sets.

Data ownership, intellectual property and copyright
Data ownership was found to be a key inhibitor to data sharing. Opinion about ownership varied: does the funder, the institution, the research unit, the supervisor, or the student own the data? Or is the owner a combination of all these potential data owners?  When asked if their data should be made available for future research, 88% of researchers responded positively. But there were caveats to this response, which are reported in Table 2. The raw data revealed cascading requirements for making data available. For example, respondents were willing to share data only after publication and only if the data generator was offered co-authorship, or, after publication and only on request so that the data generator could evaluate the researcher and project wishing to use the data. Being able to trust the person with whom data would be shared was an important consideration. The raw data demonstrated that the respondents who were prepared to share data through acknowledgement, inclusion of DOIs or through publishing under Creative Commons licences were those contributing to or utilising data sets such as 'citizen science' data that already had an open mandate.
In some cases, researchers indicated that there were copyright restrictions on the data they were using and that they were not permitted to share these data sets. In another case, a research group reported that their data had been misappropriated by another research group on campus, because no memorandum of understanding had been in place to specify agreed terms of data use.
Researchers were sharing data, but in the majority of cases this sharing was not through data repositories. Respondents to the survey could select multiple answers and reported sharing of data through the following methods: • by email on request, 70% • within published papers, 38% • in the public domain, 17% • through a collaborative initiative, 15% • through a repository, 12% • through the research unit's server, 3% Until data ownership is resolved, through funding or collaborative agreements -and incentives such as the acknowledgement of data generators through the use of data DOIs are commonplace -data sharing will remain a contested issue.

Housekeeping routines and responsibilities
The survey interrogated data management and preservation activities such as storage, back-up routines and data migration routines. Questions on responsibility for data storage revealed a range of perceptions ( Table 3) that focused data responsibility on the research unit rather than institutional IT departments, the university library or repositories. At the time of the survey, the institution took no responsibility for research data, although researchers could avail themselves of storage space on an IT server at a cost. Departmental IT personnel interviewed for the investigation (included in the Other/Technical category) indicated that they would give advice to researchers for the storage of data but that they were not responsible for researchers' data. Researchers and research units took responsibility for their data, as this was considered to be the status quo.  Researchers were diligent about back-up routines. The type and location of data back-ups is of interest as it demonstrates changing trends in data storage (Table 4). CD/DVDs are falling out of fashion and cloud storage is becoming more popular, although some researchers expressed reservations about data privacy on cloud storage. There appeared to be no consensus at the time of the survey, although external hard drives were the favoured medium, and keeping a back-up at home was the favoured location. One cannot predict how research data will be stored and backed up in the future. But, the move to cloud storage as a more accessible format which does not require the researcher to purchase or carry around additional hardware appears to be gaining popularity. Institutional commitment to research data management through the provision of staging repositories for active research data could improve the security of research data.

Long-term data potential, archiving and metadata
A range of data repositories was utilised by the researchers who were obliged to archive their data, either through collaborations, funding or publisher mandates, or through disciplinary mandates, such as for genome data. Repositories were also utilised by researchers to access data to use in their research. The repositories reportedly utilised by respondents are listed in Table 5. It was found that only 12% of the respondents had used a repository as a means of sharing data, although responses shown in Table 3 indicated that a higher percentage considered a repository to be the appropriate place for responsible data archiving. This apparent contradiction is understandable as routine data archiving in repositories was unknown to many researchers. As metadata are a fundamental component of data sharing, the survey was devised to include a question for which there was a range of mandatory metadata fields as possible answers; respondents could select multiple answers. The possible answers and percentage of respondents who gave each answer are shown in Table 6.
The question that elicited the responses in Table 6 was also intended to sensitise researchers to metadata fields that could be used to describe their research, as the assignment of metadata was a new concept for many researchers at the time. Maintaining detailed descriptions about their data through the use of metadata did not appear to be a routine activity and a number of researchers indicated that they did not assign metadata. The fields shown in Table 6 represent those required by the Ecological Metadata Language (EML) standard.
Researchers were asked what they thought was the purpose of data curation (Table 7). Migrating data into formats that could be used by current software and operating systems received the lowest response. This was a neglected aspect of data management among senior academics that had resulted in data obsolescence instead of data remaining viable for long-term research. Table 7: Purposes of data curation as reported by respondents

Purpose of data curation % Respondents
Storing data for access and use 83 Ensuring that data are secure and backed up and available 79 Making sure data are available for future use 77 Maintaining research data in the long term to enable reuse 68 Ensuring that data are organised and indexed 54 Migrating data to new platforms/software 35 In order to build up long-term data sets for long-term ecological research such as land-use or climate change, data management will need to become an integrated part of the research life cycle.

Institutional engagement and data management education possibilities
The survey contained three questions posed in order to gauge the appetite of researchers for data management education. The questions and percentage responses can be seen in Table 8. Institutional managers have a role to play in ensuring that data management and curation are accounted for in research budgets. Whereas 88% of respondents indicated that their research should be made available for future research, only 18% budgeted for data management and data curation and only 26% had a data preservation plan.

Conclusion
The survey demonstrated that, even within the Department of Biological Sciences, research was varied and data collection and interpretation required a range of specialist skills, equipment and tools. Any discussion of metadata should include the standards and metadata languages appropriate for all types of research data in order for researchers to successfully describe their data for long-term preservation. The link between metadata and sharing has to be made in order for researchers to see the importance of comprehensive data descriptions, as without metadata, their data have no long-term value.
There had been no systematic interventions at UCT for supporting researchers with data management or data storage facilities, and an ad-hoc situation with varying success in the preservation of research data had been the status quo. Research data archiving for long-term preservation requires secure funding streams as well as training in RDM. Assistance with the development of RDM plans, soon to be required by South African research funders, is one of the ways in which the institution can assist researchers to apportion funding for data preservation.
Systematic RDM and archiving will only come about when proposed policies have been established in consultation with researchers. RDM education of the new cohort of researchers is a prerequisite for establishing systematic data archiving, and initiatives should be introduced at senior undergraduate level. Because RDM is a relatively new concept to South African researchers, support should also be offered to senior-and mid-level academic researchers so that they are sufficiently informed to ensure that student data are properly managed and archived. New research projects should include a data archiving and sharing plan as part of the overall project plan.
At the time of the survey, there was no strategy in place for the management or archiving of pre-digital or early digital research data, and some of these data were still in the hands of the retired and emeritus staff of the Department of Biological Sciences who were interviewed. Ensuring that long-term data sets are preserved is urgent and important, as it is not possible to recreate long-term ecological data because the impacts of human population expansion and resource usage change ecological systems over time.