How to best handle data storage and archiving after the project is finished?
What is Data Collection?
Data collection is the procedure of collecting, measuring, and analyzing accurate insights for research using standard validated techniques.
To collect data, we must first identify what information we need and how we will collect it. We can also evaluate a hypothesis based on collected data. In most cases, data collection is the primary and most important step for research. The approach to data collection is different for different fields of study, depending on the required information.
Research Data Management ( RDM) is present in all phases of research and encompasses the collection, documentation, storage and preservation of data used or generated during a research project. Data management helps researchers: organize it, locate it, preserve it, reuse it.
Additionally, data management allows:
- Save time and make efficient use of available resources : You will be able to find, understand and use data whenever you need.
- Facilitate the reuse of the data you have generated or collected: Correct management and documentation of data throughout its life cycle will allow it to remain accurate, complete, authentic and reliable. These attributes will allow them to be understood and used by other people.
- Comply with the requirements of funding agencies : More and more agencies require the presentation of data management plans and/or the deposit of data in repositories as requirements for research funding.
- Protect and preserve data : By managing and depositing data in appropriate repositories, you can safely safeguard it over time, protecting your investment of time and resources and allowing it to serve new research and discoveries in the future.
Research data is “all that material that serves to certify the results of the research that is carried out, that has been recorded during it and that has been recognized by the scientific community” (Torres-Salinas; Robinson-García; Cabezas-Clavijo, 2012), that is, it is any information collected, used or generated in experimentation, observation, measurement, simulation, calculation, analysis, interpretation, study or any other inquiry process that supports and justifies the scientific contributions that are disseminated in research publications.
They come in any format and support, for example:
- Numerical files, spreadsheets, tables, etc.
- Text documents in different versions
- Images, graphics, audio files, video, etc.
- Software code or records, databases, etc.
- Geospatial data , georeferenced information
Joint Statement on Research Data from STM, DataCite and Crossref
In 2012, DataCite and STM drafted an initial joint statement on linking and citing research data.
The signatories of this statement recommend the following as best practices in research data sharing:
- When publishing their results, researchers deposit the related research data and results in a trusted data repository that assigns persistent identifiers (DOIs when available). Researchers link to research data using persistent identifiers.
- When using research data created by others, researchers provide attribution by citing the data sets in the references section using persistent identifiers.
- Data repositories facilitate the sharing of research results in a FAIR manner, including support for metadata quality and completeness.
- Editors establish appropriate data policies for journals, outlining how data will be shared along with the published article.
- The editors establish instructions for authors to include Data Citations with persistent identifiers in the references section of articles.
- Publishers include Data Citations and links to data in Data Availability Statements with persistent identifiers (DOIs when available) in the article metadata recorded in Crossref.
- In addition to Data Citations, Data Availability Statements (human and machine readable) are included in published articles where applicable.
- Repositories and publishers connect articles and data sets through persistent identifier connections in metadata and reference lists.
- Funders and research organizations provide researchers with guidance on open science practices, track compliance with open science policies where possible, and promote and incentivize researchers to openly share, cite, and link research data.
- Funders, policy-making institutions, publishers, and research organizations collaborate to align FAIR research data policies and guidelines.
- All stakeholders collaborate to develop tools, processes and incentives throughout the research cycle to facilitate the sharing of high-quality research data, making all steps in the process clear, easy and efficient for researchers through provision of support and guidance.
- Stakeholders responsible for research evaluation factor data sharing and data citation into their reward and recognition system structures.
The first phase of an investigation requires designing and planning your project. To do this, you must:
- Know the requirements and programs of the financing agencies
- Search research data
- Prepare a Data Management Plan .
Other prior considerations:
- If your research involves working with humans, informed consent must be obtained.
- If you are involved in a collaborative research project with other academic institutions, industry partners or citizen science partners, you will need to ensure that your partners agree to the data sharing.
- Think about whether you are going to work with confidential personal or commercial data.
- Think about what systems or tools you will use to make data accessible and what people will need access to it.
During the project…
This is the phase of the project where the researcher organizes, documents, processes and stores the data.
Is required :
- Update the Data Management Plan
- Organize and document data
- Process the data
- Store data for security and preservation
The description of data must provide a context for its interpretation and use, since the data itself lacks this information, unlike scientific publications. It is about being able to understand and reuse them .
The following information should be included:
- The context: history of the project, objectives and hypotheses.
- Origin of the data: if the data is generated within the project or if it is collected (in this case, indicate the source from which it was extracted).
- Collection methods, instruments used.
- Typology and format of data (observational, experimental, computational data, etc.)
- Description standards: what metadata standard to use.
- Structure of data files and relationships between files.
- Data validation, verification, cleaning and procedures carried out to ensure its quality.
- Changes made to the data over time since its original creation and identification of the different versions.
- Information about access, conditions of use or confidentiality.
- Names, labels and description of variables and values.
-
READme file template for a data set
Cornell University -
REadme file template (Madroño consortium)
STRUCTURE OF A DATASET
The data must be clean and correctly structured and ordered:
A data set is structured if:
- Each variable forms a column
- Each observation forms a row
- Each cell is a simple measurement
Some recommendations :
- Structure the data in TIDY (vertical) format i.e. each value is a row, rather than horizontally. Non-TIDY (horizontal) data.
- Columns are used for variables and their names can be up to 8 characters long without spaces or special signs.
- Avoid text values to encode variables, better encode them with numbers .
- In each cell, a single value
- If you do not have a value available , provide the missing value codes.
- Provide data tables , which collect all the data encodings and denominations used.
- Use data dictionary or separate list of these short variable names and their full meaning
DATA SORTING
Ordered data or “TIDY DATA” are those obtained from a process called “DATA TIDYING” or data ordering. It is one of the important cleaning processes during big data processing.
Ordered data sets have a structure that makes work easier; They are easy to manipulate, model and visualize. ‘Tidy’ data sets are arranged in such a way that each variable is a column and each observation (or case) is a row.” (Wikipedia).
-
Practical Guide to Publishing Tabular Data to CSV Files
Content prepared by Carlos de la Fuente García, expert in open data, within the framework of the Aporta Initiative, of the Ministry of Economic Affairs and Digital Transformation -
Data Quality Review Checklist
For research data during the life of the project it is recommended:
- Store data in readable formats for the long term.
- Check the files from time to time.
- Clearly organize and label stored data so that it is easily findable and accessible.
- Take into consideration the physical degradation of optical and magnetic media in case it is necessary to copy or migrate data.
- Store data on different media, even for a short-term project, for example on the hard drive and on CD.
- Create digital versions of paper documentation, in PDF/A Format for long-term preservation and storage.
- Take into account support conservation factors, such as changes in temperature, relative humidity, light, etc.
Storage media are complementary to each other and some of the most common options are:
- Personal or project data storage (using USB drives, computer hard drives or network drives within the institution), recommended only for use in the course of research.
- Institutional repository (Dehesa or that of your university).
- National data archiving services.
- Cloud data warehouse.
- Repositories (RIO, Zenodo, disciplinary repositories).
The repository selected to store the data must guarantee its long-term preservation, and ease of recovery and access.
When choosing a repository, some aspects must be taken into account:
According to the OpenAIRE Research Data Management Briefing Paper, data should be deposited in a data repository according to the following order of preference :
- Multidisciplinary data repository. Zenodo, Dryad, Dataverse, Figshare, Mendeley Data,
- Other data repositories
- Together with the scientific publications ODYSSEY
REPOSITORY FINDERS
Repository search engines could also help you:
- Re3data: global registry of research data repositories from different academic disciplines. Managed and maintained by DataCite .
- Fairsharing : search engine for standards, data repositories and open access policies in all disciplines.
- Repository Finder – Allows researchers to search for the most suitable repository to deposit data.
Consolidated thematic data repository for that discipline. Re3data or Data Repositories, for example, in Social Sciences, World Values Survey or Wellcome Library in History of Medicine Institutional data repository. Digital CSIC or Harvard Dataverse, Dehesa (repository of the University of Extremadura).
There may be exceptions to open dissemination, based on reasons of confidentiality, privacy, security, industrial exploitation, etc. (H2020, Work Programme, Annexes, L Conditions related to open access to research data).
There are some reasons why certain types of data cannot and/or should not be shared , either in whole or in part, for example:
-
When the data constitutes or contains sensitive information . There may be national and even institutional regulations on data protection that will need to be taken into account. In these cases, precautions must be taken to anonymize the data and, in this way, make its access and reuse possible without any errors in the ethical use of the information.
-
When the data is not the property of those who collected it or when it is shared by more than one party, be they people or institutions . In these cases, you must have the necessary permissions from the owners to share and/or reuse the data.
-
When the data has a financial value associated with its intellectual property , which makes it unwise to share the data early. Before sharing them, you must verify whether these types of limits exist and, according to each case, determine the time that must pass before these restrictions cease to apply.