Big Data

How to best handle data storage and archiving after the project is finished?

How to best handle data storage and archiving after the project is finished?

data

 

What is Data Collection?

Data collection is the procedure of collecting, measuring, and analyzing accurate insights for research using standard validated techniques.

To collect data, we must first identify what information we need and how we will collect it. We can also evaluate a hypothesis based on collected data. In most cases, data collection is the primary and most important step for research. The approach to data collection is different for different fields of study, depending on the required information.

Research Data Management  RDM) is present in all phases of research and encompasses the collection, documentation, storage and preservation of data used or generated during a research project. Data management helps researchers:  organize it,  locate it,  preserve it,  reuse it.

Additionally, data management allows:

  • Save time  and make efficient use of available resources : You will be able to find, understand and use data whenever you need.
  • Facilitate the  reuse of the data  you have generated or collected: Correct management and documentation of data throughout its life cycle will allow it to remain accurate, complete, authentic and reliable. These attributes will allow them to be understood and used by other people.
  • Comply with the requirements of funding agencies : More and more agencies require the presentation of data management plans and/or the deposit of data in repositories as requirements for research funding.
  • Protect and preserve data : By managing and depositing data in appropriate repositories, you can safely safeguard it over time, protecting your investment of time and resources and allowing it to serve new research and discoveries in the future.

Research data  is  “all that material that serves to certify the results of the research that is carried out, that has been recorded during it and that has been recognized by the scientific community” (Torres-Salinas; Robinson-García; Cabezas-Clavijo, 2012), that is, it is  any information  collected, used or generated in experimentation, observation, measurement, simulation, calculation, analysis, interpretation, study or any other inquiry process  that supports and justifies the scientific contributions  that are disseminated in research publications.

They come  in any format and support for example:

Joint Statement on Research Data from STM, DataCite and Crossref

In 2012, DataCite and STM drafted an initial joint statement on linking and citing research data. 

The signatories of this statement recommend the following as best practices in research data sharing:

  1. When publishing their results, researchers deposit the related research data and results in a trusted data repository that assigns persistent identifiers (DOIs when available). Researchers link to research data using persistent identifiers.
  2. When using research data created by others, researchers provide attribution by citing the data sets in the references section using persistent identifiers.
  3. Data repositories facilitate the sharing of research results in a FAIR manner, including support for metadata quality and completeness.
  4. Editors establish appropriate data policies for journals, outlining how data will be shared along with the published article.
  5. The editors establish instructions for authors to include Data Citations with persistent identifiers in the references section of articles.
  6. Publishers include Data Citations and links to data in Data Availability Statements with persistent identifiers (DOIs when available) in the article metadata recorded in Crossref.
  7. In addition to Data Citations, Data Availability Statements (human and machine readable) are included in published articles where applicable.
  8. Repositories and publishers connect articles and data sets through persistent identifier connections in metadata and reference lists.
  9. Funders and research organizations provide researchers with guidance on open science practices, track compliance with open science policies where possible, and promote and incentivize researchers to openly share, cite, and link research data.
  10. Funders, policy-making institutions, publishers, and research organizations collaborate to align FAIR research data policies and guidelines.
  11. All stakeholders collaborate to develop tools, processes and incentives throughout the research cycle to facilitate the sharing of high-quality research data, making all steps in the process clear, easy and efficient for researchers through provision of support and guidance.
  12. Stakeholders responsible for research evaluation factor data sharing and data citation into their reward and recognition system structures.

research

The first phase of an investigation requires  designing and planning  your project. To do this, you must:

  • Know the  requirements and programs  of the financing agencies
  • Search  research data
  • Prepare a  Data Management Plan .

Other prior considerations:

  •     If your research involves working with humans, informed consent must be obtained.
  •     If you are involved in a collaborative research project with other academic institutions, industry partners or citizen science partners, you will need to ensure that your partners agree to the data sharing.
  •     Think about whether you are going to work with confidential personal or commercial data.
  •     Think about what systems or tools you will use to make data accessible and what people will need access to it.

During the project…

This is the phase of the project where the researcher  organizes, documents, processes and  stores  the data.

Is required :

  • Update the Data Management Plan
  • Organize and document data
  • Process the data
  • Store data for security and preservation

The  description of data  must provide a context for its interpretation and use, since the data itself lacks this information, unlike scientific publications. It is about being able to understand and reuse them .

The following information should be  included:

  • The context: history of the project, objectives and hypotheses.
  • Origin of the data: if the data is generated within the project or if it is collected (in this case, indicate the source from which it was extracted).
  • Collection methods, instruments used.
  • Typology and format of data (observational, experimental, computational data, etc.)
  • Description standards: what metadata standard to use.
  • Structure of data files and relationships between files.
  • Data validation, verification, cleaning and procedures carried out to ensure its quality.
  • Changes made to the data over time since its original creation and identification of the different versions.
  • Information about access, conditions of use or confidentiality.
  • Names, labels and description of variables and values.

project

STRUCTURE OF A DATASET

 The data must be clean and correctly structured and ordered:

A data set is structured if:

  •     Each variable forms a column
  •     Each observation forms a row
  •     Each cell is a simple measurement

Some recommendations :

  •    Structure the  data in TIDY (vertical) format  i.e. each value is a row, rather than horizontally. Non-TIDY (horizontal) data.
  •    Columns  are used for variables  and their names can be up to 8 characters long without spaces or special signs.
  •    Avoid text values ​​to encode variables, better  encode them with numbers .
  •    In  each cell, a single value
  •    If you do not have  a value available , provide the missing value codes.
  •    Provide  data tables , which collect all the data encodings and denominations used.
  •    Use data dictionary or separate list of these short variable names and their full meaning

DATA SORTING

Ordered data  or  “TIDY DATA” are those obtained from a process called “DATA TIDYING” or data ordering. It is one of the important cleaning processes during big data processing.

Ordered data sets have a structure that makes work easier; They are easy to manipulate, model and visualize. ‘Tidy’ data sets are arranged in such a way that each variable is a column and each observation (or case) is a row.” (Wikipedia).

There may be  exceptions  to open dissemination, based on reasons of confidentiality, privacy, security, industrial exploitation, etc. (H2020, Work Programme, Annexes, L Conditions related to open access to research data).

There are some  reasons why certain types of data cannot and/or should not be shared , either in whole or in part, for example:

  • When the data constitutes or contains sensitive information . There may be national and even institutional regulations on data protection that will need to be taken into account. In these cases, precautions must be taken to anonymize the data and, in this way, make its access and reuse possible without any errors in the ethical use of the information.

  • When the data is not the property of those who collected it or when it is shared by more than one party, be they people or institutions . In these cases, you must have the necessary permissions from the owners to share and/or reuse the data.

  • When the data has a financial value associated with its intellectual property , which makes it unwise to share the data early. Before sharing them, you must verify whether these types of limits exist and, according to each case, determine the time that must pass before these restrictions cease to apply.  

Table of Contents