How to best handle data storage and archiving after the project is finished?

Big Data

How to best handle data storage and archiving after the project is finished?

data

 

What is Data Collection?

Data collection is the procedure of collecting, measuring, and analyzing accurate insights for research using standard validated techniques.

To collect data, we must first identify what information we need and how we will collect it. We can also evaluate a hypothesis based on collected data. In most cases, data collection is the primary and most important step for research. The approach to data collection is different for different fields of study, depending on the required information.

Research Data Management  RDM) is present in all phases of research and encompasses the collection, documentation, storage and preservation of data used or generated during a research project. Data management helps researchers:  organize it,  locate it,  preserve it,  reuse it.

Additionally, data management allows:

  • Save time  and make efficient use of available resources : You will be able to find, understand and use data whenever you need.
  • Facilitate the  reuse of the data  you have generated or collected: Correct management and documentation of data throughout its life cycle will allow it to remain accurate, complete, authentic and reliable. These attributes will allow them to be understood and used by other people.
  • Comply with the requirements of funding agencies : More and more agencies require the presentation of data management plans and/or the deposit of data in repositories as requirements for research funding.
  • Protect and preserve data : By managing and depositing data in appropriate repositories, you can safely safeguard it over time, protecting your investment of time and resources and allowing it to serve new research and discoveries in the future.

Research data  is  “all that material that serves to certify the results of the research that is carried out, that has been recorded during it and that has been recognized by the scientific community” (Torres-Salinas; Robinson-García; Cabezas-Clavijo, 2012), that is, it is  any information  collected, used or generated in experimentation, observation, measurement, simulation, calculation, analysis, interpretation, study or any other inquiry process  that supports and justifies the scientific contributions  that are disseminated in research publications.

They come  in any format and support,  for example:

  • Numerical files,  spreadsheets, tables, etc.
  • Text documents  in different versions
  • Images,  graphics, audio files, video, etc.
  • Software code  or records, databases, etc.
  • Geospatial data , georeferenced information

Joint Statement on Research Data from STM, DataCite and Crossref

In 2012, DataCite and STM drafted an initial joint statement on linking and citing research data. 

The signatories of this statement recommend the following as best practices in research data sharing:

  1. When publishing their results, researchers deposit the related research data and results in a trusted data repository that assigns persistent identifiers (DOIs when available). Researchers link to research data using persistent identifiers.
  2. When using research data created by others, researchers provide attribution by citing the data sets in the references section using persistent identifiers.
  3. Data repositories facilitate the sharing of research results in a FAIR manner, including support for metadata quality and completeness.
  4. Editors establish appropriate data policies for journals, outlining how data will be shared along with the published article.
  5. The editors establish instructions for authors to include Data Citations with persistent identifiers in the references section of articles.
  6. Publishers include Data Citations and links to data in Data Availability Statements with persistent identifiers (DOIs when available) in the article metadata recorded in Crossref.
  7. In addition to Data Citations, Data Availability Statements (human and machine readable) are included in published articles where applicable.
  8. Repositories and publishers connect articles and data sets through persistent identifier connections in metadata and reference lists.
  9. Funders and research organizations provide researchers with guidance on open science practices, track compliance with open science policies where possible, and promote and incentivize researchers to openly share, cite, and link research data.
  10. Funders, policy-making institutions, publishers, and research organizations collaborate to align FAIR research data policies and guidelines.
  11. All stakeholders collaborate to develop tools, processes and incentives throughout the research cycle to facilitate the sharing of high-quality research data, making all steps in the process clear, easy and efficient for researchers through provision of support and guidance.
  12. Stakeholders responsible for research evaluation factor data sharing and data citation into their reward and recognition system structures.

research

The first phase of an investigation requires  designing and planning  your project. To do this, you must:

  • Know the  requirements and programs  of the financing agencies
  • Search  research data
  • Prepare a  Data Management Plan .

Other prior considerations:

  •     If your research involves working with humans, informed consent must be obtained.
  •     If you are involved in a collaborative research project with other academic institutions, industry partners or citizen science partners, you will need to ensure that your partners agree to the data sharing.
  •     Think about whether you are going to work with confidential personal or commercial data.
  •     Think about what systems or tools you will use to make data accessible and what people will need access to it.

During the project…

This is the phase of the project where the researcher  organizes, documents, processes and  stores  the data.

Is required :

  • Update the Data Management Plan
  • Organize and document data
  • Process the data
  • Store data for security and preservation

The  description of data  must provide a context for its interpretation and use, since the data itself lacks this information, unlike scientific publications. It is about being able to understand and reuse them .

The following information should be  included:

  • The context: history of the project, objectives and hypotheses.
  • Origin of the data: if the data is generated within the project or if it is collected (in this case, indicate the source from which it was extracted).
  • Collection methods, instruments used.
  • Typology and format of data (observational, experimental, computational data, etc.)
  • Description standards: what metadata standard to use.
  • Structure of data files and relationships between files.
  • Data validation, verification, cleaning and procedures carried out to ensure its quality.
  • Changes made to the data over time since its original creation and identification of the different versions.
  • Information about access, conditions of use or confidentiality.
  • Names, labels and description of variables and values.

project

STRUCTURE OF A DATASET

 The data must be clean and correctly structured and ordered:

A data set is structured if:

  •     Each variable forms a column
  •     Each observation forms a row
  •     Each cell is a simple measurement

Some recommendations :

  •    Structure the  data in TIDY (vertical) format  i.e. each value is a row, rather than horizontally. Non-TIDY (horizontal) data.
  •    Columns  are used for variables  and their names can be up to 8 characters long without spaces or special signs.
  •    Avoid text values ​​to encode variables, better  encode them with numbers .
  •    In  each cell, a single value
  •    If you do not have  a value available , provide the missing value codes.
  •    Provide  data tables , which collect all the data encodings and denominations used.
  •    Use data dictionary or separate list of these short variable names and their full meaning

DATA SORTING

Ordered data  or  “TIDY DATA” are those obtained from a process called “DATA TIDYING” or data ordering. It is one of the important cleaning processes during big data processing.

Ordered data sets have a structure that makes work easier; They are easy to manipulate, model and visualize. ‘Tidy’ data sets are arranged in such a way that each variable is a column and each observation (or case) is a row.” (Wikipedia).

There may be  exceptions  to open dissemination, based on reasons of confidentiality, privacy, security, industrial exploitation, etc. (H2020, Work Programme, Annexes, L Conditions related to open access to research data).

There are some  reasons why certain types of data cannot and/or should not be shared , either in whole or in part, for example:

  • When the data constitutes or contains sensitive information . There may be national and even institutional regulations on data protection that will need to be taken into account. In these cases, precautions must be taken to anonymize the data and, in this way, make its access and reuse possible without any errors in the ethical use of the information.

  • When the data is not the property of those who collected it or when it is shared by more than one party, be they people or institutions . In these cases, you must have the necessary permissions from the owners to share and/or reuse the data.

  • When the data has a financial value associated with its intellectual property , which makes it unwise to share the data early. Before sharing them, you must verify whether these types of limits exist and, according to each case, determine the time that must pass before these restrictions cease to apply.  

How will you store and manage the best your collected data?

what is object storage 1080x630 1

How will you store and manage the best your collected data?

collected data

Collected data

Collected data  is very important. Data collection is  the process of collecting and measuring information about specific variables in an established system, which then allows relevant questions to be answered and results to be evaluated. Data collection is a component of research in all fields of study, including the  physical  and  social sciences ,  humanities and business . While methods vary by discipline, the emphasis on ensuring accurate and honest collection remains the same. The goal of all data collection is to capture quality evidence that will allow analysis to lead to the formulation of compelling and credible answers to the questions that have been posed. What is meant by privacy?

The ‘right to privacy’ refers to being free from intrusions or disturbances in one’s private life or personal affairs. All research should outline strategies to protect the privacy of the subjects involved, as well as how the researcher will have access to the information.

The concepts of privacy and confidentiality are related but are not the same. Privacy refers to the individual or subject, while confidentiality refers to the actions of the researcher.

What does the management of stored information entail?

Manual collected data and analysis are time-consuming processes, so transforming data into insights is laborious and expensive without the support  of automated tools.

The size and scope of the information analytics market is expanding at an increasing pace, from self-driving cars to security camera analytics and medical developments. In every industry, in every part of our lives, there is rapid change and  the speed at which transformations occur is increasing.

It is a constant  evolution that is based on data.  That information comes from all the new and old data collected, when it is used to  develop new types of knowledge.

The relevance that information management has acquired raises many questions about the requirements applicable to all data collected and information developed.

Data encryption

Data encryption is  not a new concept, in history we can go to the ciphers that Julius Caesar used to send his orders or the famous communication encryption enigma machine that the Nazis used in the Second World War.

Nowadays,  data encryption  is one of the most used security options to protect personal and business data.

Data encryption  works through mathematical algorithms that convert data into unreadable data. This encrypted data consists of two keys to decrypt it, an internal key that only the person who encrypts the data knows, and a key

external that the recipient of the data or the person who is going to access it must know.

Data encryption can be used   to protect all types of documents, photos, videos, etc. It is a method that has many advantages for information security.

 

Data encryption

Advantages of data encryption

  • Useless data : in the event of the loss of a storage device or the data is stolen by a cybercriminal, allows said data to be useless for all those who do not have the permissions and decryption key.
  • Improve reputation : companies that work with encrypted data offer both clients and suppliers a secure way to protect the confidentiality of their communications and data, displaying an image of professionalism and security.
  • Less exposure to sanctions : some companies or professionals are required by law to encrypt the data they handle, such as lawyers, data from police investigations, data containing information on acts of gender violence, etc. In short, all data that, due to its nature, is sensitive to being exposed, therefore requires mandatory encryption, and sanctions may be generated if it is not encrypted.

Data storage 

There are many advantages associated with achieving good management of stored information. Among the  benefits of adequately covering the requirements of the  Data Storage function  and  data management  , the following two stand out:

  • Savings: the capacity of a server to  store data  is limited, so  storing data  without a structure, without a logical order and lacking guiding principles, represents an increase in cost that could be avoided. On the contrary, when data storage responds to a plan and the decisions made are aligned with the business strategy, advantages are achieved that extend to all functions of the organization.
  • Increased productivity:  when   has not been stored correctly the system works slower. One of the strategies often used to avoid this is to  divide data into active and inactive . The latter would be kept compressed and in a different place, so that the system remains agile, but without this meaning that they remain completely inactive, since it may sometimes be necessary to access them again. Today, with cloud services it is much easier to find the most appropriate data storage approach for each type of information.

We must avoid each application deciding  how to save the data , and to this end the information management policy should be uniform for all applications and respond to the following questions in each case:

  • How the data is stored .
  • When is the data saved ?
  • What part of the data or information is collected.

In short,  through  a person in charge will be established who is determined by the  Data Governance , which is in turn responsible for defining the standards and the way to store the information, since not all silos can be used.

And this is the way to support the common objective from this function and through the procedures, planning and organization and control that is exercised transversally and always seeking  to enhance  the pragmatic side of the data .

Data storage 

Steps of data processing in research

Data processing in research has six steps. Let’s look at why they are an imperative component of  research design

  • Research data collection

Data collection is   the main stage of the research process. This process can be carried out through various online and offline research techniques and can be a mix of primary and secondary research methods. 

The most used form of data collection is research surveys. However, with a  mature market research platform  , you can collect qualitative data through focus groups, discussion modules, etc.

  • Preparation of research 

The second step in  research data management  is data preparation to eliminate inconsistencies, remove bad or incomplete survey data, and clean the data to maintain consensus. 

This step is essential, since insufficient data can make research studies completely useless and a waste of time and effort.

Introduction of research 

The next step is to enter the cleaned data into a digitally readable format consistent with organizational policies, research needs, etc. This step is essential as the data is entered into online systems that support research data management.

  • Research data processing

Once the data is entered into the systems, it is essential to process it to make sense of it. The information is processed based on needs, the  types of data  collected, the time available to process the data and many other factors. This is one of the most critical components of the research process. 

  • Research data output

This stage of processing research data is where it becomes knowledge. This stage allows business owners, stakeholders, and other staff to view data in the form of graphs, charts, reports, and other easy-to-consume formats. 

  • Storage of processed research

The last stage of data processing steps is storage. It is essential to keep data in a format that can be indexed, searched, and create a single source of truth. Knowledge management platforms are the most used for storing processed research data.

data

Benefits of data processing in research

Data processing can differentiate between actionable knowledge and its non-existence in the research process. However, the processing of research data has some specific advantages and benefits:

  • Streamlined processing and management

When research data is processed, there is a high probability that this data will be used for multiple purposes now and in the future. Accurate data processing helps streamline the handling and management of research data.

  • Better decision making

With accurate data processing, the likelihood of making sense of data to arrive at faster and better decisions becomes possible. Thus, decisions are made based on data that tells stories rather than on a whim.

  • Democratization of knowledge

Data processing allows raw data to be converted into a format that works for multiple teams and personnel. Easy-to-consume data enables the democratization of knowledge.

  • Cost reduction and high return on investment

Data-backed decisions help brands and organizations  make decisions based on data  backed by evidence from credible sources. This helps reduce costs as decisions are linked to data. The process also helps maintain a very high ROI on business decisions. 

  • Easy to store, report and distribute

Processed data is easier to store and manage since the raw data is structured. This data can be consulted and accessible in the future and can be called upon when necessary. 

Examples of data processing in research 

Now that you know the nuances of data processing in research, let’s look at concrete examples that will help you understand its importance.

  • Example in a global SaaS brand

Software as a Service (Saas) brands have a global footprint and have an abundance of customers, often both B2B and B2C. Each brand and each customer has different problems that they hope to solve using the SaaS platform and therefore have different needs. 

By conducting  consumer research , the SaaS brand can understand their expectations,  purchasing  and purchasing behaviors, etc. This also helps in profiling customers, aligning product or service improvements, managing marketing spend and more based on the processed research data. 

Other examples of this data processing include retail brands with a global footprint, with customers from various demographic groups, vehicle manufacturers and distributors with multiple dealerships, and more. Everyone who does market research needs to leverage data processing to make sense of it.  

data