“Machine Learning in 2024: Best Navigating the Frontier of Innovation”

Machine

Introduction: In the ever-evolving landscape of technology, machine learning (ML) continues to be a driving force behind transformative advancements. As we step into 2024, the field of machine learning has witnessed remarkable progress, with new breakthroughs and applications reshaping industries and society at large. This blog explores the current state of machine learning, highlighting key … Read more

Machine Learning and examples for Best Datasets

Machine Learning Datasets

Introduction to Machine Learning Datasets The accompanying article gives a layout to machine learning Datasets. AI dataset is characterized as the assortment of information that is expected to prepare the model and make forecasts. These datasets are named organized and unstructured datasets, where the organized datasets are in plain organization in which the line of the … Read more

What are the ethical considerations in your data collection process?

What are the ethical considerations in your data collection process?

data collection

Data collection

Data collection is very important. Is the process of collecting and analyzing information on relevant variables in a predetermined, methodical way so that one can respond to specific research questions, test hypotheses, and assess results. Data collection can be either qualitative or quantitative.

Data collection is the process of gathering and measuring information on variables of interest, in an established systematic fashion that enables one to answer stated research questions, test hypotheses, and evaluate outcomes. The data collection component of research is common to all fields of study including physical and social sciences, humanities, business, etc. While methods vary by discipline, the emphasis on ensuring accurate and honest collection remains the same.

The importance of ensuring accurate and appropriate data collection

Regardless of the field of study or preference for defining data (quantitative, qualitative), accurate data collection is essential to maintaining the integrity of research. Both the selection of appropriate data collection instruments (existing, modified, or newly developed) and clearly delineated instructions for their correct use reduce the likelihood of errors occurring.

Consequences from improperly collected data include

  • inability to answer research questions accurately
  • inability to repeat and validate the study
  • distorted findings resulting in wasted resources
  • misleading other researchers to pursue fruitless avenues of investigation
  • compromising decisions for public policy
  • causing harm to human participants and animal subjects

While the degree of impact from faulty data collection may vary by discipline and the nature of investigation, there is the potential to cause disproportionate harm when these research results are used to support public policy recommendations.

Issues related to maintaining integrity of data collection:

The primary rationale for preserving data integrity is to support the detection of errors in the data collection process, whether they are made intentionally (deliberate falsifications) or not (systematic or random errors).

Most, Craddick, Crawford, Redican, Rhodes, Rukenbrod, and Laws (2003) describe ‘quality assurance’ and ‘quality control’ as two approaches that can preserve data integrity and ensure the scientific validity of study results. Each approach is implemented at different points in the research timeline (Whitney, Lind, Wahl, 1998):

  1. Quality assurance – activities that take place before data collection begins
  2. Quality control – activities that take place during and after data collection

Quality Assurance

Since quality assurance precedes data collection, its main focus is ‘prevention’ (i.e., forestalling problems with data collection). Prevention is the most cost-effective activity to ensure the integrity of data collection. This proactive measure is best demonstrated by the standardization of protocol developed in a comprehensive and detailed procedures manual for data collection. Poorly written manuals increase the risk of failing to identify problems and errors early in the research endeavor. These failures may be demonstrated in a number of ways:

  • Uncertainty about the timing, methods, and identify of person(s) responsible for reviewing data
  • Partial listing of items to be collected
  • Vague description of data collection instruments to be used in lieu of rigorous step-by-step instructions on administering tests
  • Failure to identify specific content and strategies for training or retraining staff members responsible for data collection
  • Obscure instructions for using, making adjustments to, and calibrating data collection equipment (if appropriate)
  • No identified mechanism to document changes in procedures that may evolve over the course of the investigation .

Quality Assurance

The first part of our data ethics series examines the importance of consent, confidentiality, and intent when gathering data.

When collecting personal information, the GDPR principles state that organizations must act transparently and with consent, collecting data only for the explicit purpose it’s needed. They also put strong legal protection on sensitive, identifying information, often referred to as PII or Personal Identifying Information.

When GDPR was introduced, organizations were quick to meet this legislation’s requirements, with the threat of serious financial repercussions if they failed to do so. But there’s still more they can do to serve consumers ethically and responsibly during data collection.

Before you begin collecting data, have you considered..?

1) Getting consent to collect information

Seeking consent is the most appropriate way to legally collect information, while giving customers genuine control over their data.

While consent isn’t always required (such as in cases of legitimate interest and/or legal obligation), the GDPR suggests that consent be given to collect data for an explicit and stated purpose. Even without consent there still needs to be clear and comprehensive information provided about how personal information is used.

Unfortunately, some companies also resort to manipulative user agreements to get the consent they need, but it is not always consent a participant is happy to give. The value of consent is diminished when it becomes a condition of service.

Ask yourself:

  • Do you have permission from users or participants to collect their data?
  • Have they been made aware that their involvement is voluntary?
  • Is it clear that participants are free to withdraw from any active data collection programmed at any point without pressure or fear of retaliation?

2) Protecting users’ confidentiality and anonymity when collecting data

Customers will often opt in to data collection under the assumption that the information collected remains confidential and any published findings are anonymized. If you do need to break confidentiality at any point (or suspect that you will do in future) then make it clear at the start of the process.

Where possible, avoid collecting personally identifiable information (PII). Good practice might be to design your data collection methods in a way that they can’t be reverse engineered to reveal subjects. However, it is also possible to identify people from merging separate datasets with just a few personal pieces of information about them.

Ask yourself: 

  • Do you really need to collect PII at all?
  • If yes, have you taken steps to de-identify a dataset by removing all PII data before analyzing or sharing the insights?
  • Have you considered how different data points could be used in conjunction to reverse engineer identity or identifying characteristics?

3) What do you intend to do with the data you’re collecting?

While it can be hard to know the purpose or value of data in advance — the GDPR supports the practice of purpose limitation. This means organizations shouldn’t operate with an intention of gathering as much as they can, to be used for an undefined purpose, at an undetermined point in the future. Additionally, there will be some information you cannot retain for more than 12 months.

Minimum viable collection is a strategy which relates to the issues of anonymity and intention. This method encourages organizations to only collect the data they absolutely need to ensure a result they want or a trend they aim to understand. This is sometimes referred to as the data minimization principle.

In practice it can be difficult to implement, as it’s not always possible to know every purpose in advance. Being more responsible and trying to avoid this involves thinking critically about each data point you plan to collect.

collect information

While quality control activities (detection/monitoring and action) occur during and after data collection, the details should be carefully documented in the procedures manual. A clearly defined communication structure is a necessary pre-condition for establishing monitoring systems. There should not be any uncertainty about the flow of information between principal investigators and staff members following the detection of errors in data collection. A poorly developed communication structure encourages lax monitoring and limits opportunities for detecting errors.

Detection or monitoring can take the form of direct staff observation during site visits, conference calls, or regular and frequent reviews of data reports to identify inconsistencies, extreme values or invalid codes. While site visits may not be appropriate for all disciplines, failure to regularly audit records, whether quantitative or quantitative, will make it difficult for investigators to verify that data collection is proceeding according to procedures established in the manual. In addition, if the structure of communication is not clearly delineated in the procedures manual, transmission of any change in procedures to staff members can be compromised

Quality control also identifies the required responses, or ‘actions’ necessary to correct faulty data collection practices and also minimize future occurrences. These actions are less likely to occur if data collection procedures are vaguely written and the necessary steps to minimize recurrence are not implemented through feedback and education (Knatterud, et al, 1998)

Examples of data collection problems that require prompt action include:
  • errors in individual data items
  • systematic errors
  • violation of protocol
  • problems with individual staff or site performance
  • fraud or scientific misconduct

In the social/behavioral sciences where primary data collection involves human subjects, researchers are taught to incorporate one or more secondary measures that can be used to verify the quality of information being collected from the human subject. For example, a researcher conducting a survey might be interested in gaining a better insight into the occurrence of risky behaviors among young adult as well as the social conditions that increase the likelihood and frequency of these risky behaviors.

Data analytics can have a significant impact on society. Data experts must consider the social and ethical implications of their analysis, avoiding harm to people or communities. Social responsibility and the search for the common good must guide decisions in the use of data.

Ethics in data analysis is essential to ensure responsible and reliable use of information. Privacy, fairness, transparency, and social impact are critical considerations that data experts must take into account in their work. By addressing ethical challenges and adopting strong ethical principles, we can fully harness the potential of data analytics for the benefit of society as a whole. Ethical data analysis allows us to move towards a future where technology and responsibility combine to achieve meaningful and sustainable results.

data collection problems