, ,

What measures are best to address potential biases in the selection of your data sources?

What measures are best to address potential biases in the selection of your data sources?

data sources


What is a Data Source

Data Source is very important. In data analysis and business intelligence, a data source is a vital component that provides raw data for analysis. A data source is a location or system that stores and manages data, and it can take on many different forms. From traditional databases and spreadsheets to cloud-based platforms and APIs, countless types of data sources are available to modern businesses.

Understanding the different types of data sources and their strengths and limitations is crucial for making informed decisions and deriving actionable insights from data. In this article, we will define what is a data source, examine data source types, and provide examples of how they can be used in different contexts.

In short, data source refers to the physical or digital location where data can be stored as a data table, data object, or another storage format. It’s also where someone can access data for further use — analysis, processing, visualization, etc.

You often deal with data sources when you need to perform any transformations with your data. Let’s assume you have an eCommerce website on Shopify. And you want to analyze your sales to understand how to enhance your store performance. You decided that you would use Tableau for data processing. As it is a standalone tool, you must somehow fetch the data you need from Shopify. Thus, Shopify will act as a data source for your further data manipulations.

The difference between what is being valued and what is believed to be valued (Casal & Mateu, 2003). Unlike random error, systematic error is not compensated by increasing the sample size (Department of Statistics, Universidad Carlos III de Madrid). However, although its importance is vital in the development of an investigation, it is relevant to mention that
none is exempt from them; and that the essential thing is to know them to try to avoid, minimize or correct them (Beaglehole et al., 2008).



The risk of bias appearing is intrinsically related to clinical research, which is particularly high in frequency since it works with variables that involve individual and population dimensions, which are also difficult to control. However, they also occur in basic sciences, a context in which experimental settings present conditions in which biases adopt peculiar characteristics and are less complex to minimize, since a series or a large part of the variables can be controlled.

From a statistical perspective, when trying to measure a variable, it must be considered that the value obtained as a result of the measurement (XM) is made up of two parts; the true value (XV) and the measurement error (XE); so that XM = XV + XE. Thus, the measurement error is in turn composed of two parts; one random and the other systematic or bias, which can be measurement, selection or confusion (Dawson-Saunders et al., 1994).

This explanation allows us to understand the fundamental characteristics of any measurement: accuracy (measurements close to the true value [not biased]); and precision (repeated measurements of a phenomenon with similar values) (Manterola, 2002).
The objective of this article is to describe the concepts that allow us to understand the importance of biases, the most frequent ones in clinical research, their association with the different types of research designs and the strategies that allow them to be minimized and controlled.

A simple way to understand the different possibilities of committing bias during research is to think about the three axes that dominate research: what will be observed or measured, that is, the variable under study; the one who will observe or measure, that is, the observer; and with what will be observed or measured, that is, the measuring instrument (Tables II and III) (Beaglehole et al.).

1. From the variable (s) under study.

There are a series of possibilities of bias that are associated with the variable under study, either at the time of its observation, the measurement of its magnitude and its subsequent classification (Manterola).

a) Periodicity: Corresponds to the variability in the observation; That is, what is observed can follow an abnormal pattern over time, either because it is distributed uniformly over time or because it is concentrated in periods. Knowledge of this characteristic is essential in biological events that present known cycles such as the circadian rhythm,
electroencephalographic waves, etc.

b) Observation conditions: There are events that require special conditions for their occurrence to be possible, such as environmental humidity and temperature, respiratory and heart rates. These are non-controllable situations that, if not adequately considered, can generate bias; context more typical of basic sciences.

c) Nature of the measurement: Sometimes there may be difficulty in measuring the magnitude or value of a variable, qualitative or quantitative. This situation may occur because the magnitude of the values ​​is small (hormonal determinations), or due to the nature of the phenomenon under study (quality of life).

d) Errors in the classification of certain events:
They may occur as a result of modifications in the nomenclature used; fact that must be noted by the researcher. For example, neoplasm classification codes, operational definition of obesity, etc.

2. From the observer
The ability to observe an event of interest
(EI) varies from one subject to another. What’s more, when faced with the same stimulus it is possible that two individuals can have different perceptions. Therefore, homogenizing the observation, guaranteeing adequate conditions for its occurrence and adequate observation methodology, leads to minimizing measurement errors.

This is how we know that the error is inherent to the observer, independent of the measuring instrument used. This is why in the different clinical research models, strict conditions are required to homogenize the measurements made by different observers; using clear operational definitions or verifying compliance with these requirements among the subjects incorporated into the study.

 3. From the measurement instrument (s) The measurement of biomedical phenomena using more than just the senses entails the participation of measurement instruments, which in turn may have technical limitations to be able to measure exactly what they are. is desired.

The limitations of measurement instruments apply both to “hard” devices and technology, as well as to population exploration instruments such as surveys, questionnaires, scales and others. Regarding the latter, it is important to consider that the verification of compliance with the technical attributes of these is usually left aside, which, independent of any consideration, are “measuring instruments”, since they have been designed to measure the occurrence of an EI; Therefore, they must be subject to the same considerations as any measuring instrument (Manterola).

These restrictions easily apply to diagnostic tests, in which there is always the probability of overdiagnosing subjects (false positives) or underdiagnosing them (false negatives), committing errors of a different nature in both cases.
Frequently, it is necessary to resort to the design of data collection instruments; whose purpose, like the application of diagnostic tests, is to separate the population according to the presence of some IS.

Thus, if an instrument lacks adequate sensitivity, it will determine a low identification rate of subjects with IS (true positives). On the contrary, screening instruments with low specificity will decrease the probability of finding subjects without the IS (true negatives).

For example, a questionnaire intended to carry out a prevalence study of gastroesophageal reflux may consider inappropriate items to detect the problem in a certain group of subjects, altering their sensitivity. The same instrument, with an excessive number of items of little significance in relation to the problem, may lack adequate specificity to measure EI.


Cohorts Cases and controls Cross section Ecological studies

  1. Selection bias Low High Medium Not applicable
  2. Recall bias Low High High Not applicable Confusion
  3. bias Low Medium Medium High
  4. Follow-up losses High Low Not applicable Not applicable
  5. Time required High Medium Medium Low
  6. Cost High Medium Medium Low
  7. Table III. Most common types of bias in observational studies.
  8. MANTEROLA, C. & OTZEN, T. Biases in clinical research. Int. J. Morphol., 33(3):1156-1164, 2015. Another way of classifying biases is that which is related to the frequency in which they occur and the stage of the study in which they originate; It is known that in clinical research, the most frequent biases that affect the validity of a study can be classified into three categories: selection (generated during the selection or monitoring of the study population), information (originated during measurement processes in the study population) and confusion (occur due to the impossibility of comparing the study groups).

1. Selection biases
This type of bias, particularly common in case-control studies (events that occurred in the past can influence the probability of being selected in the study); It occurs when there is a systematic error in the procedures used to select the subjects of the study (Restrepo Sarmiento & Gómez-Restrepo, 2004). Therefore, it leads to an estimate of the effect different from that obtainable for the white population.

It is due to systematic differences between the characteristics of the subjects selected for the study and those of the individuals who were selected for us. For example: hospital cases and those excluded from these either because the subject dies before arriving at the hospital due to the acute or more serious nature of their condition; or for not being sick enough to require admission to the hospital under study; or due to the costs of entry; the distance of the healthcare center from the home of the subject who is excluded from the study, etc.

They can occur in any type of study design, however, they occur most frequently in retrospective case series, case-control, cross-sectional, and survey studies. This type of bias prevents extrapolation of conclusions in studies carried out with volunteers drawn from a population without IS. An example of this situation is the so-called Berkson bias; Also called Berkson’s fallacy or paradox, or admission or diagnostic bias; which is defined as the set of selective factors that lead to systematic differences that can be generated in a case-control study with hospital cases.

It occurs in those situations in which the combination between an exposure and the IS under study increases the risk of admission to a hospital, which leads to a systematically higher exposure rate among hospital cases compared to controls (for example: negative association between cancer and pulmonary tuberculosis, in which tuberculosis acted as a protective factor for the development of cancer; which was explained by the low frequency of tuberculosis in those hospitalized for cancer, a fact
that does not mean that among these subjects the frequency of the disease is less).

Another subtype of selection bias is the so-called Neymann bias (prevalence or incidence), which occurs when the condition under study determines premature loss due to death of the subjects affected by it; For example, if in a group
of 1000 subjects with high blood pressure (risk factor for myocardial infarction) and 1000 non-hypertensive subjects, followed for 10 years; An intense association is observed between arterial hypertension and myocardial infarction. However, it may occur that an association is not obtained due to the non-incorporation in the analysis of subjects who die from myocardial infarction during follow-up.

Another subtype of selection bias is the so-called non-response bias (self-selection or volunteer effect), which occurs when the degree of motivation of a subject who voluntarily participates in research can vary significantly in relation to other subjects; either over or under reporting.

Another that should be mentioned is the membership (or belonging) bias, which occurs when among the subjects under study there are subgroups of individuals who share a particular attribute, related positively or negatively with the variable under study; For example, the profile of surgeons’ habits and lifestyles may differ significantly from that of the general population, such that incorporating a large number of this type of subjects in a study may determine findings conditioned by this factor.

Another is the bias of the selection procedure, which occurs in some clinical trials (CT), in which the random assignment process to the study groups is not respected (Manterola & Otzen, 2015). Another type of selection bias is loss to follow-up bias, which can occur especially in cohort studies, when subjects from one of the study cohorts are lost totally or partially (≥ 20%) and pre-follow-up cannot be completed. -established, thus generating a relevant alteration in the results (Lazcano-Ponce et al., 2000; Manterola et al., 2013).

measurement bias

2.  Measurement bias

This type of bias occurs when a defect occurs when measuring exposure or evolution that generates different information between the study groups that are compared (precision). It is therefore due to errors made in obtaining the information that is required once the eligible subjects are part of the study sample (classification of subjects with and without IS; or of exposed and non-exposed).

In practice, it can present itself as the incorrect classification of subjects, variables or attributes, within a category different from the one to which they should have been assigned. The probabilities of classification can be the same in all groups under study, called “non-differential incorrect classification” (the degree of misclassification) MANTEROLA, C. & OTZEN, T. Biases in clinical research. Int. J. Mor

Table of Contents