How to best address potential confounding variables in your data collection?


How to best address potential confounding variables in your data collection?

data collection

Data Collection

Data is a collection of facts, figures, objects, symbols, and events gathered from different sources. Organizations collect data with various data collection methods to make better decisions. Without data, it would be difficult for organizations to make appropriate decisions, so data is collected from different audiences at various points in time.

For instance, an organization must collect data on product demand, customer preferences, and competitors before launching a new product. If data is not collected beforehand, the organization’s newly launched product may fail for many reasons, such as less demand and inability to meet customer needs. 

Although data is a valuable asset for every organization, it does not serve any purpose until analyzed or processed to get the desired results.

Data collection methods are techniques and procedures used to gather information for research purposes. These methods can range from simple self-reported surveys to more complex experiments and can involve either quantitative or qualitative approaches to data gathering.

Some common data collection methods include surveys, interviews, observations, focus groups, experiments, and secondary data analysis. The data collected through these methods can then be analyzed and used to support or refute research hypotheses and draw conclusions about the study’s subject matter.

What is a confounding variable?

The concept of confusion is probably one of the most important in general epidemiology. Firstly, because much of the work carried out in this field of science consists precisely of trying to prevent it when designing research studies or controlling its effect when it appears in the research work carried out. Secondly, and specifically with regard to health professionals, because the adequate understanding of this phenomenon will depend on whether they can interpret, critically and correctly, the results of the many studies that are published in the scientific literature. .

This review aims to explain, in a didactic manner and with the help of several examples, the concept of confusion
and then do the same with another important concept, that of effect modification (interaction), and finally describe the differences between both concepts.


Although some antecedents can be found in Francis Bacon, the first author who explicitly addressed the issue of confusion was the British philosopher and economist John Stuart Mill (1806-1873)1. When referring to the criteria necessary for establishing a causal relationship, Mill pointed out the need to ensure that no factor was present that had effects that could be confused with the agent one wanted to study.

Before defining the confusion phenomenon, it is necessary to describe the counterfactual approach of biological models2. Given the presence of data referring to 15 newborns with neural tube defects (NTD) in a sample of 10,000 women with folic acid deficiency, we could ask ourselves if the incidence of malformations is due to folic acid deficiency. The question is important because, if the answer is affirmative, we would have a simple solution to the problem in our hands: for example, the fortification of foods with folic acid.

To answer this question, it is necessary to compare this group of women with another that had normal folic acid values. If the hypothesis that the deficiency increases the incidence of NTD in newborns is true, it would be logical to find a smaller number of newborns with these malformations, for example 5 cases with NTD, in another we would obtain a relative risk of 3, which would be would be interpreted as stating that folic acid deficiency multiplies by three the risk of a newborn being born with a neural tube defect.

However, we cannot rule out that this second group of women, with normal values ​​of
folic acid, also present certain healthy characteristics, such as a better diet in general, a better genetic makeup or a low prevalence of risk factors such as tobacco. or alcohol. Therefore, it would be acceptable to conclude that the lower incidence of NTD in these women may be due to two different phenomena: to normal values ​​of folic acid, but also to healthier habits and conditions, with which the relative risk of 3 would be a overestimation of the harmful effect of
low concentrations of folic acid in pregnant women.

Intuitively, it seems logical that the most perfect procedure to determine the effect of folic acid deficiency would consist of comparing the first 10,000 women who had insufficient folic acid values ​​with themselves but assuming that they had normal folic acid concentrations. In this case, both groups would only differ in their exposure to folic acid and the measure of association would really be attributable to folic acid deficiency. However, this comparison group is not possible
in practice, it is not “feasible”, it goes against the facts (each woman has or does not have an adequate value of folic acid), and for this reason it has been called a “counterfactual group”. .

If there is an association between folic acid deficiency and NTDs, we would probably obtain a figure lower than 15 in these
“counterfactual” women with normal concentrations of folic acid, but a frequency greater than the hypothetical figure of 5 that we presented previously, we would obtain, for example, the hypothetical number of 10.

confounding variable



In general terms, we speak of confounding when there are important differences between the raw estimates of an association and those adjusted for possible confounding factors. These differences can be assessed following various criteria, although there is a certain consensus on the importance of assessing the effect that the adjustment has on the magnitude of the changes in the association measures. Thus, a factor can be considered a confounder when its adjustment is responsible for a change of at least
10% in the magnitude of the difference between the adjusted and raw estimates.

Before carrying out these comparisons, it is necessary to estimate the adjusted values. The most classic method to obtain adjusted values ​​of association measures is the one we have presented previously, which consists of recalculating new estimates within each stratum of the possibly confounding variable. It is easy to understand that, when we want to assess several confounding factors simultaneously (e.g., age categorized into two groups, sex, and intake of a given food as a dichotomous variable), we quickly reach the situation where we lack resources in the strata to be able to reach a valid estimate in each stratum (we would obtain 8 strata with the variables that we have just listed).

A more efficient option to consider the confounding role of various variables simultaneously is multivariate analysis. Multivariate analysis is a complex procedure carried out more or less automatically by statistical programs and which consists of obtaining, from an initially important number of variables, the set of variables (called independent variables, covariates or predictor variables) that are most intensely associated with the outcome of interest (dependent variable). This set of variables constitutes what we call the “multivariate statistical model.” From
this model we obtain the measures of association of the different variables that make it up, but with the additional advantage that each of these estimates is adjusted by the other variables that make up the model.

Depending on the scale of the variable that quantifies the outcome, different types of multivariate models are used: multiple linear regression (quantitative outcome), logistic regression (dichotomous outcome) or multiple Cox regression (survival function as outcome of interest), Poisson (outcome in the form of rates).

The main advantage of multivariate analysis over stratified analysis is that multivariate models are more efficient. That is, given the same sample size used, more precise estimates are obtained and with a greater number of variables than would be admissible in a stratified analysis.

When estimating these models to identify confounding variables, it is recommended to choose any variable that, while meeting the general criteria for confounding variables (criteria summarized in the acyclic diagrams), is responsible for changes of more than 10% between crude measures of association (without said variable in the model) and adjusted (with said variable included in the model) and that present a conservative level of significance (p value), approximately less2 than 0.20.

identify confounding


Because confounding factors introduce, by definition, a bias in the measures of association, it is evident that an attempt should be made to prevent and control this effect before presenting the definitive results of an investigation. Confounding factors can be prevented in the design phase or eliminated in the analysis phase of an epidemiological study.

Why Are Confounding Variables Important?

A quantitative study can be an investment of significant time and money. You must have confidence that your study will be reliable, and its results will be valid. Studies should be constructed such that they could be repeated, with an expectation of the same result.

When a research study has low bias and a high level of repeatability and control, it has high internal validity — in other words, a study is internally valid if it does not bias a participant towards any specific answer or action.

If your research study has significant confounding variables, then the conclusions from that study may be wrong. Making decisions based on these misguided conclusions can result in significant loss of time and money for organizations.

Best Practices for Avoid Confounding Variables

  • Use within-subject study designs when possible. Counterbalance or randomize the order in which participants are exposed to the different conditions in your study. For example, if they are testing two designs, randomly decide the first one tested by each participant. Within-subjects designs reduce sources of error and naturally counterbalance experimental conditions.
  • Randomly assign condition groups for between-subjects study designs. For example, randomly decide which design should be seen by a participant.