What role does randomization best play in your data collection design?

directivos empresas inteligencia artificial

What role does randomization best play in your data collection design?

data collection


What is data collection?

Data collection is the process of gathering data for use in business decision-making, strategic planning, research and other purposes. It’s a crucial part of data analytics applications and research projects: Effective data collection provides the information that’s needed to answer questions, analyze business performance or other outcomes, and predict future trends, actions and scenarios.

IT systems regularly collect data on customers, employees, sales and other aspects of business operations when transactions are processed and data is entered. Companies also conduct surveys and track social media to get feedback from customers. Data scientists, other analysts and business users then collect relevant data to analyze from internal systems, plus external data sources if needed. The latter task is the first step in data preparation, which involves gathering data and preparing it for use in business intelligence (BI) and analytics applications.

An overview of randomization techniques: An unbiased assessment of outcome in clinical research

A good experiment or trial minimizes the variability of the evaluation and provides unbiased evaluation of the intervention by avoiding confounding from other factors, which are known and unknown.

Randomization ensures that each patient has an equal chance of receiving any of the treatments under study, generate comparable intervention groups, which are alike in all the important aspects except for the intervention each groups receives. It also provides a basis for the statistical methods used in analyzing the data. The basic benefits of randomization are as follows: it eliminates the selection bias, balances the groups with respect to many known and unknown confounding or prognostic variables, and forms the basis for statistical tests, a basis for an assumption of free statistical test of the equality of treatments. In general, a randomized experiment is an essential tool for testing the efficacy of the treatment.

In practice, randomization requires generating randomization schedules, which should be reproducible. Generation of a randomization schedule usually includes obtaining the random numbers and assigning random numbers to each subject or treatment conditions. Random numbers can be generated by computers or can come from random number tables found in the most statistical text books.

For simple experiments with small number of subjects, randomization can be performed easily by assigning the random numbers from random number tables to the treatment conditions. However, in the large sample size situation or if restricted randomization or stratified randomization to be performed for an experiment or if an unbalanced allocation ratio will be used, it is better to use the computer programming to do the randomization such as SAS, R environment etc.


Researchers in life science research demand randomization for several reasons. First, subjects in various groups should not differ in any systematic way. In a clinical research, if treatment groups are systematically different, research results will be biased. Suppose that subjects are assigned to control and treatment groups in a study examining the efficacy of a surgical intervention. If a greater proportion of older subjects are assigned to the treatment group, then the outcome of the surgical intervention may be influenced by this imbalance. The effects of the treatment would be indistinguishable from the influence of the imbalance of covariates, thereby requiring the researcher to control for the covariates in the analysis to obtain an unbiased result.

Second, proper randomization ensures no a priori knowledge of group assignment (i.e., allocation concealment). That is, researchers, subject or patients or participants, and others should not know to which group the subject will be assigned. Knowledge of group assignment creates a layer of potential selection bias that may taint the data. Schul and Grimes stated that trials with inadequate or unclear randomization tended to overestimate treatment effects up to 40% compared with those that used proper randomization. The outcome of the research can be negatively influenced by this inadequate randomization.

Statistical techniques such as analysis of covariance (ANCOVA), multivariate ANCOVA, or both, are often used to adjust for covariate imbalance in the analysis stage of the clinical research. However, the interpretation of this post adjustment approach is often difficult because imbalance of covariates frequently leads to unanticipated interaction effects, such as unequal slopes among subgroups of covariates.

One of the critical assumptions in ANCOVA is that the slopes of regression lines are the same for each group of covariates. The adjustment needed for each covariate group may vary, which is problematic because ANCOVA uses the average slope across the groups to adjust the outcome variable. Thus, the ideal way of balancing covariates among groups is to apply sound randomization in the design stage of a clinical research (before the adjustment procedure) instead of post data collection. In such instances, random assignment is necessary and guarantees validity for statistical tests of significance that are used to compare treatments.



Many procedures have been proposed for the random assignment of participants to treatment groups in clinical trials. In this article, common randomization techniques, including simple randomization, block randomization, stratified randomization, and covariate adaptive randomization, are reviewed. Each method is described along with its advantages and disadvantages. It is very important to select a method that will produce interpretable and valid results for your study. Use of online software to generate randomization code using block randomization procedure will be presented.

Simple randomization

Randomization based on a single sequence of random assignments is known as simple randomization.This technique maintains complete randomness of the assignment of a subject to a particular group. The most common and basic method of simple randomization is flipping a coin. For example, with two treatment groups (control versus treatment), the side of the coin (i.e., heads – control, tails – treatment) determines the assignment of each subject. Other methods include using a shuffled deck of cards (e.g., even – control, odd – treatment) or throwing a dice (e.g., below and equal to 3 – control, over 3 – treatment). A random number table found in a statistics book or computer-generated random numbers can also be used for simple randomization of subjects.

This randomization approach is simple and easy to implement in a clinical research. In large clinical research, simple randomization can be trusted to generate similar numbers of subjects among groups. However, randomization results could be problematic in relatively small sample size clinical research, resulting in an unequal number of participants among groups.


The block randomization method is designed to randomize subjects into groups that result in equal sample sizes. This method is used to ensure a balance in sample size across groups over time. Blocks are small and balanced with predetermined group assignments, which keeps the numbers of subjects in each group similar at all times. The block size is determined by the researcher and should be a multiple of the number of groups (i.e., with two treatment groups, block size of either 4, 6, or 8). Blocks are best used in smaller increments as researchers can more easily control balance.

After block size has been determined, all possible balanced combinations of assignment within the block (i.e., equal number for all groups within the block) must be calculated. Blocks are then randomly chosen to determine the patients’ assignment into the groups.

Although balance in sample size may be achieved with this method, groups may be generated that are rarely comparable in terms of certain covariates. For example, one group may have more participants with secondary diseases (e.g., diabetes, multiple sclerosis, cancer, hypertension, etc.) that could confound the data and may negatively influence the results of the clinical trial. Pocock and Simon stressed the importance of controlling for these covariates because of serious consequences to the interpretation of the results. Such an imbalance could introduce bias in the statistical analysis and reduce the power of the study. Hence, sample size and covariates must be balanced in clinical research.


Stratified randomization

The stratified randomization method addresses the need to control and balance the influence of covariates. This method can be used to achieve balance among groups in terms of subjects’ baseline characteristics (covariates). Specific covariates must be identified by the researcher who understands the potential influence each covariate has on the dependent variable. Stratified randomization is achieved by generating a separate block for each combination of covariates, and subjects are assigned to the appropriate block of covariates. After all subjects have been identified and assigned into blocks, simple randomization is performed within each block to assign subjects to one of the groups.