How to best address potential confounding variables in your data collection?

Translating

How to best address potential confounding variables in your data collection?

data collection

Data Collection

Data is a collection of facts, figures, objects, symbols, and events gathered from different sources. Organizations collect data with various data collection methods to make better decisions. Without data, it would be difficult for organizations to make appropriate decisions, so data is collected from different audiences at various points in time.

For instance, an organization must collect data on product demand, customer preferences, and competitors before launching a new product. If data is not collected beforehand, the organization’s newly launched product may fail for many reasons, such as less demand and inability to meet customer needs. 

Although data is a valuable asset for every organization, it does not serve any purpose until analyzed or processed to get the desired results.

Data collection methods are techniques and procedures used to gather information for research purposes. These methods can range from simple self-reported surveys to more complex experiments and can involve either quantitative or qualitative approaches to data gathering.

Some common data collection methods include surveys, interviews, observations, focus groups, experiments, and secondary data analysis. The data collected through these methods can then be analyzed and used to support or refute research hypotheses and draw conclusions about the study’s subject matter.

What is a confounding variable?

The concept of confusion is probably one of the most important in general epidemiology. Firstly, because much of the work carried out in this field of science consists precisely of trying to prevent it when designing research studies or controlling its effect when it appears in the research work carried out. Secondly, and specifically with regard to health professionals, because the adequate understanding of this phenomenon will depend on whether they can interpret, critically and correctly, the results of the many studies that are published in the scientific literature. .

This review aims to explain, in a didactic manner and with the help of several examples, the concept of confusion
and then do the same with another important concept, that of effect modification (interaction), and finally describe the differences between both concepts.

Concept

Although some antecedents can be found in Francis Bacon, the first author who explicitly addressed the issue of confusion was the British philosopher and economist John Stuart Mill (1806-1873)1. When referring to the criteria necessary for establishing a causal relationship, Mill pointed out the need to ensure that no factor was present that had effects that could be confused with the agent one wanted to study.

Before defining the confusion phenomenon, it is necessary to describe the counterfactual approach of biological models2. Given the presence of data referring to 15 newborns with neural tube defects (NTD) in a sample of 10,000 women with folic acid deficiency, we could ask ourselves if the incidence of malformations is due to folic acid deficiency. The question is important because, if the answer is affirmative, we would have a simple solution to the problem in our hands: for example, the fortification of foods with folic acid.

To answer this question, it is necessary to compare this group of women with another that had normal folic acid values. If the hypothesis that the deficiency increases the incidence of NTD in newborns is true, it would be logical to find a smaller number of newborns with these malformations, for example 5 cases with NTD, in another we would obtain a relative risk of 3, which would be would be interpreted as stating that folic acid deficiency multiplies by three the risk of a newborn being born with a neural tube defect.

However, we cannot rule out that this second group of women, with normal values ​​of
folic acid, also present certain healthy characteristics, such as a better diet in general, a better genetic makeup or a low prevalence of risk factors such as tobacco. or alcohol. Therefore, it would be acceptable to conclude that the lower incidence of NTD in these women may be due to two different phenomena: to normal values ​​of folic acid, but also to healthier habits and conditions, with which the relative risk of 3 would be a overestimation of the harmful effect of
low concentrations of folic acid in pregnant women.

Intuitively, it seems logical that the most perfect procedure to determine the effect of folic acid deficiency would consist of comparing the first 10,000 women who had insufficient folic acid values ​​with themselves but assuming that they had normal folic acid concentrations. In this case, both groups would only differ in their exposure to folic acid and the measure of association would really be attributable to folic acid deficiency. However, this comparison group is not possible
in practice, it is not “feasible”, it goes against the facts (each woman has or does not have an adequate value of folic acid), and for this reason it has been called a “counterfactual group”. .

If there is an association between folic acid deficiency and NTDs, we would probably obtain a figure lower than 15 in these
“counterfactual” women with normal concentrations of folic acid, but a frequency greater than the hypothetical figure of 5 that we presented previously, we would obtain, for example, the hypothetical number of 10.

confounding variable

 

ID

In general terms, we speak of confounding when there are important differences between the raw estimates of an association and those adjusted for possible confounding factors. These differences can be assessed following various criteria, although there is a certain consensus on the importance of assessing the effect that the adjustment has on the magnitude of the changes in the association measures. Thus, a factor can be considered a confounder when its adjustment is responsible for a change of at least
10% in the magnitude of the difference between the adjusted and raw estimates.

Before carrying out these comparisons, it is necessary to estimate the adjusted values. The most classic method to obtain adjusted values ​​of association measures is the one we have presented previously, which consists of recalculating new estimates within each stratum of the possibly confounding variable. It is easy to understand that, when we want to assess several confounding factors simultaneously (e.g., age categorized into two groups, sex, and intake of a given food as a dichotomous variable), we quickly reach the situation where we lack resources in the strata to be able to reach a valid estimate in each stratum (we would obtain 8 strata with the variables that we have just listed).

A more efficient option to consider the confounding role of various variables simultaneously is multivariate analysis. Multivariate analysis is a complex procedure carried out more or less automatically by statistical programs and which consists of obtaining, from an initially important number of variables, the set of variables (called independent variables, covariates or predictor variables) that are most intensely associated with the outcome of interest (dependent variable). This set of variables constitutes what we call the “multivariate statistical model.” From
this model we obtain the measures of association of the different variables that make it up, but with the additional advantage that each of these estimates is adjusted by the other variables that make up the model.

Depending on the scale of the variable that quantifies the outcome, different types of multivariate models are used: multiple linear regression (quantitative outcome), logistic regression (dichotomous outcome) or multiple Cox regression (survival function as outcome of interest), Poisson (outcome in the form of rates).

The main advantage of multivariate analysis over stratified analysis is that multivariate models are more efficient. That is, given the same sample size used, more precise estimates are obtained and with a greater number of variables than would be admissible in a stratified analysis.

When estimating these models to identify confounding variables, it is recommended to choose any variable that, while meeting the general criteria for confounding variables (criteria summarized in the acyclic diagrams), is responsible for changes of more than 10% between crude measures of association (without said variable in the model) and adjusted (with said variable included in the model) and that present a conservative level of significance (p value), approximately less2 than 0.20.

identify confounding

 

Because confounding factors introduce, by definition, a bias in the measures of association, it is evident that an attempt should be made to prevent and control this effect before presenting the definitive results of an investigation. Confounding factors can be prevented in the design phase or eliminated in the analysis phase of an epidemiological study.

Why Are Confounding Variables Important?

A quantitative study can be an investment of significant time and money. You must have confidence that your study will be reliable, and its results will be valid. Studies should be constructed such that they could be repeated, with an expectation of the same result.

When a research study has low bias and a high level of repeatability and control, it has high internal validity — in other words, a study is internally valid if it does not bias a participant towards any specific answer or action.

If your research study has significant confounding variables, then the conclusions from that study may be wrong. Making decisions based on these misguided conclusions can result in significant loss of time and money for organizations.

Best Practices for Avoid Confounding Variables

  • Use within-subject study designs when possible. Counterbalance or randomize the order in which participants are exposed to the different conditions in your study. For example, if they are testing two designs, randomly decide the first one tested by each participant. Within-subjects designs reduce sources of error and naturally counterbalance experimental conditions.
  • Randomly assign condition groups for between-subjects study designs. For example, randomly decide which design should be seen by a participant.

What best steps to take to ensure the representative sample?

images 10

What best steps to take to ensure the representative sample?

representative sample

 

What is a representative sample?

The representative sample is a sample of a relatively appropriate size that has been selected by random procedures and the characteristics observed in it correspond to the population from which it was drawn (Ras, 1980; Cochran, 1976; Scheaffer, Mendenhall and Ott, 1987). It is not possible, in any case, to be certain of the degree of representativeness, but rather there is a reasonable probability of that representativeness.

Representativeness

Is a function of several factors, it not only depends on the randomness and size of the sample, but also on the sampling design, very particular for each case, the use of key auxiliary information, the sampling design and a useful and useful sampling frame. updated. The term representative is used as long as the sample faithfully represents the variable under study, which has a probabilistic distribution in the population and the frequency distribution in the sample must be mirror or very similar to that of the population.

This highlights how complex the selection of a representative sample is. To do this, the following must be taken into account: the way the sample is selected, the estimators to be proposed and their precision, the determination of the sample size that takes into account the “aquaricity” or margin of error allowed, the level of confidence in the estimation and variability of the variable on which the Probabilistic Inference is going to be carried out.

Likewise, attention must be paid to the available sampling frame and the set of key auxiliary variables or covariates that are correlated with the variables of interest, which will allow improving the sampling design, with the formation of strata, selection of direct estimators, such as the Horvitz-Thompson, and indirect (ratio, regression and difference), and choose a sample size appropriate to a given precision, choose samples with probabilities proportional to a measure of size (PPT) and use calibrated estimates where adjustment has to be made. sampling weights depending on the non-response and the auxiliary information found, especially in complex samples.

Many times, when designing a probabilistic sample, concessions must be made, especially if the statistical population is asymmetric, there are even times when elements with probability one (1) of belonging to the sample are used and, if this is not done, the sample will not be representative enough.

A probabilistic sample in its structure approaches a greater degree of what is called representativeness when the value of the distance between the estimate of the sample and the value of the population parameter becomes smaller, this is known as aquaricity in the Statistical inference.

We can be in the presence of a sufficiently representative sample when the selection process assigns a probability of inclusion in advance to each element, if this probability is different from zero, if known, and not necessarily being equal for each element of the population. and, furthermore, if the sampling error is low, if aquarity exists and if a random process is used in its selection.

The best way we have to define a sufficiently representative sample is one where a probabilistic sampling strategy is used that allows estimating the value of the parameter with aquaricity, the minimum bias, the minimum Standard Error of the estimator of that parameter or the minimum error of estimate, which is a multiple of the Standard Error of the estimator.

Representativeness

Importance of having a representative sample

Representative samples are known to collect results, knowledge, and observations that can be relied upon as representative of the broader population being studied. Therefore, representative sampling is usually the best method for  market research .

If we do not have representation, we will surely have data that will be of no use to us. Therefore, it is important that we guarantee that the characteristics that matter to us and need to be investigated are found in the sample that is going to be the object of study.

Let’s take into account that we will always be prone to falling into  sampling bias  because there will always be people who do not answer the survey because they are busy, or answer it incompletely, so we will not be able to obtain the data we require.

Regarding the  size of the sample , the larger it is, the more likely it is to be representative of the population. 

That a sample is representative gives us greater certainty that the  people included are the ones we need , and we also reduce possible  bias . Therefore, if we want to avoid inaccuracy in our surveys, we must have a representative and balanced sample.

Representative samples

How to obtain a representative sample?

There are  established sampling methods  to obtain a representative sample that have been tested and verified over time through academic, scientific and market research

The  most common types of sampling  are probability or random sampling and non-probability sampling.

Probability sampling

If we are going to have a  probabilistic or random sampling  , we must make sure we have updated information on the population from which we will draw the sample and survey the majority to ensure representativeness. 

The sample will be chosen at random, which guarantees that each member of the population will have the same probability of selection and inclusion in the sample group.

Non-probability sampling

In  non-probabilistic sampling,  the aim is to have different types of people to ensure a more balanced representative sample. 

Knowing the demographic characteristics of our group will undoubtedly help to limit the profile of the desired sample and define the variables that interest us, such as gender, age, place of residence, etc. 

By knowing these criteria, before obtaining the information, we can have the control to create a representative sample that is useful to us.

We must  avoid having a sample that does NOT reflect the target population , the ideal is to have data that is as accurate as possible for the success of our project.  

Probability sampling

Avoid making sampling errors

When a sample is not representative, then we will have a  sampling error . If we want to have a representative sample of 100 employees, then we must choose a similar number of men and women. For example, if we have a sample biased towards a certain gender, then we will have an error in the sample.

Sample size is very important, but it does not guarantee that the population we need is accurately represented. More than size, representativeness is more related to the  sampling frame , that is, to the list from which the people who are going to be, for example, part of a  survey are selected . 

Therefore, we must ensure that people from our  target audience  are included in that list to say that it is a representative sample.

 

Are you using best primary and secondary data, and why?

Translating

Are you using best primary and secondary data, and why?

primary and secondary data

Primary and secondary data

To analyze the similarity and difference between primary and secondary data sources, it is advisable that you know what a market study entails. As a  Statista report explains , “ a market study is an important business strategy that requires the collection of information about a target market for the company .”

That is, before launching a new line of products or services, growing your team or establishing a social media campaign, you will most likely need to conduct market research. Collecting primary and secondary data ensures that you can make informed decisions to save time and money.

On the one hand,  primary data is the information collected directly from the source of interest; in this case, the potential client . Normally, when a company needs primary marketing data, it does so to determine the viability of a product or service and analyze the buyer persona, market offers, investment risk, among other factors. Primary data collection methods include customer interviews, surveys,  focus groups , etc.

On the other hand, unlike primary data,  secondary data is based on information researched by other companies, institutions or platforms . Secondary data sources are usually public. Some of these include newspapers, government websites, media agencies, etc. In order for it to serve the company’s purposes, the collection of secondary data has to go through an analysis by the marketing team that selects the most convenient information.

Primary data types

To begin to understand the characteristics of primary and secondary data, we must start from their types. That way,  data analysis in marketing  makes sense and can help you plan market research. In the case of primary data, there are two types that you must take into account.

Quantitative primary data

As you might guess from the name, the importance of quantitative primary data lies in the quantities and numbers. Those in charge of conducting primary data research focus on the mathematical data and not on the subjective signals of people’s behaviors or opinions .

The information from primary data allows us to understand a problem in the market and analyze its implication for the consumer. By obtaining this numerical information, a company’s marketing team and managers can make decisions based on clear and objective realities.

Primary data

 

Qualitative primary data

Numbers or statistics are not necessary here. On the contrary,  qualitative secondary data offers information about the behaviors and emotions of potential customers of a product or service from audios, texts, videos or other formats for collecting opinions .

It is common for this type of data to be obtained from direct conversations with people of interest. Interviews are usually conducted with open-ended questions and other data collection techniques.

Primary data sources

In the process of obtaining primary and secondary data, locating the sources is the most important step. If you’re part of a marketing team with research underway, here’s a list of primary data sources to consider.

  • Surveys:  serve to gather direct information on the problems, needs and preferences of the potential client. In this way, survey information makes it possible to predict consumer behavior during the sales phases.
  • Questionnaires:  this format includes open and closed questions to determine the assessment that a potential client has of a brand, company or product. They can be made in different formats, such as phone calls, sending SMS, emails, etc.
  • Interviews:  These are conversations conducted by a marketing or market data research specialist. They tend to be long and deep conversations with people interested in a brand and product. Given their character, they offer quality information from verbal responses and even body language.
  • Website: web analytics  of   a company’s website is also a primary database. From the SEO and user experience analysis carried out by the corresponding areas, valuable information can be obtained for the content marketing strategy, the creation of new products and the customer profile.
  • Social networks:  brand accounts on Instagram, Facebook, Twitter and other social networks are sources of a lot of information for any marketing team. In them, existing customers and interested parties present their opinions, objections and ideas, which can be used to make decisions about advertising campaigns and organic content. That is, primary data in social media marketing has a lot of relevance for a company.

Primary data sources

 

Secondary data types

If you are wondering  how to do a complete market study  , you need to also take into account the types of secondary data. These include the information that can be found in the company’s internal files and that which comes from outside.

Internal secondary data

Although this concept seems to reflect a similarity between primary and secondary data, the truth is that internal secondary data is information that is not collected directly from the ideal client or the specific audience at their current moment. These data are obtained from the internal and archived records of a company, such as reports from past advertising campaigns, accounts of former clients, accounting data, etc.

Secondary data

 

External secondary data

This information is what is commonly found outside of company records and files . They have been prepared and published by external sources for objectives that do not directly relate to commercial research objectives. That is, as we mentioned before, secondary data information comes from studies of competitors, magazines of public organizations, etc.

Secondary data sources

Just as with primary data sources, secondary data also exists in different formats. The nature of each of these sources and the information collected must be analyzed to select what is useful and what is not.

  • Government studies and reports:  many governments around the world carry out in-depth studies on the behavior of their populations, demographics and the sociocultural changes that occur in them. This offers a lot of valuable secondary data for any business’s marketing research.
  • Private market research:  using the power of  data science , there are many companies that are dedicated to conducting user and consumer behavior studies. These usually sell this information to other companies that can leverage the value of this secondary data for their objectives.
  • Sales data:  the data obtained from sales processes is very important for any commercial organization. You can find useful secondary data in invoices, returned products, order documents, and delivery experience. 
  • Competitor platforms:  in a secondary data investigation you cannot forget the information found on competitor pages, applications and reports. In them, you can locate information about potential customers, unresolved pain points, and gaps in your value proposition.
  • Web search engines:  Search engines like Google also offer invaluable data. Not only because they list the competition’s organic results for their websites, but because they allow you to see an overview of the advertising spending that these other companies make in search of the same type of client.

 

 

 

 

 

 

Why English language is importance and the best uses of it

english

1. Global Lingua Franca: English, often referred to as the global lingua franca, holds paramount importance in the contemporary world. Its widespread use as a common language of communication facilitates interaction and collaboration on a global scale. English has become the default language for international business, diplomacy, science, technology, and academia, making it an indispensable … Read more

Why hindi language is importance and the best uses of it

hindi

Introduction The Hindi language holds immense importance in various aspects of society, culture, history, and communication. Its significance can be understood through its widespread usage, cultural richness, historical roots, and its role in fostering national unity. In this comprehensive exploration, we will delve into the multifaceted importance and uses of the Hindi language. 1. Linguistic … Read more

How to best handle outliers or anomalous data points?

images 3

How to best handle outliers or anomalous data points?

data

 

DATA

Data is a collection of facts, figures, objects, symbols, and events gathered from different sources. Organizations collect data with various data collection methods to make better decisions. Without data, it would be difficult for organizations to make appropriate decisions, so data is collected from different audiences at various points in time.

Outlier detection in the field of data mining (DM) and knowledge discovery from data (KDD) is of great interest in areas that require decision support systems, such as, for example, in the financial area, where through DM you can detect financial fraud or find errors produced by users. Therefore, it is essential to evaluate the veracity of the information, through methods for detecting unusual behavior in the data.

This article proposes a method to detect values ​​that are considered outliers in a database of nominal type data. The method implements a global “k” nearest neighbors algorithm, a clustering algorithm called  k-means  , and a statistical method called chi-square. The application of these techniques has been implemented on a database of clients who have requested financial credit. The experiment was performed on a data set with 1180 tuples, where outliers were deliberately introduced. The results demonstrated that the proposed method is capable of detecting all introduced outliers.

Detecting outliers represents a challenge in data mining techniques. Outliers or also called anomalous values ​​have different properties with respect to generality, since due to the nature of their values ​​and therefore, their behavior, they are not data that maintain a behavior similar to the majority. Anomalous  are susceptible to being introduced by malicious mechanisms ( Atkinson, 1981 ). 

Mandhare and Idate (2017)  consider this type of data to be a threat and define it as irrelevant or malicious. Additionally, this data creates conflicts during the analysis process, resulting in unreliable and inconsistent information. However, although anomalous data are irrelevant for finding patterns in everyday data, they are useful as an object of study in cases where, through them, it is possible to identify problems such as financial fraud through an uncontrolled process.

What is anomaly detection?

Anomaly detection examines specific data points and detects unusual occurrences that appear suspicious because they are different from established patterns of behavior. Anomaly detection is not new, but as the volume of data increases, manual tracking is no longer practical.

Why is anomaly detection important?

Anomaly detection is especially important in industries such as finance, retail, and cybersecurity, but all businesses should consider implementing an anomaly detection solution. Such a solution provides an automated means to detect harmful outliers and protects data. For example, banking is a sector that benefits from anomaly detection. Thanks to it, banks can identify fraudulent activity and inconsistent patterns, and protect data. 

Data is the lifeline of your business and compromising it can put your operation at risk. Without anomaly detection, you could lose revenue and brand value, which takes years to cultivate. Your company is facing security breaches and the loss of confidential customer information. If this happens, you risk losing a level of customer trust that may be irretrievable. 

anomaly detection

 

The detection process using data mining techniques facilitates the search for anomalous values ​​ ( Arce, Lima, Orellana, Ortega and Sellers, 2018 ). Several studies show that most of this type of data also originates from domains such as credit cards ( Bansal, Gaur, & Singh, 2016 ), security systems ( Khan, Pradhan, & Fatima, 2017 ), and electronic health information ( Zhang & Wang, 2018 ).

The detection process includes a data mining process that uses tools based on unsupervised algorithms (Onan, 2017). The detection process consists of two approaches depending on its form: local and global ( Monamo, Marivate, & Twala, 2017 ). Global approaches include a set of techniques in which each anomaly is assigned a score relative to the global data set. On the other hand, local approaches represent the anomalies in a given data with respect to its direct neighborhood; that is, to the data that are close in terms of the similarity of their characteristics.

According to the aforementioned concepts, the local approach detects outliers that are ignored when a global approach is used, especially those with variable density (Amer and Goldstein, 2012). Examples of such algorithms are those based on i) clustering and ii) nearest neighbor. The first category algorithm considers outliers to be in sparse neighborhoods, which are far from the nearest neighbors. While the second category operates in grouped algorithms ( Onan, 2017 ).

There are several approaches related to the detection of outliers, in this context,  Hassanat, Abbadi, Altarawneh, and Alhasanat (2015) , carried out a survey where a summary of the different studies of detection of outliers is presented, these being: statistics-based approach, the distance-based approach and the density-based approach. The authors present a discussion related to outliers,  and conclude that the k-mean algorithm is the most popular in clustering a data set.

Furthermore, in other studies (Dang et al., 2015; Ganji, 2012; Gu et al., 2017; Malini and Pushpa, 2017; Mandhare and Idate, 2017; Sumaiya Thaseen and Aswani Kumar, 2017; Yan et al., 2016 ) data mining techniques, statistical methods or both are used. For outlier detection, nearest neighbor (KNN) techniques have commonly been applied along with others to find unusual patterns during data behavior or to improve process performance. What’s up. (2017) present an efficient grid-based method for finding outlier data patterns in large data sets.

Similarly, Yan et al. (2016) propose an outlier detection method with KNN and data pruning, which takes successive samples of tuples and columns, and applies a KNN algorithm to reduce dimensionality without losing relevant information.

Classification of significant columns

To classify the significant columns, the chi-square statistic was used. Chi-square is a non-parametric test used to determine whether a distribution of observed frequencies differs from expected theoretical frequencies ( Gol and Abur, 2015 ). The weight of the input column (columns that determine the customer profile) is calculated in relation to the output column (credit amount). The higher the weight of a column corresponding to the input columns on a scale of zero to one, the more relevant it is considered.

That is, the closer the weight value is to one, the more important the relationship with respect to the output column will be. The statistic can only be applied to nominal type columns and has been selected as a method to define relevance. Chi-square reports a level of significance of the associations or dependencies and was used as a hypothesis test on the weight or importance of each of the columns with respect to the output  column S. The resulting value is stored in a column called  weight , which together with the anomaly score is reported at the end of the process.

Nearest local neighbor rating

To obtain the values ​​with suspected abnormality, K-NN Global Anomaly Score is used. KNN is based on the k-nearest neighbor algorithm, which calculates the anomaly score of the data relative to the neighborhood. Usually, outliers are far from their neighbors or their neighborhood is sparse. In the first case, it is known as global anomaly detection and is identified with KNN; The second refers to an approach based on local density.

The score comes by default from the average of the distance to the nearest neighbors ( Amer and Goldstein, 2012 ). In the  k  nearest neighbor classification, the output column  S  of the nearest neighbor of the training dataset is related to a new data not classified in the prediction, this implies a linear decision line.

values

To obtain a correct prediction, the value k (number of neighbors to be considered around the analyzed value) must be carefully configured. A high value of k represents a poor solution with respect to prediction, while low values ​​tend to generate noise ( Bhattacharyya, Jha, Tharakunnel, & Westland, 2011 ).

Frequently, the parameter k is chosen empirically and depends on each problem. Hassanat, Abbadi, Altarawneh, and Alhasanat (2014)  propose carrying out tests with different numbers of close neighbors until reaching the one with the best precision. Their proposal starts with values ​​from k=1 to k= square root of the number of tuples in the training dataset. The general rule is often to map k with the square root of the number of tuples in dataset D.

HOW CAN WE SOLVE THE PROBLEM OF ATYPICAL DATA?

If we have confirmed that these outliers are not due to an error when constructing the database or in measuring the variable,  eliminating them is not the solution. If it is not due to an error, eliminating or replacing it can modify the inferences made from that information, because it introduces a bias, reduces the sample size, and can affect both the distribution and the variances.

Furthermore,  the treasure of our research lies in the variability of the data!

That is, variability (differences in the behavior of a phenomenon) must be explained, not eliminated. And if you still can’t explain it, you should at least be able to reduce the influence of these outliers on your data.

The best option is to remove weight from these atypical observations using robust techniques .

Robust statistical methods are modern techniques that address these problems. They are similar to the classic ones but are less affected by the presence of outliers or small variations with respect to the models’ hypotheses.

ALTERNATIVES TO THE MEDIA

If we calculate the  median (the center value of an ordered sample) for the second data set we have a value of 14 (the same as for the first data set). We see that this centrality statistic has not been disturbed by the presence of an extreme value, therefore, it is more robust.

Let’s look at other alternatives…

The  trimmed mean (trimming)  “discards” extreme values. That is, it eliminates a fraction of the extreme data from the analysis (eg 20%) and calculates the mean of the new data set. The trimmed mean for our case would be worth 13.67.

The  winsorized mean  progressively replaces a percentage of the extreme values ​​(eg 20%) with less extreme ones. In our case, the winsorized mean of the second sample would be the same 13.62.

We see that all of these robust estimates better represent the sample and are less affected by extreme data.

What role does technology play in your best data collection process?

language20for20applications

What role does technology play in your best data collection process?

data collection

Data Collection

Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem.

While methods and aims may differ between fields, the overall process of data collection remains largely the same. Before you begin collecting data, you need to consider:

  • The aim of the research
  • The type of data that you will collect
  • The methods and procedures you will use to collect, store, and process the data

To collect high-quality data that is relevant to your purposes, follow these four steps.

Data collection is the process of gathering data for use in business decision-making, strategic planning, research and other purposes. It’s a crucial part of data analytics applications and research projects: Effective data collection provides the information that’s needed to answer questions, analyze business performance or other outcomes, and predict future trends, actions and scenarios.

In businesses, data collection happens on multiple levels. IT systems regularly collect data on customers, employees, sales and other aspects of business operations when transactions are processed and data is entered. Companies also conduct surveys and track social media to get feedback from customers. Data scientists, other analysts and business users then collect relevant data to analyze from internal systems, plus external data sources if needed. The latter task is the first step in data preparation, which involves gathering data and preparing it for use in business intelligence (BI) and analytics applications.

For research in science, medicine, higher education and other fields, data collection is often a more specialized process, in which researchers create and implement measures to collect specific sets of data. In both the business and research contexts, though, the collected data must be accurate to ensure that analytics findings and research results are valid.

What are information technologies?

Information technology is   a process that uses a combination of means and methods of collecting, processing and transmitting data to obtain new quality information about the state of an object, process or phenomenon. The purpose of information technology is the  production of information  for analysis by people and making decisions based on it to perform an action.

Information technologies (IT)

The introduction of a personal computer in the information sphere and the application of telecommunications media have determined a new stage in the development of  information technology . Modern IT is an information technology with a “friendly” user interface using personal computers and telecommunication facilities. The new information technology is based on the following basic principles.

Information technologies

  1. Interactive (dialogue) mode of working with a computer.
  2. Integration with other software products.
  3. Flexibility in the process of changing data and task definitions.

As a set of  information technology tools , many types of computer programs are used: word processors, publishing systems, spreadsheets, database management systems, electronic calendars, functional purpose information systems.

Characteristics of information technologies:

  • User operation in  data manipulation mode  (without programming). The user must not know and remember, but must see (output devices) and act (input devices).
  • Transversal information support  at all stages of information transmission is supported by an integrated database, which provides a unique way to enter, search, display, update and  protect information .
  • Paperless document processing  during which only the final version of the paper document is recorded, intermediate versions and necessary data recorded on the media are delivered to the user through the PC display screen.
  •  Interactive (dialogue) task solution mode  with a wide range of possibilities for the user.
  • Collective production of a document  on the basis of a group of computers linked by means of communication.
  • Adaptive processing  of the form and modes of presentation of information in the problem-solving process.

Types of information technologies

The main  types of information technology  include the following.

  • Information technology for data processing is   designed to solve well-structured problems, whose solution algorithms are well known and for which all necessary input data exist. This technology is applied to the performance level of low-skilled personnel in order to automate some routine and constantly repeated operations of administrative work.
  • Management information technology is   intended for the information service of all company employees, related to the acceptance of administrative decisions. In this case, the information is usually in the form of ordinary or special management reports and contains information about the past, present and possible future of the company.
  • Automated office information technology is   designed to complement the company’s existing staff communication system. Office automation assumes the organization and support of communication processes both within the company and with the external environment on the basis of computer networks and other modern means of transferring and working with information.
  • Information technology for decision support is   designed to develop a management decision that occurs as a result of an iterative process involving a decision support system (a computer link and the object of management) and a person (the management link, which sets input data and evaluates the result).
  • Expert systems information technology is   based on the use of  artificial intelligence . Expert systems allow managers to receive expert advice on any problem about which knowledge has been accumulated in these systems.

Types of information technologies

 

The use of modern technology is more economical than ever and electronic tools now offer a cost-effective alternative to paper questionnaires for collecting high-quality data. In order to help you decide if using computer-assisted personal interviewing (CAPI) is for you, this blog reviews the potential benefits and challenges of using CAPI and shares a recent survey experience. carried out in Guyana in which free software developed by Survey Solutions was used.

Paper questionnaires: the traditional way to collect data

Conducting surveys of this magnitude with paper questionnaires can be costly in terms of economic, administrative and logistical efforts while presenting a series of challenges: printing and transporting questionnaires to and from the field is often associated with with a high cost; Corrections to questions can represent a significant challenge in terms of cost and time. There is also the real risk that questionnaires will be lost in the field or damaged by weather or transportation before the data is systematized.

collect data

Even when all interviews have been conducted, responses must be manually entered into a digital file before the data can be analyzed. This process represents a lot of time and manual work and increases the margin of error. Data quality checks are limited, and errors are sometimes only recognized after the survey has ended, making them more difficult to correct.

However, there is an alternative to paper questionnaires: computer-assisted personal interview (CAPI). In recent years, CAPI has attracted more attention as it presents a more economical way to collect high-quality data.

P7213035

CAPI: an increasingly popular tool

With the new processing speeds of today’s computers, the increasing global availability of Internet service and the falling prices of mobile devices, CAPI has become increasingly attractive. The CAPI tool creates the questionnaire using special software that can be downloaded directly to a mobile device (usually a smartphone or tablet), which the interviewer uses to administer and fill out the questionnaire. Information from these questionnaires is uploaded to a central server where it can be accessed and reviewed remotely.

Depending on the size of the survey sample, purchasing tablets to complete electronic surveys becomes increasingly more affordable than printing paper questionnaires. The technical requirements for such devices are relatively low and a large number of questionnaires can usually be saved on the device without danger of running out of storage. Additionally, once a questionnaire has been entered on a mobile device, it can be modified: If an error is detected in the early stages of the survey, it can be easily corrected without incurring additional printing costs.

How do you best standardize your data collection procedures?

images 4

How do you best standardize your data collection procedures?

data collection

Data Collection

Data collection is the process of collecting and analyzing information on relevant variables in a predetermined, methodical way so that one can respond to specific research questions, test hypotheses, and assess results. Data is a collection of facts, figures, objects, symbols, and events gathered from different sources. Organizations collect data with various data collection methods to make better decisions. Without data, it would be difficult for organizations to make appropriate decisions, so data is collected from different audiences at various points in time.

Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem.

While methods and aims may differ between fields, the overall process of data collection remains largely the same. Before you begin collecting data, you need to consider:

  • The aim of the research
  • The type of data that you will collect
  • The methods and procedures you will use to collect, store, and process the data

To collect high-quality data that is relevant to your purposes, follow these four steps.

For instance, an organization must collect data on product demand, customer preferences, and competitors before launching a new product. If data is not collected beforehand, the organization’s newly launched product may fail for many reasons, such as less demand and inability to meet customer needs. 

Although data is a valuable asset for every organization, it does not serve any purpose until analyzed or processed to get the desired results.

In a society in which information and data are the key to any activity or business, it is very important to know  how to standardize data  to be able to get the most out of it and performance.

With globalization and the information systems that are used in our daily lives, the amount of information and data we have at our disposal is immense. The problem is  knowing how to manage such an amount of data, how to collect it, how to treat it, classify it and apply it . In this sense, data standardization can be of great help.

Aware of the importance of knowing how to standardize data, at  Ayuware ,  as  experts in big data,  we have prepared this post with relevant information about the advantages that this procedure offers and all the benefits it provides, so that you can know exactly what it is about. treats.

What is standardizing data?

Data standardization is the data quality process of transforming data to fit a predefined and constrained set of values, relying on the power of uniformity to improve data efficiencies.

Data standardization, sometimes also known as  normalization , is the  process of adjusting or adapting certain characteristics  so that the data resembles a common type, model or normal with the aim of making  its treatment, access and use easier for people. the users or people who have them.

The very concept of open data implies a search for standardization in the use of information in open format so that it can be used and reused by citizens.

Therefore, it could be said that standardization is the way in which all people can  compare and consult data and always find the information they need, being certain of uniformity in the way in which they will find it.

In statistics, normalization or standardization can have a wide list of meanings. In the simplest cases, standardization of indices involves adjusting the values ​​measured on different scales with respect to a common scale.

On the other hand, in more complex cases, data normalization or standardization can refer to making more sophisticated adjustments where the objective is to obtain all those probability distributions that fit the determined values.

In short, it could be said that without standardization and, therefore,  without uniformity  in naming conventions and coincidences between macros and parameters,  only unreliable results can be obtained.

Importance of data standardization

Before knowing how to standardize data, it is important that you know the  importance  of this process for the correct use and processing of information in an effective and functional manner.

Data standardization allows us  to ensure that we will have useful, easily linkable and usable data  at our disposal for any activity we require.

Data standardization not only  helps us organize sets of complex information , but it will also  facilitate its analysis  since it breaks down multiple dimensions and transforms the information into viable insights.

Committing to standardize data is a way to  ensure that we implement the uniformity necessary to maintain the effectiveness of your analysis . As we know, information, today, is power and, therefore,  having standardized and optimized information is a great advantage  both on a personal and business level.

standardizing data

In this sense, data standardization has gone from being a good practice or something recommended, to being  a necessity  for all those who want  to get the most out of the  data  and information they have at their disposal.

Now that we are getting closer to knowing the process on how to standardize data, it is important to keep in mind that, for it to be possible,  it is necessary for institutions to work together  to develop standards since, for a standard to exist, it must be used by a large majority for it to be considered implemented.

Furthermore, data standards and normalization imply  good practices  such as monitoring, control, keeping in mind at all times that the information is useful to anyone who wishes to use it, etc. Thanks to these good practices, the process of how to standardize data is effective and provides all the advantages mentioned above.

To know how to standardize data correctly, it is necessary  to pay attention to the planning prior to collecting the information , in addition to thinking about the feedback of said information from the publication of the data in an open and public manner.

Key moments in data standardization

To know how to standardize data, it is important to know that  there are three key moments  for the process, in which it is necessary to pay attention to carry out the process satisfactorily:

  • Standardization in data capture : For the data standardization procedure to be correct, it is necessary that the data be collected optimally, following guidelines that will greatly facilitate the following steps.
  • Standardization at the time of data storage : Once the information is collected, it is very important to pay attention to how the data is stored, to do so in an orderly manner and to provide all subsequent facilities for its recovery and consultation.
  • Standardization in the presentation of data : This is a fundamental moment since, after carrying out a search for data that we may need, it is very important that these are displayed in a standardized way, so that we can make use of them in a useful and functional. If all the previous steps have been carried out correctly, this last moment should not present any problems.

data

 

What measures are in place to ensure the security of your data?

english

What measures are in place to ensure the security of your data?

data

Data

Data is a collection of facts, figures, objects, symbols, and events gathered from different sources. Organizations collect data with various data collection methods to make better decisions. Without data, it would be difficult for organizations to make appropriate decisions, so data is collected from different audiences at various points in time.

For instance, an organization must collect data on product demand, customer preferences, and competitors before launching a new product. If data is not collected beforehand, the organization’s newly launched product may fail for many reasons, such as less demand and inability to meet customer needs. 

Although data is a valuable asset for every organization, it does not serve any purpose until analyzed or processed to get the desired results.

A phrase attributed to the creator of the Internet, Tim Berners-Lee, is that “data is precious and will last longer than the systems themselves.”

The computer scientist was referring to the fact that information is something highly coveted, one of the most valuable assets that companies have, so it must be protected. The loss of sensitive data can represent the bankruptcy of a company.

Faced with an increasing number of threats, it is necessary to implement measures to protect information in companies. And before doing so, it is necessary to classify the data  you have and the risks to which you are exposed: the price list of the products that are marketed is not the same as the estimated sales figure that is planned to be achieved in the year or base. of customer data.

To talk about the cloud these days is to talk about a need for storage, flexibility, connectivity and decision making in real time. Information is a constantly growing asset and needs to be managed by work teams, and platforms such as  Claro Drive Negocio  offer, in addition to that storage space, collaboration tools to manage an organization’s data.

In cloud storage, the user, instead of saving the data on their computers or hard drives, does so somewhere in the remote location, which can be accessed through the internet service. There are several  providers  of these services that sell space on the network for different ranges, but few offer true security and protection of that gold that you have in your company called: data.

To give you context, more than a third of companies have consolidated  flexible and scalable cloud models  as an alternative to execute their workload and achieve their digital transformation, reducing costs. Hosted information management services allow IT to maintain control and administrators to monitor access and hierarchies by business units.

Five key security measures

Below are five security recommendations to protect information in companies:

  1. Make backup copies or backups . Replicating or having a copy of the information outside the company’s facilities can save your operation in the event of an attack. In this case, options can be sought  in the cloud or in data centers so that the protected information is available at any time. It is also important that the frequency with which the backup is made can be configured, so that the most recent data is backed up.
  2. Foster a culture of strong passwords . Kaspersky recommends that passwords be longer than eight characters, including uppercase, lowercase, numbers, and special characters. The manufacturer also suggests not including personal information or common wordsuse a password for each service ; change them periodically; Do not share them, write them on paper or store them in the web browser. Every year, Nordpass publishes a ranking of the 200 worst passwords used in the world. The worst four are “123456”, “123456789”, “picture1” and “password”.
  3. Protect email.  Now that most of the communication is done through this medium, it is advisable to have anti-spam filters and message encryption systems to protect and take care of the privacy of the data. Spam filters help control the receipt of unsolicited emails, which may be  infected with viruses  and potentially compromise the security of company data.
  4. Use antivirus.  This tool should provide protection against security threats such as zero-day attacks, ransomware, and cryptojacking. And it must also be installed on cell phones that contain company information.
  5. Control access to information.  One way to minimize the risk and consequent impact of errors on data security is to provide access to data according to the profile of each user. With the principle of least privilege, it is considered that, if a person does not have access to certain vital company information, he cannot put it at risk.

security measures

In security, nothing is too much

In a synthetic way, the National Cybersecurity Institute of Spain, INCIBE, recommends the following “basic security measures”:

  • Keep systems updated  free of viruses and vulnerabilities
  • Raise awareness among employees about the correct use of corporate systems
  • Use secure networks to communicate with customers, encrypting information when necessary
  • Include customer information in annual risk analyses, perform regular backups, and verify your restore procedures
  • Implement correct authentication mechanisms, communicate passwords to clients securely and store them encrypted, ensuring that only they can recover and change them

The first time a company or business faces the decision to automate a process it can be somewhat intimidating, however, taking into account the following points, it is a simple task.

1.- Start with the easy processes

Many companies start  considering automation  because they have a large, inflexible process that they know takes up too much time and money. So they start with their most complex problem and work backwards. This strategy is generally expensive and time-consuming, what you should do is review your most basic processes and automate them first. For example, are you emailing a document with revisions when you should be building an automated workflow? There are probably dozens, if not hundreds of these simple processes that you can address and automate before taking on your “giant” process.

2.- Make sure your employees lose their fear of automation

Many times an employee who is not familiar with an automated process is afraid of it. Because? In general he is afraid that automation will eliminate his position. That’s why it’s important to build a supportive culture around automation and get your employees to understand that just because  some of their work is now being assisted by an automated process , it doesn’t mean they are any less valuable.

How will you store and manage the best your collected data?

what is object storage 1080x630 1

How will you store and manage the best your collected data?

collected data

Collected data

Collected data  is very important. Data collection is  the process of collecting and measuring information about specific variables in an established system, which then allows relevant questions to be answered and results to be evaluated. Data collection is a component of research in all fields of study, including the  physical  and  social sciences ,  humanities and business . While methods vary by discipline, the emphasis on ensuring accurate and honest collection remains the same. The goal of all data collection is to capture quality evidence that will allow analysis to lead to the formulation of compelling and credible answers to the questions that have been posed. What is meant by privacy?

The ‘right to privacy’ refers to being free from intrusions or disturbances in one’s private life or personal affairs. All research should outline strategies to protect the privacy of the subjects involved, as well as how the researcher will have access to the information.

The concepts of privacy and confidentiality are related but are not the same. Privacy refers to the individual or subject, while confidentiality refers to the actions of the researcher.

What does the management of stored information entail?

Manual collected data and analysis are time-consuming processes, so transforming data into insights is laborious and expensive without the support  of automated tools.

The size and scope of the information analytics market is expanding at an increasing pace, from self-driving cars to security camera analytics and medical developments. In every industry, in every part of our lives, there is rapid change and  the speed at which transformations occur is increasing.

It is a constant  evolution that is based on data.  That information comes from all the new and old data collected, when it is used to  develop new types of knowledge.

The relevance that information management has acquired raises many questions about the requirements applicable to all data collected and information developed.

Data encryption

Data encryption is  not a new concept, in history we can go to the ciphers that Julius Caesar used to send his orders or the famous communication encryption enigma machine that the Nazis used in the Second World War.

Nowadays,  data encryption  is one of the most used security options to protect personal and business data.

Data encryption  works through mathematical algorithms that convert data into unreadable data. This encrypted data consists of two keys to decrypt it, an internal key that only the person who encrypts the data knows, and a key

external that the recipient of the data or the person who is going to access it must know.

Data encryption can be used   to protect all types of documents, photos, videos, etc. It is a method that has many advantages for information security.

 

Data encryption

Advantages of data encryption

  • Useless data : in the event of the loss of a storage device or the data is stolen by a cybercriminal, allows said data to be useless for all those who do not have the permissions and decryption key.
  • Improve reputation : companies that work with encrypted data offer both clients and suppliers a secure way to protect the confidentiality of their communications and data, displaying an image of professionalism and security.
  • Less exposure to sanctions : some companies or professionals are required by law to encrypt the data they handle, such as lawyers, data from police investigations, data containing information on acts of gender violence, etc. In short, all data that, due to its nature, is sensitive to being exposed, therefore requires mandatory encryption, and sanctions may be generated if it is not encrypted.

Data storage 

There are many advantages associated with achieving good management of stored information. Among the  benefits of adequately covering the requirements of the  Data Storage function  and  data management  , the following two stand out:

  • Savings: the capacity of a server to  store data  is limited, so  storing data  without a structure, without a logical order and lacking guiding principles, represents an increase in cost that could be avoided. On the contrary, when data storage responds to a plan and the decisions made are aligned with the business strategy, advantages are achieved that extend to all functions of the organization.
  • Increased productivity:  when   has not been stored correctly the system works slower. One of the strategies often used to avoid this is to  divide data into active and inactive . The latter would be kept compressed and in a different place, so that the system remains agile, but without this meaning that they remain completely inactive, since it may sometimes be necessary to access them again. Today, with cloud services it is much easier to find the most appropriate data storage approach for each type of information.

We must avoid each application deciding  how to save the data , and to this end the information management policy should be uniform for all applications and respond to the following questions in each case:

  • How the data is stored .
  • When is the data saved ?
  • What part of the data or information is collected.

In short,  through  a person in charge will be established who is determined by the  Data Governance , which is in turn responsible for defining the standards and the way to store the information, since not all silos can be used.

And this is the way to support the common objective from this function and through the procedures, planning and organization and control that is exercised transversally and always seeking  to enhance  the pragmatic side of the data .

Data storage 

Steps of data processing in research

Data processing in research has six steps. Let’s look at why they are an imperative component of  research design

  • Research data collection

Data collection is   the main stage of the research process. This process can be carried out through various online and offline research techniques and can be a mix of primary and secondary research methods

The most used form of data collection is research surveys. However, with a  mature market research platform  , you can collect qualitative data through focus groups, discussion modules, etc.

  • Preparation of research 

The second step in  research data management  is data preparation to eliminate inconsistencies, remove bad or incomplete survey data, and clean the data to maintain consensus. 

This step is essential, since insufficient data can make research studies completely useless and a waste of time and effort.

Introduction of research 

The next step is to enter the cleaned data into a digitally readable format consistent with organizational policies, research needs, etc. This step is essential as the data is entered into online systems that support research data management.

  • Research data processing

Once the data is entered into the systems, it is essential to process it to make sense of it. The information is processed based on needs, the  types of data  collected, the time available to process the data and many other factors. This is one of the most critical components of the research process. 

  • Research data output

This stage of processing research data is where it becomes knowledge. This stage allows business owners, stakeholders, and other staff to view data in the form of graphs, charts, reports, and other easy-to-consume formats

  • Storage of processed research

The last stage of data processing steps is storage. It is essential to keep data in a format that can be indexed, searched, and create a single source of truth. Knowledge management platforms are the most used for storing processed research data.

data

Benefits of data processing in research

Data processing can differentiate between actionable knowledge and its non-existence in the research process. However, the processing of research data has some specific advantages and benefits:

  • Streamlined processing and management

When research data is processed, there is a high probability that this data will be used for multiple purposes now and in the future. Accurate data processing helps streamline the handling and management of research data.

  • Better decision making

With accurate data processing, the likelihood of making sense of data to arrive at faster and better decisions becomes possible. Thus, decisions are made based on data that tells stories rather than on a whim.

  • Democratization of knowledge

Data processing allows raw data to be converted into a format that works for multiple teams and personnel. Easy-to-consume data enables the democratization of knowledge.

  • Cost reduction and high return on investment

Data-backed decisions help brands and organizations  make decisions based on data  backed by evidence from credible sources. This helps reduce costs as decisions are linked to data. The process also helps maintain a very high ROI on business decisions. 

  • Easy to store, report and distribute

Processed data is easier to store and manage since the raw data is structured. This data can be consulted and accessible in the future and can be called upon when necessary. 

Examples of data processing in research 

Now that you know the nuances of data processing in research, let’s look at concrete examples that will help you understand its importance.

  • Example in a global SaaS brand

Software as a Service (Saas) brands have a global footprint and have an abundance of customers, often both B2B and B2C. Each brand and each customer has different problems that they hope to solve using the SaaS platform and therefore have different needs. 

By conducting  consumer research , the SaaS brand can understand their expectations,  purchasing  and purchasing behaviors, etc. This also helps in profiling customers, aligning product or service improvements, managing marketing spend and more based on the processed research data. 

Other examples of this data processing include retail brands with a global footprint, with customers from various demographic groups, vehicle manufacturers and distributors with multiple dealerships, and more. Everyone who does market research needs to leverage data processing to make sense of it.  

data