Brief Background of Machine Learning and its best facts

Translating

Did you know that machine learning is a part of artificial intelligence that enables computers to learn from data without explicit programming using statistical techniques? Many call this software 2.0. In recent years, it has evolved from a mere concept into a technology that significantly impacts our daily routines. From email spam filters to voice assistants, movie suggestions, and … Read more

What is your strategy for minimizing non-response bias?

english

What is your strategy for minimizing non-response bias?

bias

BIAS

Have you ever looked at audience data and thought that it doesn’t seem completely real or accurate? It could be the result of bias in the data. Bias in the data generates results that are not fully representative of the audience you are researching. It can happen intentionally or unintentionally, and is something you should take into account in your planning and strategy.

Before we continue, you might want to read this couple of articles about  how we use and enrich our data sources in Audiense , and  data restrictions and how it works in the real world .

An example of data bias can be found in demographic and socioeconomic data. India’s population is made up of  52% men and 48% women . If we talk about social data, to begin with, Internet penetration in the population is  49% . And looking at India’s population in Facebook Insights, we see that the gender split is 76% men and 24% women! So what is the correct data?

This shows us that there is an imbalance between how many men and women there are on social media compared to the number of men and women in the country. Simply put, we know that not the entire adult population in the world is on social media, so we are aware that the data we are working with will only be representative of the existing population on social media.

If we want to go deeper, we must remember that people can create various social profiles, such as private accounts or fan pages, and this can differ depending on the online community you are analyzing.

If you’re to deliver an effective survey, you’ll need to identify what you want to measure, the audience you want to target and your choice of distribution method to reach that audience.

However, if after all this careful planning you find that your survey response rate is much lower than you expected, you need to be asking yourself, what could be the cause of this?

Well, one of the biggest factors could be nonresponse bias. Read on to find out more about this issue, its causes, why it can be problematic and ways to reduce nonresponse bias in your own surveys.

What is nonresponse bias?

Nonresponse bias occurs when survey participants are unwilling or unable to respond to a survey question or an entire survey. While the reasons for nonresponse can vary from person to person, when respondents refuse to participate it can be a major source of error in your survey data, which can harm its accuracy.

If it’s to be considered a form of bias, then a source of error must be systematic in nature. And nonresponse bias is not an exception to this rule.

So, if a survey method or design is created in a way that makes it more likely for certain groups of potential respondents to refuse to participate or be absent during a surveying period, then it has created a systematic bias.

Consider the following example where you’re asking respondents for sensitive information as part of a survey, which is looking to measure tax payment compliance.

In this scenario, it’s likely that citizens who do not properly follow tax laws are likely to be the most uncomfortable with filling out this type of survey and be more likely to refuse. Consequently, this will bias the data towards the more law-abiding net sample, rather than the original sample.

This nonresponse bias in surveys when requesting legally sensitive information has been proven to be even more extreme if the survey explicitly states that a government or another organisation of authority is collecting that data.

What causes nonresponse bias?

Besides requests for sensitive information, there are many more issues that can cause nonresponse bias.

Here are some of the key ones.

Poor survey design

From the length and presentation of your survey to how easy it is to understand and answer. There’s a lot of issues to do with your survey design that can cause respondents to drop out and fail to complete your survey.

Subsequently, you need to make sure your survey is as clear, concise and engaging as you can make it.

Be sure to follow and include some survey design best practices, to make your next survey is as good as it can be.

Incorrect target audience

One of the first things you need to think about in your survey project, is the audience you’re targeting.

Make sure that audience is relevant to the survey you’re looking to send out.

For example, if you were issuing a survey to canvas views about a new flavour of dog food, you wouldn’t want to accidentally include cat owners in your survey distribution list.

Failed deliveries

Unfortunately, when you send your surveys, some will always end up going directly into a spam folder. However, if you’ve not set up your sending options in the right way, you may not even know if your survey wasn’t received and it will be just recorded as a nonresponse.

To help with that some distribution options such as email enable open tracking options, to let you know if your email was opened, how many survey click throughs you got, and who responded to your survey, so you can be more accurate in your recording.

Refusals

There can be a lot going on in people’s lives, so, no matter how good your survey, there’s always likely to be some people who will just say ‘no’ to completing it.

It could be a bad day or time for them, or they may just not want to do it. However, bear in mind that just because they said “no” today, it doesn’t mean they won’t take one of your surveys another time.

Accidental omission

Sometimes some people will simply forget to complete your survey.

While it’s difficult to prevent this from happening, in most cases this will only affect a smaller number of your nonresponses.

nonresponse bias

 

Why is nonresponse bias a problem?

The problem with nonresponse bias is that it can lead to inconclusive results, which prevents your survey from meeting its objective, no matter what your survey’s goal.

For example, let’s say you wanted to gather data about a particular product feature, to find out whether or not it was still adding value to your product.

If an insufficient number of your sample completed your survey, you might not have sufficient data to make an informed decision on whether to keep the feature as it is, improve it, or go in another direction completely.

Survey data is only at its most informative and useful when you’re able to see the complete picture of something. So, limiting your nonresponse bias not only has an impact on your survey responses, but on your decision making too.

How to reduce nonresponse bias

Having got up to speed with nonresponse bias, it’s causes and why it’s a problem, we’re sure many of you will be keen to know how to keep it to an absolute minimum in your surveys.

Well, read on for some tips about how to reduce nonresponse bias.

Keep your surveys simple and concise

Short and simple is the key here.

In fact, studies show that longer surveys on average lose more than three times as many respondents compared to a survey that is less than five minutes long.

The problem with including too many survey questions, is that your customer may not finish their responses, or want to begin your survey in the first place. Consider making your survey no more than five minutes long with 10 questions at most.

Pre test your survey

It is really important to ensure that your survey and your survey invites are able to run smoothly on any medium or device that your respondents might potentially use. This is because respondents are more likely to ignore any survey requests which have long loading times, or the questions don’t fit accurately with the screen size they’re looking at.

Consequently, it’s prudent to consider all the possible communications software and devices your survey may be run on and pre-test your surveys on each of these, to try and ensure it runs as smoothly on as many of these as you can.

Set participant expectations

To help minimise nonresponse bias, it’s good practice to communicate to your customer what they should expect from your survey, either through an earlier email or in your survey introduction message.

You need to outline your survey’s goal, approximate time it will take to complete and any details about anonymity or confidentiality that’s included in your survey.

Re-examine your survey timing and distribution methods

Your survey distribution timing and method can also make a difference to your volume of nonresponses.

From whether you’re targeting an internal or external or B2B or B2C audience. When it comes to the best times to send your survey, there’s lots of factors that can influence your success. However, it can be helpful to test out a range of different days and times and see what best works for you.

Whether you use email, SMS or a weblink, or have more success through social media or QR codes. Similarly, to altering your timings, different survey distribution methods can work better with different audience groups.

Once again, it can also be helpful to test out different channels with your audience to see what generates the best response rate, while minimising your nonresponses.

Offer an incentive to complete your survey

Always try to communicate to respondents how they will benefit from taking your survey. It could be as simple as telling them how their feedback will be used and the pain points this will solve for them moving forward.

Alternatively, if you’re targeting a consumer audience, you might like to offer a monetary incentive for them to complete your survey.

For example, you might like to offer a discount on a future purchase they make with you, or an incentive for referring a friend.

survey

 

Issue reminders

Busy customers can easily put your survey on their to-do’s list, but then forget to complete it. So, being able to send a few reminders can be really beneficial in boosting the number of completed responses you’re able to gather.

Carefully make a note of when you send reminders and be mindful to space them out, so you don’t harass people on your contact lists, especially those who’ve already completed your survey.

Remember to close the feedback loop

Be sure to thank those that complete your survey, letting them know how much you appreciate their time and feedback. And depending on the nature of your survey, you may give a brief indication of what you hope to do with that information.

Ultimately, when a respondent feels that they have been heard and appreciated, then they’ll be more likely to complete another one of your surveys in the future.

Get better results when sending your surveys

We hope you found this blog interesting. Having provided an overview of nonresponse bias, it’s causes and ways to reduce it, we hope you will be able to incorporate some of this advice in your own surveys

While the extra checks and tasks are likely to add a bit more time on to your survey project, the boost to your survey response numbers and the quality of your data should make up for this.

How to best address potential confounding variables in your data collection?

Translating

How to best address potential confounding variables in your data collection?

data collection

Data Collection

Data is a collection of facts, figures, objects, symbols, and events gathered from different sources. Organizations collect data with various data collection methods to make better decisions. Without data, it would be difficult for organizations to make appropriate decisions, so data is collected from different audiences at various points in time.

For instance, an organization must collect data on product demand, customer preferences, and competitors before launching a new product. If data is not collected beforehand, the organization’s newly launched product may fail for many reasons, such as less demand and inability to meet customer needs. 

Although data is a valuable asset for every organization, it does not serve any purpose until analyzed or processed to get the desired results.

Data collection methods are techniques and procedures used to gather information for research purposes. These methods can range from simple self-reported surveys to more complex experiments and can involve either quantitative or qualitative approaches to data gathering.

Some common data collection methods include surveys, interviews, observations, focus groups, experiments, and secondary data analysis. The data collected through these methods can then be analyzed and used to support or refute research hypotheses and draw conclusions about the study’s subject matter.

What is a confounding variable?

The concept of confusion is probably one of the most important in general epidemiology. Firstly, because much of the work carried out in this field of science consists precisely of trying to prevent it when designing research studies or controlling its effect when it appears in the research work carried out. Secondly, and specifically with regard to health professionals, because the adequate understanding of this phenomenon will depend on whether they can interpret, critically and correctly, the results of the many studies that are published in the scientific literature. .

This review aims to explain, in a didactic manner and with the help of several examples, the concept of confusion
and then do the same with another important concept, that of effect modification (interaction), and finally describe the differences between both concepts.

Concept

Although some antecedents can be found in Francis Bacon, the first author who explicitly addressed the issue of confusion was the British philosopher and economist John Stuart Mill (1806-1873)1. When referring to the criteria necessary for establishing a causal relationship, Mill pointed out the need to ensure that no factor was present that had effects that could be confused with the agent one wanted to study.

Before defining the confusion phenomenon, it is necessary to describe the counterfactual approach of biological models2. Given the presence of data referring to 15 newborns with neural tube defects (NTD) in a sample of 10,000 women with folic acid deficiency, we could ask ourselves if the incidence of malformations is due to folic acid deficiency. The question is important because, if the answer is affirmative, we would have a simple solution to the problem in our hands: for example, the fortification of foods with folic acid.

To answer this question, it is necessary to compare this group of women with another that had normal folic acid values. If the hypothesis that the deficiency increases the incidence of NTD in newborns is true, it would be logical to find a smaller number of newborns with these malformations, for example 5 cases with NTD, in another we would obtain a relative risk of 3, which would be would be interpreted as stating that folic acid deficiency multiplies by three the risk of a newborn being born with a neural tube defect.

However, we cannot rule out that this second group of women, with normal values ​​of
folic acid, also present certain healthy characteristics, such as a better diet in general, a better genetic makeup or a low prevalence of risk factors such as tobacco. or alcohol. Therefore, it would be acceptable to conclude that the lower incidence of NTD in these women may be due to two different phenomena: to normal values ​​of folic acid, but also to healthier habits and conditions, with which the relative risk of 3 would be a overestimation of the harmful effect of
low concentrations of folic acid in pregnant women.

Intuitively, it seems logical that the most perfect procedure to determine the effect of folic acid deficiency would consist of comparing the first 10,000 women who had insufficient folic acid values ​​with themselves but assuming that they had normal folic acid concentrations. In this case, both groups would only differ in their exposure to folic acid and the measure of association would really be attributable to folic acid deficiency. However, this comparison group is not possible
in practice, it is not “feasible”, it goes against the facts (each woman has or does not have an adequate value of folic acid), and for this reason it has been called a “counterfactual group”. .

If there is an association between folic acid deficiency and NTDs, we would probably obtain a figure lower than 15 in these
“counterfactual” women with normal concentrations of folic acid, but a frequency greater than the hypothetical figure of 5 that we presented previously, we would obtain, for example, the hypothetical number of 10.

confounding variable

 

ID

In general terms, we speak of confounding when there are important differences between the raw estimates of an association and those adjusted for possible confounding factors. These differences can be assessed following various criteria, although there is a certain consensus on the importance of assessing the effect that the adjustment has on the magnitude of the changes in the association measures. Thus, a factor can be considered a confounder when its adjustment is responsible for a change of at least
10% in the magnitude of the difference between the adjusted and raw estimates.

Before carrying out these comparisons, it is necessary to estimate the adjusted values. The most classic method to obtain adjusted values ​​of association measures is the one we have presented previously, which consists of recalculating new estimates within each stratum of the possibly confounding variable. It is easy to understand that, when we want to assess several confounding factors simultaneously (e.g., age categorized into two groups, sex, and intake of a given food as a dichotomous variable), we quickly reach the situation where we lack resources in the strata to be able to reach a valid estimate in each stratum (we would obtain 8 strata with the variables that we have just listed).

A more efficient option to consider the confounding role of various variables simultaneously is multivariate analysis. Multivariate analysis is a complex procedure carried out more or less automatically by statistical programs and which consists of obtaining, from an initially important number of variables, the set of variables (called independent variables, covariates or predictor variables) that are most intensely associated with the outcome of interest (dependent variable). This set of variables constitutes what we call the “multivariate statistical model.” From
this model we obtain the measures of association of the different variables that make it up, but with the additional advantage that each of these estimates is adjusted by the other variables that make up the model.

Depending on the scale of the variable that quantifies the outcome, different types of multivariate models are used: multiple linear regression (quantitative outcome), logistic regression (dichotomous outcome) or multiple Cox regression (survival function as outcome of interest), Poisson (outcome in the form of rates).

The main advantage of multivariate analysis over stratified analysis is that multivariate models are more efficient. That is, given the same sample size used, more precise estimates are obtained and with a greater number of variables than would be admissible in a stratified analysis.

When estimating these models to identify confounding variables, it is recommended to choose any variable that, while meeting the general criteria for confounding variables (criteria summarized in the acyclic diagrams), is responsible for changes of more than 10% between crude measures of association (without said variable in the model) and adjusted (with said variable included in the model) and that present a conservative level of significance (p value), approximately less2 than 0.20.

identify confounding

 

Because confounding factors introduce, by definition, a bias in the measures of association, it is evident that an attempt should be made to prevent and control this effect before presenting the definitive results of an investigation. Confounding factors can be prevented in the design phase or eliminated in the analysis phase of an epidemiological study.

Why Are Confounding Variables Important?

A quantitative study can be an investment of significant time and money. You must have confidence that your study will be reliable, and its results will be valid. Studies should be constructed such that they could be repeated, with an expectation of the same result.

When a research study has low bias and a high level of repeatability and control, it has high internal validity — in other words, a study is internally valid if it does not bias a participant towards any specific answer or action.

If your research study has significant confounding variables, then the conclusions from that study may be wrong. Making decisions based on these misguided conclusions can result in significant loss of time and money for organizations.

Best Practices for Avoid Confounding Variables

  • Use within-subject study designs when possible. Counterbalance or randomize the order in which participants are exposed to the different conditions in your study. For example, if they are testing two designs, randomly decide the first one tested by each participant. Within-subjects designs reduce sources of error and naturally counterbalance experimental conditions.
  • Randomly assign condition groups for between-subjects study designs. For example, randomly decide which design should be seen by a participant.

What best steps to take to ensure the representative sample?

images 10

What best steps to take to ensure the representative sample?

representative sample

 

What is a representative sample?

The representative sample is a sample of a relatively appropriate size that has been selected by random procedures and the characteristics observed in it correspond to the population from which it was drawn (Ras, 1980; Cochran, 1976; Scheaffer, Mendenhall and Ott, 1987). It is not possible, in any case, to be certain of the degree of representativeness, but rather there is a reasonable probability of that representativeness.

Representativeness

Is a function of several factors, it not only depends on the randomness and size of the sample, but also on the sampling design, very particular for each case, the use of key auxiliary information, the sampling design and a useful and useful sampling frame. updated. The term representative is used as long as the sample faithfully represents the variable under study, which has a probabilistic distribution in the population and the frequency distribution in the sample must be mirror or very similar to that of the population.

This highlights how complex the selection of a representative sample is. To do this, the following must be taken into account: the way the sample is selected, the estimators to be proposed and their precision, the determination of the sample size that takes into account the “aquaricity” or margin of error allowed, the level of confidence in the estimation and variability of the variable on which the Probabilistic Inference is going to be carried out.

Likewise, attention must be paid to the available sampling frame and the set of key auxiliary variables or covariates that are correlated with the variables of interest, which will allow improving the sampling design, with the formation of strata, selection of direct estimators, such as the Horvitz-Thompson, and indirect (ratio, regression and difference), and choose a sample size appropriate to a given precision, choose samples with probabilities proportional to a measure of size (PPT) and use calibrated estimates where adjustment has to be made. sampling weights depending on the non-response and the auxiliary information found, especially in complex samples.

Many times, when designing a probabilistic sample, concessions must be made, especially if the statistical population is asymmetric, there are even times when elements with probability one (1) of belonging to the sample are used and, if this is not done, the sample will not be representative enough.

A probabilistic sample in its structure approaches a greater degree of what is called representativeness when the value of the distance between the estimate of the sample and the value of the population parameter becomes smaller, this is known as aquaricity in the Statistical inference.

We can be in the presence of a sufficiently representative sample when the selection process assigns a probability of inclusion in advance to each element, if this probability is different from zero, if known, and not necessarily being equal for each element of the population. and, furthermore, if the sampling error is low, if aquarity exists and if a random process is used in its selection.

The best way we have to define a sufficiently representative sample is one where a probabilistic sampling strategy is used that allows estimating the value of the parameter with aquaricity, the minimum bias, the minimum Standard Error of the estimator of that parameter or the minimum error of estimate, which is a multiple of the Standard Error of the estimator.

Representativeness

Importance of having a representative sample

Representative samples are known to collect results, knowledge, and observations that can be relied upon as representative of the broader population being studied. Therefore, representative sampling is usually the best method for  market research .

If we do not have representation, we will surely have data that will be of no use to us. Therefore, it is important that we guarantee that the characteristics that matter to us and need to be investigated are found in the sample that is going to be the object of study.

Let’s take into account that we will always be prone to falling into  sampling bias  because there will always be people who do not answer the survey because they are busy, or answer it incompletely, so we will not be able to obtain the data we require.

Regarding the  size of the sample , the larger it is, the more likely it is to be representative of the population. 

That a sample is representative gives us greater certainty that the  people included are the ones we need , and we also reduce possible  bias . Therefore, if we want to avoid inaccuracy in our surveys, we must have a representative and balanced sample.

Representative samples

How to obtain a representative sample?

There are  established sampling methods  to obtain a representative sample that have been tested and verified over time through academic, scientific and market research

The  most common types of sampling  are probability or random sampling and non-probability sampling.

Probability sampling

If we are going to have a  probabilistic or random sampling  , we must make sure we have updated information on the population from which we will draw the sample and survey the majority to ensure representativeness. 

The sample will be chosen at random, which guarantees that each member of the population will have the same probability of selection and inclusion in the sample group.

Non-probability sampling

In  non-probabilistic sampling,  the aim is to have different types of people to ensure a more balanced representative sample. 

Knowing the demographic characteristics of our group will undoubtedly help to limit the profile of the desired sample and define the variables that interest us, such as gender, age, place of residence, etc. 

By knowing these criteria, before obtaining the information, we can have the control to create a representative sample that is useful to us.

We must  avoid having a sample that does NOT reflect the target population , the ideal is to have data that is as accurate as possible for the success of our project.  

Probability sampling

Avoid making sampling errors

When a sample is not representative, then we will have a  sampling error . If we want to have a representative sample of 100 employees, then we must choose a similar number of men and women. For example, if we have a sample biased towards a certain gender, then we will have an error in the sample.

Sample size is very important, but it does not guarantee that the population we need is accurately represented. More than size, representativeness is more related to the  sampling frame , that is, to the list from which the people who are going to be, for example, part of a  survey are selected . 

Therefore, we must ensure that people from our  target audience  are included in that list to say that it is a representative sample.

 

Are you using best primary and secondary data, and why?

Translating

Are you using best primary and secondary data, and why?

primary and secondary data

Primary and secondary data

To analyze the similarity and difference between primary and secondary data sources, it is advisable that you know what a market study entails. As a  Statista report explains , “ a market study is an important business strategy that requires the collection of information about a target market for the company .”

That is, before launching a new line of products or services, growing your team or establishing a social media campaign, you will most likely need to conduct market research. Collecting primary and secondary data ensures that you can make informed decisions to save time and money.

On the one hand,  primary data is the information collected directly from the source of interest; in this case, the potential client . Normally, when a company needs primary marketing data, it does so to determine the viability of a product or service and analyze the buyer persona, market offers, investment risk, among other factors. Primary data collection methods include customer interviews, surveys,  focus groups , etc.

On the other hand, unlike primary data,  secondary data is based on information researched by other companies, institutions or platforms . Secondary data sources are usually public. Some of these include newspapers, government websites, media agencies, etc. In order for it to serve the company’s purposes, the collection of secondary data has to go through an analysis by the marketing team that selects the most convenient information.

Primary data types

To begin to understand the characteristics of primary and secondary data, we must start from their types. That way,  data analysis in marketing  makes sense and can help you plan market research. In the case of primary data, there are two types that you must take into account.

Quantitative primary data

As you might guess from the name, the importance of quantitative primary data lies in the quantities and numbers. Those in charge of conducting primary data research focus on the mathematical data and not on the subjective signals of people’s behaviors or opinions .

The information from primary data allows us to understand a problem in the market and analyze its implication for the consumer. By obtaining this numerical information, a company’s marketing team and managers can make decisions based on clear and objective realities.

Primary data

 

Qualitative primary data

Numbers or statistics are not necessary here. On the contrary,  qualitative secondary data offers information about the behaviors and emotions of potential customers of a product or service from audios, texts, videos or other formats for collecting opinions .

It is common for this type of data to be obtained from direct conversations with people of interest. Interviews are usually conducted with open-ended questions and other data collection techniques.

Primary data sources

In the process of obtaining primary and secondary data, locating the sources is the most important step. If you’re part of a marketing team with research underway, here’s a list of primary data sources to consider.

  • Surveys:  serve to gather direct information on the problems, needs and preferences of the potential client. In this way, survey information makes it possible to predict consumer behavior during the sales phases.
  • Questionnaires:  this format includes open and closed questions to determine the assessment that a potential client has of a brand, company or product. They can be made in different formats, such as phone calls, sending SMS, emails, etc.
  • Interviews:  These are conversations conducted by a marketing or market data research specialist. They tend to be long and deep conversations with people interested in a brand and product. Given their character, they offer quality information from verbal responses and even body language.
  • Website: web analytics  of   a company’s website is also a primary database. From the SEO and user experience analysis carried out by the corresponding areas, valuable information can be obtained for the content marketing strategy, the creation of new products and the customer profile.
  • Social networks:  brand accounts on Instagram, Facebook, Twitter and other social networks are sources of a lot of information for any marketing team. In them, existing customers and interested parties present their opinions, objections and ideas, which can be used to make decisions about advertising campaigns and organic content. That is, primary data in social media marketing has a lot of relevance for a company.

Primary data sources

 

Secondary data types

If you are wondering  how to do a complete market study  , you need to also take into account the types of secondary data. These include the information that can be found in the company’s internal files and that which comes from outside.

Internal secondary data

Although this concept seems to reflect a similarity between primary and secondary data, the truth is that internal secondary data is information that is not collected directly from the ideal client or the specific audience at their current moment. These data are obtained from the internal and archived records of a company, such as reports from past advertising campaigns, accounts of former clients, accounting data, etc.

Secondary data

 

External secondary data

This information is what is commonly found outside of company records and files . They have been prepared and published by external sources for objectives that do not directly relate to commercial research objectives. That is, as we mentioned before, secondary data information comes from studies of competitors, magazines of public organizations, etc.

Secondary data sources

Just as with primary data sources, secondary data also exists in different formats. The nature of each of these sources and the information collected must be analyzed to select what is useful and what is not.

  • Government studies and reports:  many governments around the world carry out in-depth studies on the behavior of their populations, demographics and the sociocultural changes that occur in them. This offers a lot of valuable secondary data for any business’s marketing research.
  • Private market research:  using the power of  data science , there are many companies that are dedicated to conducting user and consumer behavior studies. These usually sell this information to other companies that can leverage the value of this secondary data for their objectives.
  • Sales data:  the data obtained from sales processes is very important for any commercial organization. You can find useful secondary data in invoices, returned products, order documents, and delivery experience. 
  • Competitor platforms:  in a secondary data investigation you cannot forget the information found on competitor pages, applications and reports. In them, you can locate information about potential customers, unresolved pain points, and gaps in your value proposition.
  • Web search engines:  Search engines like Google also offer invaluable data. Not only because they list the competition’s organic results for their websites, but because they allow you to see an overview of the advertising spending that these other companies make in search of the same type of client.

 

 

 

 

 

 

Why English language is importance and the best uses of it

english

1. Global Lingua Franca: English, often referred to as the global lingua franca, holds paramount importance in the contemporary world. Its widespread use as a common language of communication facilitates interaction and collaboration on a global scale. English has become the default language for international business, diplomacy, science, technology, and academia, making it an indispensable … Read more

Why hindi language is importance and the best uses of it

hindi

Introduction The Hindi language holds immense importance in various aspects of society, culture, history, and communication. Its significance can be understood through its widespread usage, cultural richness, historical roots, and its role in fostering national unity. In this comprehensive exploration, we will delve into the multifaceted importance and uses of the Hindi language. 1. Linguistic … Read more

How to best handle outliers or anomalous data points?

images 3

How to best handle outliers or anomalous data points?

data

 

DATA

Data is a collection of facts, figures, objects, symbols, and events gathered from different sources. Organizations collect data with various data collection methods to make better decisions. Without data, it would be difficult for organizations to make appropriate decisions, so data is collected from different audiences at various points in time.

Outlier detection in the field of data mining (DM) and knowledge discovery from data (KDD) is of great interest in areas that require decision support systems, such as, for example, in the financial area, where through DM you can detect financial fraud or find errors produced by users. Therefore, it is essential to evaluate the veracity of the information, through methods for detecting unusual behavior in the data.

This article proposes a method to detect values ​​that are considered outliers in a database of nominal type data. The method implements a global “k” nearest neighbors algorithm, a clustering algorithm called  k-means  , and a statistical method called chi-square. The application of these techniques has been implemented on a database of clients who have requested financial credit. The experiment was performed on a data set with 1180 tuples, where outliers were deliberately introduced. The results demonstrated that the proposed method is capable of detecting all introduced outliers.

Detecting outliers represents a challenge in data mining techniques. Outliers or also called anomalous values ​​have different properties with respect to generality, since due to the nature of their values ​​and therefore, their behavior, they are not data that maintain a behavior similar to the majority. Anomalous  are susceptible to being introduced by malicious mechanisms ( Atkinson, 1981 ). 

Mandhare and Idate (2017)  consider this type of data to be a threat and define it as irrelevant or malicious. Additionally, this data creates conflicts during the analysis process, resulting in unreliable and inconsistent information. However, although anomalous data are irrelevant for finding patterns in everyday data, they are useful as an object of study in cases where, through them, it is possible to identify problems such as financial fraud through an uncontrolled process.

What is anomaly detection?

Anomaly detection examines specific data points and detects unusual occurrences that appear suspicious because they are different from established patterns of behavior. Anomaly detection is not new, but as the volume of data increases, manual tracking is no longer practical.

Why is anomaly detection important?

Anomaly detection is especially important in industries such as finance, retail, and cybersecurity, but all businesses should consider implementing an anomaly detection solution. Such a solution provides an automated means to detect harmful outliers and protects data. For example, banking is a sector that benefits from anomaly detection. Thanks to it, banks can identify fraudulent activity and inconsistent patterns, and protect data. 

Data is the lifeline of your business and compromising it can put your operation at risk. Without anomaly detection, you could lose revenue and brand value, which takes years to cultivate. Your company is facing security breaches and the loss of confidential customer information. If this happens, you risk losing a level of customer trust that may be irretrievable. 

anomaly detection

 

The detection process using data mining techniques facilitates the search for anomalous values ​​ ( Arce, Lima, Orellana, Ortega and Sellers, 2018 ). Several studies show that most of this type of data also originates from domains such as credit cards ( Bansal, Gaur, & Singh, 2016 ), security systems ( Khan, Pradhan, & Fatima, 2017 ), and electronic health information ( Zhang & Wang, 2018 ).

The detection process includes a data mining process that uses tools based on unsupervised algorithms (Onan, 2017). The detection process consists of two approaches depending on its form: local and global ( Monamo, Marivate, & Twala, 2017 ). Global approaches include a set of techniques in which each anomaly is assigned a score relative to the global data set. On the other hand, local approaches represent the anomalies in a given data with respect to its direct neighborhood; that is, to the data that are close in terms of the similarity of their characteristics.

According to the aforementioned concepts, the local approach detects outliers that are ignored when a global approach is used, especially those with variable density (Amer and Goldstein, 2012). Examples of such algorithms are those based on i) clustering and ii) nearest neighbor. The first category algorithm considers outliers to be in sparse neighborhoods, which are far from the nearest neighbors. While the second category operates in grouped algorithms ( Onan, 2017 ).

There are several approaches related to the detection of outliers, in this context,  Hassanat, Abbadi, Altarawneh, and Alhasanat (2015) , carried out a survey where a summary of the different studies of detection of outliers is presented, these being: statistics-based approach, the distance-based approach and the density-based approach. The authors present a discussion related to outliers,  and conclude that the k-mean algorithm is the most popular in clustering a data set.

Furthermore, in other studies (Dang et al., 2015; Ganji, 2012; Gu et al., 2017; Malini and Pushpa, 2017; Mandhare and Idate, 2017; Sumaiya Thaseen and Aswani Kumar, 2017; Yan et al., 2016 ) data mining techniques, statistical methods or both are used. For outlier detection, nearest neighbor (KNN) techniques have commonly been applied along with others to find unusual patterns during data behavior or to improve process performance. What’s up. (2017) present an efficient grid-based method for finding outlier data patterns in large data sets.

Similarly, Yan et al. (2016) propose an outlier detection method with KNN and data pruning, which takes successive samples of tuples and columns, and applies a KNN algorithm to reduce dimensionality without losing relevant information.

Classification of significant columns

To classify the significant columns, the chi-square statistic was used. Chi-square is a non-parametric test used to determine whether a distribution of observed frequencies differs from expected theoretical frequencies ( Gol and Abur, 2015 ). The weight of the input column (columns that determine the customer profile) is calculated in relation to the output column (credit amount). The higher the weight of a column corresponding to the input columns on a scale of zero to one, the more relevant it is considered.

That is, the closer the weight value is to one, the more important the relationship with respect to the output column will be. The statistic can only be applied to nominal type columns and has been selected as a method to define relevance. Chi-square reports a level of significance of the associations or dependencies and was used as a hypothesis test on the weight or importance of each of the columns with respect to the output  column S. The resulting value is stored in a column called  weight , which together with the anomaly score is reported at the end of the process.

Nearest local neighbor rating

To obtain the values ​​with suspected abnormality, K-NN Global Anomaly Score is used. KNN is based on the k-nearest neighbor algorithm, which calculates the anomaly score of the data relative to the neighborhood. Usually, outliers are far from their neighbors or their neighborhood is sparse. In the first case, it is known as global anomaly detection and is identified with KNN; The second refers to an approach based on local density.

The score comes by default from the average of the distance to the nearest neighbors ( Amer and Goldstein, 2012 ). In the  k  nearest neighbor classification, the output column  S  of the nearest neighbor of the training dataset is related to a new data not classified in the prediction, this implies a linear decision line.

values

To obtain a correct prediction, the value k (number of neighbors to be considered around the analyzed value) must be carefully configured. A high value of k represents a poor solution with respect to prediction, while low values ​​tend to generate noise ( Bhattacharyya, Jha, Tharakunnel, & Westland, 2011 ).

Frequently, the parameter k is chosen empirically and depends on each problem. Hassanat, Abbadi, Altarawneh, and Alhasanat (2014)  propose carrying out tests with different numbers of close neighbors until reaching the one with the best precision. Their proposal starts with values ​​from k=1 to k= square root of the number of tuples in the training dataset. The general rule is often to map k with the square root of the number of tuples in dataset D.

HOW CAN WE SOLVE THE PROBLEM OF ATYPICAL DATA?

If we have confirmed that these outliers are not due to an error when constructing the database or in measuring the variable,  eliminating them is not the solution. If it is not due to an error, eliminating or replacing it can modify the inferences made from that information, because it introduces a bias, reduces the sample size, and can affect both the distribution and the variances.

Furthermore,  the treasure of our research lies in the variability of the data!

That is, variability (differences in the behavior of a phenomenon) must be explained, not eliminated. And if you still can’t explain it, you should at least be able to reduce the influence of these outliers on your data.

The best option is to remove weight from these atypical observations using robust techniques .

Robust statistical methods are modern techniques that address these problems. They are similar to the classic ones but are less affected by the presence of outliers or small variations with respect to the models’ hypotheses.

ALTERNATIVES TO THE MEDIA

If we calculate the  median (the center value of an ordered sample) for the second data set we have a value of 14 (the same as for the first data set). We see that this centrality statistic has not been disturbed by the presence of an extreme value, therefore, it is more robust.

Let’s look at other alternatives…

The  trimmed mean (trimming)  “discards” extreme values. That is, it eliminates a fraction of the extreme data from the analysis (eg 20%) and calculates the mean of the new data set. The trimmed mean for our case would be worth 13.67.

The  winsorized mean  progressively replaces a percentage of the extreme values ​​(eg 20%) with less extreme ones. In our case, the winsorized mean of the second sample would be the same 13.62.

We see that all of these robust estimates better represent the sample and are less affected by extreme data.

What role does technology play in your best data collection process?

language20for20applications

What role does technology play in your best data collection process?

data collection

Data Collection

Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem.

While methods and aims may differ between fields, the overall process of data collection remains largely the same. Before you begin collecting data, you need to consider:

  • The aim of the research
  • The type of data that you will collect
  • The methods and procedures you will use to collect, store, and process the data

To collect high-quality data that is relevant to your purposes, follow these four steps.

Data collection is the process of gathering data for use in business decision-making, strategic planning, research and other purposes. It’s a crucial part of data analytics applications and research projects: Effective data collection provides the information that’s needed to answer questions, analyze business performance or other outcomes, and predict future trends, actions and scenarios.

In businesses, data collection happens on multiple levels. IT systems regularly collect data on customers, employees, sales and other aspects of business operations when transactions are processed and data is entered. Companies also conduct surveys and track social media to get feedback from customers. Data scientists, other analysts and business users then collect relevant data to analyze from internal systems, plus external data sources if needed. The latter task is the first step in data preparation, which involves gathering data and preparing it for use in business intelligence (BI) and analytics applications.

For research in science, medicine, higher education and other fields, data collection is often a more specialized process, in which researchers create and implement measures to collect specific sets of data. In both the business and research contexts, though, the collected data must be accurate to ensure that analytics findings and research results are valid.

What are information technologies?

Information technology is   a process that uses a combination of means and methods of collecting, processing and transmitting data to obtain new quality information about the state of an object, process or phenomenon. The purpose of information technology is the  production of information  for analysis by people and making decisions based on it to perform an action.

Information technologies (IT)

The introduction of a personal computer in the information sphere and the application of telecommunications media have determined a new stage in the development of  information technology . Modern IT is an information technology with a “friendly” user interface using personal computers and telecommunication facilities. The new information technology is based on the following basic principles.

Information technologies

  1. Interactive (dialogue) mode of working with a computer.
  2. Integration with other software products.
  3. Flexibility in the process of changing data and task definitions.

As a set of  information technology tools , many types of computer programs are used: word processors, publishing systems, spreadsheets, database management systems, electronic calendars, functional purpose information systems.

Characteristics of information technologies:

  • User operation in  data manipulation mode  (without programming). The user must not know and remember, but must see (output devices) and act (input devices).
  • Transversal information support  at all stages of information transmission is supported by an integrated database, which provides a unique way to enter, search, display, update and  protect information .
  • Paperless document processing  during which only the final version of the paper document is recorded, intermediate versions and necessary data recorded on the media are delivered to the user through the PC display screen.
  •  Interactive (dialogue) task solution mode  with a wide range of possibilities for the user.
  • Collective production of a document  on the basis of a group of computers linked by means of communication.
  • Adaptive processing  of the form and modes of presentation of information in the problem-solving process.

Types of information technologies

The main  types of information technology  include the following.

  • Information technology for data processing is   designed to solve well-structured problems, whose solution algorithms are well known and for which all necessary input data exist. This technology is applied to the performance level of low-skilled personnel in order to automate some routine and constantly repeated operations of administrative work.
  • Management information technology is   intended for the information service of all company employees, related to the acceptance of administrative decisions. In this case, the information is usually in the form of ordinary or special management reports and contains information about the past, present and possible future of the company.
  • Automated office information technology is   designed to complement the company’s existing staff communication system. Office automation assumes the organization and support of communication processes both within the company and with the external environment on the basis of computer networks and other modern means of transferring and working with information.
  • Information technology for decision support is   designed to develop a management decision that occurs as a result of an iterative process involving a decision support system (a computer link and the object of management) and a person (the management link, which sets input data and evaluates the result).
  • Expert systems information technology is   based on the use of  artificial intelligence . Expert systems allow managers to receive expert advice on any problem about which knowledge has been accumulated in these systems.

Types of information technologies

 

The use of modern technology is more economical than ever and electronic tools now offer a cost-effective alternative to paper questionnaires for collecting high-quality data. In order to help you decide if using computer-assisted personal interviewing (CAPI) is for you, this blog reviews the potential benefits and challenges of using CAPI and shares a recent survey experience. carried out in Guyana in which free software developed by Survey Solutions was used.

Paper questionnaires: the traditional way to collect data

Conducting surveys of this magnitude with paper questionnaires can be costly in terms of economic, administrative and logistical efforts while presenting a series of challenges: printing and transporting questionnaires to and from the field is often associated with with a high cost; Corrections to questions can represent a significant challenge in terms of cost and time. There is also the real risk that questionnaires will be lost in the field or damaged by weather or transportation before the data is systematized.

collect data

Even when all interviews have been conducted, responses must be manually entered into a digital file before the data can be analyzed. This process represents a lot of time and manual work and increases the margin of error. Data quality checks are limited, and errors are sometimes only recognized after the survey has ended, making them more difficult to correct.

However, there is an alternative to paper questionnaires: computer-assisted personal interview (CAPI). In recent years, CAPI has attracted more attention as it presents a more economical way to collect high-quality data.

P7213035

CAPI: an increasingly popular tool

With the new processing speeds of today’s computers, the increasing global availability of Internet service and the falling prices of mobile devices, CAPI has become increasingly attractive. The CAPI tool creates the questionnaire using special software that can be downloaded directly to a mobile device (usually a smartphone or tablet), which the interviewer uses to administer and fill out the questionnaire. Information from these questionnaires is uploaded to a central server where it can be accessed and reviewed remotely.

Depending on the size of the survey sample, purchasing tablets to complete electronic surveys becomes increasingly more affordable than printing paper questionnaires. The technical requirements for such devices are relatively low and a large number of questionnaires can usually be saved on the device without danger of running out of storage. Additionally, once a questionnaire has been entered on a mobile device, it can be modified: If an error is detected in the early stages of the survey, it can be easily corrected without incurring additional printing costs.

How do you best standardize your data collection procedures?

images 4

How do you best standardize your data collection procedures?

data collection

Data Collection

Data collection is the process of collecting and analyzing information on relevant variables in a predetermined, methodical way so that one can respond to specific research questions, test hypotheses, and assess results. Data is a collection of facts, figures, objects, symbols, and events gathered from different sources. Organizations collect data with various data collection methods to make better decisions. Without data, it would be difficult for organizations to make appropriate decisions, so data is collected from different audiences at various points in time.

Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem.

While methods and aims may differ between fields, the overall process of data collection remains largely the same. Before you begin collecting data, you need to consider:

  • The aim of the research
  • The type of data that you will collect
  • The methods and procedures you will use to collect, store, and process the data

To collect high-quality data that is relevant to your purposes, follow these four steps.

For instance, an organization must collect data on product demand, customer preferences, and competitors before launching a new product. If data is not collected beforehand, the organization’s newly launched product may fail for many reasons, such as less demand and inability to meet customer needs. 

Although data is a valuable asset for every organization, it does not serve any purpose until analyzed or processed to get the desired results.

In a society in which information and data are the key to any activity or business, it is very important to know  how to standardize data  to be able to get the most out of it and performance.

With globalization and the information systems that are used in our daily lives, the amount of information and data we have at our disposal is immense. The problem is  knowing how to manage such an amount of data, how to collect it, how to treat it, classify it and apply it . In this sense, data standardization can be of great help.

Aware of the importance of knowing how to standardize data, at  Ayuware ,  as  experts in big data,  we have prepared this post with relevant information about the advantages that this procedure offers and all the benefits it provides, so that you can know exactly what it is about. treats.

What is standardizing data?

Data standardization is the data quality process of transforming data to fit a predefined and constrained set of values, relying on the power of uniformity to improve data efficiencies.

Data standardization, sometimes also known as  normalization , is the  process of adjusting or adapting certain characteristics  so that the data resembles a common type, model or normal with the aim of making  its treatment, access and use easier for people. the users or people who have them.

The very concept of open data implies a search for standardization in the use of information in open format so that it can be used and reused by citizens.

Therefore, it could be said that standardization is the way in which all people can  compare and consult data and always find the information they need, being certain of uniformity in the way in which they will find it.

In statistics, normalization or standardization can have a wide list of meanings. In the simplest cases, standardization of indices involves adjusting the values ​​measured on different scales with respect to a common scale.

On the other hand, in more complex cases, data normalization or standardization can refer to making more sophisticated adjustments where the objective is to obtain all those probability distributions that fit the determined values.

In short, it could be said that without standardization and, therefore,  without uniformity  in naming conventions and coincidences between macros and parameters,  only unreliable results can be obtained.

Importance of data standardization

Before knowing how to standardize data, it is important that you know the  importance  of this process for the correct use and processing of information in an effective and functional manner.

Data standardization allows us  to ensure that we will have useful, easily linkable and usable data  at our disposal for any activity we require.

Data standardization not only  helps us organize sets of complex information , but it will also  facilitate its analysis  since it breaks down multiple dimensions and transforms the information into viable insights.

Committing to standardize data is a way to  ensure that we implement the uniformity necessary to maintain the effectiveness of your analysis . As we know, information, today, is power and, therefore,  having standardized and optimized information is a great advantage  both on a personal and business level.

standardizing data

In this sense, data standardization has gone from being a good practice or something recommended, to being  a necessity  for all those who want  to get the most out of the  data  and information they have at their disposal.

Now that we are getting closer to knowing the process on how to standardize data, it is important to keep in mind that, for it to be possible,  it is necessary for institutions to work together  to develop standards since, for a standard to exist, it must be used by a large majority for it to be considered implemented.

Furthermore, data standards and normalization imply  good practices  such as monitoring, control, keeping in mind at all times that the information is useful to anyone who wishes to use it, etc. Thanks to these good practices, the process of how to standardize data is effective and provides all the advantages mentioned above.

To know how to standardize data correctly, it is necessary  to pay attention to the planning prior to collecting the information , in addition to thinking about the feedback of said information from the publication of the data in an open and public manner.

Key moments in data standardization

To know how to standardize data, it is important to know that  there are three key moments  for the process, in which it is necessary to pay attention to carry out the process satisfactorily:

  • Standardization in data capture : For the data standardization procedure to be correct, it is necessary that the data be collected optimally, following guidelines that will greatly facilitate the following steps.
  • Standardization at the time of data storage : Once the information is collected, it is very important to pay attention to how the data is stored, to do so in an orderly manner and to provide all subsequent facilities for its recovery and consultation.
  • Standardization in the presentation of data : This is a fundamental moment since, after carrying out a search for data that we may need, it is very important that these are displayed in a standardized way, so that we can make use of them in a useful and functional. If all the previous steps have been carried out correctly, this last moment should not present any problems.

data