Data annotation helps machines understand text, video, image or audio data.
One of the distinguishing features of artificial intelligence (AI) and machine learning technology is its ability to learn for every task it performs, good or bad. It’s this constant evolution process that sets AI apart from static, code-dependent software.
It is this ability that makes high-quality labeled data a critical factor in training representative, successful, and bias-free AI models.
Data annotation, or data labeling, is the process of labeling individual elements of training data, whether text, video, or images, to help a machine understand exactly what’s in that data. This labeled data is then applied during model training.
Data annotation also plays a role in the larger quality control process of data collection, as well-labeled datasets become ground-truth datasets: data that are considered the gold standard and used to measure the quality of other datasets.
teaching through data
Of course, this is a simplified version of how AI learns. In practice, machine learning algorithms require large amounts of correctly labeled data to learn how to perform tasks — which can be a challenge in practice. Companies must have the resources and time to collect and label data for their specific use cases—sometimes in arcane language or unique and highly technical domains.
The following details the different types of data labeling, how the labeled data is used, and why humans will continue to be an integral part of the data labeling process in the future.
The importance of data labeling
Data annotation is especially important when considering the amount of unstructured data in the form of text, images, video, and audio that exists in the digital world. By most estimates, unstructured data accounts for 80-90% of all data.
Currently, most models are currently trained through structured or supervised learning, which relies on well-labeled data from humans to create training examples.
Since data comes in many different forms, there are several different types of data annotation for text, image, or video-based datasets. Below is a breakdown of these three data labeling types.
Written Language: Text Annotation
There is a vast amount of information in any given text dataset. Text annotations are used to segment data in a way that helps machines recognize individual elements within it. Types of text annotations include:
Named Entity Tags: Single and Multiple Entities:
Named entity tagging, or named entity recognition, helps identify individual entities within a block of text, such as “person”, “sport”, or “country”.
This type of data labeling creates entity definitions so that ultimately a machine learning algorithm will always recognize that “St. Louis” is a city, “St. Patrick” is a historical figure, and “St. Lucia” is a tropical island in the Caribbean.
Humans use language to express thought in unique and varied ways—sentences or phrases can’t always be taken at face value. It is necessary to read between the lines or consider the context to understand the sentiment behind a phrase, which is why sentiment labels are crucial for letting a machine decide whether a selected text is positive, negative or neutral.
In many cases, the sentiment of a sentence is clear: For example, “Super helpful experience with the customer support team!” is clearly positive. However, it can be more difficult to discern the true meaning when the intent is less than straightforward, or when sarcasm or other ambiguity is used. For example, “This place has great reviews, but I can’t say I agree!”. This is where human annotation adds real value.
The intent or meaning of a word can vary widely depending on the context and within a particular domain. The domain-specific jargon used in technical conversations in the financial industry is very different from the slang used between two friends on social media. Semantic annotations provide the extra context machines need to truly understand the intent behind text.
Image annotation helps machines understand which elements are present in an image. This can be done by using image bounding boxes, where elements of the image are labeled with basic bounding boxes, or by more advanced object labeling.
Annotations in images can range from simple classifications (such as labeling the gender of a person in an image) to more complex details (such as labeling whether the weather in a scene is rainy or sunny).
Image classification is another way to label images based on single or multi-level categories. An example in this case would be an image of a mountain classified as a “mountain” category.
Motion detected: Video annotation works similarly to image annotation—using bounding boxes and other annotation methods, individual elements within a video frame are identified, classified, and even tracked across multiple frames. For example, labeling all humans in CCTV footage as “customers” or helping self-driving cars identify objects on the road.
man and machine
While some data labeling can now be automated, the human-in-the-loop paradigm of data labeling remains the default, and humans play an integral role in ensuring data is labeled correctly. Humans can provide context and a deeper understanding of intent, adding overall value to annotations.
In-house vs. Outsourced
Data labeling is essential, but also resource-intensive and time-intensive. According to one report, data preparation and engineering tasks account for more than 80% of the time spent on most machine learning projects. Organizations may often be faced with a decision: perform data annotation in-house or outsource?
Performing data labeling internally has some advantages. On the one hand, you retain control and visibility over the data collection process. Second, for very niche or technical models, subject matter experts with relevant knowledge may already be in-house.
However, outsourcing data labeling to third parties is an excellent solution to some of the biggest challenges of in-house data labeling—namely, time, resources, and quality. Third-party data annotation helps achieve the scale, speed, and quality needed to create effective training datasets while adhering to increasingly complex data privacy rules and requirements.