Best Artificial Intelligence Data Labeling (1): Text Labeling

Thanks to the rapid development of information technology in the new millennium and the convenience brought by big data, artificial intelligence relies on big data to quickly complete the transition from theory to practical application, and gradually enters our lives. first year. So how is the data that a large amount of artificial intelligence relies on now processed, and the massive disordered data is turned into data that machines can understand? We are here today to make a brief introduction.

What are text annotations?

Text annotation is the process of characterizing the text, labeling it with specific semantics, composition, context, purpose, emotion and other data labels. Through the labeled training data, we can teach the machine how to recognize the hidden content in the text. Intention or emotion, so that the machine can understand language more humanely. What is text annotation in machine learning?

Text annotation type

At present, the data labeling objects in the data industry mainly include the following types: text, sound, image, video (in most cases, they are still converted into images for labeling);

1. Semantic recognition

Semantic recognition is to use the platform to mark the text, the same content, different segmentation, different order, the meaning of expression will be completely different, so if you want the computer to clearly recognize, the first step is to tell the computer, in each In a sentence, those words are a phrase, which is the process of word segmentation, and Chinese is very controversial, so accurate word segmentation is very complicated and challenging.

2. Emotion recognition

Emotion recognition originally refers to the automatic identification of an individual’s emotional state by AI by obtaining individual physiological or non-physiological signals, which is an important part of emotional computing. The content of emotion recognition research includes facial expression, voice, behavior, heart rate and text, etc., and the user’s emotional state can be judged through the above content.

3. Entity recognition

An information extraction technique. Obtain entity data such as person names and place names from text data.

4. Data cleaning

Data cleaning refers to the last procedure for discovering and correcting identifiable errors in data files, including checking data consistency, dealing with invalid and missing values, etc. Data cleaning after input is generally done by a computer.


Text Annotation Application Areas

Text annotation At present, we have more contact with industries such as customer service, public opinion, medical care, and education. The application types generally include semantic recognition, emotion recognition, entity recognition, scene recognition, data cleaning, and response recognition.

1. In the field of customer service

The annotation of the customer service industry mainly focuses on scene recognition and response recognition. Taking the intelligent customer service robot as an example, when the user interacts with the machine, it quickly cuts into the corresponding scene according to the user’s consultation content, and then allows the user to choose a more subdivided The response model of the user is located in the actual scene of the user, and then the corresponding answer is given according to the user’s specific question. The whole process is similar to re-screening the user’s question.

2. In the financial field

Due to the rapid development of computer information and Internet technology, banking services are gradually developing in the direction of networking, intelligence and personalization. With the help of intelligent customer service robots based on natural language understanding and speech recognition technology, customers can use their official website, official account, Weibo Realize intelligent human-computer interaction with customers through online channels, which can effectively reduce customer service costs and improve service quality.

3. In the medical field

Use text annotation to tell the robot about the disease, it will tell you relevant medical knowledge, and provide services such as appointment registration function and department navigation. Doctors “speak” medical records through natural speech, intelligent guidance robots can help patients answer questions and guide patients to seek medical treatment, and intelligent medical image recognition systems help doctors intelligently read films for diseases such as cancer.


The optimization of medical vocabulary can greatly improve the work efficiency of doctors.

Through big data analysis of hospital outpatient service, hospitalization, inspection, nursing and other business data, the hospital’s administrative management, medical service and logistics support are optimized and reengineered. The guidance service in the hospital is a typical application of intelligent customer service robots in the medical field.

Experience in the machine in one sentence

In the initial stage of establishing this response system, it is necessary to classify a large amount of user consultation corpus, and put the corresponding question mark number of user consultation into the corresponding model (the same is true for other response robots), similar to this:

Classification of corpus (the actual classification is more detailed, here is just an example)

The data labeling in this step is mainly to mark the scene of the sentence and divide the user’s questions into the corresponding scene. This kind of labeling needs to be very familiar with the business logic tree of the industry, which is equivalent to building the robot’s response knowledge base. When issuing instructions, identify which subdivision question has the highest fitting degree, and then select the answer to that question as the answer to the user.

There are mainly two ways of labeling: online platform labeling and offline form labeling, which vary according to the company’s own situation. Take the offline form labeling content of a company in the financial industry as an example:

Examples of text annotation

1. Example of customer service category labeling

Although a large amount of organized corpus will be used to exhaustively list the answer knowledge base corresponding to the scene and the model, the way the user asks the question is different, the context content is different from the scene, and the recognition of the machine is a probability problem. What is the final recognition problem? And there is a threshold for what answer is finally given, so this recognition may be wrong.


When an error occurs, we call it badcase. The labeling at this stage is for the labeler to mark the original chat data to see if the robot’s answer is correct. If not, what kind of problem is it? It is a first-level classification error. Or the second-level classification is wrong or the answer is not good enough to meet the needs of users. For example: the user asks how to apply for a bank card, and the robot replies with the credit card processing process. Then this is a bad case. The robot put the question into the wrong category and answered a wrong answer.


The labeling of this step is to screen out the errors that occur, and classify the problems according to the business logic tree. After marking, the colleagues who are responsible for handling badcases and the R&D colleagues will jointly optimize the response. 【This step is a long-term process that requires a stable and familiar team to mark it】


Let me give another example of natural language recognition. Ordinary natural language recognition, which extracts information such as time, place and people, will not be cited. At present, there are too many such labeling teams in the market. The content of the labeling is relatively basic. I will take it here. A natural language processing annotation for a medical industry that I deal with.


This is a highly professional annotation. We also specially recruited doctors and language teachers to do the annotation. The objects of the annotation are some fields extracted from the medical records. The physical examination items and past history in the medical records are If there is a template, it can be exhaustive with a small workload, and the result of directly identifying the replaceable item is enough, but the chief complaint and the doctor’s description of the patient will be different every time.


So our labeling is first, labeling the attributes of each word, that is, what kind of attributes each word has in this context (the same word will have different attributes if it does not make sense), and second, labeling each word function of a word in a sentence.

Still give an example: This is a main complaint: low back pain for two years, accompanied by radiating pain in the left lower extremity for more than 10 days.

2. Examples of medical labeling

The purpose of this labeling is to allow the machine to recognize each word in the medical record. After a large amount of data labeling, the machine can recognize what attributes a word has, what role it plays in the sentence, and what role the word plays in this context. , and teach the machine to split words, identify which words are useful and which words are useless.

In the same way, the principles of labeling for natural language recognition purposes in daily conversations are mostly similar, but the rules are different.