What is NLP text annotation? The basic knowledge points in NLP that data labelers must understand, the characteristics of text labeling are the most varied and frequent ones of these types of labeling requirements. Whether it is a small partner of an AI company who makes requirements, or a small partner who makes annotations in a professional text annotation company , they will encounter such a problem, that is, temporarily setting requirements, Semantic changing requirements, etc., which is a headache for everyone. But have you ever wondered what its root cause is?
1. What is NLP annotation?
NLP (Natural Langunge Possns, Natural Language Processing ) NLP is a branch of artificial intelligence, used to analyze, understand and generate natural language (such as Chinese, English, etc.), to achieve effective communication between humans and computers. The so-called “natural” means the formation of natural evolution, and it is to distinguish some artificial languages, such as C++, Java and other artificially designed voices.
2. NPL labeling process
A. Acquire corpus (that is, language materials, corpus is the basic unit that constitutes a corpus.)
- ready-made corpus
Books, documents and other materials in daily life can be integrated and processed into a corpus for use.
- Grab corpus from the web
Crawling corpus from the Internet generates a large amount of text every day on the Internet, such as microblogs, forums, messages, etc., which can be captured and processed as a corpus. Its characteristic is that it is easy to obtain and the electronic version can be directly converted into the required format when grabbing. The difficulty is that the usage of text data on the Internet may be different from that in real life and requires further processing.
- Manual collection of corpus The corpus
with special needs can only be collected manually, such as children’s text dialogues, dialogues in daily life, and so on. The application in special scenarios basically requires manual collection. The collection characteristics will standardize the content of the corpus according to the required scenarios, and use the actual conversations of people in their lives to collect in specific specifications. This type of collection is still in great demand at the current stage.
B. Processing of corpus
The above-mentioned corpus, whether it is ready-made corpus, online crawling corpus or manual collection of corpus, needs to be further processed before it can be applied, so the processing of corpus often accounts for 50%-70% of a complete Chinese natural speech processing project. workload. Basically, it needs to go through the following four aspects of data cleaning, word segmentation, part-of-speech tagging, and stop words to process the corpus. Most of these four aspects need to be processed manually.
- Corpus cleaning
Corpus cleaning can generally be carried out from several dimensions. - Participle
Word segmentation is a very important step in text analysis, but there are many methods of word segmentation, so we need to set the granularity of word segmentation and the segmentation of some special words in advance according to our project requirements. In order to avoid ambiguity in post-processing. This part can be combined with the word segmentation algorithm to speed up the progress of data labeling. However, there are many word segmentation algorithms, and you need to choose according to your needs. Such as: forward maximum matching algorithm, reverse maximum matching algorithm, maximum Ngram score algorithm, full segmentation algorithm, two-way maximum and minimum matching algorithm, etc.
- part-of-speech tagging
Part-of-speech tagging is to tag each word, such as adjectives, verbs, nouns, etc. Part-of-speech tagging is not necessary in most processing, but it is only needed for sentiment analysis and knowledge reasoning. But compared to other part-of-speech tagging, it also requires more professional knowledge.
- Removing stop words
Stop words generally refer to words that do not contribute to text features, such as punctuation, tone, person and other words. The operation of removing stop words must be carried out according to the scene, and some scenes require modal particles to judge emotions.
3. NLP text annotation method
So it is conceivable that the first stage of obtaining the data is a headache. You will find that there are too many dimensions and types, and the comments of each product may also be different. Then we need to analyze and propose commonality and basic processing principles from a higher dimension. Therefore, we can consider it from three dimensions.
1. General principles: the basic principles that must be followed in this labeling process.
For example: the principle of simplicity/minimum principle can be understood as the smallest granularity word segmentation method used in the word segmentation process. Example: Peace Hotel can be divided into Peace Hotel as a whole, or it can be divided into Peace/Hotel, so here we divide it into Peace/Hotel.
2. Special definition: the handling method of special circumstances in the labeling process.
For example: some proper nouns that can be encountered in word segmentation will not be split.
3. Labeling requirements: explain the specific labeling process.
In the part of labeling requirements, we still consider two types of distinctions.
a. Part of speech angle.
For example: what we need to mark is divided into, which can be better suited to our needs. In this requirement, we want to analyze the user’s experience in the whole process of using the product, so what can be involved? What will the message be? First of all, emotion is a category that must exist. So what can and which characteristic words can express the customer’s situation? Then I understand the core problem. Feature words and sentiment words.
b. Angle of Event.
What is the angle of events? There are many types of products involved in this requirement. But no matter what product, it needs to go through a whole process of events, and finally generate user feedback. So which points in the whole process can affect the user experience. In this way, the logic of this matter comes out, such as: logistics, packaging, branding, etc. Then you can set the corresponding events according to the actual situation.
The above is a practical example to analyze a process of text scene analysis. After this process is sorted out, you will find that the result of the data to be processed will be close to the unique value. I hope that whether it is a small partner of an AI company to formulate requirements or a small partner of a labeling company to do labeling, you can refer to it for understanding.