, , ,

NLP Algorithm: Text Research Field and NLP Text Annotation Tool

Natural language processing NLP itself is to enable computers to process, understand, and use human language, so as to achieve effective communication between humans and computers. In order to study information retrieval, sentiment analysis, text classification , intelligent question and answer, abstract extraction, text mining, and public opinion analysis , knowledge graph, etc., to solve ambiguities in morphological, syntactic, and semantic terms, here is mainly to introduce the open source annotation tools and annotation platforms that I personally use when learning with related algorithms, for reference.

bigstock Nlp Neuro linguistic Program 84219074

Fields of text research:

Knowledge map:

Knowledge map technology involves various technologies in natural language processing. In the representation of resource content, it can be used from shallow text vector representation to syntax and semantic structure representation. From natural language processing technology, word segmentation and part-of-speech tagging can be used , Named Entity Recognition, Syntactic Semantic Structure Analysis, Reference Analysis, etc. Information extraction and semantic integration are the core technical issues of knowledge graph construction.

Information extraction:

It refers to extracting specified types of information (such as entities, attributes, relationships, events, commodity records, etc.) from unstructured/semi-structured texts (such as web pages, news, papers, microblogs, etc.), A comprehensive technology that converts unstructured text into structured information by means of redundancy elimination and conflict resolution. At present, the core research content of information extraction can be divided into named entity recognition (Named Entity Recognition, NER), relation extraction (RelationExtraction), event extraction and information integration (Information Integration).

Text Mining:

Text mining refers to the process of obtaining high-quality structured information from these unstructured or semi-structured text data. In other words, the purpose of text mining is to obtain useful knowledge or information from unprocessed text data. Typical text mining tasks include text classification, text clustering, concept/entity extraction, sentiment analysis, document summarization, etc.

emotion analysis:

The goal of sentiment analysis research is to establish an effective analysis method, model, and system to analyze the emotional information held by an object in the input information, such as opinion tendencies, attitudes, subjective views, or emotional expressions such as emotions. The main sentiment analysis tasks include: construction of sentiment resources, quality analysis of sentiment information, sentiment classification , and sentiment information extraction. The relatively new ones also include emotional interpretation, irony analysis, and position analysis.

Abstract extraction:

Summary extraction refers to automatically analyzing a given document or documents, extracting and summarizing the key information, and finally outputting a short, readable summary (usually containing a few sentences or hundreds of words) ), sentences in this abstract may be taken directly from the original text or rewritten. Its main tasks include key point screening and abstract synthesis.

Information retrieval:

The three main links of retrieval users, information resources and retrieval systems constitute the complete structure of knowledge acquisition and information transfer under the information retrieval application environment, and the current factors that affect the efficiency of information acquisition are also mainly reflected in these links, namely: the retrieval user’s Expression of intent, quality measurement of information resources (especially network information resources), matching and ranking of results, and evaluation of information retrieval.

It is not difficult to see from several major fields that NLP is developed around four modules: classification, sequence labeling, text matching, and text generation.


NLP classification/clustering:

107 agfuzc1tyxpllu5muc1kyxjrlwjsdwu scaled

Summary of algorithms for NLP classification:

1. Rule-based method (often used in special scenarios where no rules have been set)

Generally speaking, the scenarios where rules are used are special, and it is difficult to extend to other scenarios. Some professional knowledge is required to form the rules, and the maintenance cost is high. It is necessary to manually summarize the rules when the rules are changed or updated due to training. However, in the face of In non-standard special scenarios, it is still necessary to selectively manually label a certain scale of data for model evaluation and training.


2. Traditional machine learning text classification: mainly to obtain TF-IDF features and n-gram features based on the word level, and put them into the classifier for training. In terms of text feature extraction, based on the TF-IDF features at the word level, n-gram features, subject words and keyword features. Based on the sentence level, there are features such as sentence pattern and sentence length. Features based on the semantic level can also use the word2vec/GloVe/ELMo pre-training corpus to obtain word vectors, and use the word vectors to reasonably construct the word vectors of the text as the semantic features of the text.


The advantage of traditional machine learning classification lies in its advantages in fast training and easy understanding. The main idea of ​​TF-IDF is: if a word or phrase appears frequently in an article and rarely appears in other articles, it is considered that this word or phrase has a good category discrimination ability and is suitable for use in Classification. TF-IDF=TF*IDF, ngram The main idea is to combine all combinations of adjacent words of length n that can be found in the source text. Therefore, the idea of ​​traditional machine learning classification is to use the n-gram concept in natural language processing to extract text features, and use TFIDF to adjust the n-gram feature weights, and then input the extracted text features to Logistic regression, SVM and other classifiers for training. However, the above-mentioned feature extraction methods have problems such as data sparsity and dimension explosion, which are disastrous for the classifier and make the generalization ability of the trained model limited. Therefore, it is often necessary to adopt some strategies for dimensionality reduction: stop word filtering, low-frequency n-gram filtering, LDA, etc. fastText is a model for word and sequence classification open sourced by facebook in 2016. It fits the traditional machine learning classification Relevant content, very suitable for machine learning processing of text classification, has a routine about the use of the fastText project. Processing uses official tools In addition, you can also use keras to implement the fastText model. If you want to do it from front to back, I recommend processing-codes-in-python


This article mainly explains how to eliminate noise, vocabulary normalization, object standardization concepts and related codes from text preprocessing, and how to parse and extract entities from grammar in syntactic analysis, and use the commonly used model TF-IDF , ngram, LDA, and word2vec have relevant analysis, and the subsequent classification uses textblob’s naive Bayesian classifier and SVM.


3. Deep learning text classification: Based on the text classification method of deep learning, it is obvious that the structure of the model and the parameters of the model will play a crucial role in the classification effect. Commonly used neural network models in model structure include CNN, RNN, GRU, Attention mechanism, Dropout, etc.


Basic models such as various combination models of CNN/RNN/GRU, the original version is recommended here: Chinese translation version: /p/39054002This article mainly compares the traditional machine learning classification and deep learning text models before and after embedding the pre-training model, and includes the code of these related models. If you have doubts about the four classic RNN models of N vs N, N vs 1, 1 vs N, and N vs M, will be a good choice, basically All formulas and reasoning processes are represented by images, and the application fields and suitable processing conditions of each model are explained. Other models: Dropout, attention mechanism: The attention model is usually applied in the classic Encoder-Decoder framework, which is the famous N vs M model in RNN. The seq2seq model is a typical Encoder-Decoder framework. About this The model is strongly recommended to explore the Seq2Seq model and the Attention mechanism. This article explains the calculation steps and concepts clearly. It is rare that there is a small task and detailed solutions. It is worth recommending. Note that the code of its small tasks is not in the full text, but in,


Of course, in addition to this, in the adjustment of model parameters, on the one hand, it is necessary to set the parameter learning rate of the model, and on the other hand, it is necessary to adjust according to the characteristics of the model and the text content to be analyzed.


NLP classification-related data sets: THUCNews Chinese text classification, NLP 740,000 news documents (2.19 GB) large amount of data


Kesci platform short text classification:


Natural Language Processing Group, International Database Center, Department of Computer Information and Technology, Fudan University: Only one Baidu cloud link was found, but it was not downloaded


The remaining algorithm models and data sets will be sorted out in the second, let’s talk about NLP text annotation tools


The pre-processing of NLP, especially the pre-processing of entity labeling, is of course supported by relevant training packages, but sometimes the text encountered after a period of research finds that there is no public data set in a certain direction, so the necessary rule-based method Training is also indispensable.


Text annotation tools and annotation platforms:

1.prodigy: Demo online demo looks pretty good, the tricky NLP thing is the fee, and it’s not cheap, I have to say, I can’t find the Chinese version (it’s like a fierce operation, and I can only cover my face in the end representative)


2. YEDDA: only supports python2.7. The more attractive point is that it can directly import txt and is open source, and the annotator interface is used to annotate sentences, and the administrator interface provides functions such as comparison of annotation results by different personnel for the same file . As an open source tool, it is still very good, but the shortcut key setting is a bit troublesome, there is no marking function for emotion categories or classification categories, and the number of marking types is only 7, which is a very good choice for those who do not need the above functions. Download address:


3. BRAT can refer to for installation configuration and emotional direction. You can customize and mark special areas, but it only supports Linux systems. This tool can only generate .ann suffix The annotation file also needs to be converted.


For other tools:, two articles recommended, the first article introduces the labeling tools on the market , there are tool screenshots and some basic functions, which saves time for selection. The second article compares the pros and cons of many tools, but in the second article, I only found two or three, not one.


Now the mainstream is yedda and brat, each has its own advantages and disadvantages, but I have a lot of text this time, some friends don’t have linux, some don’t want to install python, and YEDDA’s shortcut keys are really difficult to set, there is no way, read http: // Recommend NLP

JD Zhongzhi, tried it, and the entity is still good, you can set multi-level labels, support pre-marking, and export in various formats like YADDA Support, but there is no word segmentation, the operation is very simple, and very cheap, URL:


Copyright statement: This article is an original article of CSDN blogger “liuxiangjunzzz”, which follows the CC 4.0 BY-SA copyright agreement. For reprinting, please attach the original source link and this statement.

Original link:

Table of Contents