Data labeling summary (1)
1. Under supervised learning, a large amount of (labeled) data is required.
2. Reasons for data noise:
Problems with data collection tools
Data entry, transmission errors
technical limitations
3. On the basis of the import, complete (data cleaning) and preprocessing work for missing information, inconsistent information and redundant information.
4. In the drawing frame labeling, the edge of the frame should be close to the (edge) of the object to be marked, and the attributes of each frame must be indicated at the same time.
5. When cutting and labeling, it is necessary to pay special attention to the fact that the frame of the label needs to be tangent to the edge of the marked object
6. The text error rate refers to the labeling errors in the content. As long as one word is wrong, the voice is considered wrong, and it should generally be controlled within (3%); other error rates refer to errors in other marked items except the voice content. As long as one item is wrong, the voice is also wrong, and generally should be controlled within (5%).
7. Real-time inspection is a method of on-site inspection and mobile inspection. It is generally arranged in the data labeling task (in progress), so that problems can be found and solved in time.
8. Since computers in (commercial office area, comprehensive office area, and data collection area) can connect to the Internet, in order to ensure data security in the data cleaning and data labeling area, this area cannot be arranged together with the data cleaning and data labeling area.
9. Sampling inspection is an auxiliary inspection method in product production. In data labeling , in order to ensure the accuracy of data labeling, sampling inspection methods will be superimposed to form a (multiple sampling inspection) method. This method can assist real-time inspection or full-sample inspection to improve the accuracy of data labeling quality inspection
10. Common noises include:
Voices of people other than the main character Data
rain, animals
background music
Bicycle ticking, obvious electric current sound
11. Taking image labeling as an example, the calculation of the labor required for data labeling includes:
Calculation of man-hours for a single image
Calculate the number and time of labor for data labeling preprocessing
Calculate the number and time of labor for data labeling quality inspection
Calculate the number and time of data labelers
12. If the data labeling factory wants to successfully implement customer relationship management, it needs to do the following:
Establish a business plan
Build a customer relationship management team
Customer Information Management
Analysis of Customer Relationship Management
13. At present, the application of data labeling has covered all walks of life, and different industries have derived various data labeling requirements, which play a key role in the development of artificial intelligence.
transportation
household
the medical
14. Advantages of multiple sampling inspection method:
Able to reasonably allocate the work focus of quality inspectors.
Effectively make up for the omissions of other inspection methods
Improving the Accuracy of Data Labeling Quality Inspection
15. Speech annotation, based on (speech recognition, voiceprint recognition, speech synthesis) and other modeling and testing needs, the data needs to be tagged with speaker roles, environmental scenarios, multilingual tags, prosodic tagging systems, noise tags, etc.
16. Text annotation, through (word segmentation annotation, semantic judgment annotation, text translation annotation) emotional color annotation, pinyin annotation, polyphonic word annotation, number symbol annotation, etc., can provide text corpus with high accuracy.
17. Data transformation is to convert data into a form suitable for data mining through methods such as smooth aggregation, data generalization, and normalization.
18. Multiple linear regression involves more than two attributes and fits the data to a (multidimensional) surface.
19. In the database, records with the same attribute value can be regarded as (duplicate records).
20. As a very important automatic driving in the field of vehicle license plates, there are two main marking methods, one is (drawing frame) marking; the other is fine (cutting) marking.
21. Lane line labeling is a comprehensive labeling of (road ground markings). The labeling includes area labeling, classification labeling, and semantic labeling. It is used to train automatic driving to drive according to lane rules.
22. Sign/signal labeling is a comprehensive labeling of road suspension signs/signals. The labeling includes area labeling, classification labeling and semantic labeling. It is used to train autonomous driving to drive according to (traffic rules).
23. Video tracking and labeling is mainly used to train automatic driving to identify targets (movement tracking ability), so that automatic driving can better identify targets during the movement process.
24. Expression analysis is a kind of (classification) labeling, which generally needs to be carried out in conjunction with face labeling.
Data labeling summary (2)
1. According to the subject of data generation, it can be subdivided into the following sources:
Data generated by a small amount of enterprise applications, such as data in relational databases and data warehouses.
Data generated by a large number of people, such as Twitter, Weibo, communication software, mobile communication data, e-commerce online transaction log data, and relevant comment data of enterprise applications, etc.
Data generated by a huge amount of machines, such as application server logs, various sensor data, image and video surveillance data, QR code and barcode (barcode) scanning data, etc.
2. In the process of supervised learning, the more (accurate) and larger (larger) the input data samples, the higher (higher) the processing efficiency and operating efficiency are, and the magnitude and quality of data processing are directly related to the machine This is what we call “as much intelligence as there is artificial intelligence”.
3. On the basis of the import, complete (data cleaning) and preprocessing work for missing information, inconsistent information and redundant information.
4. (Linear regression) involves finding the “best” straight line that fits two attributes (or variables) so that one attribute can be used to predict the other.
5. In the customer service industry, text annotation is mainly focused on (scene recognition and response recognition).
6. The quality standard of semantic labeling is to label the semantics of words or sentences. In the inspection, it is necessary to:
Test against individual words or sentences
Examine the situational environment for the context
Check for intonation of speech in speech data.
7. After the labeler completes the data labeling task of the first stage, the quality inspector will inspect the data marked in the first stage. If all the labeled data are qualified, the quality inspector only needs to check the data in the second stage of real-time inspection. (50%) of the labeled data were tested.
8. The evaluation process of the data labeling project is as follows:
Acceptance standard confirmation, test bid, test bid acceptance, calculation of labor quantity required for data labeling, comprehensive evaluation of project cost, and comprehensive quotation.
9. In order to protect the security of data in the computer (data cleaning area, data labeling area), only the LAN server can be connected, and copying through external devices is prohibited.
10. The mainstream application fields of image annotation include:
Autopilot
portrait recognition
medical imaging
mechanical image
11. Data cleaning includes the following application methods:
Handle missing values
Dealing with Noisy Data
Handle Duplicate Data
12. Common noises include:
Voices of people other than the main character
rain, animals
background music
Bicycle ticking, obvious electric current sound.
13. Invalid speech contains the following types ( )
The voice is not Mandarin, but a dialect, and the accent of the dialect is very strong, making it hard to hear or understand
The audio background noise is too loud, which affects speech content recognition;
There are only modal particles of “um”, “ah” and “uh” in the voice.
14. Behavior labeling is the area labeling and classification labeling of specific behaviors. It is mainly used in the monitoring of (dangerous behaviors), such as fighting, fainting, car accidents, suicide, and theft.
15. Pedestrian labeling is to mark pedestrians with frames, and is mainly used (statistics of the number of people entering and leaving). Generally, it is necessary to judge through the statistics of the number of people entering and leaving in shopping malls, supermarkets, city centers, stations, schools, factories and other places where people are likely to be dense. Whether the accommodation is saturated can effectively prevent danger caused by overcrowding.
16. Expression analysis is a kind of (classification) labeling, which generally needs to be carried out in conjunction with face labeling.
17. The initial face labeling is to mark the face (frame) and train artificial intelligence to judge the face. Later, with the development of face recognition algorithm technology, it starts to use (point) labeling and trains artificial intelligence to judge the face. face recognition.
18. Video tracking and labeling is mainly used to train automatic driving to recognize targets (movement tracking ability), so that automatic driving can better identify targets during the movement process.
19. 3D vehicle labeling is the (3D) labeling of vehicles in 2D pictures, which is mainly used to train automatic driving to judge the volume of passing or overtaking vehicles.
20. (Vehicle multi-deformation labeling) It is the area labeling and classification labeling of vehicles. It is mainly used to identify vehicle types, such as vans, trucks, buses, cars, etc., to train automatic driving to selectively follow cars when driving on the road Or change lanes.
21. The business model of the data cleaning group is divided into original data (quality inspection) work and (sensitive private data) cleaning work.
22. Text annotation is a special type of annotation. It does not only have basic frame annotation, but also needs to be done according to different needs (polyphone annotation), (semantic annotation), etc.
23. Marking the frames of (traffic lights), (vehicles), (road signs such as viaducts) in the street view can be used to help self-driving vehicles recognize road objects. Data