Data is the cornerstone of machine learning, without data, there is no model. Three types of data commonly used in the field of artificial intelligence: text, image and speech. Data collection refers to the process of collecting specific raw data of the target field and scene. The collected data is mainly unstructured data such as images, texts, voices, and videos. This article mainly introduces three data sources and collection methods, namely text (words), images (graphs, tables) and voice.
Artificial intelligence data collection method:
1. Data collection for this article:
Data collection in this paper can be divided into different methods according to the type of data collected. The main methods are: sensor collection, crawler, and input. For news information, industry Internet and government open data, you can write web crawlers and set up data sources to crawl data in a targeted manner.
2. Image data collection:
To use image acquisition software for image acquisition, it is necessary to select software that supports multi-resolution and multi-type images. For large images, use large file formats; for small images, use small file formats, such as mobi, jpg, etc. In order to ensure data quality, all images need to be labeled before collection.
In the process of image labeling, simple character strings or text are generally used for labeling, and then the labeling results are output to the acquisition software for processing. For small file formats, image compression can generally be achieved by adding tags (such as text, color, and shape). If other files need to be processed during the collection process, you can also use the compression tool to compress small files.
3. Voice data collection:
Speech data is divided into many different types, common types are speech recognition data (ASR), and speech synthesis data (TTS). Scripted speech recognition data capture typically includes voice commands, wake word capture, or a combination of the two. Personnel involved in data collection are usually asked to read a set of wake-up words or voice command sentences that have been set.