Build vs Buy: Text Statistics & NLP
Adding Build vs Buy: Text Statistics & NLP new native language processing (NLP) capabilities to a product, application or other analytics infrastructure? Answer this question before anything else: Should I build my system from open source components or an NLP solution provider license?
“Create or Buy” to get text stats and NLP means to choose between:
Building from open source
Licensed basic cloud API
Customizing the NLP forum
NLP open source models are better than ever, and NLP APIs are no longer simple. Today, you are faced with a myriad of free tools to choose from as well as a host of text-based vendors who want your attention.
This article will help you make an informed decision by explaining the options you face, what to expect if you choose to build, and how to choose an NLP provider if you decide to integrate an API or platform license.
Summary and Important Steps to Take
If you do not remove anything from this article, let it be: Building your own NLP system from an open source may be a viable option but it is rarely the best option.
Instead, refer to the license and configure the API or NLP forum based on four factors: Document types, data volumes, analytics requirements, and other requirements.
For easy-to-use situations, such as document level sensitivity or easy split, go with the text analysis API or other NLP tool.
For more detailed text analysis, such as multi-layer scoring, complex classification, or custom business recognition, or if you have complex documents, work with a customized NLP forum.
Building NLP open source libraries
System cost: $ 0 (open source)
Technical costs: $ 81,000 + (hiring NLP + skilled person and other engineers)
Time line: Weeks
Skills: limited without much extra work
Licensing an API or NLP platform
System costs: $ 10,000 (basic cloud statistics) to $ 40,000 + (configurable NLP platform)
Technical costs: None (your dealer will do more work)
Time line: Days
Skills: Both deep and wide, customized to your needs
What is the “Build or Buy” Question?
The question of building or buying is the choice whether to build your own version of something or buy a prefab solution (“off the shelf”) from another company. People deal with building or purchasing questions each week: Cook dinner or order delivery? Manage my own investment or pay an adviser? Do I wash them myself or bring them to the cleaner?
One way to think about this is to differentiate between building or repairing a car compared to buying a new one. Of course, it can be fun to build your own internal combustion engine from scratch. But there are other things you can do, and the time you spend fixing up your car is probably the time you spend finding a grocer, taking your kids to kindergarten, or meeting friends.
Building from an open source NLP: Fundamentals of Text Analysis
Building your own NLP system with open source libraries can be a viable option but is rarely the best option.
Why? Because while building a basic NLP tool is easy, building something useful is actually very difficult.
Like a car, any NLP system that calls for its salt involves a large number of complex moving parts. When you buy an off-the-shelf solution, most of them are cared for by the seller. But if you build a text analysis program from the beginning, you are responsible for everything.
There are 7 basic text analysis activities, each of which plays an important role in deep natural language analysis.
This chart shows a simplified view of the layers for processing random text document to be converted to formal data at Lexalytics.
Lexalytics NLP Technology Stack.png
All of that, with machine learning models in text only. Language acquisition, Speech marking component, composed business recognition and other functions all require machine learning models to achieve optimal accuracy. Each model should be trained in a data set that includes hundreds or thousands of hand-drawn documents.
And that’s the only point of the ice. When thinking about building your own system, you need to understand the misleading simplicity and hidden dangers of the basic elements of NLP.
Basic NLP Limitations: Emotional Analysis and Classification
[a man with 3 emotional bubbles above his head.png]
Legal analysis based on rules and classification driven by questions are very simple concepts. If you work from open source libraries while deliberately limiting the scope of your system, you may find a basic goal scorer or a divider that works quickly.
Similarly, three Big Tech companies (Google, Amazon, and Microsoft) offer cloud NLP services, including emotion and content analysis. If your analysis needs are simple, if you do not need local processing, and if you do not have many documents, these tools can be an effective and inexpensive choice.
Don’t jump the gun, though. An open source NLP is ready for easy use. But profit margin analysis goes against it unless you already have an established data science program. Similarly, large cloud providers are good at resolving low volume usage cases that include one or two basic NLP features. But these tools offer limited analysis and limited tuning. If you need more complex analysis or customization, they will just not support you.
This is important because basic textual calculations come with many obstacles. For example, document level emotional scores may be misleading because they are removed from context.
Consider a restaurant review that reads, “The lining was beautiful, but the room was very dark.”
Obviously this evokes constructive feelings about the language, as well as a negative feeling about the ambience of the room. But a document-level tool will restore the balance of the two schools, reporting it as neutral.
This example shows that the standard of the document is based on the rules of emotional analysis can be dangerous, especially if you are making important decisions based on the outcome. For more information, read this paper: Theme Release and Content Analysis.
The point is, basic statistics are useful, but only a point. If you need in-depth knowledge or custom analysis skills, you need advanced NLP.
Cost of Building an Advanced NLP Learning Machine
Advanced NLP features such as business sense and title level, poor word classification, and theme analysis are powerful tools for data analysis. But building these features requires you to integrate NLP rules with custom machine learning models. And this adds a huge cost to your project.
According to Glassdoor, the average salary of a native language-based engineer based in the United States is more than $ 80,000. Hiring one data scientist to train NLP machine learning models will put you in six numbers, as well as benefits and bonuses.
In addition to hiring NLP engineers and data scientists, you will need to collect, refine and translate data to train models. One NLP separation model, for example, requires at least 100 pieces of training content. Models of language-specific emotion or Part of Speech marks require thousands more. Each set of data must be collected, carefully cleaned, and carefully analyzed in person before it is given to the model. If you do not have a good pipeline for processing data and weight control as your models grow over time, you will face problems quickly.
Summary: create your own NLP program that can deliver in-depth, useful information, looking up to $ 100,000 or more in terms of talent and data. And that is more than the thousands of hours people will have to spend building, repairing and retraining the system.
Choosing an NLP Solution: 4 Things to Consider
Repetition: Building your own NLP program with open source libraries may be a viable option but is rarely the best option.
In most cases, it is very expensive to license an API or NLP platform. Which method and which dealer is best for you? Start by understanding your needs. Make your choices based on these four factors:
First, consider the type of documentation you are working with. Simple test terms? Try the open source NLP model or text analysis skills for your test tool. Informal content such as customer reviews, free text surveys, and social media comments? You will need a dedicated NLP tool to manage them. Contracts, invoices, emails, medical files or other complex documents? Check out a fully customized NLP solution supported by an established vendor.
Next, analyze the amount of data you process per month, on average. A few hundred or a few thousand documents can be easily managed with an open source model or a cloud API. Apart from that, open source gets into operating problems and cloud solutions are very expensive. If you have tens of thousands of documents or more, look for an uncontrollable NLP tool with predictable values and stable, uncontrollable structures.
Third, check your analytical requirements. Document analysis of document-level emotions, easy fragmentation, or common business idea can often be handled by open source models or even an application-specific solution as a test tool. If you want to understand why people feel the way they do, define custom organizations or organize content into complex buckets, however, you will need an NLP forum with tuning and setting tools.
Lastly, list some of your needs, such as personal data storage, location processing, low-resolution data analysis, high level of support, or specific services such as custom machine learning models.
Open source NLP models can process documents locally but leave you to fend for yourself with training. Cloud statistics providers may provide private storage, but you will never know where your data goes when you call their API. Meanwhile, Big Tech companies do not offer much in the way of services and training – after all, they are not in the NLP business, they are in the cloud business.
Only dedicated NLP companies such as Lexalytics that integrate all the technologies needed to meet your professional needs and the information needed to help you meet your goals.
Summary: Create or Buy Indigenous Language Processing?
So, should you create or purchase NLP text and statistics?
Building from open source
Benefits: free NLP libraries; complete control of the structure; spatial processing; developing internal skills
Inadequacy: it takes time to build; no support or services; requires technology; limited statistics without much extra work
Licensing an API or NLP platform
Benefits: start analyzing immediately; additional mathematical skills; tune and prepare as needed; someone else keeps core tech; full support and services
Disadvantages: license costs; you need to repair the engine; choosing the wrong provider leads to a big head on the road
And perhaps most importantly: Choosing to buy, rather than build from scratch, frees you to focus on achieving your goals: improving product and customer experience, reducing employee profits, managing compliance, automated business processes, or something else altogether.