There is a lot of excitement in the market about artificial intelligence (AI), machine learning (ML), and natural language processing (NLP). Although many of these technologies have been available for decades, new advancements to computing power along with new algorithmic developments are making these technologies more attractive to early adopter companies. These organizations are embracing advanced analytics technologies for a number of reasons including improving operational efficiencies, better understanding behaviors, and gaining competitive advantage.
Organizations today deal with huge amount and a wide variety of data — calls from customers, their emails, tweets, data from mobile applications and whatnot. It takes a lot of effort and time to make this data useful. One of the core skills in extracting information from text data is Natural Language Processing (NLP).
Another important trend is that more AI technology approaches are targeting users beyond data scientists (e.g., a broad range of business users and “citizen” data scientists). Analytics applications more often include built-in AI/ML algorithms that are targeted to make it easier for business analysts and users to find insights. These include natural-language-based search interfaces, automated suggestions, and automated model building.
Machine learning for NLP and text analytics involves a set of statistical techniques for identifying parts of speech, entities, sentiment, and other aspects of the text. The techniques can be expressed as a model that is then applied to other text, also known as supervised machine learning. It also could be a set of algorithms that work across large sets of data to extract meaning, which is known as unsupervised machine learning. It’s important to understand the difference between supervised and unsupervised learning, and how you can get the best of both in one system.
Text data requires a special approach to machine learning. This is because text data can have hundreds of thousands of dimensions (words and phrases) but tends to be very sparse. For example, the English language has around 100,000 words in common use. But any given tweet only contains a few dozen of them. This differs from something like video content where you have very high dimensionality, but you have oodles and oodles of data to work with, so, it’s not quite as sparse.
In supervised machine learning, a batch of text documents is tagged or annotated with examples of what the machine should look for and how it should interpret that aspect. These documents are used to “train” a statistical model, which is then given the un-tagged text to analyze.