Text is a critical aspect of communication. It is everywhere around us. With the growing usage and adoption of the internet, TEXT has become an integral part of day to day life. From social media conversations between friends & family to online reviews and complaints between consumers and brands, the data available for gaining insights has increased multifold over the years.
Predictive Analytics Today defines Text Analytics as
“Text analytics is the process of converting unstructured text data into meaningful data for analysis, to measure customer opinions, product reviews, feedback, to provide search facility, sentimental analysis and entity modeling to support fact based decision-making.”
What is text analysis all about?
Traditional data analysis is based on relational models in which data is stored in tables with predefined data attributes (so-called structured data). However, only approximate 20% of the data available for enterprises is in structured data, the other 80% is unstructured and in free text. Most of the data we encounter online today is unstructured data and free text and includes over 40 million articles in Wikipedia (5 million plus of them are in English alone), 4.5 billion Web pages, about 500 million tweets a day, and over 1.5 trillion queries on Google a year. That’s a colossal amount of data to process, and impossible for humans to do it alone. If machines are made solely responsible for sorting through data using text analysis models, the benefits for businesses will be huge. For instance:
- Regulatory compliance work involves complex text related documents which are predominantly in healthcare, pharmaceutical and financial sectors. They dedicate 10-15% of their workforce for such activities and end up spending considerable amounts of money that could have been used elsewhere.
- As part of HR Analytics, text analytics can help understand the voice of employees which can result in improving employee engagement and increase productivity.
- Spell checking, Keyword search, finding synonyms.
- Optimizing products and services based on social media monitoring which helps organizations understand the pulse of customers, their needs, wants and pain points.
- Creation of Chatbot and voice recognition systems for improved customer experience.
Steps Involved in Text Analytics:
The Following sections give a step-by-step guide to how an NLP engine allow users to analyze a broad array of data sources:
#STEP ONE: NORMALIZATION [Tokenization and Sentence Breaking]
Extraction of information from text by separating the text into fragments for efficient information extraction. This involves identification of markers for the ends of sentences, paragraphs and documents, and then breaking the stream of text into meaningful elements (words, phrases,and symbols).
1. Unstructured data have no pre-defined data type i.e data can be string, numbers or special characters
Next, certain special characters and character sequences (like contractions and abbreviations) are removed and replaced with appropriate words. In this case, the engine recognizes that “doesn’t” equals “does not.” So the output is updated as:
#STEP TWO: PARTS OF SPEECH (PoS) TAGGING
Detection of parts of speech for every word. This is to figure out whether a given token represents a proper noun, verb or adjective.
#STEP THREE: NAMED ENTITY RELATIONSHIP
The next step in the normalization process is undertaken by the Gazetteer module. It takes the basic attributes assigned by the morphology module and adds additional information using standard lexicons, custom lexicons, and custom rules. This is the stage in which the tool understands that “smart” and “phone” become “smartphone.”
Also, the unknown words module uses a variety of methods including suffix patterns and common misspellings to try to identify unknown words. In this case, “chagre” becomes “charge.” It also includes a dictionary of idiomatic expressions that in English includes introductory phrases, complex prepositions, and complex adverbs.
#STEP FOUR: SYNTAX PARSING [SEMANTIC RECOGNITION]
The syntax parsing serves the function of determining the structure of a sentence.
Understanding how words relate to each other is essential to establishing and extracting meaning from text. Special entities, such as dates, currencies, numbers, etc., are also recognized.
This becomes clear in the following example:
- My Smartphone does not support video conferencing until Android Update…
- Because my smartphone was supporting video conferencing, Android…
- Smartphone was not supporting video conferencing because Android…
In the first sentence, My Smartphone is negative, whereas Android is positive. In the second, Smartphone is still negative, but Android is now neutral. In the final example, both Smartphone and Android are negative.
#STEP FIVE: SENTENCE CHAINING:
The last step in preparing unstructured text for detailed analysis is “sentence chaining”, sometimes also known as “sentence relation”. Here, individual sentences are linked using each sentence’s strength of association to an overall topic.
E.g In the sentence, “Smartphone is charging”, the words “Smartphone” and “charging” do not have a direct relationship. But “Smart[phone” has a relationship with the verb “is” and so does “charging”
Next Step – Modelling:
There are various models which can be deployed on the output of the above Text Analytics steps to help in solving different NLP cases. Below are few of the popular models (not an exhaustive list):
- Long Short term memory (LSTM): LSTM is a special kind of Recurrent Neural Net (RNN) Neural which is capable of learning long-term dependencies by remembering information over long/dynamic periods of time. Here each neuron(node) has a memory cell and three gates namely – INPUT, OUTPUT and FORGET. Each of these gates serves the function of safeguarding the information by stopping or allowing the flow of it.
- Input gate: Determines the amount of information from the previous layer that gets stored in the cell.
- Output Layer: Determines how much of the next layer gets to know about the state of this cell.
- Forget Gate: Appears to an odd inclusion at first but sometimes it’s good to forget. e.g if it’s learning a book and a new chapter begins, it might be necessary for the network to forget some characters from the last chapter.
LSTMs have been shown to be able to learn complex sequences, such as writings of Leonardo-Da-Vinci or composing ancient music. Also, LSTM is the default model for most sequence labelling tasks.
- Transformer: The Transformer architecture serves as the core of almost all the recent major developments in NLP. It was introduced in 2017 by Google.
“Transformer neural networks apply a self-attention mechanism which directly detects relationships between all words in a sentence, regardless of their respective position”. It does so using a fixed-sized context i.e the previous words.
“She found the shells on the bank of the river.”
The model needs to understand that “bank” here refers to the shore and not a financial institution. Transformer understands this in a single step.
The below animation depicts how Transformer works:
- Google’s BERT: BERT is pre-trained on a large corpus of unlabeled text including the entire Wikipedia(that’s 2,500 million words!) and Book Corpus (800 million words)
BERT, abbreviation for Bidirectional Encoder Representations, considers the context from both sides (left and right) of a word. This bi-directionality helps the NLP model gain a better understanding of the context in which the words were used.
BERT is considered to be the first unsupervised, deeply bi-directional system for pretraining NLP models. It was trained using just a plain text corpus.
T1 – Final representation
E1 – Embedding representation
Trm – Intermediate representation (12 Layer)
However, BERT perform quite poorly when compared to humans if the sentence completion tasks requires world knowledge (common sense) that cannot be gleaned from the corpus
- XLNet: XLNet is an auto-regressive language model that uses permutation modeling and a two-stream self-attention architecture. It overcomes the limitations of BERT which is due to its auto-regressive formulation.
Shaded words are provided as input to the model while unshaded words are masked out.
Apart from these models there other models are extensively used as well including ULMFiT (Universal Language Model Fine-Tuning) which is specialised for text classification tasks; ELMo (Embeddings from Language Models) which finds it use in representing words in vectors and embeddings, etc.
- Natural Language: The natural language faces the problem of ambiguity meaning one term having several meanings, one phrase being interpreted in various ways and as a result different meanings are obtained.
- Another limitation is that during the information extraction system it involves semantic analysis. Due to this the full text is not presented, only a limited part of the text is presented to the users. But these days there is a need for more text understanding.
Since the advent of management information systems in the 1960s, and then of BI in the ’80s and ’90s as a software category and field of practice, the emphasis was on numerical data stored in relational databases. During those times text in “unstructured” documents was hard to process. But with advancements in technology TEXT ANALYTICS slowly but steadily has overcome this challenge.
Websites have been using text-based searches, which only found documents containing specific user-defined words or phrases. Now, by means of text analytics one can find content based on meaning and context (rather than just by a specific word). Additionally, text analytics can be used to build large dossiers of information about specific people and events. For instance, large datasets based on data extracted from online news websites and social media platforms can be built to facilitate social network analysis or counter-intelligence. In effect, text analytics may act in a capacity similar to an intelligence analyst or research librarian, but with a more limited scope of analysis. Text analytics plays a major role in email spam filters as a way of determining the characteristics of messages that are likely to be advertisements or SPAM.