Lingua Franca: An overview of NLP and translation for social media analytics in CPG – Part I

Lingua Franca An overview of NLP


Natural Language Processing (NLP) methodologies enable us to leverage ‘natural’ or ‘human’ language as data sources for Machine Learning and AI to gain insights, observe patterns and even build sophisticated algorithms like recommender systems. Tweets, blogs and other social media posts on the internet form a large repository of text data, in circulation, about various topics. This text can be broken down via NLP and offer insights about anything which is discussed at large. Needless to say, now more than ever, it plays a pivotal role in Social Media – Analytics and Listening.

The importance of NLP is now being used across industries to mine data, discover insights and when it comes to the CPG space, track patterns to determine/forecast trends. That being said, while NLP can be used to uncover insights, there is no geographical promised land of data which provides marketers information around global trends and buzzwords. This is especially hard to track in the Beauty and Personal Care industry. While there is unstructured data available in all languages, they do not cater to traditional NLP methods. At this point, there is a need to look at handling data tweeted, posted or blogged in foreign languages as well.

In the first part of this two-part series blog on Natural Language Processing, we will look to understand the importance of NLP, its methodologies pertaining to text and extension to dealing with foreign languages.

The importance of NLP illustrated in a use case

Natural Language Processing (NLP) is a pivotal part of text analytics. As we know, NLP methodologies specialize in dealing with ‘natural’ or human language, spoken or written on different media and observing findings from them. We will talk about this in detail by way of an example of how LatentView Analytics leveraged NLP to discover ‘trends’ from social media for a global beauty brand. The legacy beauty brand wanted to understand trends in the beauty market by deciphering social media data to mine consumer conversations. They wanted to leverage analytics to spot upcoming trends before they became mainstream.

The LatentView Analytics team used text analytics to mine various beauty trends from social media channels such as Twitter and Instagram and track the same for the client. All of that text came from the US region and the source-text was only to be in English. The text was processed using various NLP models coupled with deep learning. The processes were then automated into a pipeline which worked on a monthly influx of text data from various sources like Twitter, YouTube etc.

Interestingly the source data, although restricted to the United States, offered a lot of observations on Korean Beauty and Skincare as well. The client therefore required a parallel observation of trends from these countries in order to track the trends and their trajectory. In other words, mining text in Korean/Japanese could provide more insight on what beauty trends and products were being talked about in these countries and looking at both sets of trends would also ascertain whether these trends are following a trajectory by starting from Asia and then emerging into the US market.

The challenge therefore expanded from English Text to other languages, which meant dealing with huge volumes of data, breaking them into meaningful units, translating and preparing them for trend generation.

Linguistic drills:  NLP methods for trend generation from English Text

The trend generation system already in place involves a plethora of Natural Language Processing tasks.
To analyze text in order to find trends was more than matching keywords. It involved: Spam Identification, Sentiment Analysis, Trend Identification and several other tasks.

NLP methodologies included:

  • Stemming (producing morphological variants of a root/base word)
  • Lemmatization (removal of inflectional endings of words to get their root word)
  • Stop-word removal (removal of very frequent useless words, eg: articles, prepositions, ‘The’ ,’For’ etc.)
  • POS-tagging or Part of Speech tagging, to identify morphological units of the text.

The text had to be cleaned or processed and NLTK has proven to be a very important package. Word Vectorization and coupling our text processing to Neural Networks helped us to process a large amount of tweets and posts. All of which of course was written in English.

While Python packages have made text mining in English pretty simple, posts written in other languages, specifically unknown to us, (no matter how much we love Korean Drama) is difficult to handle without a language expert intervening manually.

Breaking the text into linguistically meaningful units (in other words tokenizing), ending up with useless or ‘Spam’ text and incurring time and memory costs, handling two or more alternate scripts are a few problems associated with the exercise.

Need for Translation: To prepare unstructured foreign language data in more meaningful units suited for existing trend pipeline

Adding foreign language data to the English pipeline was the next challenge at hand. The big question was whether the team was able to work with text data without being acquainted with the language. There were several approaches taken up in order to proceed with the task without a language expert. KoNLPy and Whoosh also made the task at hand easier than looking at a screen full of unknown characters.

Without a language expert, only a successful translation of significant samples would help analyze and understand the trends, compare them with those from US and in future model the process and build a translator. There are many neural machine translation options, however the team decided to experiment with Google Translate API first. Free versions of Google Translate API proved to be insufficiently stable and hence, Yandex was the chosen alternative.

Let’s look at what Google and Yandex has to offer and why Yandex scored higher on the scale.

  • Nature of Translation: Yandex and Google cannot be objectively compared, even for one Language, it gave different results for different sentences, and one outperformed the other at a different sentence. Yet Google Translate API is quoted as the first choice for projects.
  • Number of Languages Supported:  Google Translate supports over 100 languages while Yandex supports 93 languages.
  • Pricing: Pricing became a crucial point when it came to an urgent need for translation, although Google offered a free API, it was insufficiently stable, and after much research we understood that the official version of the Google API is not free. There is a 300$ credit, but one had to sign up using their payment option. Interestingly there is a differential pricing model for Yandex. Google API charges $20 per million characters of text while Yandex has a starting 10-million-character limit with the API issued free of charge, and a differential pricing model.

Pros and Cons: Positives and negatives of using any translator API largely depends on the use case, the language pairs, even the purpose of translation and pricing constraints. The table below summarizes the various factors which lists the upside/downside, based on the use case:

Translator No. of Languages Result Quality Free Tier Pricing
Google 100+ Good No $20/million
Yandex Translate 93 Good API key free of charge, first 10 million characters are free. $15 for 0-50 million characters, $12 for 50-100million, $10 for 100-200million, $8 for 200million to 500 million, $6 for 500million to 1 billion
Microsoft Translate API 60+ Good Up to 1 million free characters. Tier-wise pricing.

Choosing an API among various options available is largely based on several factors, including the language pair, the purpose, and the level of granularity in text that holds importance and the stage of the project.

Challenges and the way ahead

There are multiple hold-ups when it comes to dealing with foreign languages, and one of which is the quality of the translation and the sheer volume of data which makes it time-consuming to process and clean. There are several other problems pertaining to dealing which a foreign script, which will be explained in the second part of this series – Lingua Franca: A closer look at the treatment of Foreign Language for Social Media Analytics in Personal Care.

While the tools in the ever-growing Text and NLP market keep reinventing themselves to address various technical challenges, the openness of the Data Science Community, a whole world of packaged codes, shared knowledge, some patience and a few cups of coffee helps us steering through any data problem we encounter!

Decoding consumer conversation to predict trends

Natural Language Processing (NLP) is growing in stature and can be applied in a variety of situations that deal with text data. In this time where businesses are flooded with more data, these techniques provide decisive insights which was not practical with manual means. LatentView Analytics, as experts in harnessing unconventional sources of data, is working on cutting edge NLP techniques helping global brands understand the pulse of the consumer. For more details, write into:

Related Blogs

Many enterprises using Databricks for ETL workflows face challenges with isolated data management across workspaces. This…

Businesses are embracing the scalability and flexibility offered by cloud solutions. However, cloud migration often poses…

Streamlit is an open-source Python library designed to effortlessly create interactive web applications for data science…

Scroll to Top