The digital revolution brings with it an exponential rise in the number the digital devices we now use on an everyday basis; this in turn generates a huge amount of data that, needless to say, is invaluable to understand today’s digitally savvy, born-in-the-clouds consumer.
While companies capture huge amounts of data from various sources across multiple touchpoints, one of the biggest challenges when it comes to data storage is – how can companies effectively store data without duplicates in different locations on the same servers, hard drives, tape libraries etc. When it comes to data quality, duplication is one of the most ignored facets, but that could prove costly as these duplicates are almost impossible to identify and process by human or conventional programs.
In this blog, we will talk about data duplication problems at the record level in databases and how it negatively affects the business. If you are someone who works in database management or data warehousing or responsible for data backup and/or migrating huge size data to cloud from on-premise servers or vice-versa, you would be familiar with data duplication. Experts say that for many companies, without the data quality procedures in place, the duplication rates are in the range of 10-30%.
Data duplication: the problem of plenty
For a simple perspective on this issue, consider this – a customer record shows up multiple times in the database. That is, there are multiple records which identifies the single real entity. Those two or more records might be an exact match or a partial match, due to data collection from varied sources like contact forms, survey forms, user database, newsletter contacts, sales leads and so on or even intentional duplication by a customer who creates another account with the company with slightly different values. The duplication can also be due to typo, style of writing, differences in data types/schema representation. This might not look like a big challenge, but when used for analysis or reporting for a business goal in marketing, promotion or auditing, data duplication problems affect the business negatively.
Firstly, it is very difficult to have a clear idea about how many real unique customers exist and attributing the actions performed to a real unique person. This results in inefficient analysis and unclear reporting. Secondly, it is a huge wastage of the marketing budget because more money is spent on same customer for ad-targeting or emailing and so on. Thirdly, it can also result in low response rates and can create a negative impression on the performance of the business. Finally, it is a self-threat to brand reputation and poor customer experience due to potential repetition in calls/emails/ads. Simply put, it is a complete loss of time, money and productivity.
Leveraging AI to avoid duplication
So how can this problem of multiple records, across multiple channels, for a single entity be solved? Can a team of persons solve this? Or can a conventional programming with set of logics and rules solve this? The obvious thought is to deduplicate the data; often called data deduplication or record deduplication in cloud. This is where AI can make a significant difference in saving us time, effort and money. Of course, even before we introduce AI, human intervention is required to select, process and analyse the data points which can potentially identify the duplicates.
“Deduplication simply refers to finding the potential duplicates, exact or partial, and separating them from the unique ones. These are often copied, moved and referenced back to original master record. Also, an extra manual verification is performed to check if there are any anomalies that are obvious to the human eye. Duplicates can also be deleted retaining the original unique record saving the extra space, time and effort required.”
There is no universal technique that can be used for deduplication. There are several techniques and methods which are to be designed and implemented for the specific problem identified. Data can be analysed at byte level, field by field, word by word etc. It often depends on how strict or lenient you want to be with the process and depends on the type of data, platforms and sources you are handling. You must keep in mind that being too strict can result in loss of data and any unique data lost defeats the purpose. This process can also be goal oriented, and the solutions might vary accordingly. You need to be vigilant about how the duplicates are managed, accordingly, after they are found. You might want to retain the latest one, or the first created one, or the one with no null values, or the one with most number of actions/orders/purchases, or the one with lowest interaction.
Choosing a solution
Two of the most significant among the myriad of techniques for data deduplication:
1. Distance-based deduplication:
It would be very difficult, in fact almost impossible to manually label data when the data size pinnacles to billions of records and it would be hard to train an AI model to perform this task. So, how would one decide if the records are duplicate without training data? A very simple solution is, by looking at the two records for similar text/numerical/phonetics, on which the distance is calculated using methods like Levenshtein Distance, Hamming Distance, Trigram Comparison, Gap Penalty etc. If the distance value exceeds the threshold value, the two records are said to be similar.
When it is said that the distance is being calculated between two records, will the entire record or field by field or word by word or each character be considered? If a field is deemed to be so important that you think it would affect the similarity of records, then field by field evaluation is used in the calculation of distance. Now you are good with the calculation strategy, but how will you decide on the threshold value without first evaluating the records? You’ll need to label, train and test the data for this, which defeats the purpose of this method. To overcome this issue, let’s look at something better.
2. Active-learning method:
Active learning has an active learner which starts with very limited labelled data and a very large pool of unlabelled data. The data can be selected randomly, and n-records are labelled initially by the user. Usually the labelling is binary i.e., two random records are shown and asked if they are duplicates or not. Initially, when the active learner starts classifying, it tends to be sure about few predicted results but uncertain about the remaining huge pool of records. Human intervention is required to clear that confusion and provide information on the uncertain data. With each labelled record, the agent learns more about the data and start classifying again.
Of course, each record can’t be compared with every other record in question to find a duplicate. Just imagine the magnitude of computations required, which increase exponentially with the increase in number of records in the database. A point to be understood here is that it is not needed to compare every record because all the records might not be duplicates. Techniques like clustering, binning and blocking, so on will group together only the probable similar records and records within those groups are compared.
While we have touched upon two techniques, there are other techniques such as File-level data deduplication, Block-level data deduplication and Byte-level data deduplication, to name a few. A lot of these techniques can be used in their entirety or a combination of methods can be used to build an advanced solution.
Deduplication, like other storage technologies, can create substantial impact and provide great dividends if used in the right situation. However, it is important to remember that all deduplication techniques are not created equal and some might perform better than others based on the situation and challenge. Therefore, it is important to ask questions and to run multiple sets across your own data.
Remember, the solution is not always out-of-the-box. Often, the mix and series of techniques designed and implemented accordingly will help solve the specific problem that is in question. Once the database is clean, it is important to maintain that quality for which there needs to be a strict governance on data. You might also want to integrate the deduplication systems into your process based on how frequently the data changes and the volume of it.
Accelerate your Data Engineering capabilities
At LatentView Analytics, we follow a business-focused approach to data engineering to align analytics and technology. Our workload-centric architectures are designed to meet different needs of business stakeholders. To help unleash all levels of data analytics capabilities and turn it into a competitive advantage for your business, please get in touch with us at: firstname.lastname@example.org