May 28, 2020 By: Sandeep L. Joice
One of our recent RPA prospects had a problem of duplicate invoices being generated because of near-duplicate names. The near-duplicate names are the same names with spelling mistakes. For Example: Sandeep Joice & Sandeep Joyce. The problem was with identifying these two are same or different person.
This identification is a big challenge as it is firstly humane, until & unless we have some identification, we cannot identify just by the name. Secondly, if there is no digital data to verify & identify if this human being is the same or different.
To overcome these things, what we did is to find an algorithm that can help us identify. There is an algorithm called as Levenshtein’s Distance Algorithm – “The Levenshtein distance is a string metric for measuring difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965“.
We learnt & understood this algorithm and how it works. Once we had this knowledge, we started digitizing this algorithm to code. We had the code which works perfectly by taking two strings (First Name & Last Name) as inputs and give the distance of corrections between them. Now the real problem was on how to map this distance numeric to near-duplicate strings. Because the distance goes way higher if the two strings are different. So, we decided to reduce the probability of identifying strings as different by using basic probability & few string operations before we go to Levenshtein’s Algorithm. This helped us a lot but still, we were far away from the accuracy rate which is when we had the light bulb moment. Why don’t we introduce our own intelligence here? We are in an era of Artificial Intelligence and Machine Learning, why don’t we deep dive into it? This thought made us brainstorm and come up with a simple idea of building our own master data for duplicate names and use it as Intelligence. Voila! We had a solution!
In the end, we built a simple yet powerful solution for this problem as below:
- First the bot checks for duplicate names in the Master data file which is built by the bot itself over the time. Where it checks for wrong name presence and picks its correct name.
- If it fails to find these duplicate names in master data, the second step is to check duplicate names based on string operations.
- If it fails with step 2, step 3 is to check duplicate names based on probability.
- If it fails with step 3, step 4 is to check duplicate names based on Levenshtein’s Distance Algorithm.
- If all the above fail, we enter both the strings (Names) in the Master Data file, so that it will be reviewed by a human being later for identification & correction of the right name. Which helped the bot next time when it checks for a duplicate name in master data.
So, the moral of the story is, not every problem has a quick solution, we just need to build one around it!