May 28, 2020 Sandeep L. JoiceBusiness ,
One of our recent RPA prospects had a problem of duplicate invoices being generated because of near-duplicate names. The near-duplicate names are the same names with spelling mistakes. For Example: Sandeep Joice & Sandeep Joyce. The problem was with identifying these two are same or different person.
This identification is a big challenge as it is firstly humane, until & unless we have some identification, we cannot identify just by the name. Secondly, if there is no digital data to verify & identify if this human being is the same or different.
To overcome these things, what we did is to find an algorithm that can help us identify. There is an algorithm called as Levenshtein’s Distance Algorithm - "The Levenshtein distance is a string metric for measuring difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965".
We learnt & understood this algorithm and how it works. Once we had this knowledge, we started digitizing this algorithm to code. We had the code which works perfectly by taking two strings (First Name & Last Name) as inputs and give the distance of corrections between them. Now the real problem was on how to map this distance numeric to near-duplicate strings. Because the distance goes way higher if the two strings are different. So, we decided to reduce the probability of identifying strings as different by using basic probability & few string operations before we go to Levenshtein’s Algorithm. This helped us a lot but still, we were far away from the accuracy rate which is when we had the light bulb moment. Why don’t we introduce our own intelligence here? We are in an era of Artificial Intelligence and Machine Learning, why don’t we deep dive into it? This thought made us brainstorm and come up with a simple idea of building our own master data for duplicate names and use it as Intelligence. Voila! We had a solution!
In the end, we built a simple yet powerful solution for this problem as below:
So, the moral of the story is, not every problem has a quick solution, we just need to build one around it!
Hello, I am Aria!
Would you like to know anything in particular? I am happy to assist you.