Eljas Linna
Product
April 15, 2020 • 3 min read
The correct company information is found even with typos in the name and a changed address.
Let’s say you want to add a new company, “ABC Holdings”, in your customer database. Which version would you type in?
You probably have a standard format you always use. But is it the same as what your colleagues use?
Now we get to the problem: How do we know for sure whether or not this company has already been contacted by our colleagues?
Unfortunately, the common solution is to implement strict and rigorous guidelines for adding new entries. Normally these systems consist of endless drop-down menus and dozens of fields that require you to fill in information you had no idea existed.
A more pleasant solution is to implement smart search that finds possible duplicates when a new entry is being registered. A simple keyword search helps a lot already, but for a more reliable and robust search you’ll need to consider the relationships between data points and their different variations. That’s when you need machine learning. I’ll show you how to get it done with Aito and it’ll take less than ten minutes, I promise.
I picked up the data for this experiment at the USA public catalogue. It contains the basic information of 3745 American companies in a tab separated .txt format. After wrestling with the file for a bit, I turned it into a nice and smooth .csv which you can get here.
ID | Name | Zip_Code | Street | Building | City | State | Number | ||
---|---|---|---|---|---|---|---|---|---|
1904 | ABRAHAM & CO., INC. | 829452 | 3724 47TH STREET | - | GIG HARBOR | WA | 98335 | ||
2303 | ROSPERA FINANCIAL SERVICES, INC. | 828164 | 5429 LBJ FREEWAY | SUITE 400 | DALLAS | TX | 75240 | ||
2554 | AEI SECURITIES, INC. | 816750 | 1300 WELLS FARGO PLACE | 30 SEVENTH STREET | ST. PAUL | MN | 55101-4901 | ||
... | ... | ... | ... | ... | ... | ... | ... |
Pretty normal looking stuff. Before we get to try out the smart search, we need to upload the dataset into your Aito instance to serve as the learning data. By far the easiest way to do this is to use the Quick Upload feature at the super secret instance management page. Sign up here to get invited and get your own instance.
You can also use Aito Python SDK and CLI or go straight for our REST API to upload your data.
The first thing you should do after uploading is to have a quick look at your data to check for any errors or shenanigans. You can run the following commands on any cURL friendly terminal but using a REST client like Insomnia is way more convenient. Remember to replace the api URL and keys with your own. Here's the first cURL with our public instance:
curl -X POST \
https://public-1.aito.app/api/v1/_query \
-H 'content-type: application/json' \
-H 'x-api-key: bvss2i2dIkaWUfBCdzEO89LpxUkwO3A24hYg8MBq' \
-d '
{
"from": "company_info",
"limit": 1
}'
And the response below looks all good:
{
"Building": "SUITE 210",
"City": "CHERRY HILL",
"ID": 9319,
"Name": "BCG SECURITIES, INC.",
"Number": "08002",
"State": "NJ",
"Street": "51 HADDONFIELD ROAD",
"Zip_Code": 812680
}
Now onto the fun part!
Aito offers the _similarity API endpoint specifically designed for identifying similar entries in the database.
Let’s use the above company information and give it a small twist. We’ll leave out some of the data points and remove the “, Inc.” from the company name.
curl -X POST \
https://public-1.aito.app/api/v1/_similarity \
-H 'content-type: application/json' \
-H 'x-api-key: bvss2i2dIkaWUfBCdzEO89LpxUkwO3A24hYg8MBq' \
-d '
{
"from": "company_info",
"similarity": {
"Building": "SUITE 210",
"City": "CHERRY HILL",
"Name": "BCG SECURITIES",
"State": "NJ",
"Street": "51 HADDONFIELD ROAD"
},
"limit": 1
}'
Ta-da! Aito returns the right company information we wanted. As you can see in the response below, Aito also gives it a “$score” which indicates the strength of the match. We’ll see the score going much lower when the queries get more difficult. This one was pretty easy.
{
"$score": 1226332.030858179,
"Building": "SUITE 210",
"City": "CHERRY HILL",
"ID": 9319,
"Name": "BCG SECURITIES, INC.",
"Number": "08002",
"State": "NJ",
"Street": "51 HADDONFIELD ROAD",
"Zip_Code": 812680
}
Now we’ll make things much more complex. What if the company moved to a completely different location and there’s a typo in the name?
curl -X POST \
https://public-1.aito.app/api/v1/_similarity \
-H 'content-type: application/json' \
-H 'x-api-key: bvss2i2dIkaWUfBCdzEO89LpxUkwO3A24hYg8MBq' \
-d '
{
"from": "company_info",
"similarity": {
"Building": "A 30",
"City": "NEW YORK",
"Name": "BCG SECURTY",
"State": "NY",
"Street": "92 HELM STREET"
},
"limit": 1
}'
Aito still finds the right company. This time the score is significantly lower, as expected, but it’s multiple times larger than the next closest match. You can see more suggestions and their scores in the response by changing the “limit”: 1 in the query to a higher number.
{
"$score": 15.109117592387861,
"Building": "SUITE 210",
"City": "CHERRY HILL",
"ID": 9319,
"Name": "BCG SECURITIES, INC.",
"Number": "08002",
"State": "NJ",
"Street": "51 HADDONFIELD ROAD",
"Zip_Code": 812680
}
There are a lot more scenarios we could try and see how Aito responds. I encourage you to try it yourself. Copy any of the above queries to a REST client, change the values and see what happens to the score.
What you probably really care about is how would this work with your own data. There’s only one way to find out. Request access to Aito and you’ll swiftly get your very own instance to test with. And it’s completely free.
And by the way, I made a simple UiPath demo for you to play around with. You'll need to enable UiPath Web Activities in the Manage Packages console. Have fun!
Back to blog listEpisto Oy
Putouskuja 6 a 2
01600 Vantaa
Finland
VAT ID FI34337429