Company Identification and Linkage at Credit2B

The quality of a data-focused product is defined by the quality of the underlying data. Maintaining high quality of data is of utmost importance to us as we acquire tens of millions of trade lines of customer receivables and huge amounts of corporate data sourced on a daily basis. Because the data arrives from a multitude of independent sources, it comes “unclean”: the same company could be named differently by different sources, or addresses could contain numerous variations or be misspelled or formatted improperly, company branches could appear as if they were independent entities, or multiple companies at the same address may appear as if they are they are part of the same entity. We use cutting edge Artificial Intelligence and Natural Language Processing techniques in order to make sure the underlying data that powers our Credit2B reports is of the highest quality possible. In particular, we go to great lengths to identify companies correctly (i.e., start with a high quality name and address and then link related companies into “families” and split companies that are erroneously grouped together) so that the data could be associated with the right entities.  We use several open source or proprietary credit bureau type databases to create a clean foundation (e.g., International Business Machines is the true legal name and not “IBM” and we ensure it’s headquarters are accurately established through authority methods and sources for validation).
 

 

Visualizing just 20 milliseconds of Credit2B’s Company Linkage algorithm.
Every node is a company and every edge represents a link between the two companies.

 

As an example, the image above animates an actual processing involved in identifying proper entities from a set of 28 records and separating them into 20 distinct groups. In the graph shown in the animation, each node represents a company and each edge corresponds to similarity linkage. The process starts assuming that all companies are connected and proceeds to break up the connections between companies whose features identify them as separate entities according to criteria derived from various models. Random spot-checking of of large numbers of companies linkages and de-linkages by human experts shows 100% accuracy of our procedures.  
 
We are automating what would take the productive human being hours and possibly days of work and will continue to deploy AI techniques to giving the machine an opportunity to learn and do.
 
By Chintan Trivedi and Irina Rabinovich