Technology -> Big Data
By: Philip Howard, Research Director - Data Management, Bloor Research
Published: 22nd March 2013
Copyright Bloor Research © 2013
An interesting discussion arose during the TDWI conference in London this week. The question was posed: could you use a graph database to do matching and de-duplication?
The answer must be yes. If Bill Clinton and William Clinton (this was the example posed during the session) have the same relationships they must surely be the same person, though given the nature of some of the ex-president's relationships it would perhaps be better to refer to following the edges of the graph rather than the relationships they represent. In fact, if you are using a graph database to look at terrorist or criminal networks this is precisely one of the things you would be doing as you want to understand which aliases equate to which real individuals.
First of all I should say that I am not aware of any graph vendor packaging up any special facilities to support matching and de-duplication but I imagine that there are things they could do to make this process easier. However, the concept is quite cool. It would mean that you don't need to license such capabilities from the likes of Trillium or Informatica. Of course there are other data cleansing requirements beyond matching but this does tend to be the bedrock for all such environments so could a graph database be a real competitor?
Of course the big advantage is that there is no additional license fee.
What I don't really know is how performance would compare. Vendors in the data quality field are apt to extol that their matching engine can outperform anybody else's: something that is inherently impossible to prove one way or the other, thanks to the fact that you can't compare match accuracy across platforms.
Nevertheless, my guess is that a graph database could seriously outperform a conventional matching engine. That's because graph databases have been explicitly designed to explore relationships and that's precisely what you do when matching: you have two similar but non-identical names and they each have relationships with an address, a mother's name, a phone number, an email address and so on. Instead of searching through table after table: just follow the edges of the graph.
Will we see this become a practical reality? To be honest I don't know. The person who raised this issue in the first place said that his company couldn't use conventional data quality tools for matching (he didn't explain why) so maybe there is a problem out there that can be solved by using graphs. Certainly if you are going to be using a graph database anyway then it may make sense to look at it for this purpose also.
Posted: 26th March 2013 | By Peter Neubauer :
Yes, I think de-duplication of data via patter matching and finding of related data should be very doable in a Graph Database, see e.g. Neo4j's Cypher pattern matching language for some examples, http://docs.neo4j.org/chunked/snapshot/cypher-introduction.html
Posted: 27th March 2013 | By Manish Sood :
Matching in a graph datastore is definitely not the same as matching in a relational datastore. However, how you solve this problem depends on how well you understand the matching rules and constructs that need to be applied to the underlying data.
Graph pattern matching is one solution but in addition you also need to think about some of the conventional algorithms that include noise word removal for various entities, synonyms, soundex, various distance algorithms, concepts of fuzzy matching, multiple rules that you might want to support are some of the variants.
At Reltio, we are creating applications for various verticals such as Life Sciences, Financial Services and Retail, where all the data (dimensional, relationships and transactional) is stored in a petabyte scale Reltio Graph datastore. Not only are we able to match and resolve various duplicates based on graph patterns but we also leverage common fuzzy algorithms to find matches.
Posted: 29th March 2013 | By Philip Rathle :
Philip, I agree with you about this is a very good use of graph database technology. I'd like to add that one major data quality vendor has already adopted graph databases themselves for de-duplication and other related data quality concerns, as an embedded part of their own software solution.
See the link below for some more details:
The messages above were all contributed by IT-Director.com readers. Whilst we take care to remove any posts deemed inappropriate, we can take no responsibility for these comments. If you would like a comment removed please contact our editorial team.
We automatically stop accepting comments 180 days after a post is published. If you would like to know more about this subject, please contact us and we'll try to help.
Published by: IT Analysis Communications Ltd.
T: +44 (0)190 888 0760 | F: +44 (0)190 888 0761