In most people’s minds data profiling is inextricably linked with data quality, data cleansing, data migration and data governance. However, in this article I want to point out that this is very far from the truth and that there are multiple additional scenarios where data profiling may be used where it is distinct from any considerations—at least initially—of data quality.
The first of these scenarios is in support of data privacy, protection and masking. If you have sensitive data—and that includes foul and blasphemous language as well as personally identifiable information—then you need some sort of mechanism for finding where this data is before you can mask or otherwise deal with it. This, data profiling can do for you: looking for patterns that, for example, match credit card numbers. However, this will typically mean that you need to be able to profile text fields, which not all data profiling tools can do, or do well.
Secondly, data profiling is needed to support test data management where you are generating synthetic test data. This data needs to match the structure of the production data and, while data modelling can get you part of the way to this, it can only work by reverse engineering your database schema and, unfortunately, there are often lots of relationships in a database that are not explicitly defined in the schema and which therefore need to be inferred, which is what data profiling can do for you.
Finally, and perhaps the most interesting, at least for large enterprises, is what might be called landscape analysis. Major organisations can have hundreds or thousands of databases. We know of some cases where they actually get into five figures. How does the chief data officer, if there is one, understand this landscape? How does he or she get a handle on the relationships that exist across these data sources? How do you implement any sort of governance across a data landscape that is this extreme? The answer—you guessed it—is to use data profiling.
Very few data profiling vendors (actually, we only know of one) target this market because you need extreme scale (not performance - performance isn’t usually a focus except that it should not impact on the production systems being profiled) and you also need some intelligent way to visualise the results, being able to look at the big picture, drill down into details and follow chains of relationships. Given that this is all about relationships then using a graph database for this purpose would seem one obvious solution or, at least, a graph engine that will run on top of whatever database is used to store the relevant metadata.
Anyway, the point of this brief blog is to suggest that users and vendors should think a bit more broadly about how and where data profiling can be used: it isn’t necessarily just about data quality.