Every organization generates tons of data these days. However most of it is never analyzed and therefore not acted up on. Some of the biggest barriers for data driven decisions for organizations are :
– How do we discover data across all the data products automatically,
– How do we catalog the data quickly and build a searchable metadata dataset at organization level
– How do we govern this carefully so that we implement all relevant regulatory and security best practices ex: least-privilege
Let’s take a detour, imagine you are in a airport, you check-in your baggage and at that time you notice something : the airport staff carefully stick an appropriate tag to each of your luggage. The tag-types are limited and these are usually color-coded along with logos and heavy fonts. In addition you are careful to ‘label’ your luggage with your name, address and sometimes any cute little stickers or special insignia that help you quickly spot your luggage later at pick-up time.
Using the same analogy your organization can
– create organization level tags. These are special words like – ‘Marketing’, ‘GDPR’, ‘HIPAA’, ‘Financial’, ‘PII’, ‘PHI’, ‘PCI’.
– everyone in your organization is educated and empowered to stamp one or more of these tags onto each of your datasets.
– you then use these tags to quickly discover, programmatically and periodically all your ‘data-assets’ across all your systems esp. your datalake/datawarehouse/datalakehouse.
– after each discovery-run catalog all the dsicovered data-assets (with necessary statistics if needed) neatly into one central metadata database.
– Further spell out specific instructions on how each ‘tag’ should be handled especially with respect to access (and updates if needed) through IAM policies utilitzing same tags that you came up with carefully at the start.
– In addition you allow every data-product-team to apply as many free-hand labels on their data-products that they can use for their own purposes or for shared responsbilities with other teams. You don’t need to control this strictly.
These steps ensure that your organization’s key assets i.e. your data-assets a.k.a data-products are always neatly discovered, cataloged and its usage governed per your regulatory needs
None of the above is new, infact, BigData ecosystems had libraries and tools in-place for doing this efficiently. For example :
– tools like Atlas and Ranger in Hadoop ecosystem.
– tags, labels, service accounts, projects, bigquery datasets, bigquery tables/views in Google Cloud.
In fact. every cloud has equivalent features in-place. They leave enough room for creatively and flexibly utilizing these features for your organizational needs, without explicitly solving your problems entirely. As an exception you may see a service like GCP dataplex come close to your needs.
PPS : Tags and labels can be utlitized for many other purposes ex: FinOps, SRE, DevSecOps etc
This post was later published on LinkedIn here.