Data classification is an essential pre-requisite to data protection, security and compliance. Firms need to know where their data is and the types of data they hold.
Organisations also need to classify data to ensure it has the right level of protection and whether it is stored on the most suitable type of storage in terms of cost and access time.
Data classification checks for personally identifiable information (PII). It may also classify intellectual property or sensitive financial and strategy information. Also, data classification will provide basic information such as data format, when last accessed, access controls, etc. Finally, data classification will often form part of large-scale analytics work, such as in data lakes.
“The idea of a classification scheme is to be able to qualify the sensitivity or the importance of data to an organisation,” says David Adams, GRC security consultant at Prism Infosec. “Applying meaningful data classification allows an organisation to be able to understand its sensitive data and apply appropriate controls.”
Data classification and data management
Increasingly, organisations have invested in dedicated tools to classify datasets as they are ingested, as well as to scan stored data for sensitive information and to create data catalogues and business glossaries. These, in turn, help with security, data management and data quality. This tools-based approach is replacing the custom scripts that enterprises have often relied on for data discovery.
Suppliers have also turned to natural language-based systems to make data management easier for non-specialists, and to automation via machine learning and artificial intelligence (AI). This is in response to the growing volumes of data that organisations need to process, and the growth in unstructured data.
But it is also a response to compliance pressures. Automated systems are less prone to human error, and can be invaluable in tracking down incorrectly classified or inadequately protected datasets.
Gartner points out that manual data classification is cumbersome and prone to inconsistencies. And the growth of data volumes, alongside greater use of unstructured data, is making it almost impossible to carry out the task manually.
But data classification is critical for IT strategy, governance and compliance, and also for a business’s risk tolerance. If an organisation lacks an accurate record of its data, it will not have an accurate view of its risk. This can leave critical data sources unprotected or, as Gartner warns, can result in “over-classification” of data and an unnecessary burden on the organisation.
Tools or platforms?
Data classification tools come as standalone – typically data cataloguing – products, or as part of broader data quality or data management toolsets. Also, they can form part of a business intelligence (BI) or enterprise software application.
Some suppliers, including Microsoft and SAP, provide data classification as a service. Also, there is a trend towards “serverless” offerings from other suppliers that remove the need for users to configure IT infrastructure. This is especially useful for cloud-based workloads, but is not restricted to them
Most suppliers claim at least some machine learning (ML) or AI capabilities to automate the data classification process. Some also provide data classification as part of a broader data quality toolset.
Providers of data classification tools include business analytics suppliers, database and infrastructure companies, application software suppliers, cloud providers and niche specialists. There are also several open source options.
Unsurprisingly, IBM, Microsoft, Oracle and SAP all have a presence in the market.
IBM’s Watson Knowledge Catalog works with the vendor’s InfoSphere Information Governance Catalog for data discovery and governance. It has more than 30 connectors to other applications, uses a common business glossary, and was designed to use AI and ML.
Microsoft’s Purview Data Catalog also uses an enterprise data catalogue, and is part of the Purview data governance, compliance and risk management service Microsoft offers though its Azure cloud platform.
SAP offers document classification as a service through its cloud operations or as part of its AI business services. It also has an AI-powered Data Attribute Recommendation service to automatically classify master data.
Oracle offers its Cloud Infrastructure Data Catalog to provide a metadata management cloud service to build an inventory of assets and a business glossary. It includes AI technology as well as discovery capabilities.
Data management supplier Informatica offers its Enterprise Data Catalog tool. This is an ML-based tool that can scan data and classify it across local and cloud storage. It also works with BI tools and third-party metadata catalogues.
Analytics and BI company Qlik has built up its data classification tools in recent years, including via its acquisition of Podium which added data preparation, quality and management tools. The data cataloguing part of Qlik’s Data Integration platform aims to work closely with its BI and analytics tools, but can also exchange data with other applications and catalogues.
Tableau takes a similar approach, putting its Catalog tool in its data management suite. This is an add-on to its analytics platform. The tool ingests information from Tableau datasets into its catalogue, and offers application programming interfaces (APIs) that can bring in data from other applications.
Google’s Cloud Data Catalog, despite its name, is a managed data discovery service that works across cloud and on-premise data stores. It integrates with Google’s identity and access management and data loss prevention tools, and is “serverless” so users do not have to configure infrastructure.
Amazon Web Services
AWS provides its data catalogue through Glue, a managed ETL (extract, transform and load) service. Glue Data Catalog works across a range of AWS services, including AWS Lake Formation, as well as with open source Apache Hive data warehouses.
Ataccama One is the supplier’s data management and governance platform, and features in Gartner’s Magic Quadrant for data quality solutions. Its Data Catalog module automates data discovery and change detection and works with databases, data lakes and file systems. The supplier’s emphasis is on data quality improvement.
Collibra is also rated by Gartner in its Magic Quadrant, and is a data intelligence cloud platform based around an ML-based data catalogue. The data catalogue has pre-built integration with business applications, BI and data stores. It claims users can search data stores using the tool, without the need to learn SQL.
DataHub and Apache Atlas
DataHub originated at LinkedIn as a metadata search and discovery tool, and went open source in 2020. But perhaps the most widely supported open source tool is Apache Atlas, which offers data cataloguing, metadata management and data governance.