Agile Governance in the form of Automatic Discovery and Protection Is Needed to Create Confidence in a Big Data Environment

The arrival of Big Data is having a dramatic impact on many organizations, in terms of deepening insight. However it also has an impact in the enterprise that drives a need for data governance. Big Data introduces:

  • New sources of information
  • Data in motion as well as additional data at rest
  • Multiple analytical data stores in a more complex analytical environment (with some of these data stores possibly being in the cloud)
  • Big Data Platform specific storage, e.g. Hadoop Distributed File System (HDFS), HBase, Analytical RDBMS Columnar Data Store, or a NoSQL Graph database
  • New analytical workloads
  • Sandboxes for data scientists to conduct exploratory analytics
  • New tools and applications to access and analyse big data
  • More complex information management in a big data environment to
    • Supply data to multiple analytical data stores
    • Move data between big data analytical systems and Data Warehouses
    • Move core data into big data analytical environment from core systems to facilitate analysis in context e.g. move core transaction data into a Graph database to do graph analysis on fraudulent payments

The data landscape is therefore becoming more complex. There are more data stores and each big data analytical platform may have a different way to store data sometimes with no standards.

Despite this more complex environment there is still a need to protect data in a Big Data environment but doing that is made more difficult by the new data characteristics of data volume, variety and velocity. Rich sets of structured and multi-structured data brought into a big data store for analysis may easily attract cyber criminals if sensitive data is included. Data sources like customer master data, location sensor data from smart phones, customer interaction data, on-line transaction data, e-commerce logs and web logs may all being brought into Hadoop for batch analytical reasons. Security around this kind of big data may therefore be an issue.  In such a vast sea of data we need technology to automatically discover and protect sensitive data. It is also not just the data that needs to be protected. Access to Big Data also needs to be managed whether that be data in analytical sandboxes, in distributed file systems, NoSQL DBMS or analytical databases. Knowing that information remains protected even in this environment is important. In MapReduce applications, low level programming APIs need to be monitored to control access to sensitive data. Also, access control is needed to govern people with new analytical tools so that only authorised users can access sensitive data in analytical data stores . Compliance may dictate that some data streams, files and file blocks holding sensitive data are also protected e.g. by encrypting and masking sensitive data.

In addition a new type of user has emerged – the data scientist. Data scientists are highly skilled power users who need a secure environment where they can explore un-modelled multi-structured data and/or conduct complex analyses on large amounts of structured data.

However sandbox creation and access needs to be controlled, as does data going into and coming out of these sandboxes

If we can use software technology to automatically discover and protect big data then confidence in Big Data will grow.

I shall be discussing Big Data governance in more detail at my up and coming Big Data Multi-Platform Analytics class in London on October 17-18. Please register here if you want to attend.


Register for additional content

Register today for additional and exclusive content - informative research papers, product reviews, industry news.

RegisterMember login