26 August 2018
Big data refers to datasets whose size, volume and structure is beyond the ability of traditional software tools and database systems to store,process and analyze within reasonable timeframes.
Size of data has been increasing rapidly since last decade.Big data originates from multiple sources including sensors that are used to gather digital pictures and videos, climate information, posts to social media sites, purchase transaction records, and cell phone GPS signals. With cloud computing and the socialization of the Internet, unstructured data petabytes of are created online daily and much of this information has an intrinsic business value if it can be captured and analyzed.
Examples, mobile communications companies collect data from cell towers; oil and gas companies collect data from refinery sensors and seismic exploration; electric power utilities collect data from power plants and distribution systems. Businesses collect large amounts of user-generated data from prospects and customers including social security numbers, credit card numbers, and data on patterns of usage and buying habits. The influx of big data and the need to move this information throughout an organization has created a massive new target for hackers and other cybercriminals. This data, which previously was unusable by various organizations is now highly valuable and is subject to privacy laws and compliance regulations, and must always be protected.
Hadoop and similar NoSQL data stores is used in many orgnazation of large and small size to collect,manage and analyze large data sets .Even though, these tools are popular among large and small organization ,they were not designed with comprehensive security in mind.
In Hadoop based ecosysten ,there are many new ways to process and ingest data whether it might be a push or pull based architecture.Hadoop framework can be used to handle these data for different use cases .But managing petabytes of data in a single centralized cluster can be dangerous as data is the most valuable asset of a company.
The question about Hadoop security is not just only about securing the source data which is moved from the enterprise systems to the Hadoop ecosystems,but also about securing the business insights and intelligence developed from those data.Any such insights in the hands of the competitor,individual hacker or any unauthorized personnel could be disastrous as they could steal personal or corporate data and use it for unlawful purposes.That’s why all these data must be fully secured.
Sensitive data stored in Hadoop or any big data frameworks are subject to privacy standards such as HIPPA,HITECH etc and security regulation and audits .In addition to bringing benefits to the enterprise ,Hadoop framework is also introducing new dimensions in the cyber-attack landscape.In the time when attackers are constantly looking for which system to target,Hadoop has become a starting point as all data are stored on top of HDFS.
Data security strategy is one of the most widely discussed topics among executives, business stakeholders , data scientists and the developers when working with data based solutions in the enterprise level .
Among the many reasons of securing the Big data cluster ,below are some of the important ones.
Sensistive data like Crdit card,SSN and other corporate needs to be protected all the time.
Certain Country/Region like USA/EU have different data protection policies like HIPPA,FISMA,GDPR to protect sensitive data. These complaince differs based upon the data types and the region in which a company is conductiong the Business. Companies rea required legally to fo
By securing the sensitive data ,companies can allow different workloads on the sensitive datasets.
A complex and holistic approach is needed for data security in the entire Big Data Hadoop ecosystem.Below are some of the key considerations while designing security features for Hadoop Based Big Data Ecosystem
A single point for authentication is needed for an enterprise identity and access management system. It is about verifying the identity of user or service so that only legitimate users get access to the data and services of Hadoop cluster. In large organizations ,Hadoop is integrated with existing authentication systems like below.
Active Directory(AD) Use of Active Directory has many advantage on the part of the organization and the users. From the organization perspective ,Re-usage of existing services reduces maintenance efforts and costs. From the user perspective use of Single Sign-On service is important to simplify the access and to increase the security in the cluster as password hashes do not get repeatedly transmitted over the wire.
Use of Keberos and LDAP Kerberos provides Single Sign-On visa a ticket-based authentication mechanism. The SPNEGO protocol, which is supported by all major browsers, extends Kerberos authentication to web applications and portals.
SAML(Security Assertion Markup Language)
Outh (Open Authentication)
HTTP Authentification : REST API based authetification mainly used for JDBC connection
A role-based authorization with fine-grained access control needs to be set up for providing access to sensitive data.
Access to the data needs to be controlled based upon the availability of the processing capacity in the cluster.
Enterprise must deploy a proper encryption and masking techniques on the data so that secure access to sensitive data are available for authorized personnel only.
External or external leakage of data is a key business concern in any organization. It is a challenging task for any organization to secure sensitive and critical business data and personally identifiable in a Big Data cluster as data is stored across in various format after passing through different data pipelines
There are two types of encryption techniques that can applied in the data.
Data intransit Encryption
Implementing these techniques can be challanging as many information is not file-based in nature ,but rather handled through complex chain of message queues and message brokers. Sometimes application in Hadoop may use a local temprorary files that can contain sensitive information which must be secured.Plain version of Hadoop provides encrytion for data that are stored in HDFS.But,it does not have any comprehensive cryptographic key management solution or any Hardware Security Module(HSM) integration.
To support data at rest encryption ,Hadoop distribution from Cloudera provides a tool named Cloudera Navigator Encrypt and Key Trustee Server whereas Hortonworks provides Ranger Key Management Service.MapR uses formatpreserving encryption and masking techniques maintaining the data format without replacing it with cryptic text supporting faster analytical processing between applications.
Cryptographic protection of Data-at-rest can be done in three ways.
Application Level It integrates with the current application by securing the data during ingestions by using a an external key amanger with cryptic keys in HSM to encrypt and decrypt the data.
HDFS- Level It is a transparent encryption in which content is encrypted on wrire and decrypted on read.It protects against file-system & OS level attacks.
Disk Level It is a transparent encryption which is at a layer between application & file system.It provides process based access control which can secure metadata logs and config files.
Core Hadoop does not provide any native safeguard against network based attacks like denial of service attacks ,which can . Some of these denial of service attacks can include Denial of Service(DoS),Distributed Denial of Service(DDoS), flooding a cluster with extra jobs or running jobs that consume high amount of resources.
To protect a Big data cluster from network based attacks, an organziation needs to :
System level security is achieved by hardening the OS and the applications that are installed as part of the ecosystem.
Data centers should have a strict infrastructure and physical access security.
Security-Enhanced Linux (SELinux) is a Linux Kernel security module that provides a mechanism for supporting access control policies such as MAC(Mandatory Access Control) .It was developed by NSA and adopted by upstream Linux Kernel . It prevents command injection attacks such as having a lib files with executable permission(x) but not write permssions(w). This policy prevents another user or process from accessing one’s home directory even if that user changes any settings on their home directory. This policy helps to label files ,grant permissions on it and enforce MAC.
Enterprise should have a proper audit trail indicating any changes to the data ecosystem and also provide audit reports for any data access and data processing that occurs within the ecosystem.
As part of following givernment regulations, companies are often required to keep an audit trail of the log related cluster access and cluster configuration changes. Most of the Hadoop distributions like Cloudera,Mapr and Hortonwokrs offer audit capabilities to ensure that platform administrator and users activities can be logged.
Logging for audits should include at least below items.
Good auditing practice in an organizaton allows to identify sources of data and application & data errors as well as identify security events.Most of the big data plaform components allows for one or another form of logging either to local files system or HDFS. Main challange in the big data world for auditing is the distributed nature of big data components and the tight integration of distinct components with each othetr.
Good practice of auditing on Hadoop frameowrk let organizations capture metadata for data lineage, database changes and security events. Some of the commpon tools for auditing are the Cloudera Navigator Audit Server and Apache Atlas (Hortonworks). By using this ,organizations can capture events from the filesystem,database and authorization components automatically and display these data througha User Interface.
Disaster Recover (DR) enables business continuity for significant data center failures beyond what high availability features of Hadoop components can cover.
Disaster recovery is supported by various computer system in three ways.
Backup Backup of data refers to cold storage of data which won’t be used all the time.
Replication aims to provide a close resemblance of the production system by replicating data on a scheduled interval. Replication can also be used within the cluster to increase availability and reduce single points of failure.