2 June 2018
The amount of data in our world has been exploding. Different Companies capture trillions of bytes of information about their customers, suppliers, and operations, and millions of networked sensors are being embedded in the physical world in devices such as mobile phones and automobiles, sensing, creating, and communicating data.
Multimedia and individuals with smart phones and on social network sites will continue to fuel exponential growth. Big data is large pools of data that can be captured, communicated, aggregated, stored, and analyzed and is now part of every sector and function of the global economy. Like other essential factors of production such as hard assets and human capital, it is increasingly the case that much of modern economic activity, innovation, and growth simply couldn’t take place without data.
Digital data is now everywhere—in every sector, in every economy, in every organization and user of digital technology. While this topic might once have concerned only a few data geeks, big data is now relevant for leaders across every sector, and consumers of products and services stand to benefit from its application. The ability to store, aggregate, and combine data and then use the results to perform deep analyses has become ever more accessible as trends such as Moore’s Law in computing, its equivalent in digital storage, and cloud computing continue to lower costs and other technology barriers
“Big data” refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. This definition is intentionally subjective and incorporates a moving definition of how big a dataset needs to be in order to be considered big data—i.e., we don’t define big data in terms of being larger than a certain number of terabytes (thousands of gigabytes). We assume that, as technology advances over time, the size of datasets that qualify as big data will also increase. Also note that the definition can vary by sector, depending on what kinds of software tools are commonly available and what sizes of datasets are common in a particular industry. With those caveats, big data in many sectors today will range from a few dozen terabytes to multiple petabytes (thousands of terabytes).
Big data is a problem of dealing with structured, semi-structured and unstructured data sets so large that it cannot be processed by using conventional relational database management systems. It includes different challenges such as storage, search, analysis, and visualization of the data, finding business trends, determining the quality of scientific research, combatting crime and other use cases that would be difficult to derive from smaller datasets .
Structured data It have a predefined schema and represents data in row and column file format. Though many examples of structured data exist, some of the important examples can be taken as Extensible Markup Language (XML) Data warehousing data, databases data, Enterprise Resource Planning (ERP) data, customer relationship management (CRM) data.
Semi-structured It is a type of self-describing structured data which does not conform with the data types as in relational data but contains some tags or related information which separate it from unstructured data. Some examples of semi-structured data are XML and JSON data format.
Unstructured data These are data types which do not have a predefined schema or data model. With the ambiguity in a formal pre-defined schema, traditional applications have hard time reading and analyzing the unstructured data. Some examples of unstructured data are video, audio, and, binary files. Big data can be categorized based on four properties which are volume, variety, velocity and veracity.
Volume: Data have grown at exponential growth in the last decade as the web evolution has brought more devices and users in internet grid. The storage capacity of the disk has increased from megabytes to terabytes and petabytes scale as enterprise level applications started producing data in large volumes.
Variety: Explosion of data has caused a revolution in the data formats types. Most of the data formats such as Excel, database, and Comma Separated Values (CSV), Tab Separated Value (TSV) files can be stored in a simple text file. There is no any predefined data structure for big data because of which it can be in either structured, unstructured or a semi-structured format. Unlike the previous storage medium like spreadsheets and databases, data currently comes in a variety of formats like emails, photos, Portable Document Format (PDF), audios, videos, and monitoring devices etc. Real world problems include data in a variety of formats that possess a big challenge for technology companies.
Velocity: With the explosion of social media platform over the internet, it caused explosion in the growth of data in comparison to data coming from traditional sources. There has been massive and continuous flow of big data from the sources like social media websites, mobile devices, businesses, machines data, sensors data, web servers and human interaction within the last decade. Modern people are hooked into their mobile devices all the time updating their latest happening in their social media profiles leaving a huge electronic footprint. These electronic footprints are collected every second at high speed at petabytes scale.
Veracity: It is not always guaranteed that all the data that gets produced and ingested into the big data platform contains clean data. Veracity deals with the biases, noise, and abnormality that might arrive with data. It reflects one of the biggest challenges among the analysts and engineers to clean the data. As the velocity and speed of data keep on increasing, big data team must prevent the accumulation of dirty data in the systems.
Apache Hadoop is an open source framework that is used for processing large data sets across clusters of low-cost servers using simple MapReduce programming models. It is designed to scale up from one server to multiple servers, each of them offering computation and storage at the local level. Hadoop library is designed in such a way that high availability is obtained without solely relying on the hardware. Failures are detected at application layer using Hadoop and handled well.
Google was the first organization which dealt with the massive scale of data when they decided to index the internet data to support their search queries. In order to solve this problem, Google built a framework for large-scale data processing using the map and reduce model of the functional programming paradigm. Based on the technological advancement that Google made related to this problem, they released two academic papers in 2003 and 2004. Based on the readings of these papers Doug Cutting started implementing Google MapReduce platform as an Apache Project. Yahoo hired him in 2006 where he supported the Hadoop Project.
Hadoop mainly consists of two components,
HDFS is a distributed file system that is designed for storing very large files with streaming data access patterns running on clusters of commodity hardware. It was originally created and implemented by Google, where it was known as the Google File System (GFS). HDFS is designed such that it can handle large amounts of data and reduces the overall input/output operations on the network. It also increases the scalability and availability of the cluster because of data replication and fault tolerance.
When input files are ingested into the Hadoop framework, they are divided into a block size of 64 MB or 128 MB and are distributed among Hadoop clusters. Block size can be pre-defined in the cluster configuration file or can be passed as a custom parameter while submitting a MapReduce job. This storage strategy helps Hadoop framework store large files having bigger size than the disk capacity of each node. It enables HDFS to store data from terabytes to petabytes scale.
MapReduce is the parallel programming model that is used for processing large chunks of data. A Map-Reduce job splits the input datasets from the disk into independent chunks if these data cannot be stored on one single node. MapReduce job first executes the mapping tasks to process the split input data in a parallel manner and sorts the output of the map function and sends the result to reduce tasks as their input.
Gautam, N. “Analyzing Access Logs Data using Stream Based Architecture.” Masters, North Dakota State University ,2018.Available