Data and Big Data
To understand Hadoop and it’s working, it is necessary to know what Big data is. As the internet grew and with the advent of Web 2.0 & 3.0, demand for data grew. Data collected and shared increased. Particularly with social media and IoT , there is an exponential growth of data stored, transacted and accumulated. Big data is an umbrella term for a large volume of data.
Big Data Analytics
Data is information. Information that is used for planning, for scheduling, forecasting, budgeting, mitigating, etc. The larger the data set, more accurate the information. Hence the importance of Big data. Big data helps an enterprise to uncover information like buying patterns, demographic preferences, trends, correlations, etc. This in turns helps an enterprise discover new streams of revenue, provide better customer service, improve processes, etc. Tools that analyze such large data sets are known as Big data analytics applications.
Hadoop and Big Data
Hadoop is one such big data analytics application. A quick look at its history
Authors: Doug Cutting, Mike Cafarella
Developers: Apache Software Foundation
First Release: April 1st, 2006
Current Release: 3.3.0
Hadoop is a framework for the distributed processing of large data sets across clusters of computers using simple programming models. Whether it’s a single storage server or hundreds of machines across location, every node offers local computation and storage. With this parallel processing happening across hundreds of nodes which are distributed, data processing is faster than conventional databases. So basically rather than store all the data in a single or limited number of datasets, Hadoop clusters multiple nodes which can process even petabytes of data quickly
Hadoop Distributed File System — HDFS is a file system that’s designed for fault tolerance and for deployment to the lowest end of hardware as well. Since Hadoop works with Big data, HDFS file systems are tuned for large data files. A major plus point to HDFS is that it provides mechanisms for applications to perform the computation on large data files where they are located instead of sending these large files to the application
Yet Another Resource Negotiator (YARN) — YARN is to applications what Houston is to Nasa. The control center!. Yarn manages nodes, cluster nodes, job status, scheduling, tasks. It also does extensive monitoring and reports.
MapReduce — Hadoop supports and uses replicated data. There are a number of benefits of this. Such as the availability of data at any time, fault tolerance etc. But data replication can lead to redundancy as well. This is solved by MapReduce. The “Map” in MapReduce job filters and sorts the input data-set into independent chunks which are processed by the map tasks parallely. Once mapping is done, reducing steps in and performs a aggregate or summary function to flush out any redundant data
And finally across the above 3 models and other models in Hadoop, Hadoop common libraries exist which can be used by all modules
Hadoop vs Conventional Database
Actually, Hadoop is not a database, so this comparison might feel like apples and oranges, but Hadoop as we saw is an ecosystem for distributed storage and thus has some similarities with database
Hadoop is ideally used for applications dealing with petabytes of data RDBMS are a good fit for applications with GigaBytes of data
RDBMS are mainly used for structured data. Well defined schemas and data definitions are mandatory for RDBMS. Hadoop works well with structured and unstructured data
Since Hadoop works mostly with semi-structured or unstructured data, it does not need SQL. Hadoop applications do support SQL as well in various lite versions, but it primarily uses HQL or Hive Query Language. Which is particularly good for Metastore. RDBMS use SQL
Hadoop is fully open source, whilst most RDBMS are licensed
Data and Data objects are stored as key-value pairs in Hadoop whereas as the name suggests in RDBMS, data objects are relational
Hadoop is well suited for data types like Audio, Video, Images, etc. RDBMS goes well with relational data such as commonly found in OLTP
Hadoop is not a single library or tool or framework which handles big data. It is essentially a platform or a suite or an ecosystem which provides tools, frameworks, libraries, models, objects to solve, analyse, gleam insight from Big data. The four main modules of Hadoop are listed above. In addition the main components of an Hadoop ecosystem are
Data Storage is a collection of hardware, commodity hardware which are known as nodes or clusters in the Hadoop Ecosystem
Data Processing involves computational techniques which processes the data stored in the data storage. This is built for optimization
Data access tools helps applications to write queries, execute commands on the data sets returned by Data processing tools
Needless to say, the Hadoop Ecosystem is constantly evolving and new tools keep getting added every day. What we have listed above are some of the popular ones
Hadoop framework is written in the Java programming language, with some native code in C and command-line utilities written as shell scripts.
The MapReduce model can be implemented in any language for any application that uses Hadoop.
Hadoop can be implemented on-premise or on the cloud.
More than half of the fortune 50 companies use Hadoop
Data is only going to increase. With more applications coming up every day, data growth is one-way traffic. In such an environment, with the proven benefits exhibited by Hadoop, enterprises can and should start thinking about the virtues of Big data for their organizations. With the ability to host Hadoop on clouds, enterprises can get the best of both worlds, cloud storage and the power of Big data