Draft:Big Data Analytics
|Submission declined on 23 November 2018 by talk). (|
This submission reads more like an essay than an encyclopedia article. Submissions should summarise information in secondary, reliable sources and not contain opinions or original research. Please write about the topic from a neutral point of view in an encyclopedic manner.
Declined by 11 months ago. Last edited by Anomalocaris 41 days ago. Reviewer: Inform author.
|Submission declined on 21 November 2018 by talk). (|
Declined by 11 months ago.
This submission's references do not show that the subject qualifies for a Wikipedia article—that is, they do not show significant coverage (not just passing mentions) about the subject in published, reliable, secondary sources that are independent of the subject. Before any resubmission, additional references meeting these criteria should be added (see technical help and learn about mistakes to avoid when addressing this issue). If no additional references exist, the subject is not suitable for Wikipedia.
- 1 Big Data Analytics
- 2 Apache Hadoop
- 3 References
- 4 External links
Big Data Analytics
What is Big Data
We are living in the Data World right now.
So, everywhere we are seeing the Data. The important thing is how to store the data and how to processes the data.
The Data which is beyond to the storage capacity and which is beyond to the processing power is called Big Data.
Big Data Generators
Large Hadron collider – Produce 1 PB every second.
NASA 1.73 GB every hour.
eBay – 40 PB Hadoop cluster for search, consumer recommendations and merchandising.
Facebook – 30 PB Hadoop cluster.
50 Billion photos.
30 TB logs every day.
Airlines – 6000 sensors and produces 2.5 TB of data every day.
Hospitality – In this modern world, we use the health data for the forecasting and many more.
Keeping Hidden data hide is also a burden for because we can make it open to every one can access as it contains some confidential data also
|Hard Disk Capacity||1 GB – 20 GB||1 TB|
|Ram Capacity||64 MB – 128 MB||16 GB|
|Reading Capacity||10 KBPS||100 MBPS|
Right now if we are in the 100% of the data, in which 90% of it is generated in the last couple of years only. Remaining 10% belongs to the past data.
In the year 1990 and 2018
Because we are getting huge data we have to store it and should not be lost.
Let's take a Gmail account which has a Storage limitation of 5GB,
I started using Gmail in the year 2013,
By the end of 2014 – Used data is 500 MB
2015 – Used data is 1 GB
2016 – Used data is 3GB
2017 – Used data is 5 GB and I got the email to subscribe for the yearly plan to increase the storage Capacity.
It's basically beyond the storage Capacity. So, by scribing to the Gmail it will store our exceed data to the servers.
In the same way when we are working on the huge data like 100 TB, we will store our Data in the servers.
When I want to processes 2 TB of data out of 100 TB. So, I will write some code which is 100 KB.
This process is called Computation( The program which we are apply on your data).
Now It's is easy to send the 100KB code to the server then fetching the 2 TB data.
But even though it is very easy we can't send it to the Data center.
Before Hadoop computation is processor bond. So, we have to fetch the data.
4 V’s of Big Data
Volume (GB, TB, PB) – Data is always increasing in to GB, TB, PB.
Velocity (fetching the Data) – It will be time taking process.
Variety (Unstructured Data, Semi Structured Data and Structured Data)
Varacity – Trustness of the Data.
Unstructured Data: Example – Videos, Images, Text and Audio.
Semi Structured Data: Example – Log Files.
Structured Data: Cleaned Data.
Big Data characteristics: Data is Increasing from Structured data < Semistructured data < Unstructed data.
These 4 V's together we are calling as Big Data. This definition has been given by IBM.
I want to Distribute newspapers to the 10 Cities.
I will employ one driver with a truck and he fished the work by the end of the day.
Next time I increased to 10 drivers and provided with a truck each. They completed the work parallelly in 1 hour.
So, by splitting the work with more people it consumes less time.
As. The Data is increasing rapidly, processing speed got minimized. That is the problem for the Big Data.
History of Hadoop
Google people have done a great work, when they have to come up with Huge Data.
How and Where to store huge data.
In year 2003 – GFS
In year 2004 – MapReduce(Best processing technique)
Google just given the Description about GFS and MapReduce but not implemented.
Later Yahoo! Implemented and Introduced
In year 2006 – HDFS
In year 2007 – MapReduce
Doug Cutting is the founder of Hadoop.
How the Name Hadoop come from.
When Doug Cutting thinking about a name for his invention.
His kid was playing with his elephant toy and the name of his toy was Hadoop.
So, Doug Cutting took the same name for his invention.
Hadoop is an open source framework which has been overseen by Apache software foundation, for storing huge datasets and for processing huge datasets with cluster of commodity hardware.
HDFS – Hadoop Distributed File System
It is a specially designed file system for storing huge datasets with cluster of commodity hardware with streaming access patterns(Write once read Many times without changing the content of files).
For 100 GB Memory
In the Normal file system thee default block size is 4 KB
If we insert 2 KB of data in it.
It will store the data in one block and remaining 2 KB will be empty(wasted).
After installing HDFS, each block size will become 64MB by default and we can also change it to 128 MB or 256 MB as per the requirement.
In HDFS if we insert 35 MB data then it will store that data. Into one block.
Then remaining 29 MB of the data can be used for other data and will not be wasted.
This is the advantage of the HDFS.
HDFS has 5 Services
1. Name Node
2. Secondary Name Node
3. Job tracker
4. Data Node
5. Task Tracker
Top there are Master Services/Demons/Nodes and bottom two are Slave Services.
Master Services can communicate with each other and in the same way Slave services can communicate with each other.
Name Node is a master node and Data node is its corresponding Slave node and can talk with each other.
1. Name Node: HDFS consists of only one Name Node we call it as Master Node which can track the files, manage the file system and has the meta data and the whole data in it. To be particular Name node contains the details of the No. of blocks, Locations at what data node the data is stored and where the replications are stored and other details. As we have only one Name Node we call it as Single Point Failure. It has Direct connect with the client.
2. Data Node: A Data Node stores data in it as the blocks. This is also known as the slave node and it stores the actual data into HDFS which is responsible for the client to read and write. These are slave demons. Every Data node sends a Heartbeat message to the Name node every 3 seconds and conveys that it is alive. In this way when Name Node does not receive a heartbeat from a data node for 2 minutes, it will take that data node as dead and starts the process of block replications on some other Data node.
3. Secondary Name Node: This is only to take care of the checkpoints of the file system metadata which is in the Name Node. This is also known as checkpoint Node. It is helper Node for the Name Node.
4. Job Tracker: Basically Job Tracker will be useful in the Processing the data. Job Tracker receives the requests for Map Reduce execution from the client. Job tracker talks to the Name node to know about the location of the data like Job Tracker will request the Name Node for the processing the data. Name node in response gives the Meta data to job tracker.
5. Task Tracker: It is the Slave Node for the Job Tracker and it will take the task from the Job Tracker. And also it receives code from the Job Tracker. Task Tracker will take the code and apply on the file. The process of applying that code on the file is known as Mapper.
Steps for the Storing Data
Client sends request to Name Node where to store Data.
Name Node will take all the data and stores the details as meta data in it, then sends response with the Data Node numbers. So, that client can send the data to the Data nodes.
For the case of backup, It default stores 3 replications to overcome the data loss problems.
All the Data nodes will give the Block report to the Name Node. It's the location of the data.
For every short period of time Data Nodes will give the Heartbeat to the Name Node to say that these Nodes are working properly.
If a Data node doesn't send any heartbeat, Name Node will think that Data Node is not working and will send the data to another Node to save.
Steps for the Processing Data
Processing of Data is known as MapReduce
For Processing the data Job Tracker will do the task.
Job Tracker will give request to the Name Node for processing the data. Name node in response gives the meta data to Job Tracker.
Now Job tracker will send the code to Task Tracker and then will splits the code to data. This processor is known as mapper.
These splits are known as input splits.
How many number of splits will be there that many number of mappers will be there.
All the TaskTracker will give Heartbeat to JobTracker for every 3 seconds.
If it is not giving Heartbeat, Job Tracker will wait for 2 min and will think that TaskTracker is dead.
In that case Job Tracker will send the code to another TaskTracker which has the same data.
Name Node and Job Tracker are also called as Single Point of failure.
Combining the outputs is done by reducer and gives output to Client.
MapReduce is used for the processing the data.
We can say from the word MapReduce there are two different tasks can be performed by the Hadoop program. First task is Map which will takes the data and convert it to set the data. Second is Reducer will take the output from Map as input and combines it and send to the client. The reduce job is always performed after the Map job.
The processing of running the code on the data is called as Mapper. The number of the mappers is equal to the number of the splits of the data. The process of combining all the outputs is called as Reducer. The number of Reducers is equal to the number of outputs from the data. Reducer will send the output to the client.
Word Count with a Sample
It splits all the data.
There are total of 2 splits. So, we have two mappers.
In mappers it is making (key/values pairs).
Input to reducer
Input to the reducers
This data was computed by shuffle and sort
(”.”, [1, 1])
(“cat”, [1, 1])
(“fast”, [1, 1])
(“hat”, [1, 1])
- EMC Education Services (5 January 2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. Wiley. ISBN 978-1-118-87605-3.
- Nataraj Dasgupta (15 January 2018). Practical Big Data Analytics: Hands-on techniques to implement enterprise analytics and machine learning using Hadoop, Spark, NoSQL and R. Packt Publishing. ISBN 978-1-78355-440-9.