BDA - Big Data And Analytics Capsule/Notes
Topic covered:
- Top Amazing Facts.
- Three Characteristics of Big Data.`
- Big Data Sources.
- What is Hadoop?
- What is Hadoop used for?
- Which companies are using Hadoop?
- What is HDFS?
- HDFS Architecture.
- What is PIG.
- What is Hive.
- What is MapReduce.
- Why Sqoop?
- What is Sqoop
1. Top Amazing Facts
- Over 90% of all the info within the world was created within the past 2 years.
- Every minute we send 204+ million emails, generate 1.8+ million facebook likes,send 278+thousand tweets and upload 200,000+ photos to facebook.
- Google alone processes on the average over 40+ thousand search queries per second.
- Big data could be subsequent big thing within the IT world.
- First organisations to embrace it were online startup firms.Firms like Google,eBay,Linkedin, and facebook were built around big data from the start .
- Do you know that company named Walmart, it handles more than 1+ million customer transactions every hour.
2. Three Characteristics of Big Data.
- Volume
- Data quality
- Velocity
- Data speed
- Variety
- Data types
3. Big Data Sources.
- Users
- Application
- Systems
- Sensors
- Or Mobile Devices,Microphones,Readers/Scanners,Science Facilities,Programs/Software,Social Media,Cameras.
4. What is Hadoop?
- Hadoop is a very flexible and available architecture for large scale computation and data processing on a network of commodity hardware.
5. What's Hadoop used for?
- Searching,
- Log processing,
- Recommendation systems,
- Analytics,
- Video and image analysis,
- Data retention.
6. Which companies are using Hadoop?
- Amazon/A9,Facebook,Google,IBM,Blue Cloud,Joost,Last.fm,New York Times,PowerSet,Veoh,Yahoo.
7. What is HDFS?
- A distributed file(filing) system that runs on large clusters of commodity machines.
7.1. HDFS Architecture(Official image release by Apache Hadoop)
HDFS Architecture: Source Hadoop apache. |
8. What is Pig?
- A pig is a data flow language(DFL) and execution environment for exploring very big datasets.
- Pig runs on HDFS and MapReduce clusters
9. What is hive?
- A distributed data warehouse. Hive manages the data stored in HDFS(Hadoop Distributed File System) and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.
10.What is MapReduce?
- Map reduces is a programming model for data processing.
- It was first introduced at Google.
- MapReduce works by breaking the processing into two part(phases) i.e : the map phase and the reduce phase. Each phase has the key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function.
11. Why Sqoop?
- SQL Servers are already deployed opulent worldwide.
- Nightly processing is done on SQL servers for years.
- As Hadoop making ways into enterprise, there was a need to move certain part of data from traditional SQL DB (RD) to Hadoop.
- Transferring data using scripts is inefficient and time Consuming.
- Traditional DB have already got reporting, data visualization etc. applications built in enterprise
- Bringing processed data from Hadoop to those application is the needed.
12. What is Sqoop ?
- Sqoop is a advance "tool" designed to transfer data between Hadoop and relational databases.
- You can use it to import data from a relational database (RDB) such as SQL or MysQL or Oracle into the Hadoop Distributed File System (HDFS)
- Transform data in Hadoop with MapReduce or Hive.
- Export data back into RDB.
- Sqoop (SQL-to-Hadoop) is a big data tool that offers the capability to extract data from non-Hadoop data stores, transform the data into a form usable by Hadoop, and then load the data(info) into HDFS. This process is named ETL, for Extract, Transform, and Load.
***Thanks for reading***
To bookmark this page,select on add to homescreen from your chrome browser.
Source: Oreily Hadoop
Comments
Post a Comment