Monday, September 16, 2019

Big Data / Hadoop (Hive) / NoSQL

Big Data :

Big data is a blanket term for any collection of data sets so large and complex that is become difficult to process using normal DBMS tool (Like SQL Server / Oracle ....)

The challenge include capture, storage, search, sharing, transfer, analysis, and visualization of data.

Big data Three Dimension:
  1. Volume  : High Volume Data
  2. Velocity : High Speed Data
  3. Variety   : Different type of Data (Audio/Video/Sensor/Click Stream/Log files)
Handle of Big Data:

Hadoop & NoSQL Database are Specially design to handle the big data.

Hadoop:

It is a open source distributed file system framework/tool that is specially design to handle the big data. It enables the distributed processing of large data sets across clusters of commodity servers .It works to interpret or parse the result of big data search through specific algorithm and methods.

Pillar of Hadoop:
  1. HDFS
  2. Map Reduce
  3. Yarn
HDFS (Hadoop Distributed File System): It is a file system that spans all the nodes in a hadoop cluser of data storage. It links together file system on many local nodes to make them into one big file system.

Map Reduce: The idea behind the Map Reduce is that hadoop first map a large data sets and them perform a reduction on that contents for specific result.

YARN (Yet Another Resource Negotiator): It assign the CPU, Memory and storage to application running on hadoop cluster.

Hadoop Advantage:

  • Scalable
  • Cost Effective
  • Flexible
  • Fault Tolerance 

Hive: Hive is a DWH system that is specially design to work on top of hadoop cluster for those developers who they are from SQL background  that don't have understanding to write the JAVA code for map reduce program.
Hive is setup on top of Hadoop that have a SQL like query language called HQL (That auto create the MAP Reduce program to connect with hadoop cluster) to facilitates the ad hoc query and the analysis of large data set stored in hadoop.

NoSQL (Not Only SQL):

A NoSQL database environment is a non relational and largely distributed DBMS. NoSQL DB sometimes referred as Cloud DB, Non Relational DB, Big Data DB. It has a Schema less data model. horizontal scalability, distributed architecture.

Types of NoSQL DB:   

  • Key value pair (Ex. Cassandra)
  • Column Store (Ex. HBase, Big Table)
  • Document Store (Ex. MongoDB)
  • Graph Database (Ex. Neo4J, Ployglot)
Characteristics of NoSQL:
  • Non Relational (Better Performance)
  • Open Source (Low Cost)
  • Cluster Friendly (Scalable and No Failure)
  • Schema Less (Flexible data model)
Why NoSQL is better than SQL:
  • More flexible Data model
  • Better Performance 
  • Scalability
  • Low cost than RDBMS
  • Continues Availability 
[But Transaction level application where data safety & Security is more important still RDBMS is winner]

Top NoSQL Database : Cassandra / MongoDB / CouchDB / HBase / CosmosDB 

C.A.P. Theorem is implemented while configure the NoSQL DB in the clusters on the basis of Application behavior and requirement. 

C : Consistency / A : Availability / P : Partition Tolerance
Out of these three Only two can be select while configuring the NoSQL DB in the cluster (CA/CP/AP) depend on Application behavior and requirement.

Hadoop VS NoSQL:

Hadoop and NoSQL appear to be similar both manage the large and rapidly growing the data sets, both can handle a variety of data format and both can leverage the commodity hardware together as a cluster.

Hadoop : is a distributed file system that allow for Massively parallel computing, and hadoop is  suited for Data Analysis. The process behind it is batch operation suited for analytical computing task.

NoSQL: It is a distributed database infrastructure that can handle the heavy demand of big data. NoSQL is design the real time application that provide the ability to query the data so user can drill down into data as it change. It allow the high performance.

No comments:

Post a Comment