Manish - Database Specialist

Tuesday, November 19, 2019

Saturday, November 9, 2019

A good BI solution can improve/manage any business to increase the revenue and cut the unnecessary cost. With the help a good BI solution, Management/Business Analyst can see the business impacting stats with drill down details and Root cause (Positive & Negative) in a short time and take the immediate appropriate decision to improve the benefits and stop the losses.

Most of the top domain companies use the BI solution to improve/manage their business. BI Solution is actively used in Retails, Telecom, Finance & Insurance, Transport & Logistics, Stock Exchanges, KPO/BPO.

Machine Learning can play a best role with any BI solution just like butter on the bread that can be used for data forecasting, predictive analytics, Recommendation system.

A good BI solution can have the following features.

Dashboard
Trending
Drill Down Analysis
Reporting & Report Builder (Ad hoc Reporting)
Alert & Notification
What if Analysis
Recommendation Analytics
Predictive Analytics
Real time data Analytics
Customer Analytics

Tuesday, October 8, 2019

Tableau VS Power BI

Tableau and Power BI both are great BI tool that are used for Data visualization.

Tableau :

Tableau offers 5 main product:

Tableau Desktop
Tableau Server
Tableau online
Tableau Reader (Free)
Tableau Public (Free)

Tableau Desktop

This tool is common for everyone. With this tool we can create stories, dashboard, workbooks and visualization but can not publish and share for other users.

This tool will not secure visualizing data and this come into two versions:
1) Personal & 2) Professional.
Personal : can not connect with all listed Data Sources

Professional : can connect with all the listed Data Sources

Tableau Server

It is a secured internet portal where users can easily publish and share their Tableau workbooks by simply providing a link to a work book. This does not required Tableau desktop to consume the information it is just a browser.

Tableau Online

It is hosted in cloud version of SaaS. It is secure but limited for editing of workbook, dashboard etc. It is cloud based Tableau server solution

Tableau Public

It is a free tool offered by Tableau organization. It is mainly used for publishing interactive data in online which can be accessible by everyone. This in not secure one as anyone can access the data and download.

Tableau Reader

A free tool that allows you to open but not edit tableau workbook. You can only read what others have created.

Power BI :

Power BI come into 3 Elements

Desktop (Free)
Services
Mobile

Power BI Desktop

It a window desktop application that is used to only create the Report & Visualization but you can not publish and share those for other users.

Power BI Service

This is SaaS Application hosted in cloud you can create the Report & Visualization on Power BI Desktop and publish and share those into Power BI Service.

Power BI Mobile

Using this Power BI Mobile Service you can create the Report & Visualization for windows phone/ Tablets as well as IOS and Android devices

Tableau VS Power BI

Power BI is a Microsoft Tool and Table develop by Tableau organization
Power BI is easy to learn than Tableau
Tableau has more features and graphics than Power BI.
Power BI is more compatible to work with MS Share Point and Azure Services.
Tableau is more powerful to work for big data analytics.
Power BI service is a shared service hosted on cloud and tableau also have dedicated server (Tableau Server) product.

There are others good BI tool like (Qlikview, SAS, Looker & more ) available in the market that can be use for a BI application dependent on organization requirement/deal/cost.

Thursday, October 3, 2019

DW @ Type of Dimensions and Facts

In a DW, a Dim is a data element that categories each item in a data set and used for slicing and dicing the data in DW. The primary function of Dim is to provide the filtering, grouping and labeling. And a Fact table is a table that have the numerical data (Measures) that can be aggregated based on Dim data elements.

Type of Dims

Confirmed Dim: These are dims that have the same meaning for all the fact with which those are related (Ex. Date Dim that have the same meaning for all the facts)
Junk Dim: These are dims that have the miscellaneous data (like flag & indicator) that do not fit in the base Dim table (Some time we need to optimize a Dim that have some attributes that are important but not required so these attributes are moved in to a New Dim table that Dim table called Junk Dim)
Degenerated Dim: These are dims that are derived from the fact table and does not have its own Dim table (Sometime there are some attributes coming in fact table which are not fixed but we need to categories the data based on these attributes, so we create a Dim from Fact table for those attributes and that Dim table called Degenerated Dim)
Role-Playing Dim: A Dim that can play different roles in a fact table. (Ex. Date Dim can be used for the Order Date, Shipping Date, Invoice Date in a Order Line fact table)
Slow Changing Dim (SCD): A Dim is called SCD that can change over time. It may change immediately and sometimes rapidly. It has 3 type of data handling methods (Depend on project requirement):

Type-1) That will not maintain any history will replace the existing data.
Type-2) That will have only previous history
Type-3) That will maintain all the history along With Effective From

Rapid Changing Dim (RCD) : A SCD that will change rapidly is called Rapid Changing Dim.

Type of Facts:

Additive: These facts can be aggregated for all the Dims in fact table to make any information
Semi Additive: These facts can be aggregated for some of the Dims not for all Dim in fact table to make any information
Non-Additive: These facts cannot be aggregated for any of the Dims in fact table to make any information
Fact Less fact Table: These fact table does not have any Fact and these table are created only to get the occurrence of actions/events.

Saturday, September 21, 2019

MDM (Master Data Management) is not DWH

Both MDM and DWH have the same concept (Getting the data form different sources and store it in a separate Database) but purpose of both are different As DWH is to make it easier to perform analytics & BI on historical data from transnational system as well as MDM system whereas MDM :

MDM:

It is to defining a process of collecting the data from various sources and applying the standard rules and business processes to building a single view of data and finally distributing the golden version of data to various system. MDM is created for data auditing purpose.

Simply can say, MDM reconcile the data from various systems to create a single view of master data for auditing.

How MDM works

MDM creates a new version of data every time changes is made (in the source system) along with information about who is making the changes. You can track it back & look for delta between various version, when it was made & who did it. By having this level of auditing history, you organization is helpful to achieve complete regularly compliance and also to provide enterprise information management program.

MDM store data about entities like

People (Customer, Vendors, Employees, Patients)
Things (Product, Business Unit, Parts, Equipment)
Place (Location, Stores, GEO Area)

Simple MDM Example:

Let suppose an organization (ABC) have 4 business unit (A Company / A Hospital / A School / Retails store) and A people (M) is an employee of that company, a patient of that Hospital , a teacher in that school and a customer of that Retail store but company ABC don't know about M where this people is part of our Business Unit. So in that case ABC can have a MDM to capture all the details and actions done by M for tracking and audit purpose.

DWH / BI / Data Science

Business Intelligence (BI)

BI is a technology driven process to gathering, storing and analyzing the data in such a way to make a better business decision.

BI Solution help a company in making business decision faster, accurate and market oriented.

Data Warehouse / Data Mart (DW / DM)

To implement the BI solution, we need to design the data model in such a way that should be perfect for reporting ans analysis purpose. This type of databases are called DW/DM.

Data Mart is a subset DW that are design for a company's specific department like (Finance, HR, Marketing...) whereas DW are design for complete company point of view (to combine the data for all department of company) to get company level Reporting and Analysis.

There are two approach to store the data into DW/DM

Dimensional Approach (For Analysis Purpose)
Normalized Approach (For Reporting Purpose)

Dimensional Approach:

In this approach, the data is divided and stored into two form Dimension and Facts in the database.

The type of data modelling for Dimensional Approach is called Dimensional Modeling.

Facts : this is data the we need to analyze mathematically (numerical data that can be aggregated)
Dimension : this is the data that for which we need to analyze the fact data.

Ex. If we need to analyze the product sales on the basis of Date & Product Name so here Date & product name are Dim and Product Sales is the measure come from fact data.

Dimensional Modeling can be done in two form:

Star Schema
Snowflake Schema

Star Schema: In the start schema design, a single object (Fact Table) site in the middle and is rapidly connected to the other surrounding objects (dimension lookup table) like a star.

Snowflake Schema: The snowflake schema is a extension of star schema where each point of the star explodes into more points. Can say sometime we need to normalize the dimension to optimize the star schema by breaking the one dimension into multiple dimension that called snowflake schema.

Advantage : Retrieving the data from DW is very quickly.

Disadvantage : Adding the New data and updating the old data in DW is big challenge.

Normalized Approach:

In this approach, the data is stored in DW in 3NF to easily insert, update and delete but data accessing is slow because of multiples joins can be require to perform a query.

* Data Vault (DV) Approach: This is hybrid data model approach (Star Schema + 3NF)

There are some limitation in Dim & 3NF data model approach (3NF down the query performance and Dim model is not scalable). And nowadays companies focus on the EDW (Enterprise Data warehouse) where they can store the historical data with scalable data model that can

adapt the new data sources
absorb changes to existing source
adapt new business rules

So on that context DV approach is most valuable.

In this approach, the data is divided and stored into three type of table form:

Hub : That have the Business Key
Satellite : That contain the descriptive attribute for the Business Key
Link : That have the relationship between Business Key & descriptive attribute Dimension data

Business Intelligence VS Data Science: Data science is an evolution of BI.

BI can help to interpret the past data for reporting and Descriptive Analytics that means BI can help to tell what is happening and why is happening and cant be exactly tell what will happen in future.

Data Science use the Machine Learning Technic to analyze the past data (trends or patterns) to make future predictions more accurately. Data Science is more used for Predictive Analytics or Prescriptive Analytics.

Tuesday, September 17, 2019

Lambda Architecture for Big Data Application

There are three layers involved in Lambda Architecture

Batch Layer
Speed Layer
Servicing Layer

Lambda Architecture :

The Lambda Architecture is an accommodation of Batch Layer & Speed Layer, where data goes to both the layers redundantly to get processed. The combination of both layers is called Lambda Architecture.

Batch Layer : Where data come for batch processing.
Speed Layer : Where data come for real time processing.
Servicing Layer : Where data is stored after processing.

First data come at data center and then go to both layers.

At Batch layer where data is processed and stored at servicing layer in some form of NoSQL/Hadoop.

At Speed Layer, the data is processed in real time for different function/application

Use of Lambda Architecture :

This model is used where we need to show the real time data at real time application as well as in parallel we need to store the data for further analysis on historical data for any BI analysis application.

Monday, September 16, 2019

Big Data / Hadoop (Hive) / NoSQL

Big Data :

Big data is a blanket term for any collection of data sets so large and complex that is become difficult to process using normal DBMS tool (Like SQL Server / Oracle ....)

The challenge include capture, storage, search, sharing, transfer, analysis, and visualization of data.

Big data Three Dimension:

Volume : High Volume Data
Velocity : High Speed Data
Variety : Different type of Data (Audio/Video/Sensor/Click Stream/Log files)

Handle of Big Data:

Hadoop & NoSQL Database are Specially design to handle the big data.

Hadoop:

It is a open source distributed file system framework/tool that is specially design to handle the big data. It enables the distributed processing of large data sets across clusters of commodity servers .It works to interpret or parse the result of big data search through specific algorithm and methods.

Pillar of Hadoop:

HDFS
Map Reduce
Yarn

HDFS (Hadoop Distributed File System): It is a file system that spans all the nodes in a hadoop cluser of data storage. It links together file system on many local nodes to make them into one big file system.

Map Reduce: The idea behind the Map Reduce is that hadoop first map a large data sets and them perform a reduction on that contents for specific result.

YARN (Yet Another Resource Negotiator): It assign the CPU, Memory and storage to application running on hadoop cluster.

Hadoop Advantage:

Scalable
Cost Effective
Flexible
Fault Tolerance

Hive: Hive is a DWH system that is specially design to work on top of hadoop cluster for those developers who they are from SQL background that don't have understanding to write the JAVA code for map reduce program.
Hive is setup on top of Hadoop that have a SQL like query language called HQL (That auto create the MAP Reduce program to connect with hadoop cluster) to facilitates the ad hoc query and the analysis of large data set stored in hadoop.

NoSQL (Not Only SQL):

A NoSQL database environment is a non relational and largely distributed DBMS. NoSQL DB sometimes referred as Cloud DB, Non Relational DB, Big Data DB. It has a Schema less data model. horizontal scalability, distributed architecture.

Types of NoSQL DB:

Key value pair (Ex. Cassandra)
Column Store (Ex. HBase, Big Table)
Document Store (Ex. MongoDB)
Graph Database (Ex. Neo4J, Ployglot)

Characteristics of NoSQL:

Non Relational (Better Performance)
Open Source (Low Cost)
Cluster Friendly (Scalable and No Failure)
Schema Less (Flexible data model)

Why NoSQL is better than SQL:

More flexible Data model
Better Performance
Scalability
Low cost than RDBMS
Continues Availability

[But Transaction level application where data safety & Security is more important still RDBMS is winner]

Top NoSQL Database : Cassandra / MongoDB / CouchDB / HBase / CosmosDB

C.A.P. Theorem is implemented while configure the NoSQL DB in the clusters on the basis of Application behavior and requirement.

C : Consistency / A : Availability / P : Partition Tolerance

Out of these three Only two can be select while configuring the NoSQL DB in the cluster (CA/CP/AP) depend on Application behavior and requirement.

Hadoop VS NoSQL:

Hadoop and NoSQL appear to be similar both manage the large and rapidly growing the data sets, both can handle a variety of data format and both can leverage the commodity hardware together as a cluster.

Hadoop : is a distributed file system that allow for Massively parallel computing, and hadoop is suited for Data Analysis. The process behind it is batch operation suited for analytical computing task.

NoSQL: It is a distributed database infrastructure that can handle the heavy demand of big data. NoSQL is design the real time application that provide the ability to query the data so user can drill down into data as it change. It allow the high performance.

Sunday, September 15, 2019

Deep Learning behind AI

Deep Learning (DL) :

It is a branch of ML where algos learn independently from excessive amount of information similarly to peoples, there algos get smarter with experience by gathering and processing of more and more data.

How Deep Learning work:

The magic behind Deep learning network is discovering the pattern & structure behind vast amount of data. The computation model consists of multiple layers, called neural network , where data is processed.

Neural Network

We have three elements in the neural network

Input Layer : Which is the data the we want to analyze
Hidden Layer : At least two hidden layer which complete the computation with Deep learning algo (The real magic happen in this hidden layer)
Output Layer : Where we have calculated result

* Deep learning algo just do two tasks : 1) Correlation and 2) Reduction in the Hidden Layers

Implementation of Deep Learning:

Deep Learning program can be achieve by NLP (Natural Language Processing) that work on Neural Network concept (Just like a human brain work) And To implement the NLP we need some library or methods that are provided by three framework/library

Keras
Tensorflow
PyTourch

Keras : It is an open source Neural Network library written in Python. It is capable of running o top of R & Tensorflow framework. It has only high level method to perform only Deep Learning task (Not ML Task).

Tensorflow : It is an open source framework develop by google that have both low & high level methods to implement the ML & DL task.

Pytourch : It is an open source ML library for python based on tourch. It is use of NLP application.

Developed by Facebook AI research group.

Deep Learning is not AI:

Deep Learning is a branch of AI, AI extends way further As Deep learning can not think itself it can make decision based on data and instruct it was fed.

AI is supposed to the intimation of human consciousness and independent thinking process perform bu computer node.

Why Deep Learning is imp:

Deep learning can help to tackle the biggest problems of the words and can improve the human lives.

Saturday, September 14, 2019

MS Azure Service

MS Azure:

It is a Microsoft PaaS (Platform as a Service) application hosted on the cloud.

Some great MS Azure Services

Database (SQL Database / SQL Data warehouse / Cosmos DB [No SQL DB] )
IOT / Analytics (Azure Stream Analytics using IOT Hub, Event Hub & Notification Hub)
Machine Learning Studio
HD Insights (Hadoop – Cluster of nodes at Cloud)
Data Factory (SSIS at Cloud with no transformation)
Security (Active Directory)
Azure Storage

Azure Blob: A massively scalable object store for text and binary data.
Azure Files: Managed file shares for cloud or on-premises deployments.
Azure Queues: A messaging store for reliable messaging between application component
Azure Tables: A NoSQL store for schema less storage of structured data.
Azure Data Lake: to store any type of data structure and unstructured (Audio/Video/Image)

Machine Learning / R / Power BI

Machine Learning:

It is a program that learn itself with the historical data and decide the future result with applying the statistical function (Mean, Median, SD, VAR, Correlation etc) and that program can be implemented by language (Python/R)

Types of ML:

o Supervised (That is already trained to give the answer based on historical data)
o Unsupervised (That is not trained and worked on clustering method based on historical data)

Three type of problems can be solved by ML

o Classification
This is a Supervised learning problem and can be used for Forecasting the binary answer (Yes/No)
Best ML Algo : Decision Tree

o Regression
This is a Supervised learning problem and can be used for Forecasting for continues variable
Best ML Algo : Linear Regression

o Clustering
This is a Unsupervised learning problem and can be used for Recommendation
Best ML Algo : K-Means

Top ML Algorithms:

- Linear Regression
- Logistics Regression
- Decision Tree
- Random Forest
- KNN (K-Nearest Neighbor)
- K-Means
- NLP (Natural Language Processing)
[Deep Learning worked on Neural Network / Use for AI Application]

R:

R is open source/platform independent language that can be used for data analytics (statistical computing and graphics) and ML application.

Power BI:

It is a data presentation tool (With lots of graphics) for any BI solution application and that can connect with any data source and using that any Tabular/Graphics report/dashboard can be developed in minutes and share/integrated in any web application.

It has great feature (Streaming Data Set) that can be used for real time data analytics.

Data Modeling

Data Modeling (DM)

Data modeling is a process to design the data model for a database to organize the data for a better management and faster access.

Data Modeling Technic

Data Modeling Technic depend on the type of application behavior for which we need to design the data model.

DM for a Transnational Application where CUD operation frequently occur this is called Relational Database modeling (Normalization / Less indexing / No partitioning )
DM for a Reporting Application where Read operations frequently/heavily occur this is called Reporting DWH modeling (De-normalization / Good indexing / with partitioning on File group)
DM for a Analytics Application where we need to analyze the data in multi dimension view with fast access this is called Analytics DWH modeling (Dimension Modelling / Data Vault Modelling)
DM for a Big data Application where Data come in VVV this is called Nosql/Cloud Data Modeling (Key-Value store / Column store / Document store)

Type of DM

Conceptual DM : Where we design the DM with entity and its relationship
Logical DM : Where we design the DM with entity, attribute and attribute relationship
Physical DM : Where we create the physical database as per logical DM

Tuesday, December 11, 2018

Curr - Prev - Next Record Operation

DECLARE @T_SiteRoute TABLE (RouteOrder INT,RouteOrderRank int,DeviceGroupName NVARCHAR(50))

INSERT INTO @T_SiteRoute(RouteOrder, DeviceGroupName)
SELECT RouteOrder = 1, DeviceGroupName = '4G'
UNION All SELECT RouteOrder = 2, DeviceGroupName = '3G'
UNION All SELECT RouteOrder = 3, DeviceGroupName = 'CSR'
UNION All SELECT RouteOrder = 4, DeviceGroupName = 'LR1'
UNION All SELECT RouteOrder = 5, DeviceGroupName = 'LR2'
UNION All SELECT RouteOrder = 6, DeviceGroupName = 'LR3'
UNION All SELECT RouteOrder = 7, DeviceGroupName = 'LR4'
UNION All SELECT RouteOrder = 8, DeviceGroupName = 'MR1'
UNION All SELECT RouteOrder = 9, DeviceGroupName = 'MR2'
UNION All SELECT RouteOrder = 10, DeviceGroupName = 'MR3'
UNION All SELECT RouteOrder = 11, DeviceGroupName = 'HR1'
UNION All SELECT RouteOrder = 12, DeviceGroupName = 'HR2'
UNION All SELECT RouteOrder = 13, DeviceGroupName = 'EXFO'

UPDATE T1 SET T1.RouteOrderRank = T2.RouteOrderRank
FROM @T_SiteRoute T1
INNER JOIN (
SELECT T.RouteOrder ,
RouteOrderRank = ROW_NUMBER() OVER (ORDER BY T.RouteOrder) ,
T.DeviceGroupName
FROM @T_SiteRoute T
) T2 ON T1.RouteOrder = T2.RouteOrder

SELECT
Curr.RouteOrderRank
,Curr_DeviceGroupName = Curr.DeviceGroupName
,Nxt_DeviceGroupName = nxt.DeviceGroupName
FROM @T_SiteRoute Curr
LEFT OUTER JOIN @T_SiteRoute Nxt ON
Nxt.RouteOrderRank = (SELECT MIN(RouteOrderRank) FROM @T_SiteRoute WHERE RouteOrderRank > Curr.RouteOrderRank)

SELECT
Curr.RouteOrderRank
,Curr_DeviceGroupName = Curr.DeviceGroupName
,Prev_DeviceGroupName = Prev.DeviceGroupName
FROM @T_SiteRoute Curr
left OUTER JOIN @T_SiteRoute Prev ON
curr.RouteOrderRank = (SELECT MIN(RouteOrderRank) FROM @T_SiteRoute WHERE RouteOrderRank > Prev.RouteOrderRank)