7 July 2018
A data warehouse is a storage platform which contains historical data that is derived from transactional/relational database.Data Warehousing is a process for collecting, storing, and deliveringdecision-support data for some or all of an enterprise.
As it includes data from different sources,it helps to seperate analysis workload from transaction workload.It is designed for query and analysis rather than transaction processing so that business can consolidate data from serveral sources.
In addition to being a RDBMS(Relational Database Management Systems), a data warehouse can often consists of an ETL(Extract Transform Load) solution, an OLAP(Online Analytical Processing) engine, client analysis tools, and can manage the process of gathering data and delivering it to business clients.
Data warehouse consists of three tightly integrated components.
System tables The system tables store metadata about the data in the data tables. Both the data and system tables are stored as files using the underlying operating system’s file system.
Metadata is collected to improve supportability of the environment and to orient users to the meaning of the data.
There are two types of Metadata that is used in Enterprise Data Warehouse
Technical Metadata – data about how data is sourced, loaded, transformed, and managed.
Business Metadata – data that helps users understand the business context, meaning, and appropriate usage of the data.
Metadata benefits include the following areas:
The major types of Metadata collected and available include:
The Integration Layer is the heart of the Integrated Data Warehouse. The Integration Layer contains the lowest possible granularity available from an authoritative source, in near Third Normal Form (3NF). The term ‘near 3NF’ is used because there may be requirements for slight denormalization of the base data. This is done on an exception basis. All 3NF tables will be defined using the Natural (or Business) keys of the data. Surrogate keys will not be used.
Data in the higher layers of the architecture are derived from data in this layer. This layer includes all corporate data that has business value to more than one business area, meaning that it has corporate value. Contained in this layer is the ‘base’ business data.
The source of the data in this layer is a combination of the operational systems, base data, master data and possibly applications that are resident on the EDW (e.g., Marketing Applications, Supply Chain Applications, or other applications which create data on the EDW). Aggregation/summary tables that have broad business use could also be located here.
Guiding Principles for the Integration Layer are:
Enterprise Base Tables : Near 3NF representation of all business data. May also contain summary tables that are applicable across multiple business areas. An example of this could be Store/Item/Day sales that are created by summarizing all sales transactions.
This layer consists of Views that access the tables contained in the Integration Layer. Views are used to define a ‘virtual’ dimensional star schema model to hide the complexity associated with normalized data in the Integration Layer.
A semantic / data access layer provides ease of use for BI Developers and adhoc users. This layer presents data in a format that is easy to use and eliminates the most common joins of the physical tables. The Semantic / Data Access Layer structures provide users with a view to the data. All applications and users consume / use the data via views. Tableau is the current BI tool for Reporting. SQL Assistant is used for adhoc queries. Security will be managed per Line of Business through separate sets of semantic views.
The Integration Layer is the heart of the data warehouse and includes corporate data in a near 3NF data model.
Enterprise Views : One-to-One view on the base table. These include views with write permissions for ETL and ELT applications as well as views that provide read access to base tables. There may also be views that define corporate metrics and logical structures that are used across business areas.
Business Intelligence Reporting Views : Used by the reporting front end and most ad-hoc queries. May be a combination of Enterprise and Performance Layer access. May include views to create star schemas or dimensional models to simplify data usage
Security Views : Used to limit access to any sensitive data based on access rights
If performance requirements dictate better response time from these normalized tables in the Integration Layer, de-normalization of these tables can be created in the Performance Layer as either physical tables or other performance structures such as aggregate join indexes (AJIs).
Physical objects will only be created when a need is demonstrated; based on performance requirements and SLAs. Core Metrics and Key Performance Indicators that are used across business areas are best defined in this layer.
It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of fact table and the points of the star are the dimension tables. Usually the fact tables in a star schema are in third normal form(3NF) whereas dimensional tables are de-normalized. Despite the fact that the star schema is the simplest architecture, it is most commonly used nowadays and is recommended by Oracle. It is an important special case of snowflake schema and is more effective in handling simpler queries.
The snowflake schema represents a dimensional model which is also composed of a central fact table and a set of constituent dimension tables which are further normalized into sub-dimension tables. In a snowflake schema implementation, Warehouse Builder uses more than one table or view to store the dimension data. Separate database tables or views store data pertaining to each level in the dimension.
For each star schema it is possible to construct fact constellation schema(for example by splitting the original star schema into more star schemes each of them describes facts on another level of dimension hierarchies). The fact constellation architecture contains multiple fact tables that share many dimension tables.
The main shortcoming of the fact constellation schema is a more complicated design because many variants for particular kinds of aggregation must be considered and selected. Moreover, dimension tables are still large.
Ralph Kimball, the data warehousing guru, proposes three cases where snowflake implementation is not only acceptable but is also the key to a successful design:
Large customer dimensions where, for example, 80 percent of the fact table measurements involve anonymous visitors about whom you collect little detail, and 20 percent involve reliably registered customers about whom you collect much detailed data by tracking many dimensions
Financial product dimensions for banks, brokerage houses, and insurance companies, because each of the individual products has a host of special attributes not shared by other products
Multi-enterprise calendar dimensions because each organization has idiosyncratic fiscal periods, seasons, and holidays
A fact table typically has two types of columns: foreign keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact’s data on detail or aggregated level.
A dimension is a structure usually composed of one or more hierarchies that categorizes data. If a dimension hasn’t got a hierarchies and levels it is called flat dimension or list. The primary keys of each of the dimension tables are part of the composite primary key of the fact table. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Dimension tables are generally small in size than fact table.
Typical fact tables store data about sales while dimension tables data about geographic region(markets, cities) , clients, products, times, channels.
Data integration is a technique for moving data or otherwise making data available across data stores. The data integration process can include extraction, movement, validation, cleansing, transformation, standardization, and loading.
Extract Transform Load (ETL) In the ETL pattern of data integration, data is extracted from the data source and then transformed while in flight to a staging database. Data is then loaded into the data warehouse. ETL is strong for batch processing of bulk data.
Extract Load Transform (ELT) In the ELT pattern of data integration, data is extracted from the data source and loaded to staging without transformation. After that, data is transformed within staging and then loaded to the data warehouse.
Change Data Capture (CDC) The CDC pattern of data integration is strong in event processing. Database logs that contain a record of database changes are replicated near real time at staging. This information is then transformed and loaded to the data warehouse. CDC is a great technique for supporting real-time data warehouses.
[Data Warehousing] (https://dzone.com/refcardz/data-warehousing)