Introduction to Enterprise Data Warehouse

7 July 2018

A data warehouse is a storage platform which contains historical data that is derived from transactional/relational database.As it includes data from different sources,it helps to seperate analysis workload from transaction workload.It is designed for query and analysis rather than transaction processing so that business can consolidate data from serveral sources.

In addition to being a RDBMS(Relational Database Management Systems), a data warehouse can often consists of an ETL(Extract Transform Load) solution, an OLAP(Online Analytical Processing) engine, client analysis tools, and can manage the process of gathering data and delivering it to business clients.

Data warehouse consists of three tightly integrated components.

  • Data tables The data tables store user data in a table.
  • System tables The system tables store metadata about the data in the data tables. Both the data and system tables are stored as files using the underlying operating system’s file system.

  • SQL query engine. The SQL query engine provides a SQL interface to store and analyze the data in the data tables.

ETL (Extract Transform Load)

ETL is the process of reading data from one or more sources, applying some transformation on the data, and writing it to another data source. Conceptually, it consists of three steps: extract, transform and load. These steps need not be sequential; an application does not have to extract all the data before moving it to the transform step.Once it has extracted a portion of the data, it may run the three steps in parallel.

The extract step involves reading data from one or more operational systems. The data source could be a database, API, or a file. The source database can be a relational database or a NoSQL data store. A file can be in CSV, JSON, XML, Parquet, ORC, Avro, Protocol Buffers, or some other format. The transform step involves cleaning and modifying the source data using some rules. For example, rows with invalid data may be dropped or columns with null values may be populated with some value.

Transformations may include, concatenating two columns, splitting a column into multiple columns, encoding a column, translating a column from one encoding to a different encoding, or applying some other transformation to make the data ready for the destination system.

The load step writes the data to a destination data source. The destination can be a database or file.

Generally, ETL is used for data warehousing. Data is collected from a number of different operational systems, cleansed, transformed and stored into a data warehouse. However, ETL is not unique to data warehousing. For example, it can be used to enable sharing of data between two disparate systems. It can be used to convert data from one format to another. Similarly, migrating data from a legacy system to a new system is an ETL process.

Metadata

Metadata is collected to improve supportability of the environment and to orient users to the meaning of the data.

There are two types of Metadata that is used in Enterprise Data Warehouse

  • Technical Metadata – data about how data is sourced, loaded, transformed, and managed.

  • Business Metadata – data that helps users understand the business context, meaning, and appropriate usage of the data.

Metadata benefits include the following areas:

  • Provides orientation to new team members on both the technical and business teams.
  • Promotes usage by providing context and content information.
  • Promotes a common understanding.
  • Provides documentation on agreements about usage.

The major types of Metadata collected and available include:

  • Source to Target Mappings – this metadata include the transformation rules that are used to convert and load operational data into the data warehouse.
  • Data Load Scheduling dependencies and SLAs – This document provides information on how the data is loaded and includes load dependencies and Service levels for when data is available.
  • Data Security & Usage Guidelines – This metadata is provided by data stewards that own the data and provide the requirements for data security and any usage guidelines for the data. The security requirements are implemented in the database and the usage guidelines are available for all users. Security Views will control access to data by LOB.
  • Logical Data Model – this metadata provides the core attributes, entities, and relationships and helps users understand the data that is available for use.
  • Physical data Model (PDM) – this metadata provides the physical constructs that are used to distribute and access the data and provides the physical table structures in the Integration Layer. This model shows the structures that help development teams produce performant code.
  • Views – the metadata provides information on what Views exist and helps the user find the View that meets their needs. This improves the maintainability of the Views.
  • Corporate Metrics – the business and technical definitions of corporate metrics are defined. These are metrics that have meaning to more than one department. These metrics are available in the Semantic Layer. Whenever users utilize this data they should make sure they are in compliance with the definitions provided.

Slowly Changing Dimensions(SCD)

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is dimensions that change slowly over time, rather than changing on regular schedule, time-base. In data warehouse we need a dimensional attribute in order to report historical data. Some of the examples of these dimensions would be customer address, geograhy ,employee salary etc.

It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.

Categories off SCD

Most popular approaches to deal with SCD is given below .

Type 0 - The passive method In this method no special action is performed upon dimensional changes. Some dimension data can remain the same as it was first time inserted, others may be overwritten.

Type 1 - Overwriting the old value In a Type 1 SCD the new data overwrites the existing data. Thus the existing data is lost as it is not stored anywhere else. This is the default type of dimension you create. You do not need to specify any additional information to create a Type 1 SCD.

In this method no history of dimension changes is kept in the database. The old dimension value is simply overwritten be the new one. This type is easy to maintain and is often use for data which changes are caused by processing corrections(e.g. removal special characters, correcting spelling errors).

Type 2 - Creating a new additional record In this methodology all history of dimension changes is kept in the database. You capture attribute change by adding a new row with a new surrogate key to the dimension table. Both the prior and new rows contain as attributes the natural key(or other durable identifier). Also ‘effective date’ and ‘current indicator’ columns are used in this method.

A Type 2 SCD retains the full history of values. When the value of a chosen attribute changes, the current record is closed. A new record is created with the changed data values and this new record becomes the current record. Each record contains the effective time and expiration time to identify the time period between which the record was active.

Type 3 - Adding a new column Adding a new column. In this type usually only the current and previous value of dimension is kept in the database. The new value is loaded into ‘current/new’ column and the old one into ‘old/previous’ column. Generally speaking the history is limited to the number of column created for storing historical data. A Type 3 SCD stores two versions of values for certain selected level attributes. Each record stores the previous value and the current value of the selected attribute. When the value of any of the selected attributes changes, the current value is stored as the old value and the new value becomes the current value.

Type 4 - Using historical table In this method a separate historical table is used to track all dimension’s attribute historical changes for each of the dimension.

Type 6 - Combine approaches of types 1,2,3 (1+2+3=6)

Types of Layers in Enterprise Data Warehouse

Integration Layer

The Integration Layer is the heart of the Integrated Data Warehouse. The Integration Layer contains the lowest possible granularity available from an authoritative source, in near Third Normal Form (3NF). The term ‘near 3NF’ is used because there may be requirements for slight denormalization of the base data. This is done on an exception basis. All 3NF tables will be defined using the Natural (or Business) keys of the data. Surrogate keys will not be used.

Data in the higher layers of the architecture are derived from data in this layer. This layer includes all corporate data that has business value to more than one business area, meaning that it has corporate value. Contained in this layer is the ‘base’ business data.

The source of the data in this layer is a combination of the operational systems, base data, master data and possibly applications that are resident on the EDW (e.g., Marketing Applications, Supply Chain Applications, or other applications which create data on the EDW). Aggregation/summary tables that have broad business use could also be located here.

Guiding Principles for the Integration Layer are:

  • Commit to creating and maintaining a Logical Data Model (LDM) and a Physical Data Model (PDM)
  • Maintain the data as appropriate to meet current and future business needs.
  • Commit to backup, recovery, and business continuity that satisfy business requirements

Enterprise Base Tables : Near 3NF representation of all business data. May also contain summary tables that are applicable across multiple business areas. An example of this could be Store/Item/Day sales that are created by summarizing all sales transactions.

Semantic Layer

This layer consists of Views that access the tables contained in the Integration Layer. Views are used to define a ‘virtual’ dimensional star schema model to hide the complexity associated with normalized data in the Integration Layer.

A semantic / data access layer provides ease of use for BI Developers and adhoc users. This layer presents data in a format that is easy to use and eliminates the most common joins of the physical tables. The Semantic / Data Access Layer structures provide users with a view to the data. All applications and users consume / use the data via views. Tableau is the current BI tool for Reporting. SQL Assistant is used for adhoc queries. Security will be managed per Line of Business through separate sets of semantic views.

Semantic Views benefits

  • Simplification and Usability – provides a business specific view that may reduce attributes and combine tables to simplify usability for applications and for ad hoc access. Views can be used to create dimensional structures that are easier for BI tools to access and use.
  • Enables and implements Security by limiting the data returned based on the user’s access rights.
  • Manages database locking.
  • Provides a logical, more straight-forward view of data for business users and applications; reduces the learning curve to use the data.
  • Protect/isolate application code and user queries from changes to physical table structures.
  • Allows joins to be done in the database in parallel instead of in the application to improve performance. This layer is intended to improve usability of the data and make access to the data easy for both ad hoc users and BI Tools. All access to Integration Layer tables and Performance Layer tables will be through views. This is the external view of the Data Warehouse.

Semantic Layer Guiding Principals

  • Access to Enterprise Data or to application specific data must be performed through a view
  • Use the LOCKING modifier
  • Adopt and apply naming standards
  • Avoid excessive nesting of views
  • Semantic Layer Components and Descriptions

Integration Layer

The Integration Layer is the heart of the data warehouse and includes corporate data in a near 3NF data model.

Enterprise Views : One-to-One view on the base table. These include views with write permissions for ETL and ELT applications as well as views that provide read access to base tables. There may also be views that define corporate metrics and logical structures that are used across business areas.

Business Intelligence Reporting Views : Used by the reporting front end and most ad-hoc queries. May be a combination of Enterprise and Performance Layer access. May include views to create star schemas or dimensional models to simplify data usage

Security Views : Used to limit access to any sensitive data based on access rights

Performance Layer

If performance requirements dictate better response time from these normalized tables in the Integration Layer, de-normalization of these tables can be created in the Performance Layer as either physical tables or other performance structures such as aggregate join indexes (AJIs).

Performance Layer Guiding Principles:

  • Physical objects will only be created when a need is demonstrated; based on performance requirements and SLAs. Core Metrics and Key Performance Indicators that are used across business areas are best defined in this layer.

  • Performance can be improved through aggregates
  • Indexes and Partitions can be used to limit the I/O needed
  • Join Indexes can be used to pre-join data at load prior to application real-time requests

Data Warehouse schema and Modelling

  • Star Schema

It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of fact table and the points of the star are the dimension tables. Usually the fact tables in a star schema are in third normal form(3NF) whereas dimensional tables are de-normalized. Despite the fact that the star schema is the simplest architecture, it is most commonly used nowadays and is recommended by Oracle. It is an important special case of snowflake schema and is more effective in handling simpler queries.

  • Snowflake Schema

The snowflake schema represents a dimensional model which is also composed of a central fact table and a set of constituent dimension tables which are further normalized into sub-dimension tables. In a snowflake schema implementation, Warehouse Builder uses more than one table or view to store the dimension data. Separate database tables or views store data pertaining to each level in the dimension.

  • Fact constellation schema

For each star schema it is possible to construct fact constellation schema(for example by splitting the original star schema into more star schemes each of them describes facts on another level of dimension hierarchies). The fact constellation architecture contains multiple fact tables that share many dimension tables.

The main shortcoming of the fact constellation schema is a more complicated design because many variants for particular kinds of aggregation must be considered and selected. Moreover, dimension tables are still large.

When do you use Snowflake Schema

Ralph Kimball, the data warehousing guru, proposes three cases where snowflake implementation is not only acceptable but is also the key to a successful design:

  • Large customer dimensions where, for example, 80 percent of the fact table measurements involve anonymous visitors about whom you collect little detail, and 20 percent involve reliably registered customers about whom you collect much detailed data by tracking many dimensions

  • Financial product dimensions for banks, brokerage houses, and insurance companies, because each of the individual products has a host of special attributes not shared by other products

  • Multi-enterprise calendar dimensions because each organization has idiosyncratic fiscal periods, seasons, and holidays

Fact Tables

A fact table typically has two types of columns: foreign keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact’s data on detail or aggregated level.

Dimension Tables

A dimension is a structure usually composed of one or more hierarchies that categorizes data. If a dimension hasn’t got a hierarchies and levels it is called flat dimension or list. The primary keys of each of the dimension tables are part of the composite primary key of the fact table. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Dimension tables are generally small in size than fact table.

Typical fact tables store data about sales while dimension tables data about geographic region(markets, cities) , clients, products, times, channels.

References

Enterprise Data Warehouse

Share: Twitter Facebook Google+ LinkedIn
comments powered by Disqus