Monday, March 13, 2006

Beginners Guide to Business Intelligence – What is a Data Warehouse?

This is the second instalment of my blog feature on Beginners Guide to Business Intelligence. This time I am looking at: What is a Data Warehouse exactly?

Business Intelligence Context
First Business Intelligence in my opinion is not just architecture. It is a strategy of information within an organization. All organizations report on the performance of their business processes, services and sales. How they manage and organize information is maintained by a Business Intelligence Architecture, which is made up of several components. A data warehouse is just one such component. The data warehouse is sometimes seen as the corporate repository.

Before we get into the detail of data warehouse definitions lets have a flash back for a bit of data warehouse history.

Birth of Data warehousing
In a sense, companies have been reporting on information since the dawn of computing systems. Data warehousing is really a collection of technologies that have evolved within information systems. Listed below are some of the legacy terms that you may come across when talking about data warehouses.

Executive Information System (EIS)
One of the first attempts at giving executive managers a high level view of business activities, this kind of system was widely adopted in the late 70s and early 80s. Usually these systems would be summary tables built on top of transactional tables. EIS focused solely on the financial view of the organisation with limited information other than budget vs. actual. The tables would be maintained within the source system and loaded once a month and reports would be generated on top of them, almost always printed on line-flow paper.

Management Information System (MIS)
An extension of EIS, the MIS gave business users a wider view of information in the form of Balance Scorecards and other summarised views of information. The term Key Performance Indicators (KPI) was widely adopted for measures that were just not financial based. For the first time summary tables would be maintained independently of the transactional source system. This kind of system was widely adopted in the late 80’s early 90’s. In fact entire departments were named after MIS systems, hence the term MIS Department.

Decision Support Systems (DSS)
The next evolutionary step was the creation of the DSS. A term widely used when discussing OLAP solutions. In the 90’s you were more than likely building a DSS system to support OLAP reporting. This was the era of creating reporting systems for areas of the business no longer just focusing on Financial and Sales reporting. It has to be noted that if is was not for products like Essbase, Cognos and front-end tools like Lotus 1-2-3 and Microsoft Excel; DSS systems may not have been so successful. It was the first time that information was delivered to operational roles outside of transactional systems.

Data Warehouse
The data warehouse really is a central repository combining all of these earlier systems. The early data warehouses were third normal form 3NF databases, taking incredible amounts of time to develop. Due to the large nature of early data warehouse, data integration was heavily used in the creation of data warehouses, requiring special ETL (Extract, Transform, and Load) tools and skills - even worse to develop and create reports against. At this time two distinct trains of thought appeared when it came to defining what a data warehouse should be.

The first general definition was that the existing third normal form system, was the only solution to the eternal business questions that a drive a company. Bill Inmon was the main proponent of this concept stating that data warehouse architects could never predict all possible business questions just using summary tables alone. A data warehouse by its very definition should store data at the lowest level possible providing an Ad-hoc query environment. This created the need to store large amounts of data in a structure, which could never be fully tuned for all possible queries.

Many organisations had invested heavily in a data warehouse without fully realizing all the potential benefits. The original goals of data warehouse systems were to create an environment with the easy access, and usability too large amounts of data. Early data warehouses were too complex in nature and as a result fell well short of this goal.

Something had to change.

That happened to be Ralph Kimball and the concept of de-normalisation. Ralph proposed an ingenious way of data retrieval using existing RBMS systems. The main idea was to create low level transactional (facts) tables with attribute (Dimensions) tables relating to each transaction using an integer surrogate key system.

This allowed the database optimizers to summarize large amounts of data more efficiently. Due to the reduction in the number of tables required for a similar query in a traditional data warehouse. This definition describes the Star schema concept from a high-level technical view. Besides the query performance improvement there was a huge business benefit, for the first time business users had a view of data from an information context.

Another benefit of Star schemas was the business context or subject area. Business users were presented a view of transactional data from a business context. Instead of having to join multiple tables with complex joins to produce a query that only DBAs understood. Users where able to write queries that answered business questions for example, how many products did the company sell for this year compared to last year?

Third Normal Form vs. Star schemas
Kimball vs. Inmon is the most common discussion between Data Warehouse Architects, next to which database engine is the best for data warehousing, It maybe one of the few arguments that will never truly die down.

Proponents of 3NF push the robust nature of data warehouses as the corporate repository/information factory that will move with the business and insulate against change with in source systems.

On the other hand Star schemas provide a high performance user friendly reporting environment, which is the ultimate goal of any data warehouse. Even within the star schema approach there are differing views on whether the star should be snowflake (Relational dimensions with de-normalized fact table) or just solely the de-normalized.

The Next Step: Data Marts
Data warehouses became victims of their own success. Organisations started to realize the benefits but could not wait necessarily for a large system to be put in place, hence the data mart.

Although data warehouses were and still are a successful approach to the corporate data repository, the length of time, cost and complexity to develop and deliver data warehouse projects reduce some of the benefits. This was due to the fact that most organizations would change during the course of long projects.

A data mart tends not to suffer from such issues. As the data mart is a focused repository on one area of the business. Organisations are able to see a quick turn around and delivery of a high value repository that a data mart has to offer.

The most common form of a data warehouse will more than likely be a collection of data marts. This is sometimes called a federated data warehouse. Often this is the most successful form of data warehousing.

No matter what kind of system you are building the goals always the same: the delivery of information. Ultimately this kind of system undertaking should always support the business decision making process.

Want to know more about Data Warehouses? Check out the following references:

Bill Inmon

Ralph Kimball

Great Resources and Articles about Data Warehousing:

Beginners Guide to Business Intelligence: What is Business Intelligence About?
What is Business Intelligence About?

No comments: