Looking for software development internships? Hackr.io is hiring!

Data Structures and Algorithms and Data Warehouse

Star Schema in Data Warehousing

Posted in Data Structures and Algorithms, Data Warehouse
Star Schema in Data Warehousing

At the most basic form, the database schema is the representation of the logical configuration of the relational database, as a whole or as a part. It can also be defined as the skeletal structure which demonstrates the entire database through a logical view. It may exist only as a visual representation or as a group of formulae which govern a database. These formulae are communicated through a data definition language for the system to understand, such as SQL.

The primary role of the schema is to define how the data will be organized and how the association between them will occur. It contains an evocative detail of the database, which is portrayed by schema diagrams. It also specifies how entities such as tables, procedures, views and more will relate to one other. In general, a database schema designed by a database designer aims to assist the programmers whose software will be interacting with the database. The process through which the schema is created is known as data modeling.

The two main types of database schema are:

  • Logical Database Schema: A logical database schema defines the constraints which are applied to the data stored in the database. It may define views, tables, integrity constraints and more
  • Physical Database Schema: A physical database schema puts down how the data is stored physically on a storage system.

In database creation, the three-schema approach is followed to reduce the complexity of the system and make it efficient for retrieval of data. The goal to use the three-schema approach is to create such as a system that every user should see a customized view of the data present in the database as per their permission and requirement, without having to deal with any physical database storage directly. The three schemas in this database modeling approach are:

  • Internal Level Schema: The Internal Level Schema is the lowest level of data abstraction, and defines the storage (physical) of the database. This schema under no circumstances deals with the physical devices and views any physical device as a compilation of physical pages. This schema is formulated to tell us what data is stored and how corresponding to the database.
  • Conceptual or Logical level Schema: The Conceptual (Logical) Schema caters to the users, and describes the structure of the entire database for the community of users. This holds the definition of all database entities, their relationships, and attributes along with the security information. This level is between the internal level and the external level.
  • External Level Schema: The External Level Schema explains the elements of the database in which a particular user is concerned about while hiding unrelated database details. There might be ‘n’ number of external views, where the external view might be defined as the content of the database as seen by a particular user. Each external view is defined through an external schema containing definitions of different types of external record of the specified view.

Types of Database Schema

In Data Warehousing, there are three major schemas, which are:

  • Star Schema
  • Snowflake Schema
  • Galaxy/ Fact constellation Schema

Star Schema

The Star Schema is a relational database schema used to represent multidimensional data. This schema is the simplest form of a schema which contains one or more dimensions and fact tables. It is called a Star Schema as the entity-relationship diagram between the fact tables and the dimensions resemble the shape of a star. The advantages of Star Schema are performance increase, slicing down and easier understanding of data.

Snowflake Schema

The Schema structure in which dimension table hierarchies are broken down into simpler tables is known as a Snowflake Schema. This is an extended version of the star and is called snowflake due to its shape.

Galaxy Schema/ Fact Constellation Schema

The Galaxy Schema or the Fact Constellation Schema is a schema design which contains two fact tables, which share dimension tables. Since this schema is seen as a collection of stars, it is called a Galaxy Schema.

Star Schema

In computing, the Star Schema is the simplest style of data mart schema and is extensively used in the development of dimensional data marts and data warehouses. Since the fact table is at the center and the dimension tables surround it, it resembles a star, thus getting its name. This method splits the business processes into facts, which have the quantitative data about the business, and dimensions which are traits related to the data stored in the fact table.

In general, the fact data may contain sale price, quantity, time and distance among others. On the other hand, the dimension tables may hold product size, salesperson detail, product code, the location where it needs to be delivered and more. In special cases, when the Star Schema has multiple dimensions, it is also called a centipede schema due to its shape.

Advantages and Disadvantages of Star Schema

Before we begin to discuss the pros and cons of the Star Schema, it is vital to understand that sticking to this method only is not the best option to go forward. In some cases, the Star Schema might prove to be unfruitful. However, in most cases, the advantages of this method outweigh any minimal inadequacies. Below are some of the advantages and disadvantages of the Star Schema.

Advantages

  • Easy to Understand: The users of OLTP (Online Transaction Processing) systems work with the application through a predefined Graphical User Interface or preset query templates. Due to this, the users do not have to understand data structures going on behind the scenes. The database schema and data structures remain with IT professionals. However, the users of decision support systems will themselves formulate queries and if and when they try to interact with the data warehouse, they must have an understanding of data structures and understand when which data will be available.
  • Optimizes Navigation: The purpose of defining the relationship among entities in the database schema is to provide you the ability to navigate through the database. The relationship can be used to move between tables to obtain the information you wish to find. If the paths from one table to another are numerous and complicated, the navigation process will become complex and slow. But if the join paths are straightforward, the navigation process is simple, optimized and faster. This is where the Star Schema shines brightly. This method optimizes the navigation through the database, and even if a complex query result is expected, the navigation is simple and uncomplicated.
  • Highly suitable for Query Processing: Since Star Schema is a query-centric structure, it is highly suitable for query processing. Regardless of how many dimensions participate in the query and how complex the query is, each query will be executed by selecting the rows from the dimension tables using the filters based on the query parameters and then finding the corresponding fact table rows. This process is only possible due to the simple join paths in Star Schema.
  • Feeding Cubes: The Star Schema is used by all Online Analytical Processing (OLAP) systems to build OLAP cubes resourcefully. In fact, most OLAP systems provide a mode of operation called ROLAP which uses the Star Schema directly as a source without having to build a proprietary cube structure.
  • Special Performance Techniques: In Star Schema, the query processor software can make use of better execution plans by enabling specific performance schemes which may be applied to the queries. Two of the most prominent special performance techniques are Star-join and Star-index. Star-join can join multiple tables in a single operation, therefore boosting query performance. It is high-speed, parallelizable and single-pass. Star-index is a special index to act as a catalyst in increasing join performance by creating indexes on one or more foreign keys of the fact table.

Disadvantages

  • Data Integrity: The biggest disadvantage of this process is the minimal implementation of data integrity as it is in a highly denormalized state.
  • Lack of Flexibility: The Star Schema lacks flexibility when it comes to the analytical needs as a normalized data model, which allows the execution of analytical queries as long as the business logic is followed. Since Star Schema is more focussed towards viewing the data, it does not allow complex analytics.

The steps involved in designing the Star Schema are:

  • Identification of the business process for analysis (for example sales)
  • Identification of the measures or facts (sales currency)
  • Identification of dimensions for facts (products, locations, time and company dimensions)
  • Listing of the columns which describe each dimension (city name, region name, branch name, branch code)
  • Determining the lowest level of summary in a fact table (sales currency)

Star Schema vs Snowflake Schema

As discussed above, the Star Schema is the basic Data Warehouse schema and is called so due to its structure, which resembles a star. On the other hand, the Snowflake Schema is an extended version of Star Schema, adding additional dimensions. These two Schemas are the most common structures used all around the world. The key differences between them are:

Star Schema Snowflake Schema
Simplicity in Database design A complex Database design
The Dimension table stores the hierarchies for the dimensions The hierarchies are separated into different tables
Only a single join creates the relation between the dimension tables and the fact tables In this structure, many joins are required to fetch the data
The fact table is surrounded by dimension tables Every fact table is surrounded by a dimension table, which is surrounded by a dimension table
High-level data redundancy Low-level data redundancy
Faster cube processing Due to complex join, cube processing might be slower
Denormalised data structure Normalized data structure
A single Dimension table contains aggregated data. Data Split into different Dimension Tables.

Conclusion

The Star Schema is the simplest form of a dimensional model used in data warehousing. In this article, we discussed the features of Star Schema, how to generate it and its advantages and disadvantages, before a comparison with the second most used schema structure. If you think we missed something or have a question, feel free to comment in the window below. You can also find the best programming, design, DevOps, and data science courses and tutorials on Hackr.io.

People are also Reading:

Vijay Singh

Vijay Singh

My name is Vijay Singh Khatri, and I enjoy meeting new people and finding ways to help them have an uplifting experience. I have had a variety of customer service opportunities, through which I was able to have fewer returned products and increased repeat customers, when compared with co-workers. Currently working with hackr.io View all posts by the Author

Leave a comment

Your email will not be published
Cancel