What is data warehousing?
A data warehouse is a central repository of data designed to enable business intelligence (BI) and other business analytics. Data warehouses consolidate often historical data from various sources within an organization, such as transactional databases, spreadsheets, external data sources, and more. Business analysts, data engineers, data scientists, and others can then perform queries on the data to efficiently generate things like sales reports, marketing dashboards, data visualizations, and other analytics tools. They serve as a key component for data-driven decision-making and help organizations gain insights from their data to improve operations and strategy.
Data warehouses are designed to minimize the input and output of data and store large amounts of data efficiently. They can reduce data silos and help ensure data accuracy by collecting disparate data sources in one place. This allows everyone in an organization to work from the same information to draw conclusions about the business. Therefore, data warehouses often serve as an organization’s “single source of truth.”
How is a data warehouse architected?
Data warehouse architecture is composed of different layers that govern reporting, access, and storage. When data is added to the warehouse, it undergoes an extract, transform, and load (ETL) process that cleans and structures the data for analysis before loading it into the database server. This layer is designed for optimized querying and reporting, so data that is accessed more frequently is housed in fast storage like solid-state drives (SSDs).
A middle layer contains an engine that accesses and analyzes the data. It offers an interface for users and applications to interact with the data warehouse and can include components for data security, authentication, and query optimization.
A third tier is end-user facing. It includes query and analysis tools that allow users to interact with the data warehouse to run queries, create reports, build dashboards, and perform data analysis. Common tools include BI software, data visualization tools, and structured query language (SQL) tools.
Benefits of data warehousing
Data warehousing is a flexible and reliable way to support important business processes for reporting, business intelligence, analytics, and more. Key benefits include:
- Consistency. Data formats and values are standardized, complete, and accurate
- Nonvolatile storage. After data is added to a warehouse, it doesn’t change, allowing for more efficient data storage over extended periods.
- Subject-oriented programming. People can easily access data via topics tied to business units and processes that they work with daily.
- Ongoing analysis. Data changes are tracked over time to create an effective corporate memory of the enterprise.
- Performance. Transactional databases are separate from the analytics engine, which creates improved system performance.
These characteristics can equip data warehouses to deliver high amounts of current and historical data quickly and nimbly, allowing end users to efficiently analyze and format subsets of data for various business needs and outcomes.
Data warehouse vs. data lake vs. data lakehouse
Often, organizations will implement a combination of data warehouses and data lakes to complement each other and their distinct data needs. Let’s take a look at the differences between them and how they stack up to a data lakehouse.
The primary purpose of a data warehouse is to analyze data. It can intake vast amounts of data from various sources and store it for extended periods of time. Once stored, that data is formatted so it can be readily accessed and analyzed to deliver business insights and inform better decision-making. A data warehouse is a good option for organizations looking for a structured data solution focused on business intelligence, business reporting, and data analytics.
A data lake can also store and capture data in real time from a wide range of sources, including business applications, mobile apps, internet of things (IoT) devices, and more. But the data does not need to be formatted or defined first. This is referred to as “schema on read.” As a result, data lakes can cost-effectively hold vast amounts of raw data and easily scale as an organization’s data requirements change. In addition, users can access the data in various ways without having to know special languages like SQL. Data lakes are suitable for enterprises that need a flexible, affordable data solution to drive advanced analytics and machine learning (ML) models.
A data lakehouse is a relatively new architecture that combines the cost-effectiveness and scalability of a data lake with the robust information management of a data warehouse. A lakehouse can integrate various data sources automatically without having to make them fit formats or schemas. And it can store and process data in a single platform accessible to users across the enterprise. That allows enterprises to have a “single source of truth” for their data but still leverage advanced analytics and machine learning capabilities of a data lake to maximize business value—all in one solution.
|
Data warehouse |
Data lake |
Lakehouse |
Data type |
Relational and structured |
Relational and non-relational; can be structured, semi-structured, and unstructured |
Structured, semi-structured, and unstructured |
Schema |
On write |
On read |
On read and on write |
Data quality |
Consistent data that serves a single source of truth across the organization |
Raw data that has not been organized or curated |
A single source of truth for curated data and storage for raw data |
Use cases |
Core/batch reporting, data visualizations, business intelligence |
Machine learning, advanced analytics, forecasting, real-time analysis |
Real-time analytics, machine learning, immediate business insights, forecasting, reporting |
Primary users |
Business analysts |
Data scientists, data, developers, engineers |
Business analysts, data scientists, data, developers, engineers |
Scalability |
Difficult and expensive to scale |
Can scale easily and cost effectively |
Can automatically scale on demand |
Cost |
Storage and data management costs more than with a data lake |
Storage and data management costs less than with a data warehouse |
Storage is cost-effective and fast |
Designing and implementing a data warehouse
Designing a data warehouse requires close collaboration between information technology (IT), data professionals, and business stakeholders. It should be flexible enough to adapt to changing business requirements—and scalable to accommodate growing volumes of data.
Key considerations include:
- Business requirements. Start by understanding the specific business needs and objectives the data warehouse will address. This should identify which data needs to be stored, various ways it needs to be formatted, the types of analyses required, and the expected outcomes.
- Data sources. Determine the source from which the data will be extracted. This can include operational databases, external data feeds, spreadsheets, and more.
- Data modeling. Common models for data warehousing include star schemas and snowflake schemas. These models help to organize and structure data efficiently for analytical purposes.
- System requirements. Consider which systems will be needed and used to deliver, store, and access the data, and which ones will support the data warehouse environment. This can include selecting a database management system, ETL tools, query and analysis tools, and data governance and security solutions.
- Performance. Implement techniques like indexing, partitioning, and caching to optimize query performance and ensure that the data warehouse can efficiently handle large volumes of data.
- Business processes. Ensure that the data warehouse is seamlessly integrated with an organization's business processes to facilitate data-driven decision-making.
- Refresh and recovery processes. Consider how often the data will be refreshed, how data will be backed up, and how it will be restored if necessary.
Data warehousing solutions
Your unique business requirements should be top of mind when considering a data warehouse solution for your organization. How do you intend to use it to draw meaningful insights and improve business decision-making?
It’s also important to consider:
- Connectivity and advanced capabilities. Look for a data warehouse architecture that offers connectivity to various data sources within your organization so you can intake and store data seamlessly. That may require a solution with data modeling and mapping, quality assurance, ETL capabilities, connectivity to BI tools, and more. This is critical to ensuring you can load and analyze data quickly and efficiently.
- Performance. The quicker a data warehouse can process and export information, the faster your users can make critical business decisions and drive value. Look for tools that improve performance like parallel processing, advanced processing power, and maximum uptime that allow you to process data more rapidly and efficiently.
- Scalability. An effective data warehouse solution should also be future-ready with the ability to scale as your organization’s data needs change. That includes the ability to handle larger amounts of data over extended periods.
- Cloud vs. on-premises. An on-premises data warehouse will reside on your network, which can offer maximum control over things like latency. But it can also require much more investment in both hardware and IT support personnel. Cloud solutions don’t require a physical environment for your data warehouse, so you don’t need to independently house or maintain it. That means it’s easy to deploy, more affordable, and can scale more easily.
Why choose Teradata for your data warehouse needs?
With Teradata VantageCloud, our complete cloud analytics and data platform, you can deploy a data warehouse at an enterprise level that offers flexibility, performance, and analytics to a wide range of users.
Our next-generation cloud-native architecture can seamlessly bring disparate sources of data together to create a single source of truth for your organization. It easily manages mixed workloads. And it ensures data integrity with real-time updates.
That means greater flexibility, scalability, and security for your organization so you can access business intelligence more quickly and efficiently leverage more informed decision-making.
Data warehousing FAQ
What is an example of a data warehouse?
An enterprise data warehouse (EDW) is one example of this data storage technology. EDWs help businesses improve customer insights, optimize supply chain management, and increase productivity. Walmart was one of the first large organizations to employ an EDW, which it uses to guide suppliers, inform point-of-sales analytics, and more deeply understand customer buying habits. Today’s EDW solutions commonly involve a combination of environments designed to address a range of enterprise needs. For example, business users may require access to curated, structured data that can easily be applied to business intelligence applications. The EDW may also involve data lakes that store unstructured data with the potential to be used in business applications but that are not assigned to any particular function.
What are the key components of a data warehouse?
- A data warehouse generally includes:
- A database that stores and manages data
- An ETL process responsible for extracting data from source systems and transforming it into a format for it to be analyzed
- Query and analysis tools so end users can create reports, build dashboards, and conduct data analysis
- A data access layer that manages data access, security, and authentication
- Data governance, security, backup, and recovery tools
What schemas are often used in data warehouses?
The two most common schemas used in data warehouses are star schemas and snowflake schemas. A star schema simplifies queries and provides robust query performance because it minimizes the effort needed to retrieve data. So, they’re commonly used in scenarios where query performance is a top priority. A snowflake schema often requires more complex query processing due to an increased number of joins needed to retrieve data, but they use less space. Therefore, they can be suitable in scenarios where storage space is a concern.