Understand the differences between the two most popular options for storing big data.
When it comes to storing big data, the two most popular options are data lakes and data warehouses. Data warehouses are used for analyzing archived structured data, while data lakes are used to store big data of all structures.
A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.
The term data lake is often associated with Hadoop-oriented object storage. In such a scenario, an organization’s data is first loaded into the Hadoop platform, and then business analytics and data mining tools are applied to the data where it resides on Hadoop’s cluster nodes of commodity computers.
Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being used to describe any large data pool in which the schema and data requirements are not defined until the data is queried.
The term describes a data storage strategy, not a specific technology, although it is frequently used in conjunction with a specific technology (Hadoop). The same can be said of the term data warehouse, which despite often referring to a specific technology (relational database), actually describes a broad data management strategy.
Data lake vs. data warehouse
Data lakes and data warehouses are two different strategies for storing big data. The most important distinction between them is that in a data warehouse, the schema for the data is preset; that is, there is a plan for the data upon its entry into the database. In a data lake, this is not necessarily the case. A data lake can house both structured and unstructured data and does not have a predetermined schema. A data warehouse handles primarily structured data and has a predetermined schema for the data it houses.
To put it more simply, think of the concept of a warehouse versus the concept of a lake. A lake is liquid, shifting, amorphous, largely unstructured and is fed from rivers, streams, and other unfiltered sources of water. A warehouse, on the other hand, is a man-made structure, with shelves and aisles and designated places for the things inside of it. Warehouses store curated goods from specific sources. Warehouses are prestructured, lakes are not.
This core conceptual difference manifests in several ways, including:
Technology typically used to host data — A data warehouse is usually a relational database housed on an enterprise mainframe server or the cloud, whereas a data lake is usually housed in a Hadoop environment or similar big data repository.
Source of the data — The data stored in a warehouse is extracted from various online transaction processing applications to support business analytics queries and data marts for specific internal business groups, such as sales or inventory teams. Data lakes typically receive both relational and non-relational data from IoT devices, social media, mobile apps and corporate applications.
Users — Data warehouses are useful when there is a massive amount of data from operational systems that need to be readily available for analysis. Data lakes are more useful when an organization needs a large repository of data, but does not have a purpose for all of it and can afford to apply a schema to it upon access.
Because the data in a lake is often uncurated and can originate from sources outside of the company’s operational systems, lakes are not a good fit for the average business analytics user. Instead, data lakes are better suited for use by data scientists, because it takes a level of skill to be able to sort through the large body of uncurated data and readily extract meaning from it.
Data quality — In a data warehouse, the highly curated data is generally trusted as the central version of true because it contains already processed data. The data in a data lake is less reliable because it could be arriving from any source in any state. It may be curated, and it may not be, depending on the source.
Processing — The schema for data warehouses is on-write, meaning it is pre-set for when the data is entered into the warehouse. The schema for a data lake is on-read, meaning it doesn’t exist until the data has been accessed and someone chooses to use it for something.
Performance/cost — Data warehouses are usually more expensive for large data volumes, but the trade-off is faster query results, reliability and higher performance. Data lakes are designed with low cost in mind, but query results are improving as the concept and surrounding technologies mature.
Agility — Data lakes are highly agile; they can be configured and reconfigured as needed. Data warehouses are less so.
Security — Data warehouses are generally more secure than data lakes because warehouses as a concept have existed for longer and therefore, security methods have had the opportunity to mature.
A side-by-side comparison of data lakes and data warehouses.
Because of their differences, and the fact that data lakes are a newer and still-evolving concept, organizations might choose to use both a data warehouse and a data lake in a hybrid deployment. This may be to accommodate the addition of new data sources, or to create an archive repository to deal with data roll-off from the main data warehouse. Frequently data lakes are an addition to, or evolution of, an organization’s current data management structure instead of a replacement.
Data lake architecture
The physical architecture of a data lake may vary, as data lake is a strategy that can be applied to multiple technologies. For example, the physical architecture of a data lake using Hadoop might differ from that of data lake using Amazon Simple Storage Service (Amazon S3).
However, there are three main principles that distinguish a data lake from other big data storage methods and make up the basic architecture of a data lake. They are:
- No data is turned away. All data is loaded in from various source systems and retained.
- Data is stored in an untransformed or nearly untransformed state, as it was received from the source.
- Data is transformed and fit into a schema based on analysis requirements.
Although data is largely unstructured and not geared toward answering any specific question, it should still be organized in some manner so that doing this in the future is possible. Whatever technology ends up being used to deploy an organization’s data lake, a few features should be included to ensure that the data lake is functional and healthy and that the large repository of unstructured data doesn’t go to waste. These include:
- A taxonomy of data classifications, which can include data type, content, usage scenarios and groups of possible users.
- A file hierarchy with naming conventions.
- Data profiling tools to provide insight for classifying data objects and addressing data quality issues.
- Standardized data access process to keep track of what members of an organization are accessing data.
- A searchable data catalog.
- Data protections including data masking, data encryption and automated monitoring to generate alerts when data is accessed by unauthorized parties.
- Data awareness among employees, which includes an understanding of proper data management and data governance, training on how to navigate the data lake, and an understanding of strong data quality and proper data usage.
Benefits of a data lake
The data lake offers several benefits, including:
- The ability of developers and data scientists to easily configure a given data model, application, or query on the fly. The data lake is highly agile.
- Data lakes are theoretically more accessible. Because there is no inherent structure, any user can technically access the data in the data lake, even though the prevalence of large amounts of unstructured data might inhibit less skilled users.
- The data lake supports users of varying levels of investment; users who want to return to the source to retrieve more information, those who seek to answer entirely new questions with the data and those who simply require a daily report. Access is possible for each of these user types.
- Data lakes are cheap to implement because most technologies used to manage them are open source (i.e., Hadoop) and can be installed on low-cost hardware.
- Labor-intensive schema development and data cleanup are deferred until after an organization has identified a clear business need for the data.
- Agility allows for a variety of different analytics methods to interpret data, including big data analytics, real-time analytics, machine learning and SQL queries.
- Scalable because of a lack of structure.
Despite the benefits of having a cheap, unstructured repository of data at an organization’s disposal, several legitimate criticisms have been levied against the strategy.
One of the biggest potential follies of the data lake is that it might turn into a data swamp, or data graveyard. If an organization practices poor data governance and management, it may lose track of the data that exists in the lake, even as more pours in. The result is a wasted body of potentially valuable data rotting away unseen at the “bottom” of the data lake, so to speak, rendering it deteriorated, unmanaged and inaccessible.
Data lakes, while providing theoretical accessibility to anyone in an organization, may not be as accessible in practical use, because business analysts may have a difficult time readily parsing unstructured data from a variety of sources. This practical accessibility challenge may also contribute to the lack of proper data maintenance and result in the development of a data graveyard. It’s important to maximize investment in a data lake and reduce the risk of failed deployment.
Another problem with the term data lake itself is that it is used in many contexts in public discourse. Although it makes most sense to use it to describe a strategy of data management, it has also commonly been used to describe specific technologies and as a result, has a level of arbitrariness to it. This challenge may cease to be once the term matures and finds a more concrete meaning in the public discourse.
Although a data lake isn’t a specific technology, there are several technologies that enable them. Some vendors that offer those technologies are:
- Apache offers the open-source ecosystem Hadoop, one of the most common data lake services.
- Amazon offers Amazon S3 with virtually unlimited scalability.
- Google offers Google Cloud Storage and a collection of services to pair with it for management.
- Oracle offers the Oracle Big Data Cloud and a variety of PaaS services to help manage it.
- Microsoft offers the Azure Data Lake as a scalable data storage and Azure Data Lake Analytics as a parallel analytics service. This is an example of when the term data lake is used to refer to a specific technology instead of a strategy.
- HVR offers a scalable solution for organizations that need to move large volumes of data and update it in real time.
- Podium offers a solution with an easy-to-implement and use suite of management features.
- Snowflake offers a solution that specializes in processing diverse datasets, including structured and semi-structured datasets such as JSON, XML and Parquet.
- Zaloni offers a solution that comes with Mica, a self-service data prep tool and data catalog. Zaloni has been branded as the data lake company.
Hadoop Data Lake
A Hadoop data lake is a data management platform comprising one or more Hadoop clusters. It is used principally to process and store nonrelational data, such as log files, internet clickstream records, sensor data, JSON objects, images and social media posts.
Such systems can also hold transactional data pulled from relational databases, but they’re designed to support analytics applications, not to handle transaction processing. As public cloud platforms have become common sites for data storage, many people build Hadoop data lakes in the cloud.
Hadoop data lake architecture
While the data lake concept can be applied more broadly to include other types of systems, it most frequently involves storing data in the Hadoop Distributed File System (HDFS) across a set of clustered compute nodes based on commodity server hardware. The reliance on HDFS has, over time, been supplemented with data stores using object storage technology, but non-HDFS Hadoop ecosystem components typically are part of the enterprise data lake implementation.
With the use of commodity hardware and Hadoop’s standing as an open source technology, proponents claim that Hadoop data lakes provide a less expensive repository for analytics data than traditional data warehouses. In addition, their ability to hold a diverse mix of structured, unstructured and semistructured data can make them a more suitable platform for big data management and analytics applications than data warehouses based on relational software.
However, a Hadoop enterprise data lake can be used to complement an enterprise data warehouse (EDW) rather than to supplant it entirely. A Hadoop cluster can offload some data processing work from an EDW and, in effect, stand in as an analytical data lake. In such cases, the data lake can host new analytics applications. As a result, altered data sets or summarized results can be sent to the established data warehouse for further analysis.
Hadoop data lake best practices
The contents of a Hadoop data lake need not be immediately incorporated into a formal database schema or consistent data structure, which allows users to store raw data as is; information can then either be analyzed in its raw form or prepared for specific analytics uses as needed.
As a result, data lake systems tend to employ extract, load and transform (ELT) methods for collecting and integrating data, instead of the extract, transform and load (ETL) approaches typically used in data warehouses. Data can be extracted and processed outside of HDFS using MapReduce, Spark and other data processing frameworks.
Despite the common emphasis on retaining data in a raw state, data lake architectures often strive to employ schema-on-the-fly techniques to begin to refine and sort some data for enterprise uses. As a result, Hadoop data lakes have come to hold both raw and curated data.
As big data applications become more prevalent in companies, the data lake often is organized to support a variety of applications. While early Hadoop data lakes were often the province of data scientists, increasingly, these lakes are adding tools that allow analytics self-service for many types of users.
Hadoop data lake uses, challenges
Potential uses for Hadoop data lakes vary. For example, they can pool varied legacy data sources, collect network data from multiple remote locations and serve as a way station for data that is overloading another system.
Experimental analysis and archiving are among other Hadoop data lake uses. They have also become an integral part of Amazon Web Services (AWS) Lambda architectures that couple batch with real-time data processing.
The Hadoop data lake isn’t without its critics or challenges for users. Spark, as well as the Hadoop framework itself, can support file architectures other than HDFS. Meanwhile, data warehouse advocates contend that similar architectures — for example, the data mart — have a long lineage and that Hadoop and related open source technologies still need to mature significantly in order to match the functionality and reliability of data warehousing environments.
Experienced Hadoop data lake users say that a successful implementation requires a strong architecture and disciplined data governance policies; without those things, they warn, data lake systems can become out-of-control dumping grounds. Effective metadata management typically helps to drive successful enterprise data lake implementations.
Hadoop vs. Azure Data Lakes
There are other versions of data lakes, which offer similar functionality to the Hadoop data lake and also tie into HDFS.
Microsoft launched its Azure Data Lake for big data analytical workloads in the cloud in 2016. It is compatible with Azure HDInsight, Microsoft’s data processing service based on Hadoop, Spark, R and other open source frameworks. The main components of Azure Data Lake are Azure Data Lake Analytics, which is built on Apache YARN, Azure Data Lake Store and U-SQL. It uses Azure Active Directory for authentication and access control lists and includes enterprise-level features for manageability, scalability, reliability and availability.
Around the same time that Microsoft launched its data lake, AWS launched Data Lake Solutions an automated reference data lake implementation that guides users through creation of a data lake architecture on the AWS cloud, using AWS services, such as Amazon Simple Storage Service (S3) for storage and AWS Glue, a managed data catalog and ETL service.
Analytics platforms have made their way to the forefront of information-driven enterprises. Winning organizations know a core competency in analytics requires a modern data analytics platform architecture that delivers insights at critical junctures in their data pipelines while minimizing cost, redundancy and complexity.
What is a data analytics platform?
A data analytics platform can be defined as everything it takes to draw meaningful and useful insights from data. For a general concept of an analytics platform, think in terms of data, analytics and insights.
Delivering an analytics platform requires a robust architecture that serves as a blueprint for delivering business unit and enterprise analytics, communicating architectural decisions, reducing individual project risk and ensuring enterprise consistency.
A traditional data analytics platform architecture is often not well positioned to support today’s data-driven organizations. New business demands, enabling technologies and cost pressures are prompting organizations to modernize their analytics platforms in order to realize the full potential of data as a corporate asset.
Modernizing means rethinking a data analytics platform architecture, including these attributes:
- Agility at the speed of business
- Cost optimization
- Highly qualified personnel
- Process automation
- Best-in-class technology
- Handling of data at any speed, size and variety
- Seamless data integrations
- Timely insights throughout data pipelines
- Full spectrum of business intelligence capabilities
- Robust security architecture
- High-speed direct connect data fabric
- Loosely coupled technology ecosystem
- High-efficiency computing
- Strong governance controls and stewardship
- Rapid development and deployment
- Well-documented architecture and metadata
Data lake vs. data reservoir
A strong data analytics platform architecture will account for data lakes and data reservoirs. This coexistence is complementary as each repository addresses different data and analytical uses at different points in the pipeline.
The main differences between the two involve data latency and refinement. Both store structured and unstructured data, leveraging various data stores from simple object files to SQL and NoSQL database engines to big data stores.
Data lakes are raw data repositories located at the beginning of data pipelines, optimized for getting data into the analytics platform. Landing zones and sandboxes of independent data designed for ingestion and discovery, these native format data stores are open to private consumers for selective use. Analytics are generally limited to time-sensitive insights and exploratory inquiry by consumers who can tolerate the murky waters.
Data reservoirs are refined data repositories located at operational and back-end points of data pipelines, optimized for getting data out of the analytics platform. As sources of unified, harmonized and wrangled data designed for querying and analysis, data reservoirs are purpose-built data stores that are open to the public for general consumption. Analytics span a wide range of past, present and future insights for use by casual and sophisticated consumers, serving both tactical and strategic insights that run the business.
Determining at what point in the pipeline data becomes meaningful for a particular use case is often tempered by time and quality.
On one hand, access to data early in the pipeline will favor time-sensitive insights over the suitability of non-harmonized data, particularly for use cases that require the most recent data. On the other hand, access to data later in the pipeline will favor data accuracy over increased latency by virtue of curation, particularly for use cases that require data that has been cleaned, conformed and enriched, and that is of known quality.
Public cloud or on premises
Choosing where to run your analytics platform is not as easy decision. Fortunately, public cloud and on-premises deployments aren’t mutually exclusive. Smaller organizations typically gravitate to an entirely public cloud strategy, while midsize to large organizations often deploy a hybrid strategy or assume complete control with an all on-premises strategy.
Any decision on where to host a data analytics platform should minimally consider agility, scale, cost, security (particularly sensitive data protection), network latency and analytic capabilities.
A big part of the hosting decision comes down to control. Organizations that are comfortable sharing control are likely to lean more toward a cloud presence. Organizations that feel comfortable owning the end-to-end platform will likely lean more toward an on-premises option.
Regardless of where you run your analytics platform, modernization should not simply be a lift-and-shift approach. You may not need a complete overhaul, but take the opportunity to refresh select components and remove technical debt across your platform.
Organizations that choose the public cloud for some or all their data analytics platform architecture should take advantage of what the cloud does best. This means moving from IaaS to SaaS and PaaS models. Look to maximize managed services, migrate to native cloud services, automate elasticity, geo-disperse the analytics platform and move to consumption-based pricing whenever possible by using serverless technologies.
The importance of flexibility
Flexibility has become a necessary attribute of a modern data analytics platform architecture. An expanding demand for analytics is forcing analytics platforms to be more accessible, extensible and nimble while processing data at greater velocity, volume and variety.
One thing is for sure: Your data analytics platform architecture will change. A key measurement of a platform’s flexibility is how well it adapts to business and technology innovation. Expect the business to demand an accelerated analytics lifecycle and greater autonomy via self-service capabilities. To keep pace with the business, look for technology advancements in automation and artificial intelligence as well as catalysts for augmented data management and analytics.
A data warehouse is a repository for data generated and collected by an enterprise’s various operational systems. Data warehousing is often part of a broader data management strategy and emphasizes the capture of data from different sources for access and analysis by business analysts, data scientists and other end users.
Typically, a data warehouse is a relational database housed on a mainframe, another type of enterprise server or, increasingly, in the cloud. Data from various online transaction processing (OLTP) applications and other sources is selectively extracted and consolidated for business intelligence (BI) activities that include decision support, enterprise reporting and ad hoc querying by users. Data warehouses also support online analytical processing (OLAP) technologies, which organize information into data cubes that are categorized by different dimensions to help accelerate the analysis process.
Basic components of a data warehouse
A data warehouse stores data that is extracted from internal data stores and, in many cases, external data sources. The data records within the warehouse must contain details to make it searchable and useful to business users. Taken together, there are three main components of data warehousing:
- A data integration layer that extracts data from operational systems, such as Excel, ERP, CRM or financial applications.
- A data staging area where data is cleansed and organized.
- A presentation area where data is warehoused and made available for use.
A data warehouse architecture can also be understood as a set of tiers, where the bottom tier is the database server, the middle tier is the analytics engine and the top tier is data warehouse software that presents information for reporting and analysis.
Data analysis tools, such as BI software, enable users to access the data within the warehouse. An enterprise data warehouse stores analytical data for all of an organization’s business operations; alternatively, individual business units may have their own data warehouses, particularly in large companies. Data warehouses can also feed data marts, which are smaller, decentralized systems in which subsets of data from a warehouse are organized and made available to specific groups of business users, such as sales or inventory management teams.
In addition, Hadoop has become an important extension of data warehouses for many enterprises because the distributed data processing platform can improve components of a data warehouse architecture — from data ingestion to analytics processing to data archiving. In some cases, Hadoop clusters serve as the staging area for traditional data warehouses. In others, systems that incorporate Hadoop and other big data technologies are deployed as full-fledged data warehouses themselves.
Data warehouse benefits and options
Data warehouses can benefit organizations from both an IT and a business perspective. For example:
- Separating analytical processes from operational ones can enhance the performance of operational systems and enable data analysts and business users to access and query relevant data faster from multiple sources.
- Data warehouses can offer enhanced data quality and consistency for analytics uses, thereby improving the accuracy of BI applications.
- Businesses can choose on-premises systems, conventional cloud deployments or data-warehouse-as-a-service (DWaaS) offerings.
- On-premises data warehouses offer flexibility and security so IT teams can maintain control over their data warehouse management and configuration; they’re available from IBM, Oracle and Teradata as an example.
- Cloud-based data warehouses such as Amazon Redshift, Google BigQuery, Microsoft Azure SQL Data Warehouse and Snowflake enable companies to quickly scale up their systems while eliminating the initial infrastructure investments and ongoing system maintenance requirements.
- DWaaS, an offshoot of database as a service, provides a managed cloud service that frees organizations from the need to deploy, configure and administer their data warehouses. Such services are being offered by a growing number of cloud vendors.
Types of data warehouses
There are three main approaches to implementing a data warehouse, which are detailed below. Some organizations have also adopted federated data warehouses that integrate separate analytical systems already put in place independently of one another — an approach proponents describe as a practical way to take advantage of existing deployments.
- Top-down approach: Created by data warehouse pioneer William H. Inmon, this method calls for building the enterprise data warehouse first. Data is extracted from operational systems and possibly third-party external sources and may be validated in a staging area before being integrated into a normalized data model. Data marts are then created from the data stored in the data warehouse.
- Bottom-up method: Consultant Ralph Kimball developed an alternative data warehousing architecture that calls for dimensional data marts to be created first. Data is extracted from operational systems, moved to a staging area and modeled into a star schema design, with one or more fact tables connected to one or more dimensional tables. The data is then processed and loaded into data marts, each of which focuses on a specific business process. The data marts are integrated using a data warehouse bus architecture to form an enterprise data warehouse.
- Hybrid method: Hybrid approaches to data warehouse design include aspects from both the top-down and bottom-up methods. Organizations often seek to combine the speed of the bottom-up approach with the data integration capabilities achieved in a top-down design.
Data warehouses vs. databases vs. data lakes
Databases and data lakes are often confused with data warehouses, but there are important differences. While data warehouses typically store data from multiple sources and utilize predefined schemas designed for data analytics, an operational database is generally used to capture, process and store data from a single source, such as a transactional system, and its schema is normalized. Such databases typically aren’t designed to run across very large data sets, as data warehouses are.
By contrast, a data lake is a central repository for all types of raw data, whether structured or unstructured, from multiple sources. Data lakes are most commonly built on Hadoop or other big data platforms. A schema doesn’t need to be defined upfront in them, which allows for more types of analytics than data warehouses, which have defined schemas. For example, data lakes can be used for text searches, machine learning and real-time analytics.
Data warehouse innovations throughout history
The concept of data warehousing can be traced back to work conducted in the mid-1980s by IBM researchers Barry Devlin and Paul Murphy. The duo coined the term business data warehouse in their 1988 paper, “An architecture for a business and information system,” which stated:
“The [business information system] architecture is based on the assumption that such a service runs against a repository of all required business information that is known as the Business Data Warehouse (BDW). … A necessary prerequisite for the physical implementation of a business data warehouse service is a business process and information architecture that defines (1) the reporting flow between functions and (2) the data required.”
Bill Inmon, as he is more familiarly known, furthered data warehouse development with his 1992 book Building the Data Warehouse, as well as by writing some of the first columns about the topic. Inmon’s top-down design method for building a data warehouse describes the technology as a subject-oriented, integrated, time-variant and nonvolatile collection of data that supports an organization’s decision-making process.
The technology’s growth continued with the founding of The Data Warehousing Institute, now known as TDWI, in 1995, and with the 1996 publication of Ralph Kimball’s book The Data Warehouse Toolkit, which introduced his dimensional modeling approach to data warehouse design.
In 2008, Inmon introduced the concept of data warehouse 2.0, which focuses on the inclusion of unstructured data and corporate metadata.
You could be forgiven for thinking that operational data stores and data warehouses are synonymous. After all, a data warehouse is a place where operational data is stored for analysis and reporting. Case closed — two sides of the same coin, right?
Well, no, not so fast. There’s more to the question of operational data store vs. data warehouse than that. Both do store operational data, but in different forms and for different purposes. And in many cases, organizations incorporate both into their analytics architectures.
The operational data store (ODS) is a bit harder to pin down because there are diverging views on exactly what it is and for what it’s used. But, at heart, an ODS pulls together data from multiple transaction processing systems on a short-term basis, with frequent updates as new data is generated by the source systems. Operational data stores often serve as interim staging areas for data that’s ultimately headed to a data warehouse or a big data platform for long-term storage.
Uses and benefits of an ODS
An ODS generally holds detailed transaction data that has yet to be consolidated, aggregated and transformed into consistent data sets for loading into a data warehouse. From a data integration standpoint, then, an ODS might only involve the first and third elements of the extract, transform and load (ETL) process typically used to pull data from operational systems and to harmonize it for analysis.
In that sense, an operational data store can be thought of as a funnel that takes in raw data from various source systems and helps facilitate the process of feeding business intelligence and analytics systems with more refined versions of that data. The full ETL process is handled downstream, which streamlines data transformation workloads and minimizes the processing pipelines needed between the ODS and the source systems to which it’s connected.
However, some people also view the operational data store as a BI and analytics platform in its own right. Under that scenario, an ODS can be used to do near-real-time data analysis aimed at uncovering tactical insights that organizations can quickly apply to ongoing business operations — for example, to increase retail inventories of popular products based on fresh sales data. By comparison, data warehouses typically support historical analysis of data accumulated over a longer period of time.
Depending on the specific application, an ODS that’s used for data analysis might be updated multiple times daily, if not hourly or even more frequently. Real-time data integration tools, such as change data capture software, can be tapped to help enable such updates. In addition, some level of data cleansing and consistency checks might be applied in the ODS to help ensure that the analytics results are accurate.
ODS and data warehouse design
In weighing operational data store vs. data warehouse deployments, an ODS can potentially be built on a lighter data platform, especially if it’s primarily being used as a temporary way station for data.
For example, an operational data store architecture might be based on the MySQL open source database or the cloud-based Amazon Simple Storage Service as an alternative to traditional data warehouse platforms such as Oracle, Microsoft SQL Server, IBM DB2 and Teradata. In big data environments, Hadoop clusters can provide an ODS staging area for feeding data to either a data warehouse or another cluster built on top of the open source distributed processing framework.
While data usually passes through an ODS relatively quickly to make room for new data coming up behind it, things are different in a data warehouse. The purpose there is to create an archive of data that can be analyzed to track business performance and identify operational trends in order to guide strategic decision-making by corporate and business executives.
A data warehouse might be updated frequently — nightly, in some cases, weekly or monthly in others. But it’s a more static environment than an ODS: Data is typically added, but not deleted, especially in the case of an enterprise data warehouse (EDW), which is designed to provide a single source of consolidated and cleansed data from all of a company’s operations. EDWs tend to be large and complex platforms as a result — a combination that can make deploying them a challenge.
ODS vs. dart mart
Another facet of the operational data store vs. data warehouse discussion is how an ODS compares to a data mart. Data marts are purpose-built data warehouse offshoots — essentially, smaller warehouses that store data related to individual business units or specific subject areas. A data mart and an ODS might be in the same league on storage capacity, but otherwise, they differ in the same way that EDWs and operational data stores do. Like their bigger brethren, data marts are a repository for historical data that has been fully scrubbed and aggregated for analysis.
Two other things to keep in mind about operational data stores: First, they aren’t the same thing as an operational database. The latter is the database built into a transaction system — it’s the location from which the data flowing into an ODS comes. Put another way, transaction data is initially processed in operational databases and then moved to an ODS to begin its analytics journey.
Second, operational data stores are sometimes equated with master data management (MDM) systems. MDM processes enable companies to create common sets of master data on customers, products and suppliers. The master data can then be fed back to transaction systems via an MDM hub, where the data is managed and stored. Early on, some organizations built MDM capabilities into ODS platforms, but that approach seems to have lessened in recent years perhaps partly due to the MDM market not growing like proponents hoped it would, itself a result of MDM’s inherent complexities.