Data Preparation

The specifics of the data preparation process vary by industry, organization and need, but the framework remains largely the same.

·         Gather data

The data preparation process begins with finding the right data. This can come from an existing data catalog or can be added ad-hoc.

·         Discover and assess data

After collecting the data, it is important to discover each dataset. This step is about getting to know the data and understanding what has to be done before the data becomes useful in a particular context.

·         Cleanse and validate data

Cleaning up the data is traditionally the most time consuming part of the data preparation process, but it’s crucial for removing faulty data and filling in gaps. Important tasks here include:

  • Removing extraneous data and outliers.
  • Filling in missing values.
  • Conforming data to a standardized pattern.
  • Masking private or sensitive data entries.

Once data has been cleansed, it must be validated by testing for errors in the data preparation process up to this point. Often times, an error in the system will become apparent during this step and will need to be resolved before moving forward.

·         Transform and enrich data

Transforming data is the process of updating the format or value entries in order to reach a well-defined outcome, or to make the data more easily understood by a wider audience. Enriching data refers to adding and connecting data with other related information to provide deeper insights.

·         Store data

Once prepared, the data can be stored or channeled into a third party application such as a business intelligence tool, clearing the way for processing and analysis to take place.

Let’s understand some of the concepts in detail:

Raw Data

Raw data (sometimes called source data or atomic data) is data that has not been processed for use. A distinction is sometimes made between data and information to the effect that information is the end product of data processing. Raw data that has undergone processing is sometimes referred to as cooked data.

Although raw data has the potential to become “information,” it requires selective extraction, organization, and sometimes analysis and formatting for presentation. For example, a point-of-sale terminal (POS terminal) in a busy supermarket collects huge volumes of raw data each day, but that data doesn’t yield much information until it is processed. Once processed, the data may indicate the particular items that each customer buys, when they buy them, and at what price. Such information can be further subjected to predictive technology analysis to help the owner plan future marketing campaigns.

As a result of processing, raw data sometimes ends up in a database, which enables the data to become accessible for further processing and analysis in a number of different ways.

Data Profiling

Data profiling is the process of examining, analyzing and reviewing data to collect statistics surrounding the quality and hygiene of the dataset. Data quality refers to the accuracy, consistency, validity and completeness of data. Data profiling may also be known as data archeology, data assessment, data discovery or data quality analysis.

The first step of data profiling is gathering one or multiple data sources and its metadata for analysis. The data is then cleaned to unify structure, eliminate duplications, identify interrelationships and find any anomalies. Once the data is clean, different data profiling tools will return various statistics to describe the dataset. This could include the mean, minimum/maximum value, frequency, recurring patterns, dependencies or data quality risks.

For example, by examining the frequency distribution of different values for each column in a table, an analyst could gain insight into the type and use of each column. Cross-column analysis can be used to expose embedded value dependencies and inter-table analysis allows the analyst to discover overlapping value sets that represent foreign key relationships between entities.

Organizations can use data profiling at the beginning of a project to determine if enough data has been gathered, if any data can be reused or if the project is worth pursuing. The process of data profiling itself can be based on specific business rules that will uncover how the dataset aligns with business standards and goals.

Profiling tools evaluate the actual content, structure and quality of the data by exploring relationships that exist between value collections both within and across data sets. Vendors that offer software and tools that can automate the data profiling process include Informatica, Oracle  and  SAS.

Types of data profiling

While all applications of data profiling involve organizing and collecting information about a database, there are also three specific types of data profiling.

  1. Structure discovery- This focuses on the formatting of the data, making sure everything is uniform and consistent. It also uses basic statistical analysis to return information about the validity of the data.
  2. Content discovery- This process assesses the quality of individual pieces of data. For example, ambiguous, incomplete and null values are identified.
  3. Relationship discovery- This detects connections, similarities, differences and associations between data sources.

Benefits of data profiling

Data profiling returns a high-level overview of data that can result in the following benefits:

  • Leads to higher quality, more credible data.
  • Helps with more accurate predictive analytics and decision making.
  • Makes better sense of the relationships between different datasets and sources.
  • Keeps company information centralized and organized.
  • Eliminates errors associated with high costs, such as missing values or outliers.
  • Highlights areas within a system that experience the most data quality issues, such as data corruption or user input errors.
  • Produces insights surrounding risks, opportunities and trends.

Examples of data profiling applications

Data profiling can be implemented in a variety of use cases where data quality is important. For example, projects that involve data warehousing or business intelligence may require gathering data from multiple disparate systems or databases for one report or analysis. Applying the data profiling process to these projects can help identify potential issues and corrections that need to be made in ETL processing before moving forward.

Additionally, data profiling is crucial in data conversion or data migration initiatives that involve moving data from one system to another. Data profiling can help identify data quality issues that may get lost in translation or adaptions that must be made to the new system prior to migration.

Data Scrubbing

Data scrubbing, also called data cleansing, is the process of amending or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated. An organization in a data-intensive field like banking, insurance, retailing, telecommunications, or transportation might use a data scrubbing tool to systematically examine data for flaws by using rules, algorithms, and look-up tables. Typically, a database scrubbing tool includes programs that are capable of correcting a number of specific type of mistakes, such as adding missing zip codes or finding duplicate records. Using a data scrubbing tool can save a database administrator a significant amount of time and can be less costly than fixing errors manually.

Extract, Load, Transform

Extract, Load, Transform (ELT) is a data integration process for transferring raw data from a source server to a data system (such as a data warehouse or data lake) on a target server and then preparing the information for downstream uses. 

ELT is comprised of a data pipeline with three different operations being performed on data:

The first step is to Extract the data. Extracting data is the process of identifying and reading data from one or more source systems, which may be databases, files, archives, ERP, CRM or any other viable source of useful data.

The second step for ELT, is to Load the extract data. Loading is the process of adding the extracted data to the target database.

The third step is to Transform the data. Data transformation is the process of converting data from its source format to the format required for analysis. Transformation is typically based on rules that define how the data should be converted for usage and analysis in the target data store. Although transforming data can take many different forms, it frequently involves converting coded data into usable data using code and lookup tables.

Examples of transformations include:

  • Replacing codes with values
  • Aggregating numerical sums
  • Applying mathematical functions
  • Converting data types
  • Modifying text strings
  • Combining data from different tables and databases

How ELT works

ELT is a variation of the Extract, Transform, Load (ETL), a data integration process in which transformation takes place on an intermediate server before it is loaded into the target. In contrast, ELT allows raw data to be loaded directly into the target and transformed there.

With an ELT approach, a data extraction tool is used to obtain data from a source or sources, and the extracted data is stored in a staging area or database. Any required business rules and data integrity checks can be run on the data in the staging area before it is loaded into the data warehouse. All data transformations occur in the data warehouse after the data is loaded.

ELT vs. ETL

The differences between ELT and a traditional ETL process are more significant than just switching the L and the T. The biggest determinant is how, when and where the data transformations are performed.

With ETL, the raw data is not available in the data warehouse because it is transformed before it is loaded. With ELT, the raw data is loaded into the data warehouse (or data lake) and transformations occur on the stored data.

Staging areas are used for both ELT and ETL, but with ETL the staging areas are built into the ETL tool being used. With ELT, the staging area is in a database used for the data warehouse.

ELT is most useful for processing the large data sets required for business intelligence (BI) and big data analytics. Non-relational and unstructured data is more conducive for an ELT approach because the data is copied “as is” from the source. Applying analytics to unstructured data typically uses a “schema on read” approach as opposed to the traditional “schema on write” used by relational databases.

Loading data without first transforming it can be problematic if you are moving data from a non-relational source to a relational target because the data will have to match a relational schema. This means it will be necessary to identify and massage data to support the data types available in the target database.

Data type conversion may need to be performed as part of the load process if the source and target data stores do not support all the same data types. Such problems can also occur when moving data from one relational database management system (DBMS) to another, such as say Oracle to Db2, because the data types supported differ from DBMS to DBMS.

ETL should be considered as a preferred approach over ELT when there is a need for extensive data cleansing before loading the data to the target system, when there are numerous complex computations required on numeric data and when all the source data comes from relational systems.

The following chart compares different facets of ETL or ELT:

 ELTETL
Order of ProcessesExtract
Load
Transform
Extract
Transform
Load
FlexibilityBecause transformation is not dependent on extraction, ELT is more flexible than ETL for adding more extracted data in the future.More upfront planning should be conducted to ensure that all relevant data is being integrated.
AdministrationMore administration may be required as multiple tools may need to be adopted.Typically, a single tool is used for all three stages perhaps simplifying administration effort.
Development TimeWith a more flexible approach, development time may expand depending upon requirements and approach.ETL requires upfront design planning, which can result in less overhead and development time because only relevant data is processed.
End UsersData scientists and advanced analystsUsers reading reports and SQL coders
Complexity of TransformationTransformations are coded in by programmers (e.g., using Java) and must be maintained like any other program.Transformations are coded in the ETL tool by data integration professional experienced with the tool.
Hardware RequirementsTypically, ELT tools do not require additional hardware, instead using existing compute power for transformations.It is common for ETL tools to require specific hardware with their own engines to perform transformations.
SkillsELT relies mostly on native DBMS functionality, so existing skills can be used in most cases.ETL requires additional training and skills to learn the tool set that drives the extraction, transformation and loading.
MaturityELT is a relatively new practice, and as such there is less expertise and fewer best practices available.ETL is a mature practice that has existed since the 1990s. There are many skilled technicians, best practices exist, and there are many useful ETL tools on the market.
Data StoresMostly Hadoop, perhaps NoSQL database. Rarely relational database.Almost exclusively relational database.
Use CasesBest for unstructured data and nonrelational data. Ideal for data lakes. Can work for homogeneous relational data, too. Well-suited for very large amounts of data.Best for relational and structured data. Better for small to medium amounts of data.

Benefits of ELT

One of the main attractions of ELT is the reduction in load times relative to the ETL model. Taking advantage of the processing capability built into a data warehousing infrastructure reduces the time that data spends in transit and is usually more cost-effective. ELT can be more efficient by utilizing the computer power of modern data storage systems.

When you use ELT, you move the entire data set as it exists in the source systems to the target. This means that you have the raw data at your disposal in the data warehouse, in contrast to the ETL approach where the raw data is transformed before it is loaded to the data warehouse. This flexibility can improve data analysis, enabling more analytics to be performed directly within the data warehouse without having to reach out to the source systems for the untransformed data.

Using the ELT can make sense when adopting a big data initiative for analytics. Big data often relies on a large amount of data, as well as wide variety of data that is more suitable for ELT.

Uses of ELT

ELT is often used in the following cases:

  • when the data is structured, but the source and target database are the same type (i.e., Oracle source and target);
  • when the data is unstructured and massive, such as processing and correlating data from log files and sensors’
  • when the data is relatively simple, but there are large amounts of it;
  • when there is a plan to use machine learning tools to process the data instead of traditional SQL queries; and
  • schema on read.

ELT tools and software

Although ELT can be performed using separate tools for extracting, loading and transforming the data, tools exist that integrate all ELT processes. When seeking an ELT tool, users should look for the ability to read data from multiple sources, specifically the sources that their organization uses and intends to use. Most tools support a wide variety of source and target data stores and database systems.

Users can look for tools that can perform both ETL and ELT, as it’s likely to have the need for both data integration techniques.

A data store can be useful for managing a target data mart, data warehouse and/or data lake. For an ELT approach, NoSQL database management systems and Hadoop are viable candidates, as are purpose-built data warehouse appliances. In some cases, a traditional relational DBMS may be appropriate.