SQL in Data Analytics

SQL (Structured Query Language) is a standardized programming language that’s used to manage relational databases and perform various operations on the data in them. Initially created in the 1970s, SQL is regularly used not only by database administrators, but also by developers writing data integration scripts and data analysts looking to set up and run analytical queries.

The uses of SQL include modifying database table and index structures; adding, updating and deleting rows of data; and retrieving subsets of information from within a database for transaction processing and analytics applications. Queries and other SQL operations take the form of commands written as statements — commonly used SQL statements include select, add, insert, update, delete, create, alter and truncate.

SQL became the de facto standard programming language for relational databases after they emerged in the late 1970s and early 1980s. Also known as SQL databases, relational systems comprise a set of tables containing data in rows and columns. Each column in a table corresponds to a category of data — for example, customer name or address — while each row contains a data value for the intersecting column.

SQL standard and proprietary extensions

An official SQL standard was adopted by the American National Standards Institute (ANSI) in 1986 and then by the International Organization for Standardization, known as ISO, in 1987. More than a half-dozen joint updates to the standard have been released by the two standards development bodies since then; as of this writing, the most recent version is SQL:2011, approved that year.

Both proprietary and open source relational database management systems built around SQL are available for use by organizations. They include Microsoft SQL Server, Oracle Database, IBM DB2, SAP HANA, SAP Adaptive Server, MySQL (now owned by Oracle) and PostgreSQL. However, many of these database products support SQL with proprietary extensions to the standard language for procedural programming and other functions. For example, Microsoft offers a set of extensions called Transact-SQL (T-SQL), while Oracle’s extended version of the standard is PL/SQL. As a result, the different variants of SQL offered by vendors aren’t fully compatible with one another.

SQL commands and syntax

SQL commands are divided into several different types, among them data manipulation language (DML) and data definition language (DDL) statements, transaction controls and security measures. The DML vocabulary is used to retrieve and manipulate data, while DDL statements are for defining and modifying database structures. The transaction controls help manage transaction processing, ensuring that transactions are either completed or rolled back if errors or problems occur. The security statements are used to control database access as well as to create user roles and permissions.

SQL syntax is the coding format used in writing statements. Figure 1 shows an example of a DDL statement written in Microsoft’s T-SQL to modify a database table in SQL Server 2016:

An example of T-SQL code in SQL Server 2016. This is the code for the ALTER TABLE WITH (ONLINE = ON | OFF) option.

SQL-on-Hadoop tools

SQL-on-Hadoop query engines are a newer offshoot of SQL that enable organizations with big data architectures built around Hadoop systems to take advantage of it instead of having to use more complex and less familiar languages — in particular, the MapReduce programming environment for developing batch processing applications.

More than a dozen SQL-on-Hadoop tools have become available through Hadoop distribution providers and other vendors; many of them are open source software or commercial versions of such technologies. In addition, the Apache Spark processing engine, which is often used in conjunction with Hadoop, includes a Spark SQL module that similarly supports SQL-based programming.

In general, SQL-on-Hadoop is still an emerging technology, and most of the available tools don’t support all of the functionality offered in relational implementations of SQL. But they’re becoming a regular component of Hadoop deployments as companies look to get developers and data analysts with SQL skills involved in programming big data applications.

SQL-on-Hadoop

SQL-on-Hadoop is a class of analytical application tools that combine established SQL-style querying with newer Hadoop data framework elements. 

By supporting familiar SQL queries, SQL-on-Hadoop lets a wider group of enterprise developers and business analysts work with Hadoop on commodity computing clusters. Because SQL was originally developed for relational databases, it has to be modified for the Hadoop 1 model, which uses the Hadoop Distributed File System and Map-Reduce or the Hadoop 2 model, which can work without either HDFS or Map-Reduce.

The different means for executing SQL in Hadoop environments can be divided into (1) connectors that translate SQL into a MapReduce format; (2) “push down” systems that forgo batch-oriented MapReduce and execute SQL within Hadoop clusters; and (3) systems that apportion SQL work between MapReduce-HDFS clusters or raw HDFS clusters, depending on the workload.

One of the earliest efforts to combine SQL and Hadoop resulted in the Hive data warehouse, which featured HiveQL software for translating SQL-like queries into MapReduce jobs. Other tools that help support SQL-on-Hadoop include BigSQL, Drill, Hadapt, Hawq, H-SQL, Impala, JethroData, Polybase, Presto, Shark (Hive on Spark), Spark, Splice Machine, Stinger, and Tez (Hive on Tez).

Selecting the right SQL-on-Hadoop engine to access big data

In the world of Hadoop and NoSQL, the spotlight is now on SQL-on-Hadoop engines. Today, many different engines are available, making it hard for organizations to choose. This article presents some important requirements to consider when selecting one of these engines.

With SQL-on-Hadoop technologies, it’s possible to access big data stored in Hadoop by using the familiar SQL language. Users can plug in almost any reporting or analytical tool to analyze and study the data. Before SQL-on-Hadoop, accessing big data was restricted to the happy few. You had to have in-depth knowledge of technical application programming interfaces, such as the ones for the Hadoop Distributed File System, MapReduce or HBase, to work with the data. Now, thanks to SQL-on-Hadoop, everyone can use his favorite tool. For an organization, that opens up big data to a much larger audience, which can increase the return on its big data investment.

The first SQL-on-Hadoop engine was Apache Hive, but during the last 12 months, many new ones have been released. These include CitusDB, Cloudera Impala, Concurrent Lingual, Hadapt, InfiniDB, JethroData, MammothDB, Apache Drill, MemSQL, Pivotal HawQ, Progress DataDirect, ScleraDB, Simba and Splice Machine.

In addition to these implementations, all the data virtualization servers should be included because they also offer SQL access to Hadoop data. In fact, they are designed to access all kinds of data sources, including Hadoop, and they allow different data sources to be integrated. Examples of data virtualization servers are Cirro Data Hub, Cisco/Composite Information Server, Denodo Platform, Informatica Data Services, Red Hat JBoss Data Virtualization and Stone Bond Enterprise Enabler Virtuoso.

And, of course, there are a few SQL database management systems that support polyglot persistence. This means that they can store data in their own native SQL database or in Hadoop; by doing so, they also offer SQL access to Hadoop data. Examples are EMC/Greenplum UAP, HP Vertica (on MapR), Microsoft PolyBase, Actian ParAccel and Teradata Aster Database (via SQL-H).

SQL equality on Hadoop?

In other words, organizations can choose from a wide range of SQL-on-Hadoop engines. But which one should be selected? Or are they so alike that it doesn’t matter which one is picked?

The answer is that it does matter, because not all of these technologies are created equal. On the outside, they all look the same, but internally they are very different. For example, CitusDB knows where all the data is stored and uses that knowledge to access the data as efficiently as possible. JethroData stores indexes to get direct access to data, and Splice Machine offers a transactional SQL interface.

Selecting the right SQL-on-Hadoop technology requires a detailed study. To get started, you should evaluate the following requirements before selecting one of the available engines.

SQL dialect. The richer the SQL dialect supported, the wider the range of applications that can benefit from it. In addition, the richer the dialect, the more query processing can be pushed to Hadoop and the less the applications and reporting tools have to do.

Joins. Executing joins on big tables fast and efficiently is not always easy, especially if the SQL-on-Hadoop engine has no idea where the data is stored. An inefficient style of join processing can lead to massive amounts of I/O and can cause colossal data transport between nodes. Both can result in really poor performance.

Non-traditional data. Initially, SQL was designed to process highly structured data: Each record in a table has the same set of columns, and each column holds one atomic value. Not all big data in Hadoop has this traditional structure. Hadoop files may contain nested data, variable data (with hierarchical structures), schema-less data and self-describing data. A SQL-on-Hadoop engine must be able to translate all these forms of data to flat relational data, and must be able to optimize queries on these forms of data as well.

Storage format. Hadoop supports some “standard” storage formats of the data, such as Parquet, Avro and ORCFile. The more SQL-on-Hadoop technologies use such formats, the more tools and other SQL-on-Hadoop engines can read that same data. This drastically minimizes the need to replicate data. Thus, it’s important to verify whether a proprietary storage format is used.

User-defined functions. To use SQL to execute complex analytical functions, such as Gaussian discriminative analysis and market basket analysis, it’s important that they’re supported by SQL or can be developed. Such functions are called user-defined functions (UDFs). It’s also important that the SQL-on-Hadoop engine can distribute the execution of UDFs over as many nodes and disks as possible.

Multi-user workloads. It must be possible to set parameters that determine how the engine should divide its resources among different queries and different types of queries. For example, queries from different applications may have different processing priorities; long-running queries should get less priority than simple queries being processed concurrently; and unplanned and resource-intensive queries may have to be cancelled or temporarily interrupted if they use too many resources. SQL-on-Hadoop engines require smart and advanced workload managers.

Data federation. Not all data is stored in Hadoop. Most enterprise data is still stored in other data sources, such as SQL databases. A SQL-on-Hadoop engine must support distributed joins on data stored in all kinds of data sources. In other words, it must support data federation.

It would not surprise me if every organization that uses Hadoop eventually deploys a SQL-on-Hadoop engine (or maybe even a few). As organizations compare and evaluate the available technologies, assessing the engine’s capabilities for the requirements listed in this article is a great starting point.