How Hadoop and Spark Are Transforming Data Warehousing

Traditional data warehouses cannot handle big data. Hadoop, Spark, Impala, Parquet and SQL on Hadoop will play big role in data warehouse modernization.

The nature of data that enterprises have been dealing with has been changing in the last few years. It is no longer just transactional data. Volume, velocity, and variety of data is increasing steadily. Why should you care? If your organization is not capturing, storing and analyzing new types of data, your organization is potentially losing a competitive edge. Or put it the other way, leveraging this new type of data can give you a competitive edge.

In this and following articles, I am going to cover –

  • How data warehouses are excellent at dealing with structured data
  • Rise of unconventional data
  • Short-comings of traditional data warehouses
  • How Hadoop ecosystem and Apache Spark can help?

How data warehouses are excellent at dealing with structured data

Business Intelligence, analytics, and data warehousing are mature fields that have existed for more than three decades. Business Intelligence softwares are typically powered by an underlying data warehouse and data mart(s). Traditional data warehouses and data marts rely on relational databases like Oracle, Teradata, MySQL and the like following star schema. SQL constructs make slicing, dicing, drill down/up of data possible along different dimensions.

Data warehouses have worked nicely and they continue to serve the purpose. Most of this data has been highly structured data, i.e. transactional data most suitable for relational databases. But what about big data, the new type of unconventional data that is semi-structured or unstructured? What about data coming at very high velocity at thousands or millions of events per second?

Rise of unconventional data

A lot has been written and said about how data is growing and about the 3 (or 4) V’s (Volume, Velocity, Variety, and Veracity). Sensor data, smart phones, images, social media, and machine logs are new types of unstructured and semi-structured data. They can be effectively used to gain a competitive edge.

Traditional data warehouses cannot handle these new types of data, they can’t handle all the V’s. The associated costs are prohibitive due to licensing and infrastructure costs.

Short-comings of traditional data warehouses

Inability to deal with 3 V’s. Hard to efficiently handle the 3 V’s without breaking the bank. Besides there is no efficient way of handling unstructured and semi-structured data.

Lots of ETL. Moving data the from source transactional systems (OLTP systems) require a number of ETL jobs to move data around and transform it to the target data mart.

Inability to retain data. High storage costs means data needs to be archived to tapes and becomes inaccessible forever.

How Hadoop ecosystem and Apache Spark can help?

Hadoop ecosystem has come a long way since its inception around 2005. It has proven itself in handling the famous 3 V’s. Lots of research has gone into distributed file systems, storage formats, distributed computing, SQL-on-Hadoop engines, security etc.

Hadoop’s MapReduce framework was traditionally used to do analysis on the hadoop data. Apache Spark is a distributed computing framework that can run on HDFS, Amazon S3 and perform in-memory fast analytics. Spark is 10x to 100x faster than MapReduce even when all the data cannot fit in the cluster’s memory.

Storage – HDFS, Amazon S3 are two widely used resilient and distributed file systems that can scale from terabytes to petabytes/exabytes.

Storage formats – Special purpose efficient columnar storage formats like Parquet have evolved to enable fast analytics queries.

SQL-on-Hadoop engines – Cloudera’s Impala, Spark-SQL and Apache Drill are distributed query engines providing low latency interactive SQL queries on HDFS files.

The SQL-on-Hadoop engines are getting faster by the day to catch up with the speeds of relational databases. They also support JDBC/ODBC connectivity via third party applications. The availability of JDBC/ODBC connectivity means that existing BI tools like OBIEE, Tableau, Microstrategy, Qlik and others can connect and query the big data.

While traditional data warehouses are here to stay, at least in the near future. Hadoop ecosystem, spark and other NoSQL databases can help in modernizing your BI infrastructure, by providing the ability to store and analyze new types of unstructured, semi structured, high volume, high velocity data.

Are you looking to modernize your DW/BI infrastructure?

Originally published at http://brevitaz.com/hadoop-spark-data-warehousing-business-intelligence/ on May 10, 2016.


Pranav Shukla

Pranav is a big data architect, a technologist, and author with over 14 years of experience in building scalable, reactive, data intesive enterprise applications for large clients and startups. He is the founder of Valens DataLabs, a technology startup.

Read More