Multiple shades of No-ETL
Late 2022, AWS announced the preview of a brand-new No-ETL integration from Aurora RDMS database to Redshift warehouse during their tech event re:Invent in Las Vegas. While being attracted by this news, I can not help to feel enthusiastically skeptical. The term No-ETL seems too good to be true. As serverless does not mean no server literally (but someone else taking care of it for you), one might wonder if No-ETL is just a marketing term? In this post, let’s deep dive into the concept and the latest implementation of this tech trend.
Energy cannot be created or destroyed, it can only be changed from one form to another.
— Abert Einstein
Inspiring from the law of energy conservation, one might just wonder where ETL would go? If ETL has been around for a while, it should be there for a good reason? What will take its place?
ETL, what is this?
ETL represents Extract (data from some data sources to a staging storage), Transform (the staged data to an easier-to-use data format or features) and Load (into a target storage).
Let’s say, we have a business analytic use case which requires data from different sources: Customer transactional record (in till-roll format), spatial location in shapefiles, processing logs in text files, historical sales in tabular (RDMBS), inventory network in a graph database, and maybe some trading info available via APIs.
The first step E in an ETL pipeline for this use case would be to extract relevant data from those sources to put into a staging storage, ideally in a common format to facilitate the next step.
In T step, data now needs to be cleaned, filling missing values, discarded corrupted values. Different pieces of data might need to be joint together. Data at different time frequencies or spatial resolutions would require to be resampled. New derivative calculations might need to be computed, or summary values might need to be aggregated. Also, depending on the regulation, some sensitive information would need to be anonymized, encrypted etc…
The transformed data will finally be Loaded into data warehouse or datalake ready for BI analysis, reporting or AI modelling.
But why do we ever need ETL?
These days, data informs business decision. Data analysis (BI, reporting) are essential to extract the valuable information from stored raw data. As a rule of thumb in the data analytics, we have: garbage in, garbage out. Pertinent analysis thus requires consolidated input data.
For enterprise standard, to avoid performance impact between online transactional database (OLTP) and online analytical processing (OLAP) database, those two set of databases are often separated. Also, with the growth of the business, the volume of data is much larger, and more databases are added. To scale the analysis solution, a central data repository, or often referred as the data warehouse, is required. Data from different databases now need to be pulled and preprocessed into the data warehouse, ready for BI or reporting. We are talking about data ingestion. To have a robust and up-to-date flow of data ingestion, we would need ETL pipelines that run periodically or reactively to pull data from different databases into the data warehouse.
With the growth of new type of data (unstructured, social network, etc…) & databases (noSQL, graph, documentDB), robust ETL has become an essential part of a data analysis workflow.
Draw back of ETL
Despite the important role of ETL in a data flow, it has inherent drawbacks. The two biggest challenges are: time lag and maintenance.
For a practical reason, ETL pipeline will run periodically, in batch or by event-trigger. Depending on the complexity of data reformatting, cleaning and feature engineering, data transformation also takes time. Those processes will introduce a time lag. In many cases, it does not matter. But for business cases that require real-time analysis and response (e.g. real time recommendation system, auction, online gaming), time lag due to ETL pipeline is a hindrance.
The second main problem is the maintenance of ETL pipelines. In practice, those pipelines are often handled by different entities or persons (e.g. data engineers) other than data analysists. A failed or erroneous ETL job will have direct impact on the input, and thus the result, of the data analysis side. In addition to the data quality, keeping ETL in good shape also incurs costs for additional staffing.
No ETL in different shades
To address those limitations of a canonical ETL, No-ETL is proposed. Although referencing to the same terminology, different platform or cloud providers implement No-ETL in different ways.
Azure
In 2020, Azure introduced Azure Synapse Link
that removes the need of an intermediate ETL pipeline to connect OLTP Azure Cosmos DB to OLAP Azure Synapse Analytics. They used an auto-sync from the transactional store, with row storage format optimized for transactional I/o, to a newly introduced analytical store, with column storage format that is optimized for analytical queries.
AWS
In the AWS re:Invent 2022, a preview feature for No-ETL has been introduced for transactional database Aurora
and Redshift
data warehouse. This is done by synchronously copying data from Aurora to Redshift. The end user thus can query directly all data from Aurora inside Reshift without the need, as previously, to build an intermediate pipeline to connect both. The copying is done efficiently at the storage level of both Aurora and Redshift to optimize the sync time.
In the same trend, AWS also integrates Spark
query in Redshift
. This makes it possible to run Spark jobs to read data from Redshift in EMR
or Sagemaker
. The steps of reading and SQL transforming of data will be executed directly inside Redshift and only the transformed data will be passed back to reduce the amount of data transferring and optimize the processing time.
GoogleCloud
Google has also proposed a No-ETL solution to query data from BigTable
into BigQuery
warehouse, using federated query. Query statements will be sent via API connection to the external databases in their SQL dialect, and the output will be sent back as a temporal table in Google SQL data format. In this way, data will not need to be copied, only required information will be retrieved.
With the advantage of flexibility and no intermediated storage, federated query also comes with inherent drawback. Compared to a warehouse, the federated query is often slower due to network bottle necks (to directly connect to external databases). The query is limited to read-only. In addition, federated query is good for one-off analytic use case but for frequency usage this might lead to waste of resources (or might need a well design caching strategy).
Databricks
Strictly speaking, the solution proposed by Databricks — Lakehouse is not exactly No-ETL. Yet, it is still worthy to note when discussing on this topics. With the fusion of the classical warehouse optimized for BI with structured data and the datalake with all data structures but more suitable for AI workload, Lakehouse might be a compromising land for both.
ETL will still be required but the burden is reduced with a better control of raw data formats (using Delta table), and a common platform for coding and governing for more flexible and robust ETL pipelines.
As a fair note, as mentioned in Databricks Blog, the current performance of the Data Lakehouse might not be as optimized for Data Warehouse for some functionalities (due to the maturation of this new architecture) but hopefully it will catch up in a near future.
So what?
If your use case requires real-time data insight, it is good news that we have now a spec of solutions to go for. As data sync from different sources to data warehouse is not that trivial, it is quite exciting that we can now, as AWS claimed, analyze “petabytes of transactional data in near realtime”. Knowing that No-ETL trends has just started, new developments might bring more exciting tech solutions to improve performance and increase adoption.
Yet, like any other architecture choice, there is always trade off. No-ETL is not applicable for all cases. In current commercialized solutions, No-ETL are not yet available for use cases with multiple types of databases and raw data formats. For the example use case mentioned at the beginning of this post, we still have a hard time to figure out a No-ETL solution.
Additionally, with No-ETL, data cleansing is missing. If input data are missing or erroneous, that will affect the BI output. As a mitigation, we can implement on-the-fly preprocessing data, but that will complexify the whole analytic solution. This means, extra code refractor, packaging, versioning would be required and thus incurs extra cost. If the preprocessing is heavy and requires extra feature generation, we will waste the computing power for repetitive (on-the-fly) runs.
It’s also worth notice that the current No-ETL solutions require the input and output data to be in the same cloud provider. Technically, it is understandable as the insider’s knowledge of the storage infractuctures is necessary to optimize data synchronization from the data sources to data warehouse or datalake. Yet, this imposes a vendor locking that, for strategic reasons, many companies might need to think twice.
Final Thoughts
From what we saw, it seems to me so-called No-ETL is more like a managed EL(T). In which the EL — extracting and loading parts — are implicitly handled by the cloud providers. The transforming part will be executed as needed before BI or reporting, and is in the hands of the data analysts.
EL(T) seems to be a perfect fit for the new Lakehouse architecture. In a near future, new services that manage to combine the efficient data sync (as in AWS No-ETL) for a fast EL- pipeline; and a uniformed data governance and computing platform for optimized T-pipeline, as well as downstream analysis (BI, ML), will be a very promising track to follow up.
…
Thanks a lot for reading. As usual, any comments or suggestions are more than welcome.
Thanks to Antoine Lagadec for proofreading.
References
- https://learn.microsoft.com/en-us/azure/cosmos-db/synapse-link
- https://cloud.google.com/blog/products/data-analytics/bigtable-bigquery-federation-brings-hot--cold-data-closer
- https://aws.amazon.com/about-aws/whats-new/2022/11/amazon-aurora-zero-etl-integration-redshift/
- https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html