“Extract, Transform, Load”
If you’re an engineer dealing with never-ending streams of data, these three separately innocuous words haunt your dreams when they come together.
The process of extracting data from multiple sources, transforming it to suit business needs, and loading it into a destination database is commonly called ETL, but while ETL is by definition three distinct steps, in reality it’s a much more nuanced process that requires a variety of actions.
As ETL pipelines have become increasingly complex, data engineers have become more innovative and creative in finding solutions for their numerous data streams.
Finding the right solutions can’t just fall on the engineers, however. Each organization needs buy-in from all stakeholders to decide the most important pieces of data to extract, how to transform that data, and where that data should live. In short, if ETL isn’t a team effort, you’re bound to fail (we’ll get to that below).
So what are the three most common ETL mistakes, and more importantly, how do you avoid them? Let’s dive in…
1. Choosing the Wrong Hardware or Software
The most common mistake made when designing and building an ETL solution is buying new tools and writing code before having a comprehensive understanding of business requirements/needs.
Survey every major stakeholder, understand everyone’s goals, and then begin constructing your solution. It’s vital to choose the right tool for each component of your stack and allow yourself to update components of your ETL process when technologies and business needs change without having to rebuild from the ground-up.
Many engineers have been trained to “just figure it out”, but communication is key when it comes to ETL.
2. Underestimating Volume
This is a big one that affects virtually every company that relies on an ETL pipeline at one point or another.
If you’re reading this, you know this, but nonetheless it’s worth repeating: the amount of data you’re processing only goes up, not down. It’s impossibly difficult to predict, but there are solutions made to tackle the incredibly common, undeniably burdensome issue of massive increases in data (this seems like a good spot to shamelessly plug our solution, PlasmaENGINE®).
When you’re choosing tools, make sure to leave plenty of bandwidth to scale up. If you’re processing tons of data, there’s a very small likelihood the amount of data you’re processing will ever go down…
3. Forgetting the Customer
We’ve already alluded to the importance of a “team effort”, but let’s hammer it home.
The first question is: Who wants the data you are ETLing? What information do they need to glean from it? Conduct interviews with everyone pertinent in your organization — not just managers — and gather all the “data” to understand how to handle your company’s data.
The easiest way to conceptualize the process is to view your business units as customers and over-communicate with them to ensure you build the system they need.