In the first installment of Three Questions, a Q&A series with the world’s foremost data experts, we chat with Dipin Hora of Movable Ink.
Dipin spent the first 12 years of his career working in data warehousing and BI before converting to product development for real-time streaming technology company Wallaroo Labs.
He’s currently the Principal Data Engineer at Movable Ink.
1. What is the most difficult ETL-related challenge you’ve faced?
The most difficult problem I’ve encountered in data warehousing and BI in general is getting people to define business rules and business definitions. Regardless if it’s defining a dashboard, a report, a set of data transformation rules for ETL purposes or anything of that sort, it’s always been the biggest challenge.
For me the actual technical work has always been easy. A technology, whatever it may be, may have its weaknesses, but in most scenarios you can get the job done.
There are some limited cases where the technology matters a lot more, which is when you get to the extremely large data sizes, and in those cases the technology matters more because if you don’t have the right technology, it’s not going to be able to solve the problem. But at that point it’s more about making sure you’ve chosen the right technology to handle the scale of the problem rather than it’s about writing the right ETL job. I think most engineers will tell you the biggest difficulties are usually on the business side.
“The most difficult problem I’ve encountered in data warehousing and BI in general is getting people to define business rules and business definitions.”
2. Which industries do you think will benefit the most from better stream processing?
I think every industry is going to benefit from stream processing, but I’m not sure we’re going to know which one is going to “win” the most for another 30-50 years. And I say that full well knowing how fast technology moves because stream processing is slowly gaining more and more prominence, but the process will still take a long time to come to fruition.
It’s hard to say which one will benefit the most because they all can, and one thing people underestimate the most is the benefit of being able to get answers quickly. There are folks now who are running a batch and will say, “It only takes a few hours, nobody’s complaining about it, what does it matter? Nobody’s complaining about it?”
Of course on the flip side, if you could have that information in a second, it would dramatically change how valuable that information is going to be to the business. And I think that process of implementing the technology, businesses adapting, and then businesses truly optimizing is going to take a long time.
3. If you could invent one tool to help in the world of streaming data processing, which tool would you invent?
The one thing that would be great for streaming data processing, and maybe there’s a tool that already exists for this, is a tool for better data dependence. I don’t think that aspect of stream processing has caught up to where some of the traditional tools like Informatica have gotten to.
A lot of the traditional metadata management slash data traceability type tools are more mature in the traditional ETL world, and in the streaming world I don’t think they’re there yet, especially when you take into account the fact people are writing programs, not just in SQL, which makes things a lot harder when it comes to traceability.
Interested in being featured in the next installment of Three Questions? Email tim at fastdata.io to see if you’re eligible.