In the second installment of Three Questions, a Q&A series with the world’s foremost data experts, we chat with Tom Wentworth, Manager of Data Science at Wayfair.
1. What is the most difficult ETL-related challenge you’ve faced?
Pulling together data from multiple data sources where we have different business objectives in mind. We use finance data in our pricing algorithms to find out what something actually costs. The finance team owns that data, and there’s also the accounting team to factor in, who has its own set of rules, so you get odd rules that affect the behavior of the data.
For example, if a customer orders a replacement item, it actually looks like a new order if you don’t know where to look in the data. Which order does the revenue go in on? There are accounting rules that actually define that, which depend on when the product was shipped.
Some of these intricacies don’t make sense to an non-accountant, so when building a shared view into our finance data to be used across pricing, we put in a lot of effort to digging into the data, documenting how all of this analysis was happening, and presenting the data in a way which allowed the end user to not have to understand everything when viewing the final result.
The big thing that we did was insured the table we were making only had real costs and no estimates — the previous source was a mix of estimates and real costs — which meant you didn’t know what you were looking at.
We also grouped together associating orders to make them look like one, which allowed an analyst or a scientist to easily make sense of everything. From the pricing perspective, there was only one order, and one customer interaction where pricing was a factor.
The big challenge is that a lot of these processes weren’t documented, which happens commonly in a company that grows quickly. The biggest challenge was getting data that made sense, which we were ultimately able to do.
2. Which industries do you think will benefit the most from better stream processing?
All businesses can benefit from better stream processing, but I think the most profound gains will take place in businesses that are transient, or a large member of IoT.
I look at Tesla’s autonomous driving. They have a large stream of data coming into all of their cars, which they use to improve their models. If they want to test out and focus on things, they need to collect data for that, so there’s an iteration cycle that can be improved from having streaming data come in, but the world today, the world a month ago, it’s more or less the same.
On the other hand, there are financial trading companies where their models and understanding of the world needs to be updated quickly, or they risk being wrong. When a new job report comes in, that needs to feed into their models. When another company starts changing their investment behavior, they need to know that right away so they can better position themselves.
Another example would be Google’s ad decisions. They’re making billions of independent decisions. They have a model that decides, given this ad, where do I think this will do well, who’s going to click on this, and all of these rely on machine learning models, which are going to be wrong some percentage of the time. Google wants to be able to fix that right away so they can pivot and monetize more efficiently.
I think it’s essential for those examples, businesses in the financial realm and Google, whereas for Tesla I think it’s more “a win”, if that makes sense.
“All businesses can benefit from better stream processing, but I think the most profound gains will take place in businesses that are transient, or a large member of IoT.”
3. If you could invent one tool to help in the world of streaming data processing, which tool would you invent?
Real-time anomaly detection.
If we can’t detect problems quickly enough, they start feeding into other systems. A person could detect these things if they saw them, but there are so many things to look at. There are a million things that could go wrong, so being able to test for the unknown mistake that will inevitably come up, and doing so in real-time would be incredible. It would allow companies to turn large mistakes into small mistakes.
Wayfair, as a company, is an incredibly complex machine, with hundreds, if not thousands of moving parts. While I think Wayfair does exceptionally well in this regard, no company is perfect and errors do occur. An algorithm with the ability to understand trends in data and alert us to any abnormalities would allow us to detect and correct problems faster, while minimizing downstream impact.
Interested in being featured in the next installment of Three Questions? Email tim at fastdata.io.