What machine learning practitioners can learn from data warehousing
As a big data platform, Cloudera is being used by enterprises to make sense of large amounts of data to generate insights on various aspects of business, such as customer preferences and manufacturing efficiency.
More recently, the company – founded by engineers from Facebook, Yahoo and Google in 2008 – made a deeper push into machine learning by forming three business units that will focus on what it calls its emerging business – machine learning, analytics and cloud.
Cloudera’s heightened focus on machine learning is not surprising, given that the efficacy of machine learning algorithms is only as good as the available data. That is where data warehousing, which cleans up and aggregates data from multiple sources onto a single system, comes in.
“Data warehousing has evolved a lot in the last 20 years – it’s a team sport and no one thinks about doing it on a desktop because you need all the data in the company,” said Charles Zedlewski, senior vice-president for emerging business at Cloudera.
“On the other hand, machine learning is an individual sport that only started to take off about eight years ago. There were a limited number of practitioners, most in the financial services and marketing industries.”
The way machine learning and data warehousing teams model their data also varies, said Zedlewski, noting that the former prefer “flatter” models, whereas the latter typically work with heavily modelled data.
That said, machine learning and data warehousing have at least one thing in common that is a core aspect of business intelligence (BI) – machine learning and data warehousing teams both focus on using the same data to glean business insights.
“They operate off the same datasets and may organise data differently, but both sides want to understand their customers and operations, manage costs and lower risks,” said Zedlewski, pointing out that it would benefit both teams if the same data can be stored, secured and governed in a shared environment.
Machine learning experts would also do well to learn from data warehousing and BI practices that have been established over decades, he said.
Some aspects of data management in machine learning already hail from the data warehousing world, but Zedlewski said machine learning practitioners could take a leaf out of the traditional software development lifecycle.
“We are not trying to turn a machine learning practitioner into a software developer,” he said. “But for most customers five years ago, there was no such thing as source control for production models that are driving thousands of decisions a day.”
And as the number of machine learning applications increases, said Zedlewski, “it’s going to necessitate a shared platform and a team-sport approach in a shared environment that makes sense now compared to 10 years ago”.
Meanwhile, machine learning is also being applied in data warehousing. At Next’18, Google announced the BigQuery ML service that enables data scientists and analysts to build and deploy machine learning models on massive, structured or semi-structured datasets directly inside its BigQuery data warehouse using SQL statements.
This means they can perform predictive analytics such as forecasting sales and creating customer segments at the source, where they already store their data, without needing to move data out of the data warehouse to develop and train machine learning models.
More importantly, BigQuery ML has the potential to extend the use of machine learning to data analysts, who, unlike data scientists, may not be schooled in programming languages such as R and Python, which are commonly used to build machine learning models.