In his book “Predictive Analytics” Eric Siegel calls predictive analytics “the power to predict who will click, buy, lie, or die”. You can apply this to both people and machines.
With the increase of data-generating devices, sensors, and software, the amount of data in organizations is growing exponentially. But more data doesn’t automatically translate into information for man and machine until you can extract actionable information. Unfortunately, the capacity of most organizations to analyze this data has not increased at the same pace as the available data. In order to replace gut feeling based on experience with a data-driven approach we need to enhance this capacity by introducing predictive analytics.
Predictive analytics trains a computer model to automatically learn from large amounts of data to find the complex, hidden patterns that can optimize your inventory; predict fraud, maintenance, or customer retention; recommend the products that customers actually need; or even diagnose Alzheimer’s disease. Predictive analytics has gotten a lot of attention recently through the success of Kaggle. Kaggle is a web platform where organizations like General Electric, Pfizer and Facebook host predictive modeling competitions in which data scientists can win up to $3M.
Based on our experience with customer projects and Kaggle competitions, we at Algoritmica want to share some of the misunderstandings on this topic, and provide some lessons and takeaways that will benefit your predictive analytics projects.
1. Start with the end in mind
Big data might be the new oil, but don’t take this analogy too far. Domain knowledge (i.e. understanding your own product and customers) is essential at the beginning in establishing boundary conditions from business and IT. Certain questions need to be answered first, like: How are we going to make money with this? How will the model be deployed? What does the data look like? How fast do the predictions need to be? How often do we need to update the model? How will the new output be used? How will we introduce change in the process?
2. No treasure hunting
Treasure hunting is rarely useful for predictive analytics. Your company has to identify, with some help, which processes are worth optimizing. Once the relevant data sets and process are identified, the business opportunity can be reduced to a data problem. Handing over a data set in the hopes that someone can find a pot of gold is not the way to go.
3. Don’t get lost in translation
The data-driven approach is garnering a lot of traction and success. In most companies however, processes are still heavily designed and optimized around people. When introducing predictive analytics into a process, you’re changing it. The adoption and best result can only be achieved if everyone is on board and there’s a plan to introduce this change. The lack of a data-driven culture is the biggest hurdle for most predictive analytics projects.
4. Use modeling experts
Domain knowledge is important in every predictive analytics project. However, once the project is reduced to a data problem you need people that can do modeling really well. Kaggle shows that people that are proficient at predictive modeling can solve a problem for an insurance company and electronics company equally well. For this phase, focus on analytical ability not industry knowledge.
5. Design your data collection
Data is collected because decisions were made on where to put sensors, what sensor frequency to set, what data to aggregate, or how to design an app. Most decisions were made without any consideration of how to analyze that data. For some applications there’s a need for longitudinal data or data aggregated with a certain frequency. In other cases any missing data represents information that you want to include in your model. Imputing these cases can destroy that information. To avoid delays and design bias, it’s good practice to involve an analyst in this process as early as possible.
6. Compete on analytics
There was a time when there wasn’t a lot of data and analytics basically consisted of pie charts for management. Occasionally, someone would query a database. Today, most companies are becoming a software and data company. Look at Google and Amazon. They are leaders in their respective verticals. Google’s algorithms are in direct competition with Microsoft’s. Google is creating whole new business models by leveraging their data using smart algorithms. To be the Google of your industry, compete on your analytical capability. In this light, human capital is crucial. In order to attract the right talent, establishing the right ‘data centric’ culture is key.
7. Presentation matters
A well-published result is that the presentation is just as important as predictive accuracy for recommender systems. Sometimes the output of a predictive model forms the input of a larger optimization chain. This doesn’t mean that supervised learning should stop there. You can test how to present different thresholds or colors to different groups. Spend the effort to create a visually compelling story out of your data insights.
8. Beware of data leakage
Data leakage is the phenomenon where you inadvertently design your predictive modeling pipeline to include information about the future that you wouldn’t normally have. You’ll get very good results when back-testing your model, but these good results won’t be reproducible in the real world. This type of mistake can sometimes be very subtle, but it will have a major impact on the usefulness of your results.
9. Take model training out of the database
Most analytics within companies takes place in the land of SQL. Traditional relational databases and SQL are great for storing, managing, and performing simple analytics jobs. Predictive modeling automatically trains a model by looking at examples. The algorithms used are often both data and computation intensive. The latter makes databases too slow to do training. SQL is also not expressive enough for predictive modeling. Hadoop offers a solution if the work can be divided into pieces and a lot of data is needed for training. Hadoop limits the complexity of the algorithms you can use and the magic still happens on-disc. Rarely will you need to use all of the raw data to distill a model, though. With smart sampling techniques you can end up with a small data set without losing hardly any predictive accuracy. Data sets on Kaggle never exceed the order of a gigabyte and most modeling in the world happens in-memory.
10. Privacy is no afterthought
You have the potential to upset your customers and the media if you design a predictive analytics project without considering its privacy impact. People are put off by the idea of a company magically ‘knowing’ something about their private lives, even if that knowledge is obtained using only public data. On the other hand, customers want you to make their data work for them and playing it too safe will stifle your innovation. Each company’s situation is unique and requires special attention to how knowledge of the customer will be perceived.
You can see predictive analytics as the special sauce that adds value to your data and gives you a competitive edge over the competition. There are great opportunities for organizations at the intersection of data and algorithms. It’s an exciting time to be working with data and predictive analytics. Good luck with your predictive analytics endeavors.