As I have written elsewhere
, the most striking advances in AI
in the last few years have been in computer vision, natural language processing and reinforcement learning —image classification, Google Translate and AlphaGo. While not as dramatic, innovation in machine and deep learning tools
and techniques are also making important inroads in regular tabular data (i.e., relational data.) In this blog, I’d like to reflect on some of the challenges that are particular to building models with relational data.
The most obvious difference is the data itself. When you are building a computer vision or natural language model, you start with a static set of images or a static corpus of text. Acquiring these data sets can be very difficult, but the challenges are fundamentally different than those involved in acquiring a data set based on relational data. Relational data is the life-blood of an organization. Relational data lives on a diverse set of operational databases, it may be spread across different organization and owners, it may use different keys and hierarchies, it is often subject to security and privacy constraints and yet, your ability to get the most complete view of a domain is critical to the success of your AI effort. Clearly this is a job for an enterprise data warehouse, whether it’s a product like Teradata Vantage,
a data lake or some logical data warehouse. A data warehouse is your source of trusted, integrated data and arguably your most strategic AI tool.
The second challenge results from the nature of relational data: it’s dynamic. Whether you are building a model to forecast revenue, recommend products or detect fraud, your data is constantly changing. This means you need to constantly retrain your model as new data comes in. This it not to say the computer vision models don’t get retrained as you acquire more data, but the data tends to be additive and the basic patterns in the data are much more stable. By contrast, relational models will degrade if not continually refreshed with new data representing new products, new fraud scams, new preferences. Here again, a data warehouse can be a strategic tool. Rather than training models off multiple csv extracts, a process which is cumbersome and fundamentally ungovernable, models should be trained against a query in the warehouse. When it comes time to retrain a model, you simply adjust the query to represent the new window of data you are interested in. Of course, behind the scene the warehouse may be caching data using a variety of techniques, but this is transparent to the model developer.
The third challenge is intimately related to the second challenge. In addition to being able to easily re-train models on new data, you need to be able get those models into production, which means automation. At Teradata Consulting
, we refer to this area of concern as ‘Analytic Ops’. Analytic Ops brings analytics and operations together in way that is analogous to how DevOps brought development and operations together for software engineering. In practice, it means automated workflows that will retrain models on a scheduled basis, evaluate their performance relative to currently deployed models, deploy them in a fault-tolerant and scalable fashion, monitor them and above all ensure reproducibility, compliance and governance.
The fourth challenge is perhaps the subtlest of all: determining what your training target is and what data to use. If you’ve spent any time on Kaggle, you know that in each contest you are given a training target and data set. I think these contests are great and you can learn a lot from them, but they somewhat distort the reality of building models using relational data in the real world. Let’s start with the target. It is often not obvious what it should be. Take the case of credit card fraud detection. Do you want to detect the first incidence of fraud on a card or are you looking for fraud patterns that span multiple transactions? Are you looking for global fraud patterns or more emergent ones? What about a recommender system for coupons? Should your coupon recommendation align with the product a customer is most likely to consume? Or are you trying to shift them to a higher margin in-house brand? Are you trying to simply have them use the coupon or are you trying to increase the basket of goods they purchase during their next visit? All of these considerations will impact what data you use and how you present it to the model.
What about the data? A simple answer is to just use it all! Unfortunately, this is rarely an option. Go back to the coupon recommender system, suppose you are building a Deep Q Learning model that will try to optimize profit for each customer. For any given customer and any potential product coupon the model will need to estimate the total future discounted profit. What information does the model need? Information about the customer, either specific features or perhaps a learned dense embedding vector. What about affinity to the product in question based on their transaction history. Sure. What about affinity to other products that might often occur in a basket of goods? Not so fast, there are tens of thousands of products and there is no easy way to present this to the model. Maybe you can capture this in a dense embedding vector, maybe there are some engineered features that you could use? Suffice to say it’s a complicated problem and requires real domain expertise as well as experience in machine and deep learning.
Unless you are doing cutting edge research you don’t run into this kind of challenge when building computer vision and NLP models. The data is typically uniform, either an image or text and he target is typically well defined.
A fifth challenge also relates to data: feature engineering, data augmentation and data representation. Feature engineering refers to creating new features out of the data that’s presented to you. For example, a credit card transaction may include the zip code of the merchant where the card was used. There is important location information contained in the zip code, but it’s not likely that your model will learn it (especially when the training target relates to fraud detection), so you need convert the postal code to location information before you present it to the model. Data augmentation means adding features which are not derivable from the input data itself, for example, you might include additional information about a merchant not present in the credit card transaction. Finally, data representation refers to the work that needs to be done on data before you can present it to a ML or DL model: categorical data needs to be converted to numerical (either one-hot encoded or using a dense embedding vector), and numerical data needs to be scaled. Now, what is challenging about all of this is that whatever feature engineering, augmentation and representation work is done at training time must also be done at inference time. This means you need a clear and consistent way of representing engineered and augmented features, a light-weight DSL is a good solution here. At a minimum, you want a shared repository of code that can be used a training and inference time. Consistent data representation between training and inference is increasingly being solved by pushing representation into the model graph (e.g., by using TensorFlow input columns).
Is that it? Well, I think it’s enough for now! There are lots of other challenges related to training, tuning, hyperparameter search, aggregated features…. but I’ll leave that for another post!