New York Times chat about architecture search and meta-learning

These are the (slightly edited for grammar, clarity, and topicality) notes that I wrote to Cade Metz after our phone conversation about architecture search and meta-learning. We mostly talked about architecture search, but some questions were framed as meta-learning. I occasionally conflate the two, so keep that in mind while reading. While my take on these issues has evolved, these notes still capture well something that I deeply believe in—architecture search is about better machine learning tools for experts and non-experts.

I’m writing this post alongside another post introducing DeepArchitect, our architecture search framework that we have been working on for almost two years now. I believe that for architecture search tools to have maximal impact, they should pay close attention to programmability, modularity, and expressivity. We designed DeepArchitect with these guiding principles in mind and I believe that it should be able to easily support many architecture search use cases. I’m proud of the ideas behind DeepArchitect, so I strongly encourage you to check it out. I’m looking forward to continuing its development, hopefully now with the help of the community. Reach out to negrinho@cs.cmu.edu or via Twitter if you have questions or want to get involved. Dive in for the goodies and read the New York Times article.

On architecture search

Q: Google likes to talk about using meta-learning to build better neural networks. They say that this has significantly improved their image recognition models. And in the longterm, they see this meta-learning as an answer to the current shortage of AI researchers (AI can help build the AI, so to speak). I wanted to talk with you about whether this makes sense.

Handcrafted architectures and beyond

A: Currently, an expert using deep learning for a task needs to make many design decisions that jointly influence performance (e.g., architecture structure, parameter initialization scheme, optimization algorithm, learning rate schedule, and stopping criterion). The ML expert has to try many different configurations to get good performance as the impact of different choices varies from task to task and there is often little insight on how to make these choices apriori. Manually making all these choices is, even for experts, overwhelming.

Recent work on meta-learning for architecture search attempts to address these difficulties by setting up a search space of deep learning architectures and learning a meta-model that searches for a good architecture, i.e., an architecture from the search space that performs well on the task of interest. Comparing meta-learning for architecture search to the typical workflow by an ML expert, the space of architectures is like the space of architectures that ML expert can construct by making all those design decisions and the meta-model is like the ML expert that uses intuition and past experience to sequentially try architectures.

A cooking analogy

Cooking is a great analogy for architecture search. A chef has experience about existing dishes, ingredient combinations, and cooking techniques. The chef can rely on this experience to design a new recipe, that can then be cooked and tasted. If the recipe was great, then elements of the recipe had merit to them; if the recipe was poor, maybe the chef can deduce what caused it to be poor and think of ways to improve it. Either way, the chef has gathered additional information about what works and what does not, and therefore is in a better position to design future trials. In manual architecture search, the space of recipes is like the space of architectures; the chef is like the ML expert; tasting a recipe is like evaluating the performance of an architecture on the task of interest; the process of sequentially designing a good recipe is like the process of sequentially designing a good deep learning architecture.

Automatic cooking is also a great analogy for automatic architecture search. We can replace the human chef by an automatic chef (e.g., a cooking robot) that designs recipes and cooks them. A person then tastes each recipe and sends the results back to the automatic chef. This information can then be used by the automatic chef to design better recipes in the future. The space of recipes that the automatic chef can design is like the space of architectures, and the automatic chef is like the meta-model.

On the potential of architecture search

Going back to automatic architecture search, I think that a meta-model can conceivably learn to explore the space of architectures better than an ML expert. It can try different architectures, observe what works and what does not, and encode this information in a meta-model. As it is a algorithmic environment, given sufficient computation, we can collect as much data as we would like. The potential of these methods, combined with the fact that they scale with computation, (i.e., the more CPUs/GPUs you have, the more architectures the meta-model can try) explains why Google is so interested in this line of work. ML experts don’t scale as nicely with computation.

We are starting to see this kind of work in the literature. I believe that better tools for automatically designing architectures will become available in the next one or two years. The ML expert will not have to bother as much with all the small design decisions and will work instead at a higher level of abstraction. These techniques have the potential to, after maturing, outperform architectures manually designed by experts. This does not imply super-intelligence, but rather better models and, as a result, better visual recognition, speech recognition, and machine translation systems.

I don’t think that meta-learning will replace ML researchers, but it can give them better tools, have them work at higher levels of abstraction, or shift their focus to problems that are hard to address with meta-learning. For example, before current deep learning models that are trained end-to-end, ML researchers would handcraft the feature extraction pipeline and have a few learnable components. Before that, systems were often rule-based, which are even more heavily reliant on experts. Nowadays, most of the deep learning model is learned (i.e., the sequence of transformations is fixed, but transformations have learnable components that are automatically adjusted based on training data). In the future, even the sequence of transformation might be proposed by an ML model. All these transitions required the ML expert to progressively bother with fewer extrinsic decisions (we mostly care about model performance; the model structure is just a means to an end), or at least extrinsic decisions became less restrictive, allowing better models to be learned. This is the type of transition that we are now experiencing with meta-learning for architecture search.

On meta-learning

Q: Also wanted to talk about how meta-learning will develop in other ways. Can this help machines learn a new task based on its past experiences? Can it help machines learn tasks from much smaller amounts of data.

It’s all about supervision and the learning formulation

A: I don’t think that “meta-learning” or “learning to learn” are crisply defined concepts. Some of what is currently called meta-learning is simply the application of ML to ML (e.g., automatic architecture search). Other times, meta-learning seems to be (at least in part) referring to multitask learning, where we hope to achieve better performance by learning multiple related tasks simultaneously. For example, automatic architecture search is often thought as being “meta-learning” because the meta-model picks architectures which are then trained on a task of interest. As a result, there are two learning steps: learning a meta-model that proposes architectures that are then learned. If we go back to our analogy and replace architecture search with recipe design, the setting would be fundamentally the same but perhaps we would no longer think of it as meta-learning because tasting does not makes us think about learning as much.

One of the most fundamental aspects in machine learning, being it meta-learning or something else, is how we setup our learning problem, namely what supervision do we have and how do we use it. If the learning formulation is such that multiple related tasks can be learned simultaneously while exploiting their commonalities, then I think that we can learn with less data or learn better representations. The properties that you mentioned are a result of the learning formulation rather than it being meta-learning or not. This learning formulation is always designed by the ML expert; there is no other way around it.