This post was originally published by Sergi Castella i Sapé at Towards Data Science
Authors’ TL;DR → Learning recurrent mechanisms which operate independently, and sparingly interact can lead to better generalization to out of distribution samples..
❓Why → If artificial intelligence wants to ever resemble in some way human intelligence, it needs to generalize beyond the training data distribution. This paper — originally released a bit more than a year from now — provides insight, empirical foundations and progresses towards this kind of generalization.
💡Key insights → Recurrent Independent Mechanisms are NNs that implement an attentional bottleneck. This method draws inspiration by how the human brain processes the world; that is, largely by identifying independent mechanisms that only interact sparsely and causally. For instance, a set of balls bouncing around can be largely modelled independently until they collide with each other, which is an event that occurs sparsely.
RIMs are a form of recurrent networks where most states evolve on their own most of the time and only interact with each other sparsely via an attention mechanism, which can be either top down (directly between hidden states) or bottom up (between input features and hidden states). This network shows stronger generalization than regular RNNs when the input data distribution shifts.
One of the big takeaways of the whole Transformers thing is that the importance of inductive biases in NNs was arguably overstated. However, this is true when benchmarking models in-domain. This paper shows how, in order to prove the usefulness of strong priors such as the attention bottleneck, one needs to step outside of the training domain, and most current ML/RL systems are not benchmarked in this fashion.
While results might not be the most impressive, this paper — along with follow-up works (see below) — proposes an ambitious agenda of what’s the path forward for turning our ML systems into something that resembles our brain, might I even say merging the best from the good old symbolic AI with last decade’s DL revolution. We should celebrate such attempts!
You might also like: Fast And Slow Learning Of Recurrent Independent Mechanisms, Factorizing Declarative and Procedural Knowledge in Structured, Dynamical Environments, In Search of Lost Domain Generalization.
By Yang Song et al.
Authors’ TL;DR →A general framework for training and sampling from score-based models that unifies and generalizes previous methods, allows likelihood computation, and enables controllable generation.
❓Why → GANs are still weird creatures… Alternatives are welcome, and this one is very promising: turning data into noise is easy, turning noise into images is… Generative modeling! And this is what this paper does.
💡Key insights → Okay I can’t say I fully understood all the details, cause there’s a lot of math that’s just over my head. But the gist of it is pretty simple: you can transform an image into “noise” as a “diffusion process”. Think of how individual water molecules move inside of flowing water: there’s some deterministic flow of the water that follows a gradient with some added random jiggling around. You can do the same with pixel images, diffusing them such that they end up as something like noise from a tractable probability distribution. This process can be modeled as a Stochastic Differential Equation, known in physics, basically a differential equation with some added jiggling at each point in time.
Now, what if I told you that this stochastic diffusion process is… reversible! You can basically sample from this noise and work your way back up do an image. And just like this the authors get a SOTA inception score of 9.89 and FID of 2.20 on CIFAR-10. Okay there just so much more going on under the hood… you really need to check out this paper!
By Nicola De Cao, Gautier Izacard, Sebastian Riedel, Fabio Petroni.
Authors’ TL;DR →We address entity retrieval by generating their unique name identifiers, left to right, in an autoregressive fashion, and conditioned on the context showing SOTA results in more than 20 datasets with a tiny fraction of the memory of recent systems.
❓Why → A new straightforward approach to entity retrieval that quite surprisingly shatters some existing benchmarks.
💡Key insights → Entity retrieval is the tasks of finding the precise exact entity that natural language refers to (which can be ambigous at times). Existing approaches treated this as a search problem, where one retrieves an entity from a KG given a piece of text. Until now. This work proposes finding an entity identifier by autoregressively generating it: kind of how markdown syntax hyperlinks stuff:
[entity](identifier generated by the model). No search + reranking, nothing, plain and simple. Effectively, this means cross-encoding entities and their context with the advantage that the memory footprint scales linearly with the vocabulary size (no need to do a lot of dot products in the knowledge base space) and no need to sample negative data.
Starting with a pre-trained BART⁵, they finetune maximizing the likeliohood of autoregressive generation of a corpus with entities (wikipedia). At inference, they use constrained beam search to prevent the model from generating entities that are not valid (i.e. not in the KB). The results are just impressive, see an example in the table below.
By Lee Xiong, Chenyan Xiong et al.
Authors’ TL;DR → Improve dense text retrieval using ANCE, which selects global negatives with bigger gradient norms using an asynchronously updated ANN index.
❓Why → Information Retrieval resisted the “neural revolution” for many more years than Computer Vision. But since Bert, the advances in dense retrieval have been giant, and this is a fantastic example of that.
💡Key insights → When training a model to do dense retrieval, the common practice is to learn en embedding space where query-document distance is semantically relevant. Contrastive learning is a standard technique to do so: minimize distance of positive query-document pairs and minimize that of negative samples. However, negative samples are often chosen at random, which means they’re not very informative: most of the time negative documents are obviously not related to the query.
The authors from this paper propose to sample negatives from the Nearest Neighbours during training, which yields documents that are close to the query (i.e. documents that the current model thinks are relevant). In practice this means that an index fo the corpus needs to be asynchronoysly updated during training (updating the index every iteration would be very slow). Fortunately, results confirm how BM25 baselines are finally being left behind!
By Denis Yarats, Ilya Kostrikovm, and Rob Fergus.
Authors’ TL;DR → The first successful demonstration that image augmentation can be applied to image-based Deep RL to achieve SOTA performance.
❓Why → What are you rooting for? model-based or model-free RL? Read this paper before answering the question!
💡Key insights → Existing model-free RL are successful at learning from states input but struggle to learn from images directly. Intuitively, this is because when learning from an early replay buffer, most images are highly correlated presentig very sparse reward signals. This work shows how model-free approaches can hugely benefit from augmentations in pixel space to become more sample-efficient in learning, achieving competitive results when compared to existing model-based approaches on DeepMind control suite⁶ and 100k Atari⁷ benchmarks.
By Sashank J. Reddi, Zachary Charles et al.
Authors’ TL;DR →We propose adaptive federated optimization techniques, and highlight their improved performance over popular methods such as FedAvg.
❓Why → To make federated learning widespread, federated optimizers must become boring, just like ADAM¹¹ is in 2021. This paper precisely attempts that.
💡Key insights → Federated learning is an ML paradigm where a central model, hosted by the server, is trained by multiple clients in a distributed fashion. For instance each client can use data on their own device, compute a gradient w.r.t. a loss function and communicate to a central server the updated weights. This process opens up many questions such as how one should combine weight updates from multiple clients.
This paper does a great job at explaining the current state of federated optimizers, builds a simple framework to discuss them and shows some theoretical results on convergence guarantees and empirical results to show their proposed adaptive federated optimizers work better than existing optimizers such as FedAvg⁸. The federated optimization framework presented in this paper is agnostic of the optimizer used by the client (ClientOpt) and that used by the server (ServerOpt), and enables them to plug in techniques such as momentum and adaptive learning rate into the federated optimization process. Interestingly though, the results they showcase always use vanilla SGD as a ClientOpt , and use adaptive optimizers (ADAM, YOGI) as ServerOpt.
By Yuchen Liang et al.
Authors’ TL;DR→ A network motif from the fruit fly brain can learn word embeddings.
❓Why → The premise of this paper was too irresistible to not include it here, and it is also superb counterpoint to the dominant strain of massive ML.
💡Key insights → Words can be represented as sparse binary vectors quite effectively (even contextualized!). This work is very similar in spirit to already classics like Word2Vec⁹ and GloVe¹⁰ in the sense that word embeddings are learned using very simple neural networks and cleverly using coocurrence corpus statistics to do so.
The architecture is inspired by how biological neurons from fruit flies are organized: sensory neurons (PN) map onto Kenyon cells (KC) which are connected to the anterior paired lateral neuron (APL) which is responsible for recurrently shutting down most KCs, leaving only a few sparse activations.
Translating this to language, words are represented in PN neurons as a concatenation of a bag-of-words context and a one-hot vector for the middle word (see figure below). Then this vector is considered a training sample, which is projected onto the KC neurons and sparsified (only top-k values survive). The network is trained by minimizing an energy function that enforces words that share contexts to be close to each other in KC space.
Interestingly, this allows for generating contextualized word embeddings on the fly (😉), given that the bag-of-words context can be different for a given word during inference.
This post was originally published by Sergi Castella i Sapé at Towards Data Science