OCAML members, pre-pandemic picture

Objectively Casual Association for Machine Learning

Adam

Alicja

Frederic Grabowski, frederic.grabowski@ippt.pan.pl, IPPT PAN

Kuba Perlin

Paweł

Paweł, submitted on 2021-03-20. Will be discussed on 2021-06-07.

comic from xkcd.com

Everybody knows that margarine consumption is a cause of divorces (at least in Maine) and that a PhD in sociology can help you successfully launch a space rocket. Or… maybe not? More seriously, how do we calculate vaccine efficacy? Or how do we build robust models for healthcare, which generalize across different hospitals? (See Building Reliable ML: Should you trust NNs?). Although many popular ML algorithms work on the level of statistical relations, the causal aspects are critical for model generasability, robustness, and applications to domains in which decisions are important.

Reading:

- A Survey of Learning Causality with Data: Problems and Methods, an introductory paper on causality.
- Towards Causal Representation Learning, thanks Kuba!
- Counterfactuals uncover the modular structure of deep generative models, is causality in any way related to disentangled representations (see Disentangled Representations)?

Bonus:

- Nice to see

- See Berkson’s and Simpson’s paradoxes. There is a great Numberphile episode on Berkson’s paradox.
- If d-separation seems hard, try these: http://bayes.cs.ucla.edu/BOOK-2K/d-sep.html and Section 2.4 of Bayes Ball
- https://en.wikipedia.org/wiki/Lady_tasting_tea

- Another introduction to causality (quite a few options, in fact)

- Elements of Causal Inference, every expert in the topic I know said they had read that book
- https://www.bradyneal.com/causal-inference-course, a course from Mila. Great lecture notes and YouTube videos.
- A Practical Introduction to Bayesian Estimation of Causal Effects: Parametric and Nonparametric Approaches, another introductory paper to the domain.
- Causality For Machine Learning – and yet a different introduction
- Causality, Jonas Peter’s script, later this evolved into Elements of Causal Inference
- https://ff13.fastforwardlabs.com/, a nice review
- One from UCL
- https://causalinference.gitlab.io/kdd-tutorial/

- Causality and deep learning

- Deep Structural Causal Models for Tractable Counterfactual Inference
- Causal autoregressive flows
- Causal VAE
- Causal Effect Inference with Deep Latent-Variable Models
- Causality Learning: A New Perspective for Interpretable Machine Learning
- Examining the causal structures of deep neural networks using information theory
- Causal inference using invariant prediction

- Selected real-world applications

- Causality matters in medical imaging,
- Adapting Neural Networks for the Estimation of Treatment Effects
- Estimating the Effects of Continuous-valued Interventions using Generative Adversarial Networks
- Estimating individual treatment effect: generalization bounds and algorithms, theorems on error bounds.
- Compositional perturbation autoencoder for single-cell response modeling, using interpretable linear models to learn treatment effects
- Bayesian Inference of Individualized Treatment Effects using Multi-task Gaussian Processes

- Frameworks

Kuba, submitted on 2020-10-31.

VQA: given an image and an English question about the image, produce a truthful text answer. VQA is perhaps the most studied task where two strikingly different modalities are used. I selected some recent papers from Deepmind on the topic.

Reading:

- VQA: Visual Question Answering, Agrawal et al. 2016

Foundational paper - Learning Visual Question Answering by Bootstrapping Hard Attention, Malinowski et al. 2018
- Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding, K Yi et al. 2019

Paweł, submitted on 2020-11-01.

Talent wins games, but teamwork and intelligence win championships.

– Michael Jordan

Everybody knows that RL is excellent in one-vs-one games such as chess, go, or StarCraft. But nowadays teamwork is becoming more and more important – think about a fleet of autonomous taxicabs or an ensemble of delivery drones. How does RL work in the scenario of many autonomous agents? And, more importantly, does multi-agent RL work in the presence of imposters which try to harm the team? (See Avalon or Among Us).

Reading:

- OpenAI Five
- Multi-agent reinforcement learning: A selective overview of theories and algorithms
- Finding Friend and Foe in Multi-Agent Games

Paweł, submitted on 2020-12-01 (rebranded on 2021-02-14).

We build machine learning systems by taking a big neural network and feeding it millions of examples. However, for every task there are more and less sensible architectures. For example, CNNs employ an inductive bias: “To recognize an object, find ears, nose, wheels. To recognize ears, nose, wheels, find appropriate edges. To recognize edges, ...”.

Very often we – humans – already have a pretty good intuition how the solution should look like and what subproblems need to be solved. But how do we guide the NN to do that?

comic from xkcd.com

Reading:

- Augmenting Neural Networks with First-order Logic
- Group Equivariant Convolutional Networks, if a problem has a specific symmetry, we can use it to our advantage!
- Towards Expressive Priors for Bayesian Neural Networks: Poisson Process Radial Basis Function Networks, a novel Bayesian prior which allows an expert to input important problem properties into the model.

Bonus:

- Taco Cohen’s webpage
- NeurIPS workshop on equivariant networks
- IL-Net: Using Expert Knowledge to Guide the Design of Furcated Neural Networks

Paweł, submitted on 2021-02-03.

Quoting the wisest, “Universe is non-Euclidean, why should data be?”. Let’s explore how the ideas from geometry and topology can help neural networks work better.

In particular we will learn that all classification problems can always be solved (at least from the topologists’ perspective) and how autoencoders are linked to the Manifold Hypothesis. We will also discuss whether the task of neural network validation can be accomplished... without a validation data set!

Reading:

- Neural Persistence: A Complexity Measure for Deep Neural Networks Using Algebraic Topology
- A Topological Framework for Deep Learning
- Geometric deep learning: going beyond Euclidean data

Bonus:

- R. Grosse’s post: Differential geometry for machine learning
- Geometric Understanding of Deep Learning
- Differential Geometry for Machine Learning (NeurIPS 2020)

The story of Neural ODEs

Paweł, submitted on 2021-02-03.

We all know and love ResNets, which come in many flavours: ResNet50, ResNet100, ResNet1202… But what if we wanted to train an infinitely deep ResNet? We arrive at Neural ODEs!

Reading:

- Neural Ordinary Differential Equations, foundational, Best Paper Award on NeurIPS 2018
- Augmented Neural ODEs, three Oxford statisticians shed some light on what you can’t learn with vanilla Neural ODEs (and how to fix them)
- Dissecting Neural ODEs, why the statement “vanilla Neural ODEs are infinitely deep ResNets” is a (small) clickbait (and how to fix them)

Bonus:

- Stack Exchange thread on Neural ODEs
- NeurIPS 2020 tutorial on Deep Implicit Layers
- How to train your Neural ODE, which discusses how to efficiently train Neural ODEs on large data sets

Paweł, Adam, Frederic G., submitted on 2021-05-24.

Time to take a look at GNNs in real life!

Molecules:

- https://arxiv.org/pdf/2006.15426.pdf
- https://arxiv.org/abs/1806.02473
- https://www.sciencedirect.com/science/article/pii/S1740674920300305
- https://www.cell.com/cell/fulltext/S0092-8674(20)30102-1
- https://arxiv.org/pdf/2102.00546.pdf
- https://github.com/mufeili/DL4MolecularGraph
- https://github.com/BenevolentAI/guacamol

Other applications:

- https://deepmind.com/blog/article/traffic-prediction-with-advanced-graph-neural-networks
- https://sites.google.com/view/learning-to-simulate/

General resources:

- https://www.cs.mcgill.ca/~wlh/grl_book/ (not actually a real application, but an excellent resource on GNNs)
- https://github.com/DeepGraphLearning/LiteratureDL4Graph

Frameworks:

- https://github.com/divelab/DIG
- https://www.dgl.ai/
- https://github.com/chainer/chainer-chemistry
- https://pytorch-geometric.readthedocs.io/en/latest/
- https://graphneural.network/
- Tensorflow-based

Frederic G., submitted on 2021-05-16. Discussed on 2021-05-24.

Abstract: We will learn what triplet loss is and revisit its usefulness for training classifiers. Next we’ll dive deeper into the slightly related topic of information retrieval, by using differentiable sorting/top-k classification.

Reading:

- In Defense of the Triplet Loss for Person Re-Identification
- Stochastic Optimization of Sorting Networks via Continuous Relaxations
- SoftRank: Optimising Non-Smooth Rank Metrics

Bonus:

Alicja, submitted on 2021-04-25. Discussed on 2021-05-10.

Explaining Machine Learning predictions: is it even possible?

Let's try to find out.

Reading:

- "Why Should I Trust You?": Explaining the Predictions of Any Classifier
- A unified approach to interpreting model predictions

Bonus:

- https://christophm.github.io/interpretable-ml-book/agnostic.html
- https://thegradient.pub/interpretability-in-ml-a-broad-overview/
- The Mythos of Model Interpretability, ZC Lipton’s review of the domain
- Explainable Deep Learning, a survey paper on explainable DL

Kuba, discussed on 2020-10-31.

Knowledge distillation refers to training small networks given pre-trained large networks. The hope is to obtain a small network that shall be better at the task than if it was trained directly. Small networks are faster to evaluate and require less memory. An example use case for small networks is mobile robotics, where the robots cannot carry around a GPU.

Reading:

- Distilling the Knowledge in a Neural Network, G. Hinton et al. 2015

foundational paper - Towards Understanding Knowledge Distillation, M. Phuong et al. ICML 2019

early theoretical results about when distillation works well - Zero-Shot Knowledge Distillation in Deep Networks, GK Nayak et al. ICML 2019

Bonus, related to adversarial examples:

- Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks, N. Papernot et al. 2015, IEEE Security 2016

Paweł, discussed on 2021-02-17.

When we fit a line to the data, we minimize the squared distance. But why does this yield a sensible estimate for the slope? And what is the “uncertainty” of that slope? Bayesian (Deep) Learning answers this question by interpreting the fitting process as a Bayesian update from prior to posterior knowledge and tries to do that for a deep learning model proposing uncertainty on learned weights.

But how is that related to the ensemble models?!

Reading:

- Hands-on Bayesian Neural Networks --- a Tutorial for Deep Learning Users, a great (but a bit long) overview paper, covering Bayesian NNs, Probabilistic Graphical Models and Bayesian inference methods.
- Bayesian Deep Learning and a Probabilistic Perspective of Generalization, a real jewel, describing why ensemble models work (and how to solve the Double Descent Problem, arising from training a big neural network).
- Deep Ensembles: A Loss Landscape Perspective, why don’t Bayesian NNs work as well as ensemble models? After reading the previous paper, you may already know the answer, but this short and cute paper is also worth reading.

Bonus:

- Bayesian Deep Learning, slides to lecture by Yarin Gal, a great introductory resource. May be useful if the review paper above appears too complex.
- Deep Image Prior, Oh no, I’m running out of my TPU quota and I need to solve a Computer Vision problem! How well can I do without training?
- Is it possible to interpret the bootstrap from a Bayesian perspective? Ok, ensemble models suggest bootstrap and bootstrap suggests the frequentism. Aaaaa, frequentism is a heresy in the machine learning community! Is there any way to solve this disagreement?
- The Case for Bayesian Deep Learning, another overview. This time, a short one.
- Subspace Inference for Bayesian Deep Learning, there are better and worse weights configurations over which we can average. How to find a nice subspace of weights, generating a nice prior over models?
- Approximate inference, a NeurIPS2020 workshop on inference methods

Paweł and Adam, discussed on 2020-12-01.

[Achilles] Mr T., why do NNs work?

[Turtle] Because they are universal approximators, aren’t they? Any reasonable function can be approximated by a deep enough net.

[Achilles] But this is boooring… You can accomplish the same result with a look-up table! I wonder why they learn…

[Turtle] Oh, you mean why deep learning works? I don’t know, but let me show you NTK…

Reading:

- Deep Neural Networks as Gaussian Processes
- Neural Tangent Kernel: Convergence and Generalization in Neural Networks
- Towards an Understanding of Residual Networks Using Neural Tangent Hierarchy

Bonus:

- Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks
- Every Model Learned by Gradient Descent Is Approximately a Kernel Machine (Nov 2020)
- Edge of Chaos/vanishing and exploding gradients
- Python library for infinite networks

Paweł, discussed on 2021-02-10.

How does one choose the best set of hyperparameters for a NN, decide which one-armed bandit to play or find the catchiest new topic for OCAML? And what do these things have in common with fitting a line to noisy data? (Or a curve. Or an infinite family of curves). If the optimized function is noisy and our observations are limited, we should consider Bayesian Optimization powered by powerful regression models called Gaussian Processes (GPs). During this session we will read a review paper [1], which summarizes BayOpt and GPs in less than thirty pages and then see what happens when deep learning researchers enter the challenge [2,3].

Reading:

- Taking the Human Out of the Loop: A Review of Bayesian Optimization, an overview paper – the must-read review of BayOpt and GPs.
- Deep Gaussian Processes, how to make more flexible – deep – models,
- Neural Processes, a modern blend of GPs and neural nets advertised as “the best of both worlds”.

Bonus:

- http://gpss.cc/ Gaussian Process Summer School. Plenty of talks of varying difficulty.
- https://github.com/EmuKit/emukit and https://scikit-optimize.github.io/stable/ if you want to quickly optimize the NN hyperparameters.
- Input Warping for Bayesian Optimization of Non-stationary Functions, Oh no, accuracy on a validation data set is not a stationary function of learning rate! What should I do?
- Deep Gaussian Processes for Multi-fidelity Modeling, an actual use of Deep Gaussian Processes
- Batch Bayesian Optimization via Local Penalization, or how to optimize the hyperparameters if you have several GPUs/TPUs available.
- Computing with infinite networks, what is the relationship between NNs and GPs?
- Wide Neural Networks with Bottlenecks are Deep Gaussian Processes, what is the relationship between NNs and GPs? Round 2.

Kuba, discussed on 2021-01-22.

Humans arguably factorise situations / world states into familiar building blocks -- when you see a falling apple for the first time in your life, but you’ve seen apples before and you’ve also seen pears fall before, you know what’s going to happen. Disentanglement, or factorisation, in deep learning can make for cool sets of generated 2D images with faces gradually changing orientation, or colour of hair… But how can it be achieved and formalised? Let’s try to find out.

Reading:

- Disentangling by Factorising (FactorVAE)

2019, UCL, Brain, Helsinki - Disentangling Disentanglement in Variational Autoencoders

2019, Oxford - Variational Autoencoders and Nonlinear ICA: A Unifying Framework

2020, Deepmind, Oxford

Bonus:

- β-VAE 2017
- Understanding disentangling in β-VAE 2018
- Towards a Definition of Disentangled Representations 2018
- Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive Learning 2019

Adam, Frederic, Kuba, discussed on 2020-10-31.

Transformers use self-attention to process sequences differently from recurrent architectures. The length of the ‘information path’ through the network, between any two sequence elements, is constant, as opposed to O(n) for RNNs. Transformer-based architectures have gotten very popular lately, with applications beyond NLP.

Reading:

- Attention is all you need, Vaswani et al. 2017

Foundational paper introducing the Transformer architecture - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Google AI 2018

A majorly successful transformer-based language model from Google - Language Models are Few-Shot Learners, OpenAI 2020

The huge GPT-3 network from daddy Elon

Bonus:

- Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Introduces a special case of transformers that’s linear time and shows comparable performance. Shows how to interpret any autoregressive transformer model as an RNN. ICML 2020 - Improving Language Understanding by Generative Pre-Training, OpenAI 2018

The GPT paper - Transformers in simple language (and code), which nicely complement Attention is all you need.

- The Annotated Transformer
- The Illustrated Transformer
- Deep Learning: The transformer
- https://jalammar.github.io/illustrated-bert/
- http://jalammar.github.io/illustrated-gpt2/
- https://openai.com/blog/image-gpt/

Alicja, discussed on 2020-11-01.

Universe is non-Euclidean, why should data be? Graph Neural Networks can save the day!

Reading:

- A Comprehensive Survey on Graph Neural Networks
- Graph Attention Networks
- Neural Execution of Graph Algorithms

Bonus:

- https://arxiv.org/pdf/1904.03751.pdf

Kuba, discussed on 2020-10-31.

Normalising Flows are invertible operators used to build (easily) learnable, complicated probability distributions, for use in variational inference.

I especially encourage you to vote on this topic if you’re also not very familiar with variational inference / bayesian ML -- we can learn together (and hopefully at least one person will know a bit more to help out…)

Reading:

- Variational Inference with Normalising Flows, DJ Rezende et al. (Deepmind) 2015

The foundational paper on normalising flows. Also see these short tutorials about NFs for lighter reading: 1, 2. - Improved Variational Inference with Inverse Autoregressive Flow,

Kingma et al. (OpenAI) 2017 - Introduces the inverse autoregressive flow (IAF), a well-scalable NF. In Section 2 contains a quick review of variational inference. - Parallel WaveNet, Deepmind 2017

Application of knowledge distillation into an IAF-based model to speed up speech synthesis, with a sequential WaveNet model as Teacher

Bonus:

- Brief overview of different kinds of generative models, not sure how good the quality of this is

Paweł, discussed on 2020-10-30.

Adversarial examples try to trick neural nets, similarly as magicians and optical illusions trick human brains. It’s useful to know how to defend your neural network from these kinds of attacks… or use them during the training phase to your advantage!

Reading:

- Explaining and Harnessing Adversarial Examples
- Feature Purification: How Adversarial Training Performs Robust Deep Learning
- Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning

Bonus:

- Adversarial Attacks and Defences: A Survey
- On Evaluating Adversarial Robustness, 2019, Goodfellow et al.
- Motivating the Rules of the Game for Adversarial Example Research, 2018, Goodfellow et al.
- A Research Agenda: Dynamic Models to Defend Against Correlated Attacks, 2019, Goodfellow
- https://adversarial-ml-tutorial.org/introduction/

Paweł, discussed on 2020-11-01.

Drug design is hard – molecules are tricky to describe and it’s not obvious at all how to predict the properties of a proposed molecule just from its chemical structure. Can neural nets, aka universal approximators, solve this problem? Let’s figure it out!

Reading:

- Constrained Graph Variational Autoencoders for Molecule Design
- Interpretable Deep Learning in Drug Discovery
- Generative chemistry: drug discovery with deep learning generative models

Paweł, discussed on 2021-01-11.

Machine learning is great – we are entering the age of self-driving cars, GPTs writing newspapers, and fast medical diagnosis. But how much should we trust the NNs when there is no human around? How should they be evaluated so they are reliable? We will explore how one can discover spurious correlations leading to non-generalizability of trained models and how to properly evaluate a novel architecture.

Reading:

- Shortcut Learning in Deep Neural Networks
- Invariant Risk Minimization
- Realistic Evaluation of Deep Semi-Supervised Learning Algorithms

Bonus:

- You Can’t Escape Hyperparameters and Latent Variables, Charles Isbell’s talk at NeurIPS 2020
- I Can’t Believe It’s Not Better!, NeurIPS 2020 workshop
- Underspecification Presents Challenges for Credibility in Modern Machine Learning
- Discriminative learning under covariate shift
- Amazon scraps secret AI recruiting tool that showed bias against women, Reuters
- The First Level of Super Mario Bros. is Easy with Lexicographic Orderings and Time Travel, with its NN that pauses Tetris just before it looses
- The Risks of Invariant Risk Minimization (ICLR'21)

Kuba, discussed on 2021-01-22.

This three-act story first introduces us to a neuro-symbolic architecture. Then, a video-based reasoning dataset with 4 fundamental types of tasks is presented, where symbolic models may be expected to excel. In a grand finale, DeEpmINd trains a TrAnSfoRMer and outperforms all neuro-symbolic approaches :P

Reading:

- The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision, 2019
- CLEVRER: CoLlision Events for Video REpresentation and Reasoning, 2019
- Object-based attention for spatio-temporal reasoning: Outperforming neuro-symbolic models with flexible distributed architectures, 2020

Alicja, discussed on 2021-04-10.

Learning to learn - how cool is that?!

Reading:

- https://lilianweng.github.io/lil-log/2018/11/30/meta-learning.html
- https://lilianweng.github.io/lil-log/2019/06/23/meta-reinforcement-learning.html

Bonus: