OCAML members, pre-pandemic picture

Objectively Casual Association for Machine Learning

members

Adam

Alicja

Frederic Grabowski, frederic.grabowski@ippt.pan.pl, IPPT PAN

Kuba Perlin

Paweł

upcoming topics

Causality matters!

Paweł, submitted on 2021-03-20. Will be discussed on 2021-06-07.

comic from xkcd.com

Everybody knows that margarine consumption is a cause of divorces (at least in Maine) and that a PhD in sociology can help you successfully launch a space rocket. Or… maybe not? More seriously, how do we calculate vaccine efficacy? Or how do we build robust models for healthcare, which generalize across different hospitals? (See Building Reliable ML: Should you trust NNs?). Although many popular ML algorithms work on the level of statistical relations, the causal aspects are critical for model generasability, robustness, and applications to domains in which decisions are important.

Reading:

A Survey of Learning Causality with Data: Problems and Methods, an introductory paper on causality.
Towards Causal Representation Learning, thanks Kuba!
Counterfactuals uncover the modular structure of deep generative models, is causality in any way related to disentangled representations (see Disentangled Representations)?

Bonus:

Nice to see

See Berkson’s and Simpson’s paradoxes. There is a great Numberphile episode on Berkson’s paradox.
If d-separation seems hard, try these: http://bayes.cs.ucla.edu/BOOK-2K/d-sep.html and Section 2.4 of Bayes Ball
https://en.wikipedia.org/wiki/Lady_tasting_tea

Another introduction to causality (quite a few options, in fact)

Elements of Causal Inference, every expert in the topic I know said they had read that book
https://www.bradyneal.com/causal-inference-course, a course from Mila. Great lecture notes and YouTube videos.
A Practical Introduction to Bayesian Estimation of Causal Effects: Parametric and Nonparametric Approaches, another introductory paper to the domain.
Causality For Machine Learning – and yet a different introduction
Causality, Jonas Peter’s script, later this evolved into Elements of Causal Inference
https://ff13.fastforwardlabs.com/, a nice review
One from UCL
https://causalinference.gitlab.io/kdd-tutorial/

Causality and deep learning

Selected real-world applications

Causality matters in medical imaging,
Adapting Neural Networks for the Estimation of Treatment Effects
Estimating the Effects of Continuous-valued Interventions using Generative Adversarial Networks
Estimating individual treatment effect: generalization bounds and algorithms, theorems on error bounds.
Compositional perturbation autoencoder for single-cell response modeling, using interpretable linear models to learn treatment effects
Bayesian Inference of Individualized Treatment Effects using Multi-task Gaussian Processes

Frameworks

topic proposals

Visual Question Answering

Kuba, submitted on 2020-10-31.

VQA: given an image and an English question about the image, produce a truthful text answer. VQA is perhaps the most studied task where two strikingly different modalities are used. I selected some recent papers from Deepmind on the topic.

Reading:

VQA: Visual Question Answering, Agrawal et al. 2016
Foundational paper
Learning Visual Question Answering by Bootstrapping Hard Attention, Malinowski et al. 2018
Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding, K Yi et al. 2019

Can RL play Avalon?

Paweł, submitted on 2020-11-01.

Talent wins games, but teamwork and intelligence win championships.

– Michael Jordan

Everybody knows that RL is excellent in one-vs-one games such as chess, go, or StarCraft. But nowadays teamwork is becoming more and more important – think about a fleet of autonomous taxicabs or an ensemble of delivery drones. How does RL work in the scenario of many autonomous agents? And, more importantly, does multi-agent RL work in the presence of imposters which try to harm the team? (See Avalon or Among Us).

Reading:

How to guide your NN?

Paweł, submitted on 2020-12-01 (rebranded on 2021-02-14).

We build machine learning systems by taking a big neural network and feeding it millions of examples. However, for every task there are more and less sensible architectures. For example, CNNs employ an inductive bias: “To recognize an object, find ears, nose, wheels. To recognize ears, nose, wheels, find appropriate edges. To recognize edges, ...”.

Very often we – humans – already have a pretty good intuition how the solution should look like and what subproblems need to be solved. But how do we guide the NN to do that?

comic from xkcd.com

Reading:

Augmenting Neural Networks with First-order Logic
Group Equivariant Convolutional Networks, if a problem has a specific symmetry, we can use it to our advantage!
Towards Expressive Priors for Bayesian Neural Networks: Poisson Process Radial Basis Function Networks, a novel Bayesian prior which allows an expert to input important problem properties into the model.

Bonus:

The shape of deep learning

Paweł, submitted on 2021-02-03.

Quoting the wisest, “Universe is non-Euclidean, why should data be?”. Let’s explore how the ideas from geometry and topology can help neural networks work better.

In particular we will learn that all classification problems can always be solved (at least from the topologists’ perspective) and how autoencoders are linked to the Manifold Hypothesis. We will also discuss whether the task of neural network validation can be accomplished... without a validation data set!

Reading:

Bonus:

Can I have infinitely many layers, please?
The story of Neural ODEs

Paweł, submitted on 2021-02-03.

We all know and love ResNets, which come in many flavours: ResNet50, ResNet100, ResNet1202… But what if we wanted to train an infinitely deep ResNet? We arrive at Neural ODEs!

Reading:

Neural Ordinary Differential Equations, foundational, Best Paper Award on NeurIPS 2018
Augmented Neural ODEs, three Oxford statisticians shed some light on what you can’t learn with vanilla Neural ODEs (and how to fix them)
Dissecting Neural ODEs, why the statement “vanilla Neural ODEs are infinitely deep ResNets” is a (small) clickbait (and how to fix them)

Bonus:

Stack Exchange thread on Neural ODEs
NeurIPS 2020 tutorial on Deep Implicit Layers
How to train your Neural ODE, which discusses how to efficiently train Neural ODEs on large data sets

Graph Neural Networks 9 3/4

Paweł, Adam, Frederic G., submitted on 2021-05-24.

Time to take a look at GNNs in real life!

Molecules:

Other applications:

General resources:

https://www.cs.mcgill.ca/~wlh/grl_book/ (not actually a real application, but an excellent resource on GNNs)
https://github.com/DeepGraphLearning/LiteratureDL4Graph

Frameworks:

topic archive

Triplet loss and smooth sorting

Frederic G., submitted on 2021-05-16. Discussed on 2021-05-24.

Abstract: We will learn what triplet loss is and revisit its usefulness for training classifiers. Next we’ll dive deeper into the slightly related topic of information retrieval, by using differentiable sorting/top-k classification.

Reading:

Bonus:

Differentiable Ranks and Sorting using Optimal Transport

Explaining Machine Learning Predictions

Alicja, submitted on 2021-04-25. Discussed on 2021-05-10.

Explaining Machine Learning predictions: is it even possible?

Let's try to find out.

Reading:

Bonus:

https://christophm.github.io/interpretable-ml-book/agnostic.html
https://thegradient.pub/interpretability-in-ml-a-broad-overview/
The Mythos of Model Interpretability, ZC Lipton’s review of the domain
Explainable Deep Learning, a survey paper on explainable DL

Knowledge Distillation

Kuba, discussed on 2020-10-31.

Knowledge distillation refers to training small networks given pre-trained large networks. The hope is to obtain a small network that shall be better at the task than if it was trained directly. Small networks are faster to evaluate and require less memory. An example use case for small networks is mobile robotics, where the robots cannot carry around a GPU.

Reading:

Distilling the Knowledge in a Neural Network, G. Hinton et al. 2015
foundational paper
Towards Understanding Knowledge Distillation, M. Phuong et al. ICML 2019
early theoretical results about when distillation works well
Zero-Shot Knowledge Distillation in Deep Networks, GK Nayak et al. ICML 2019

Bonus, related to adversarial examples:

Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks, N. Papernot et al. 2015, IEEE Security 2016

Deep ensembles and Bayesian Deep Learning

Paweł, discussed on 2021-02-17.

When we fit a line to the data, we minimize the squared distance. But why does this yield a sensible estimate for the slope? And what is the “uncertainty” of that slope? Bayesian (Deep) Learning answers this question by interpreting the fitting process as a Bayesian update from prior to posterior knowledge and tries to do that for a deep learning model proposing uncertainty on learned weights.

But how is that related to the ensemble models?!

Reading:

Hands-on Bayesian Neural Networks --- a Tutorial for Deep Learning Users, a great (but a bit long) overview paper, covering Bayesian NNs, Probabilistic Graphical Models and Bayesian inference methods.
Bayesian Deep Learning and a Probabilistic Perspective of Generalization, a real jewel, describing why ensemble models work (and how to solve the Double Descent Problem, arising from training a big neural network).
Deep Ensembles: A Loss Landscape Perspective, why don’t Bayesian NNs work as well as ensemble models? After reading the previous paper, you may already know the answer, but this short and cute paper is also worth reading.

Bonus:

Bayesian Deep Learning, slides to lecture by Yarin Gal, a great introductory resource. May be useful if the review paper above appears too complex.
Deep Image Prior, Oh no, I’m running out of my TPU quota and I need to solve a Computer Vision problem! How well can I do without training?
Is it possible to interpret the bootstrap from a Bayesian perspective? Ok, ensemble models suggest bootstrap and bootstrap suggests the frequentism. Aaaaa, frequentism is a heresy in the machine learning community! Is there any way to solve this disagreement?
The Case for Bayesian Deep Learning, another overview. This time, a short one.
Subspace Inference for Bayesian Deep Learning, there are better and worse weights configurations over which we can average. How to find a nice subspace of weights, generating a nice prior over models?
Approximate inference, a NeurIPS2020 workshop on inference methods

Neural Tangent Kernel

Paweł and Adam, discussed on 2020-12-01.

[Achilles] Mr T., why do NNs work?

[Turtle] Because they are universal approximators, aren’t they? Any reasonable function can be approximated by a deep enough net.

[Achilles] But this is boooring… You can accomplish the same result with a look-up table! I wonder why they learn…

[Turtle] Oh, you mean why deep learning works? I don’t know, but let me show you NTK…

Reading:

Bonus:

Understanding the Neural Tangent Kernel (Rajat's blog)

Gaussian Process Winter School

Paweł, discussed on 2021-02-10.

How does one choose the best set of hyperparameters for a NN, decide which one-armed bandit to play or find the catchiest new topic for OCAML? And what do these things have in common with fitting a line to noisy data? (Or a curve. Or an infinite family of curves). If the optimized function is noisy and our observations are limited, we should consider Bayesian Optimization powered by powerful regression models called Gaussian Processes (GPs). During this session we will read a review paper [1], which summarizes BayOpt and GPs in less than thirty pages and then see what happens when deep learning researchers enter the challenge [2,3].

Reading:

Taking the Human Out of the Loop: A Review of Bayesian Optimization, an overview paper – the must-read review of BayOpt and GPs.
Deep Gaussian Processes, how to make more flexible – deep – models,
Neural Processes, a modern blend of GPs and neural nets advertised as “the best of both worlds”.

Bonus:

http://gpss.cc/ Gaussian Process Summer School. Plenty of talks of varying difficulty.
https://github.com/EmuKit/emukit and https://scikit-optimize.github.io/stable/ if you want to quickly optimize the NN hyperparameters.
Input Warping for Bayesian Optimization of Non-stationary Functions, Oh no, accuracy on a validation data set is not a stationary function of learning rate! What should I do?
Deep Gaussian Processes for Multi-fidelity Modeling, an actual use of Deep Gaussian Processes
Batch Bayesian Optimization via Local Penalization, or how to optimize the hyperparameters if you have several GPUs/TPUs available.
Computing with infinite networks, what is the relationship between NNs and GPs?
Wide Neural Networks with Bottlenecks are Deep Gaussian Processes, what is the relationship between NNs and GPs? Round 2.

Disentangled Representations

Kuba, discussed on 2021-01-22.

Humans arguably factorise situations / world states into familiar building blocks -- when you see a falling apple for the first time in your life, but you’ve seen apples before and you’ve also seen pears fall before, you know what’s going to happen. Disentanglement, or factorisation, in deep learning can make for cool sets of generated 2D images with faces gradually changing orientation, or colour of hair… But how can it be achieved and formalised? Let’s try to find out.

Reading:

Disentangling by Factorising (FactorVAE)
2019, UCL, Brain, Helsinki
Disentangling Disentanglement in Variational Autoencoders
2019, Oxford
Variational Autoencoders and Nonlinear ICA: A Unifying Framework
2020, Deepmind, Oxford

Bonus:

Transformers in NLP

Adam, Frederic, Kuba, discussed on 2020-10-31.

Transformers use self-attention to process sequences differently from recurrent architectures. The length of the ‘information path’ through the network, between any two sequence elements, is constant, as opposed to O(n) for RNNs. Transformer-based architectures have gotten very popular lately, with applications beyond NLP.

Reading:

Attention is all you need, Vaswani et al. 2017
Foundational paper introducing the Transformer architecture
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Google AI 2018
A majorly successful transformer-based language model from Google
Language Models are Few-Shot Learners, OpenAI 2020
The huge GPT-3 network from daddy Elon

Bonus:

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
Introduces a special case of transformers that’s linear time and shows comparable performance. Shows how to interpret any autoregressive transformer model as an RNN. ICML 2020
Improving Language Understanding by Generative Pre-Training, OpenAI 2018
The GPT paper
Transformers in simple language (and code), which nicely complement Attention is all you need.

Graph Neural Networks

Alicja, discussed on 2020-11-01.

Universe is non-Euclidean, why should data be? Graph Neural Networks can save the day!

Reading:

Bonus:

https://arxiv.org/pdf/1904.03751.pdf

Normalising Flows and an application

Kuba, discussed on 2020-10-31.

Normalising Flows are invertible operators used to build (easily) learnable, complicated probability distributions, for use in variational inference.

I especially encourage you to vote on this topic if you’re also not very familiar with variational inference / bayesian ML -- we can learn together (and hopefully at least one person will know a bit more to help out…)

Reading:

Variational Inference with Normalising Flows, DJ Rezende et al. (Deepmind) 2015
The foundational paper on normalising flows. Also see these short tutorials about NFs for lighter reading: 1, 2.
Improved Variational Inference with Inverse Autoregressive Flow,
Kingma et al. (OpenAI) 2017 - Introduces the inverse autoregressive flow (IAF), a well-scalable NF. In Section 2 contains a quick review of variational inference.
Parallel WaveNet, Deepmind 2017
Application of knowledge distillation into an IAF-based model to speed up speech synthesis, with a sequential WaveNet model as Teacher

Bonus:

Brief overview of different kinds of generative models, not sure how good the quality of this is

Adversarial examples

Paweł, discussed on 2020-10-30.

Adversarial examples try to trick neural nets, similarly as magicians and optical illusions trick human brains. It’s useful to know how to defend your neural network from these kinds of attacks… or use them during the training phase to your advantage!

Reading:

Bonus:

Adversarial Attacks and Defences: A Survey
On Evaluating Adversarial Robustness, 2019, Goodfellow et al.
Motivating the Rules of the Game for Adversarial Example Research, 2018, Goodfellow et al.
A Research Agenda: Dynamic Models to Defend Against Correlated Attacks, 2019, Goodfellow
https://adversarial-ml-tutorial.org/introduction/

Machine Learning in drug development

Paweł, discussed on 2020-11-01.

Drug design is hard – molecules are tricky to describe and it’s not obvious at all how to predict the properties of a proposed molecule just from its chemical structure. Can neural nets, aka universal approximators, solve this problem? Let’s figure it out!

Reading:

Building Reliable ML: Should you trust NNs?

Paweł, discussed on 2021-01-11.

Machine learning is great – we are entering the age of self-driving cars, GPTs writing newspapers, and fast medical diagnosis. But how much should we trust the NNs when there is no human around? How should they be evaluated so they are reliable? We will explore how one can discover spurious correlations leading to non-generalizability of trained models and how to properly evaluate a novel architecture.

Reading:

Bonus:

You Can’t Escape Hyperparameters and Latent Variables, Charles Isbell’s talk at NeurIPS 2020
I Can’t Believe It’s Not Better!, NeurIPS 2020 workshop
Underspecification Presents Challenges for Credibility in Modern Machine Learning
Discriminative learning under covariate shift
Amazon scraps secret AI recruiting tool that showed bias against women, Reuters
The First Level of Super Mario Bros. is Easy with Lexicographic Orderings and Time Travel, with its NN that pauses Tetris just before it looses
The Risks of Invariant Risk Minimization (ICLR'21)

Symbolic ML & Transformers try to understand causality in videos

Kuba, discussed on 2021-01-22.

This three-act story first introduces us to a neuro-symbolic architecture. Then, a video-based reasoning dataset with 4 fundamental types of tasks is presented, where symbolic models may be expected to excel. In a grand finale, DeEpmINd trains a TrAnSfoRMer and outperforms all neuro-symbolic approaches :P

Reading:

Learning to learn

Alicja, discussed on 2021-04-10.

Learning to learn - how cool is that?!

Reading:

Bonus:

https://sites.google.com/view/icml19metalearning

members

upcoming topics

Causality matters!

topic proposals

Visual Question Answering

Can RL play Avalon?

How to guide your NN?

The shape of deep learning

Can I have infinitely many layers, please?The story of Neural ODEs

Graph Neural Networks 9 3/4

topic archive

Triplet loss and smooth sorting

Explaining Machine Learning Predictions

Knowledge Distillation

Deep ensembles and Bayesian Deep Learning

Neural Tangent Kernel

Gaussian Process Winter School

Disentangled Representations

Transformers in NLP

Graph Neural Networks

Normalising Flows and an application

Adversarial examples

Machine Learning in drug development

Building Reliable ML: Should you trust NNs?

Symbolic ML & Transformers try to understand causality in videos

Learning to learn

Can I have infinitely many layers, please?
The story of Neural ODEs