Introduction to Machine Learning principles

oriented toward understanding and mathematics
rather than impressive demonstration
– I will try to answer to some of your questions –

M-A Delsuc - IGBMC - Université de Strasbourg
produced with Quarto & reveal
CC 4.0 BY

2024-05-13

New Science

New Science ?

An example of mine from 25 years ago.

A simple perceptron-type neural-network to assign amino-acid signals from protein ¹H NMR signals.

- input layer : 32
- hidden layer: 4
- output layer: 20
- software: MatLab module
- training time : long !

PROFFseq Rost & Sanders (1993) (Protein Secondary Structure Prediction
Perceptrons Minsky & Papert (1969)

Old Science ?

Well known example

(source Wikipedia) Analysis of a system, from measurements to model

So, what is new ?

Computers

Modern Machine Learning is about handling a lot of data
Large memory – Large computing capacities
parallelization – GPU

Mathematics / Algorithmics

new mathematics of large dimensionnal spaces
- Johnson–Lindenstrauss lemma (1984)
- Kullback–Leibler divergence (1975)
algorithmic derivation and back-propagation
Tensor objects
Stochastique Gradient Descent (SGD)

Efficiency

“bio-inspired” approaches
favour efficiency to understanding
size of training sets (digitalisation of the world)
driven by Large Companies
- Google, Meta, Microsoft

The problem

General view of Data Analysis and Modeling:

DATA
can be anything
- images
- distributions
- values
- classification

\(N\) measurements
\(X_n \quad n : \{1 \cdots N\}\)
each \(X_n\) contains
1 or more “features”
\(\Rightarrow\) stored in a matrix \(X\)
NOT in Excel !

MODEL
can be anything
- analogic
- equation
- program
- neural network

\(P\) parameters
\(M_p \quad p : \{1 \cdots P\}\)

\(\Rightarrow\) stored in a program
NOT in Excel !

Question
can be many things
- regression
- classification
- clustering
- model confirmation
- inversion
- denoising
- generative

“NOT in Excel” could be a definition of modern ML !

an optimization problem - 1

training

apply the model \(M\) to the data \(X\) and see how it matches
modifies the parameters to improve match
loop until satisfied

\(\Rightarrow\) trained model

to do so:

build a target function \(T\) which measure the mismatch (or a similarity)

Either:

1/ some answers \(A\) are known - supervised training \(\;\equiv\quad\) knowledge extension

2/ No known answers - non-supervised training \(\;\equiv\quad\) knowledge extraction

an optimization problem - 2

1/ supervised training

some answers \(A\) are known - build \(T\) using the answers :

\[T(M) = d( M(X), A )\qquad \text{where } d \text{ is a distance}\]

2/ non-supervised training

No known answers - build \(T\) with other kind information :

use some kind of a-priori information \(f\quad\) (positivity, sparcity, entropy, other global statistics, …) \[T(M) = f(M)\]
test against itself par cutting \(X\) in 2 parts: \(X_{training}\) and \(X_{validation}\) \[T(M) = d(M(X_t), X_v)\]

3/ cumulate both approaches \[T(M) = d( M(X), A ) + \alpha f(M)\] Regularisation approach – quite common actually

an optimization problem - 3

In all cases, find the set of parameters \(M_p\) so that \(T(M)\) is optimal
minimum if \(d()\) is a distance; – maximum if \(d()\) is a similarity

The distance can be the Cartesian distance: \[d(a,b) = \sqrt{ \sum_i (a_i - b_i)^2 }\] (also called the \(\ell_2\) norm)

but can be any other norm, (or even a pseudo-norm) in particular for large dimensionnal datasets.

for instance

the \(\ell_1\) norm:\(\quad \ell_1(a,b) = \sum_i |a_i - b_i|\)
the spectral norm \(\quad d_S(a,b) = \sum_i(\sigma(a-b)_i) \quad\) where \(\sigma(M)_i\) is the i^th singular value of the matrix \(M\).
the KL divergence, \(d_{KL}(a,b) = \sum_i a_i \log(\frac{ a_i} {b_i})\) a pseudo-norm which handle the information problem as Probability Density Functions:
or the cosine similarity: \(\quad d_c(a,b) = \cos(a,b) = \Vert a\Vert \Vert b \Vert \cos(\theta) \quad\) where \(\cos(\theta)\) is the angle in the multidimensionnal vector space
- equal to 1.0 when both vectors are proportional
or anything else…

an optimization problem - 4

We are dealing with a HUGE search space, so we need to compute the derivative of \(T\) against \(M_p\):

\[ \nabla M = \frac{\partial T}{\partial M_p}\]

which is a vector of dimension \(P\), depending on the form of \(M\) it can a be very complex function.

\(\Rightarrow\) automatic differentiation comes to the rescue.

Gradient Descent single step

then, a single step is taken in the downward direction: \[ M_{n+1} = M_n + \gamma \nabla M (M_n)\] - steps (often called epoch) are iterated until convergence.
- \(\gamma\) (the learning rate) is a small number which insures the convergence

Stochastic Gradient Descent (SGD)

Computing the whole vector \(\nabla M\) is usually too large,
\(\Rightarrow\) we restrict to a randomly selected subspace of \(M\) called a mini-batch

Many improvements are possible :

parallelization of the code
descent momemtum
adaptative learning rate
…

With SGD, the convergence is efficient, however it is not monotonous

source Wikipedia

Convexity

The target function value draws a surface depending on the parameters \(\quad D(P_m)/Y_n^{mes}\)

which can be convex or not

convex / non-convex, here in 2 dimensions

unique minimum or multiple minima!

algorithmics is very different !

\(L_1\) et \(L_2\) are convex, but problem convexity depends also on the model.

We need data

a lot of data!

Data can be anything

A Dataset is a set of points in a multidimensionnal space.

numerical values
numerical values with error bars !
dates
texts

classification value
- colors
- yes/no
- quantiles
- etc…

The number of dimension \(N\) can be large !

Two different kind of data

tabulated datasets

heterogeneous data from aggregated from several independent sources
tagged values
metadata

\(\Rightarrow\) statistical approaches well adapted
PCA / LDA — SVM — Random Forests — …

Not trendy but very efficient !

non-structurated datasets

values from raw measurements
pictures / texts
genomics / protein sequences

\(\Rightarrow\) Deep Learning approaches well adapted
Deep Neural Networks (DNN)

Very trendy ! \(\qquad\) AI 😢

common concepts – vocabulary

training set / test set / validation set
cross validation
confusion matrix
hyper parameters

Statistical approaches

scikit-learn

scikit-learn.org/stable/tutorial/machine_learning_map/index.html

based open-source tools in python

numpy: tools for general mathematical and array handling
scipy: advanced mathematical and statistical tools
matplotlib: generic ploting library
Pandas: Tabulated data handling

Deep Learning

Also Deep Neural Networks or DNN

Pytorch: originally created by academic, developed by Facebook, released in 2016
Tensorflow: developed by Google, released in 2015

what is a Neural Network

a “neuron” is a simple mathematical function that aggregate information
one neuron \(k\) has several inputs \(I_i\) and one output \(O^k\),
it is connected to previous neurons, and computes a weighted sum of their output: \(O^k = \sum_i W_i^k I_i\)
its output is the result of a non-linear function \(f^k()\) sent to another set of neurons
usually organized in layers, in a feed-forward manner

Typically, each neurons has its own set of \(W_i\) for each inputs, and eventually parameters for \(f()\)
For a layer of \(K\) neurons connected to an input layer of \(L\) neurons, the number of parameters is proportionnal to \(K \times L\)

implementation examples

the non-linear \(f()\) function

sigmoid: \(\quad sig(x) = tanh(x-x_o)\)
ReLu: \(\quad ReLu(x) = max(0,x-x_o)\)
SoftMax: \(\quad \sigma(z_i) = \frac {e^{z_i}}{\sum_k e^{z_k}}\)
LogSumExp: \(\quad LSE(z) = \log( \sum_i e^{z_i} )\)

what is deep ?

In my first slide, we had a 3 layers NN \(\equiv\) perceptron
it was the maximum before automatic derivation !
\(\Rightarrow\) probably \(\approx\) 1000 parameters

Modern DNN have dozens / hundred of layers, each composed of thousand of neurons.

One recent example from my group (source Laura Duciel)

different NN types

Different geometries
- perceptron NN
- auto-encoder NN
- convolutional NN – CNN
- adverserial NN
- Long Short Term Memory – LSTM
different techniques
- Attention
- Distillation
- Latent variables
- denoising
- Generative NN
- hidden Markov process

Image processing

Example of an Image classifier DNN

source Wikipedia

Convolutionnal input layers
Classifier output layer

Non Convex ! \(\Rightarrow\) Algo: SGD: Stochastic Gradient Descent

Many parameters \(\Rightarrow\) training on a large series of examples

The randomness in the algo smoothes out the differences, and builds a general model.
The Law of Large Numbers plays here \(\Rightarrow\) quasi-deterministic
Equivalent to an implicit regularisation, but which one ? — possibly Entropy

BUT it is a perfect Black-Box

AlphaFold2

A large ML system

Training takes 2 weeks on a 8 TPUv3 chip (~150 GPUs)
Inference takes a few minutes on a GPU chip
but not bigger than GPT3 or AlphaGo2 for instance

Specific structure

repetition of identical NN structures
- unlike most text or image DNN
handles MSA and structure side-by-side

Relies on previous work

Structure generation uses AMBER ForceField

AlphaFold2

Analysis

pLDDT

quality of the predicted structure
related to the quality of the tertiary fold and of the ForceField

Predicted Aligned Error

quality of the alignment
quality of the residue relative positions
quality of the Multiple Sequence Alignment

Evoformer

s:structures, r:residues, c:channels, x:coordinates

AlphaFold Key Methods

Evoformer

handles MSA
perform structural hypotheses
triangle / distance structural coding

Attention coding

Recycling

Distillation

AlphaFold Key Methods

Evoformer

Attention coding

a sub network detect the important parameters
- (having a strong impact on the result)
- a technology that comes from natural language processing
then search is concentrated on this important parameters

Recycling

Distillation

AlphaFold Key Methods

Evoformer

Attention coding

Recycling

the whole process in iterated 3 times
feeding back results of processing to initialize next step
allowed by a special structure of the Evoformer

Distillation

AlphaFold Key Methods

Evoformer

Attention coding

Recycling

Distillation

the network, once trained is used to predict protein structures
these structures are used to train further the program

Important details

The training set

Is the key element !

Check it
- for completeness
- for bias
- for size
Enrich
- collecting
- generate synthetic examples
- modify actual examples
  - add noise, artefacts, etc…
Reuse
- distillation

Missconceptions

Do not beleive

nice results often hide awfull artefacts

Not a database

a data model

Data is primordial

organisation
directional ?
text-like ?
image like ?

Questions ?

Everything You Always Wanted to Know About Sex

(But Were Afraid to Ask)

Questions ?

Everything You Always Wanted to Know About ML

(But Were Afraid to Ask)

ChatGPT

instant poetry

type the first word, then simply click systematicaly on the words proposed by your phone

on my Android phone…

Hello,sorry for the late reply and delete the message and any attachments are confidential and intended solely for the addressees to whom they are to the intended recipient only and is now totaly bricked and does not even react to the on the other hand I can miss the talks on the afternoon of the recipients

Would you please post this announcement of an upcoming virtual NMR conference for the message if altered commentaires de l’ANR aimerait des détails sur les frais de consommables

Can I ask you for the addressees for my jsme commentaires commentaires du moulin and commentaires de Laura

Interestingly, the program mixes different languages I use on my phone (English, French and even Czech – “jsme” in the last poem, prompted by “my” : “my jsme” \(\Rightarrow\) “we are” in Czech)

Hidden Markov Process

stochastic parrot

Large Langage Model LLM
- billion of textes
- many languages
- share common latent variables
attention is all you need
- the context is fundamental
reminiscent with biology !
- genetics
- proteins
- thanks to distillation
generative AI

emergence

We are impressed by a seemingly inteligent machine which produces meaningfull texts. It seems that the machine is intellignent and understands what it says, as it is meaningfull.

\(\Rightarrow\) there is a discontinuity in the process , Emergence

We tend to give to the machine the same inner world that inhabit ourselves

It is a well-known perception bias. In particular in image processing techniques.

The problem is a lack of a clear measure of the quality of the text. With such a measure, it was shown that all improvements are incrementals.