Introduction to Machine Learning principles

oriented toward understanding and mathematics
rather than impressive demonstration

– I will try to answer to some of your questions –

M-A Delsuc - IGBMC - Université de Strasbourg
produced with Quarto & reveal
CC 4.0 BY

2024-05-13

New Science

New Science ?

An example of mine from 25 years ago.

A simple perceptron-type neural-network to assign amino-acid signals from protein 1H NMR signals.

- input layer : 32
- hidden layer: 4
- output layer: 20
- software: MatLab module
- training time : long !
  • PROFFseq Rost & Sanders (1993) (Protein Secondary Structure Prediction
  • Perceptrons Minsky & Papert (1969)

Old Science ?

Well known example

Galileo, Pise 1564 - Florence 1642

(source Wikipedia) Analysis of a system, from measurements to model

So, what is new ?

Computers

  • Modern Machine Learning is about handling a lot of data
  • Large memory – Large computing capacities
  • parallelization – GPU

Mathematics / Algorithmics

  • new mathematics of large dimensionnal spaces
    • Johnson–Lindenstrauss lemma (1984)
    • Kullback–Leibler divergence (1975)
  • algorithmic derivation and back-propagation
  • Tensor objects
  • Stochastique Gradient Descent (SGD)

Efficiency

  • “bio-inspired” approaches
  • favour efficiency to understanding
  • size of training sets (digitalisation of the world)
  • driven by Large Companies
    • Google, Meta, Microsoft

The problem

General view of Data Analysis and Modeling:

DATA
can be anything
- images
- distributions
- values
- classification

\(N\) measurements
\(X_n \quad n : \{1 \cdots N\}\)
each \(X_n\) contains
1 or more “features”
\(\Rightarrow\) stored in a matrix \(X\)
NOT in Excel !

MODEL
can be anything
- analogic
- equation
- program
- neural network

\(P\) parameters
\(M_p \quad p : \{1 \cdots P\}\)


\(\Rightarrow\) stored in a program
NOT in Excel !

Question
can be many things
- regression
- classification
- clustering
- model confirmation
- inversion
- denoising
- generative

“NOT in Excel” could be a definition of modern ML !

an optimization problem - 1

training

  • apply the model \(M\) to the data \(X\) and see how it matches
  • modifies the parameters to improve match
  • loop until satisfied

\(\Rightarrow\) trained model

to do so:

build a target function \(T\) which measure the mismatch (or a similarity)

Either:

1/ some answers \(A\) are known - supervised training \(\;\equiv\quad\) knowledge extension

2/ No known answers - non-supervised training \(\;\equiv\quad\) knowledge extraction

an optimization problem - 2

1/ supervised training

some answers \(A\) are known - build \(T\) using the answers :

\[T(M) = d( M(X), A )\qquad \text{where } d \text{ is a distance}\]

2/ non-supervised training

No known answers - build \(T\) with other kind information :

  • use some kind of a-priori information \(f\quad\) (positivity, sparcity, entropy, other global statistics, …) \[T(M) = f(M)\]
  • test against itself par cutting \(X\) in 2 parts: \(X_{training}\) and \(X_{validation}\) \[T(M) = d(M(X_t), X_v)\]

3/ cumulate both approaches \[T(M) = d( M(X), A ) + \alpha f(M)\] Regularisation approach – quite common actually

an optimization problem - 3

In all cases, find the set of parameters \(M_p\) so that \(T(M)\) is optimal
minimum if \(d()\) is a distance; – maximum if \(d()\) is a similarity

The distance can be the Cartesian distance: \[d(a,b) = \sqrt{ \sum_i (a_i - b_i)^2 }\] (also called the \(\ell_2\) norm)

but can be any other norm, (or even a pseudo-norm) in particular for large dimensionnal datasets.

for instance

  • the \(\ell_1\) norm:\(\quad \ell_1(a,b) = \sum_i |a_i - b_i|\)
  • the spectral norm \(\quad d_S(a,b) = \sum_i(\sigma(a-b)_i) \quad\) where \(\sigma(M)_i\) is the ith singular value of the matrix \(M\).
  • the KL divergence, \(d_{KL}(a,b) = \sum_i a_i \log(\frac{ a_i} {b_i})\) a pseudo-norm which handle the information problem as Probability Density Functions:
  • or the cosine similarity: \(\quad d_c(a,b) = \cos(a,b) = \Vert a\Vert \Vert b \Vert \cos(\theta) \quad\) where \(\cos(\theta)\) is the angle in the multidimensionnal vector space
    • equal to 1.0 when both vectors are proportional
  • or anything else…

an optimization problem - 4

We are dealing with a HUGE search space, so we need to compute the derivative of \(T\) against \(M_p\):

\[ \nabla M = \frac{\partial T}{\partial M_p}\]

which is a vector of dimension \(P\), depending on the form of \(M\) it can a be very complex function.

\(\Rightarrow\) automatic differentiation comes to the rescue.

Gradient Descent single step

then, a single step is taken in the downward direction: \[ M_{n+1} = M_n + \gamma \nabla M (M_n)\] - steps (often called epoch) are iterated until convergence.
- \(\gamma\) (the learning rate) is a small number which insures the convergence

source Wikipedia

Stochastic Gradient Descent (SGD)

Computing the whole vector \(\nabla M\) is usually too large,
\(\Rightarrow\) we restrict to a randomly selected subspace of \(M\) called a mini-batch

Many improvements are possible :

  • parallelization of the code
  • descent momemtum
  • adaptative learning rate

With SGD, the convergence is efficient, however it is not monotonous

source Wikipedia

Convexity

The target function value draws a surface depending on the parameters \(\quad D(P_m)/Y_n^{mes}\)

which can be convex or not

convex / non-convex, here in 2 dimensions

  • unique minimum or multiple minima!
  • algorithmics is very different !

\(L_1\) et \(L_2\) are convex, but problem convexity depends also on the model.

We need data

a lot of data!

Data can be anything

A Dataset is a set of points in a multidimensionnal space.

  • numerical values
  • numerical values with error bars !
  • dates
  • texts
  • classification value
    • colors
    • yes/no
    • quantiles
    • etc…

The number of dimension \(N\) can be large !

Two different kind of data

tabulated datasets

  • heterogeneous data from aggregated from several independent sources
  • tagged values
  • metadata

\(\Rightarrow\) statistical approaches well adapted
PCA / LDA — SVM — Random Forests — …

Not trendy but very efficient !

non-structurated datasets

  • values from raw measurements
  • pictures / texts
  • genomics / protein sequences

\(\Rightarrow\) Deep Learning approaches well adapted
Deep Neural Networks (DNN)

Very trendy ! \(\qquad\) AI 😢

common concepts – vocabulary

  • training set / test set / validation set
  • cross validation
  • confusion matrix
  • hyper parameters

Statistical approaches

scikit-learn

scikit-learn.org/stable/tutorial/machine_learning_map/index.html

based open-source tools in python

  • numpy: tools for general mathematical and array handling
  • scipy: advanced mathematical and statistical tools
  • matplotlib: generic ploting library
  • Pandas: Tabulated data handling

Deep Learning

Also Deep Neural Networks or DNN

  • Pytorch: originally created by academic, developed by Facebook, released in 2016
  • Tensorflow: developed by Google, released in 2015

what is a Neural Network

  • a “neuron” is a simple mathematical function that aggregate information
  • one neuron \(k\) has several inputs \(I_i\) and one output \(O^k\),
  • it is connected to previous neurons, and computes a weighted sum of their output: \(O^k = \sum_i W_i^k I_i\)
  • its output is the result of a non-linear function \(f^k()\) sent to another set of neurons
  • usually organized in layers, in a feed-forward manner

Typically, each neurons has its own set of \(W_i\) for each inputs, and eventually parameters for \(f()\)
For a layer of \(K\) neurons connected to an input layer of \(L\) neurons, the number of parameters is proportionnal to \(K \times L\)

implementation examples

the non-linear \(f()\) function

  • sigmoid: \(\quad sig(x) = tanh(x-x_o)\)
  • ReLu: \(\quad ReLu(x) = max(0,x-x_o)\)
  • SoftMax: \(\quad \sigma(z_i) = \frac {e^{z_i}}{\sum_k e^{z_k}}\)
  • LogSumExp: \(\quad LSE(z) = \log( \sum_i e^{z_i} )\)

what is deep ?

In my first slide, we had a 3 layers NN \(\equiv\) perceptron
it was the maximum before automatic derivation !
\(\Rightarrow\) probably \(\approx\) 1000 parameters

Modern DNN have dozens / hundred of layers, each composed of thousand of neurons.

One recent example from my group (source Laura Duciel)

different NN types

  • Different geometries
    • perceptron NN
    • auto-encoder NN
    • convolutional NN – CNN
    • adverserial NN
    • Long Short Term Memory – LSTM
  • different techniques
    • Attention
    • Distillation
    • Latent variables
    • denoising
    • Generative NN
    • hidden Markov process

source: the Asimov Institute

Image processing

Example of an Image classifier DNN

source Wikipedia

  • Convolutionnal input layers
  • Classifier output layer

Non Convex ! \(\Rightarrow\) Algo: SGD: Stochastic Gradient Descent

Many parameters \(\Rightarrow\) training on a large series of examples

  • The randomness in the algo smoothes out the differences, and builds a general model.
  • The Law of Large Numbers plays here \(\Rightarrow\) quasi-deterministic
  • Equivalent to an implicit regularisation, but which one ? — possibly Entropy

BUT it is a perfect Black-Box

AlphaFold2

A large ML system

  • Training takes 2 weeks on a 8 TPUv3 chip (~150 GPUs)
  • Inference takes a few minutes on a GPU chip
  • but not bigger than GPT3 or AlphaGo2 for instance

Specific structure

  • repetition of identical NN structures
    • unlike most text or image DNN
  • handles MSA and structure side-by-side

Relies on previous work

  • Structure generation uses AMBER ForceField

AlphaFold2

Analysis

pLDDT

  • quality of the predicted structure
  • related to the quality of the tertiary fold and of the ForceField

Predicted Aligned Error

  • quality of the alignment
  • quality of the residue relative positions
  • quality of the Multiple Sequence Alignment

Evoformer

s:structures, r:residues, c:channels, x:coordinates

AlphaFold Key Methods

Evoformer

  • handles MSA
  • perform structural hypotheses
  • triangle / distance structural coding

Attention coding

Recycling

Distillation

AlphaFold Key Methods

Evoformer

Attention coding

  • a sub network detect the important parameters
    • (having a strong impact on the result)
    • a technology that comes from natural language processing
  • then search is concentrated on this important parameters

Recycling

Distillation

AlphaFold Key Methods

Evoformer

Attention coding

Recycling

  • the whole process in iterated 3 times
  • feeding back results of processing to initialize next step
  • allowed by a special structure of the Evoformer

Distillation

AlphaFold Key Methods

Evoformer

Attention coding

Recycling

Distillation

  • the network, once trained is used to predict protein structures
  • these structures are used to train further the program

Important details

The training set

Is the key element !

  • Check it
    • for completeness
    • for bias
    • for size
  • Enrich
    • collecting
    • generate synthetic examples
    • modify actual examples
      • add noise, artefacts, etc…
  • Reuse
    • distillation

Missconceptions

Do not beleive

nice results often hide awfull artefacts

Not a database

a data model

Data is primordial

  • organisation
  • directional ?
  • text-like ?
  • image like ?

Questions ?

Everything You Always Wanted to Know About Sex

(But Were Afraid to Ask)

Questions ?

Everything You Always Wanted to Know About ML

(But Were Afraid to Ask)

ChatGPT

instant poetry

type the first word, then simply click systematicaly on the words proposed by your phone

on my Android phone…

Hello,sorry for the late reply and delete the message and any attachments are confidential and intended solely for the addressees to whom they are to the intended recipient only and is now totaly bricked and does not even react to the on the other hand I can miss the talks on the afternoon of the recipients

Would you please post this announcement of an upcoming virtual NMR conference for the message if altered commentaires de l’ANR aimerait des détails sur les frais de consommables

Can I ask you for the addressees for my jsme commentaires commentaires du moulin and commentaires de Laura

Interestingly, the program mixes different languages I use on my phone (English, French and even Czech – “jsme” in the last poem, prompted by “my” : “my jsme” \(\Rightarrow\) “we are” in Czech)

Hidden Markov Process

stochastic parrot

  • Large Langage Model LLM
    • billion of textes
    • many languages
    • share common latent variables
  • attention is all you need
    • the context is fundamental
  • reminiscent with biology !
    • genetics
    • proteins
    • thanks to distillation
  • generative AI

emergence

We are impressed by a seemingly inteligent machine which produces meaningfull texts. It seems that the machine is intellignent and understands what it says, as it is meaningfull.

\(\Rightarrow\) there is a discontinuity in the process , Emergence

We tend to give to the machine the same inner world that inhabit ourselves

It is a well-known perception bias. In particular in image processing techniques.

The problem is a lack of a clear measure of the quality of the text. With such a measure, it was shown that all improvements are incrementals.

just a tool

A formidable language tool which can correct / summarize / develop / translate / etc…

BUT NOT A KNOWLEDGE BASE !

large dimension spaces

large dimension spaces are very unatural to us

Law of large numbers

binomial law with 10 draws

with large nb of draw

With a large number of draw, every stochastic law becomes predictive

large dimension spaces are very unatural to us

Law of truly large numbers:

FTICR-MS fid, 8 million points ~ 1 sec acqu.
I have thousand of these

zoomed

re-zoomed

even VERY unlikely events will occur

large dimension spaces are very unatural to us

1D random distribution

distance histogram

 

distance histogram

2D random distribution

distance histogram

 

distance to center histogram

All points are at the same distance !
to the center - to each other

large dimension spaces are very unatural to us

random matrices and their \(AA^t\) product

All random matrices are inversible !
and are there own inverse