Pan Zhang

Institute of Theoretical Physics, Chinese Academy of Sciences

Reference: Z. Han, J. Wang, H. Fan, L. Wang and P. Zhang, arXiv:1709.01662 (to appear in *Phys. Rev. X* )

Date: 06.11.2018

*Abstract*:

Generative modeling is an important task in machine learning and artificial intelligence. It asks to learn joint probability distribution from training data and generates samples according to it. Many models have been proposed for generative modeling. In this tutorial, we introuduce a new model that is inspired by probabilistic interpretation of quantum physics, which we call *Born Machine*. In particular we introduce a generative model using matrix product states, which is a tensor network originally proposed for describing (particularly one-dimensional) entangled quantum states. The model enjoys efficient learning by utilizing the density matrix renormalization group method, and offers an efficient direct sampling for generative tasks.

The tutorial contains three parts:

- An introducion to tensor networks and matrix product states (tensor train format).
- An Introducion to generative modeling, and how to do this using matrix product states, i.e. MPS Born Machine.
- In the last part I will give simple python code for training a MPS.

- Data can be expressed (or mapped to) as a vector in a Hilbert space $\mathcal {H}$ of dimension $n$. Different mappings result to different spaces.
- Linear opearations in the space $\mathcal{H}$ can be expressed by an operator. Linear operators in ab high-dimensional space correpond to non-linear operators in a low-dimensional space: Kernel methods, representer theorem.
- In practice, we use
**tensors**(multi-way arrays) to represent both operators and vectors. **Curse of dimensionality**$\Longleftrightarrow$ A large effective dimension $\Longleftrightarrow$ Tensors contain too many elements- Low-dimensional approximation to a tensor $\Longleftrightarrow$ Compression of the Hilbert space $\Longleftrightarrow$
**Tensor networks**

- Tensor networks and graphical models: tensor contractions correpond to summation (or integration) of hidden (i.e. latent) variables
- In Physics, tensor networks are used for
- expressing a wave function, by contracting virtual indices.
- expressing the partition function of a classic system, by contracting all indices.

- In Applied Mathematics, tensor networks are used for
- fitting a function or another tensor
- Data compression

In Machine Learning (new), tensor networks are used for

- predicting unknown elements of a tensor using its known elements.
- As a classifier
- Generative model via Born's rule (the topic of this tutorial)

A biased conclusion:

**Optimization**in physics,**Fitting**in applied mathematics, and**Generalization**in machine learning.

$$\mathcal{A} = \mathbf{a}^{(1)}\circ \mathbf{a}^{(2)}\circ \cdots\circ \mathbf{a}^{(n)}$$

- The simplest tensor network: rank-one tensor, or factorized (pure) state. The number of elemens is proportional to the number of mode. The different modes are
*independent*.- A qubit has two eigen states:
- The first eigen state (analogous to pixel
*black*) is expressed as $| {\uparrow } \rangle =|0\rangle=\left(\begin{matrix}1\\0\end{matrix}\right)$, - The second eigen state (analogous to pixel
*white*) is expressed as $|{\downarrow }\rangle=|1\rangle=\left(\begin{matrix}0\\1\end{matrix}\right)$， - A pixel with gray level $0.8$ can be expressed as $|\Psi\rangle=\sqrt{0.8}|{\uparrow }\rangle+\sqrt{0.2}|{\downarrow }\rangle=\left(\begin{matrix}\sqrt{0.8}\\\sqrt{0.2}\end{matrix}\right)$

- The first eigen state (analogous to pixel

- A qubit has two eigen states:
- Product of two qubits: $|\Psi\rangle =|{\uparrow }\rangle|{\downarrow }\rangle=\left(\begin{matrix}1\\0\end{matrix}\right)\otimes \left(\begin{matrix}0\\1\end{matrix}\right)=\left(\begin{matrix}0\\1\\0\\0\end{matrix}\right)=\left(\begin{matrix}0&0\\1&0\end{matrix}\right)$
- Product of two qubits: $|\Psi\rangle =|{\uparrow }\rangle\left(\frac{1}{\sqrt{2}}|{\uparrow }\rangle+\frac{1}{\sqrt{2}}|{\downarrow }\rangle\right)$
- Rank-one tensor can be used to fit any tensor i.e. extracting
*key*information of the tensor. However the error could be very large, as the number of parameters in the rank-one tensor is very limited, hence rank-one tensor has very*limited representation power*. Consider a general system with $n$ qubits, it has $2^{n}-1$ free parameters, its rank-one fitting contains only $2n$ free parameters.

Non-factorized pure states:

- entanglement state of two qubits: $\frac{1}{\sqrt{2}}|{\uparrow }\rangle|{\downarrow }\rangle-\frac{1}{\sqrt{2}}|{\downarrow }\rangle|{\uparrow }\rangle $, it can not be written as product state in any basis. The entanglements come from (anti) correlations.
- EPR pair: $\frac{1}{\sqrt{2}}|{\uparrow }\rangle|{\uparrow }\rangle+\frac{1}{\sqrt{2}}|{\downarrow }\rangle|{\downarrow }\rangle $, what is its rank of the unfolded matrix?
- $|\textrm{GHZ}\rangle=\frac{1}{\sqrt{2}}\left( |{\uparrow }\rangle^{\otimes n}+ | {\downarrow }\rangle^{\otimes n}\right)$

Rank and engtanglement:

- Rank-one $\Longleftrightarrow$ no entanglements
- Larger rank, more enganglements. Quantitatively characterized by
*entanglement entropy*. - Entanglments mean that some information of the system is shared by two qubits. Looking at one isolated qubit looses information!
- Unfold a tensor into a $2\backslash times\; 2$matrix, rank is the number of non-zero singular values of the matrix. It is also equivalent to the number of non-zero eigenvalues of the
*reduced density matrix*. - Why SVD? Because SVD = Schmidt decomposition, a natural orthogonal bisection method!
- How to determine whether there are engantlements between two sub-systems of a system with multiple (say $7$) qubits?
- The same principle: unfoled the size-$2^7$ tensor to a matrix of size $2^3\times 2^4$, then look at its rank.
- Squared singular values of this matrix ＝eigenvalues of the reduced density matrix.
- Recall the definition of singular values: it is actually equivalent to finding eigenvalues of the related covariance matrix.
- There are many methods for unfolding this size-$2^7$ tesor, each of which corresponds to different
*permute*$\to$*reshape*operations. If we can do this for any tensor (with unlimited memory and computational power), it gives exact estimate of entaglement entropy, as well as the bond dimension of DMRG. - A quick test: is $|\Psi\rangle=\frac{1}{ {2}}|{\uparrow }\rangle|{\uparrow }\rangle+\frac{1}{ {2}}|{\downarrow }\rangle|{\downarrow }\rangle+\frac{1}{ {2}}|{\uparrow }\rangle|{\downarrow }\rangle+\frac{1}{ {2}}|{\downarrow }\rangle|{\uparrow }\rangle$ entangled?

Entanglment entropy

- The entanglement entropy defined on eigenvalues of the reduced density matrix.
- The density matrix $\rho=|\Psi\rangle\langle\Psi|$ of pure state $\Psi$ is rank-1.
- Whether the reduced density matrix $\rho_A=\textrm{Tr}_B|\Psi\rangle\langle\Psi|$ has rank larger than $1$ depends on whether sub-system A is entangled with sub system B.
- $\psi_{AB}$ is a matrix obtained by unfolding tensor $\Psi$, and $\rho_A=\psi_{AB}\psi_{AB}^\dagger$. Thus rank of $\rho_A$ equals $\rho_B$, which is also identical to rank of $\psi_{AB}$.
- Eigenvalues of $\rho_A$ are coinside with eigenvalues of $\rho_B$, which are identical to squared singular values of $\psi_{AB}$.
- Entanglement entropy is defined as $S(\rho_A)=S(U\rho_A U^\dagger)=S(P(\{\lambda_i\}))=\sum_i\lambda_i\log(\lambda_i)$, where U is a (column-orthogonal) unitary matrix, $P(\{\lambda_i)\}$ denotes distribution of eigenvalues of the reduced density matrix.

I want to emphasize that tensor networks are indeed tensor decompositions. This section focus on the mathematical aspects of them.

Canonical Polyadic (CP) decomposition $[\mathbf{\lambda};\mathbf{A_1}, \mathbf{A_1},\cdots,\mathbf{A_n}]$.

- number of parameters $ndr$.
- It defines the (canonical) rank of the tensor.
- Finding the optimal rank belongs the class of NP problems.
- Alternating Least Square algorithm for CP decomposition of tensor $\mathcal{Y}$:
- Having fixed all matrices but $A_i$, the problem reduces to a problem of linear least-squares $$A_i = \arg\min_{\hat A_i}\left|Y_{(i)}-\hat A_i(A_1\odot A_2\odot\cdots\odot A_{i-1}\odot A_{i+1}\odot\cdots\odot A_n)^{\top}\right|_2$$
- Hence $A_i$ is determined by $$A_i = Y_{(i)}\left[(A_1\odot A_2\odot\cdots\odot A_{i-1}\odot A_{i+1}\odot\cdots\odot A_n)^{\top}\right]^\dagger$$ Here $Y_{(i)}$ is the matrix unfolded at $i$-th mode of the tensor $\mathcal{Y}$, $\odot$ denotes the Khatri-Rao product, and $A^\dagger$ denotes pseudo inverse of matrix $A$.

Tucker decomposition, also known as

*higher order SVD*,*n-mode PCA*$[\mathcal{{G}};\mathbf{A_1}, \mathbf{A_1},\cdots,\mathbf{A_n}]$.

- number of parameters $ndr+r^n$.
- Can be seen as a compression to the original tensor using core $\mathcal{G}$.
- Can be decomposed using
*HOSVD*algorithm:- unfold the tensor into $1,2,...,n$ modes, then do singular value decomposition on the unfolded matrices.
- contract the original tensor to the inverse of obtained unitary matrices one by one.

- Tensor train (TT) decomposition, which is named
*matrix product states*.

- Number of parameters $ndr^2$.
- TT Decomposition can be consructed exactly using
*Schmidt decomposition*, also known as*TTSVD*algorithm:- choose an order of tensors
- unfolding the tensor into a matrix of size $(d_1),(d_2\times d_3,...,\times d_n)$, then do singular value decomposition on the unfolded matrix.
- unfolding the tensor into a matrix of size $(d_1\times d_2),(d_3\times d_4,...,\times d_n)$, then do singular value decomposition on the unfolded matrix.
- $\cdots\cdots$
- unfolding the tensor into a matrix of size $(d_1\times d_2\cdots\times d_{n-1}),(d_n)$, then do singular value decomposition on the unfolded matrix.

- More stable than CP to train using e.g. algorithms analogous to the Density Matrix Renormalization Group (DMRG).
- CP format with rank $r$ can be converted into TT format with rank $r$ by exact constructing using
*direct sum*.

- Hierarchical Tucker decomposition, which is known as
*tree tensor netorks*in physics.

To be updated

There are mainly two kinds of tasks in marchine learning

- Supervised learning (discriminative learning), where data $x$ come with labels $t$. It asks to model conditional distributions $P(t|x)$.
- Unsupervised learning (generative learning), where data $x$ does not have label. It asks to model joint distribution of data $P(x)$.

Many models have been proposed for generative learning, for example, the Hidden Markov model, Boltzmann machines, Gaussian mixture models, Variational autoencoders, generative adversary networks... Among them, the Boltzmann machine (BM) is closely related to the Ising model in statistical physics, and ultilizes Boltzmann distribution for modeling data joint distribution. $$P(x)=\frac{1}{Z}e^{-E(x)},$$ where $E(x)$ denotes energy of configuration $x$ and $Z$ is the partition function.

In this tutorial, we model joint distribution using *Born's* rule of quantum mechanics
$$P(x)=\frac{1}{\mathcal{N}}{|\Psi(x)|^2},$$
where $\Psi$ denotes wave function, $\mathcal{N}$ is the normalization factor. We call the model *Born Machine*.

- Each image (among totoally $m$ images) is mapped to a product state $\mathbf{x}=v_1\circ v_2\circ v_3\circ\cdots\circ v_n$
- A white pixel $i$ is represented as $v_i=| {\uparrow } \rangle =|0\rangle=\left(\begin{matrix}1\\0\end{matrix}\right)$,
- A black pixel $j$ is represented as $v_j=|{\downarrow }\rangle=|1\rangle=\left(\begin{matrix}0\\1\end{matrix}\right)$.

- The data distribution becomes $$ Q_{\mathrm{data}}(\mathbf{x}) = \frac{1}{m}\sum_{t=1}^m\delta(x - v_1^t\circ v_2^t\circ\cdots\circ v_n^t)$$
- This can be seen as rank-$m$ CP decomposition of the original (unknown) tensor.

\begin{align} P(\mathbf{x})&=\frac{|\Psi(\mathbf{x})|^2}{Z}, \end{align} and the wave function is parametrized by MPS and is formulated as \begin{align} \Psi(\mathbf{x})&=\Psi_{\mathrm{mps}} \cdot \left( v_1^t\circ v_2^t\circ \cdots\circ v_n^t \right)\\ &=\sum_{w_1,w_2,...w_{n-1}} A^{(1)}_{v_1,w_1}A^{(2)}_{w_1,v_2,w_2}\cdots A^{(n-1)}_{w_{n-2},v_{n-1},w_{n-1}}A^{(n)}_{w_{n-1},v_{n}}\\ &= A^{(1)}_{v_1}A^{(2)}_{v_2}\cdots A^{(n)}_{v_n} \end{align} and $Z$ represent the partition function \begin{align} Z&=\sum_{\mathbf{x}}|\Psi(\mathbf{x})|^2 \end{align}

- The number of parameters is roughly $2nD^2$, rather than $2^n$.
- Training via an algorithm analogous to the Density Matrix Renormalization Group.
- Easy computation of the partition function (thanks to the data mapping).
- Direct sampling.
- Exact representation of whole data requires $D=m$ (recall the MPS and CP section), but this is not what we want:
- $m$ is too large as a bound dimension.
- It is
*overfitting*!

The goal of learning is to make the joint distribution given by the Born's rule as close as the data distribution. First one needs to quantify *distance* between two probability distributions. Here we use the Kullback-Leibler divergence between data distribution $Q_{\mathrm{data}}(\mathbf{x})$ and the MPS distribution $P_\theta(\mathbf{x})$ which is parametrized by $\theta$ (tensor elements).
\begin{align}
D_{\mathrm{KL}}\left(Q_{\mathrm{data}}(\mathbf{x})|P_\theta(\mathbf{x})\right)&=\sum_{\mathbf{x}}Q_{\mathrm{data}}(\mathbf{x})\log\frac{Q_{\mathrm{data}}(\mathbf{x})}{P_\theta(\mathbf{x})}\\
&=\sum_{\mathbf{x}}Q_{\mathrm{data}}(\mathbf{x})\log Q_{\mathrm{data}}(\mathbf{x})-\sum_{\mathbf{x}}Q_{\mathrm{data}}(\mathbf{x}) \log P_\theta(\mathbf{x})\\
&=-S_Q+\mathcal{L_\theta},
\end{align}
where $S_Q$ is the entropy of distribution $Q_{\mathrm{data}}$, and $\mathcal{L}$ is the so-called *Negative Log-Likelihood* (NLL).

It is easy to show (e.g. using Jenson's inequality) that KL divergence is non-negative. This indicates a lower-bound for the NLL $\mathcal{L_\theta}\geq S_Q$. Notice that this lower bound is achieved only when $P_\theta(\mathbf{x}) = Q_{\mathrm{data}}(\mathbf{x})$.

The learning amounts to finding parameters $\theta$, i.e. tensor elements, such that the KL divergence is minimized. $$\hat \theta=\arg\min_\theta D_{\mathrm{KL}}\left(Q_{\mathrm{data}}(\mathbf{x})|P_\theta(\mathbf{x})\right) =\arg\max_\theta \mathcal{L_\theta}$$

Let us rewrite the NLL as \begin{align} \mathcal{L_\theta}&=-\sum_{t=1}^mQ_{\mathrm{data}}(\mathbf{x}_t) \log P_\theta(\mathbf{x})\\ &=-\frac{1}{m}\sum_{t=1}^m\log |\Psi(\mathbf{x}_t)|^2 +\log Z\\ \end{align} then update tensor elements using gradient descent.

The gradients are computed as
\begin{align}
\frac{\partial \mathcal{L}}{A^{(i)}_{w_{i-1},v_i,w_{i}}}&=-\frac{2}{m}\sum_{t=1}^m\frac{\Psi'_{w_{i-1},v_i,w_i}(\mathbf{x_t})}{\Psi(\mathbf{x_t})}+\frac{2A^{(i)}_{w_{i-1},v_i,w_{i}}}{Z},
\end{align}
where $\Psi'_{w_{i-1},v_i,w_{i}}(\mathbf{x_t})$ is the derivative of $\Psi$ with respect to tensor element $A^{(i)}_{w_{i-1},v_i,w_{i}}$ evaluated at $\mathbf{x_t}$. It is evaluated as $$\Psi'_{w_{i-1},v_i,w_{i}}(\mathbf{x_t}) =u^l_{w_{i-1},v_i}(\mathbf{x_t})\cdot u^r_{w_i,v_i}(\mathbf{x_t})=\left(\sum_{w_1,w_2,...w_{i-2}}A^{(1)}_{v_1,w_1}A^{(2)}_{w_1,v_2,w_2}\cdots A^{(i-1)}_{w_{i-2},v_{i-1},w_{i-1}}\right )\cdot \left ( \sum_{w_{i+1},w_{i+2},...w_{n}}A^{(i+1)}_{v_i,w_i,v_{i+1}}\cdots A^{(n)}_{w_{n-1},v_{n}}\right).$$
The gradients can actaully be written in a more compact form:
$$\bigtriangledown_{A^{(i)}}\mathcal{L}=-\frac{2}{m}\sum_{t=1}^m\frac{ u^l(\mathbf{x_t})\circ u^r(\mathbf{x_t}) }{\Psi(\mathbf{x_t})}+\frac{2A^{(i)}}{Z},$$
where $u^l$ and $u^r$ are vectors contracted from left and from right respectively, as shown in the function above. $u^l(\mathbf{x_t})\circ u^r(\mathbf{x_t})$ denotes a rank-one tensor obtained by tensor product of vectors $u^l(\mathbf{x_t})$ and $u^r(\mathbf{x_t})$, it is also known as *environment tensor*.
Another important part in computing gradients is that the partition function $Z$ is easy to compute, thanks to the orthogonalization.

The most important feature of MPS is that the joint probability distribution can be computed exactly, as well as conditional probabilities This property makes pixel-by-pixel sampling easy and exact.

A typical behavior of the log-likelihood look like where the NLL decreases monotonically, untill it reaches the lower-bound $S(Q)=\log(m)$ for non-overlapping training data. The lower-bound is always achiveable by increasing the bond-dimension However usually one does not want to reach the lower-bound, because it is overfitting: training NLL does not make too much sense, test NLL does!

After the MPS is well trained, it could find many unsupervised applications, e.g. reconstructing partial images

to

In [18]:

```
import sys
import numpy as np
import torch # tested under pytorch 0.4.0
import math
%matplotlib inline
import matplotlib.pyplot as plt
torch.manual_seed(1) # Fix seed of the random number generators
np.random.seed(1)
```

The function below is used to plot mnist images

In [19]:

```
def show_imgs(imgs,l1=4,l2=5,s1=6,s2=6):
""" Plot images """
plt.rcParams['figure.figsize']=(s1,s2)
imgs=imgs.reshape([-1,28,28])
g, ax = plt.subplots(l1,l2)
for i in range(l1):
for j in range(l2):
a=i*l1+j
if(a>=imgs.shape[0]):
break
ax[i][j].imshow(imgs[a,:,:],cmap='summer')
ax[i][j].set_xticks([])
ax[i][j].set_yticks([])
plt.show()
```

$100$ MNIST images have been stored as "mnist_100_28x28_p0.5.npy".

Each image contains $n=28\times 28=784$ pixels, each of which takes value $0$ or $1$.

In our settings, each image is viewed as a product state in the Hilbert space of dimension $2^n$.

In [20]:

```
n=784 # number of qubits
m=20 # m images
data=np.load("imgs/mnist_100_28x28_p0.5.npy")
data=data[:m,:,:]
data=torch.LongTensor(data)
data=data.view(-1,784) # m images, each of which is reshaped into a vector of length 784
show_imgs(data,2,10,10,2)
```

Define the mps, which is a list of 3-way tensors containing random values

In [21]:

```
Dmax=30 # maximum bond dimension
bond_dims=[Dmax for i in range(n-1)]+[1]
tensors= [ torch.randn(bond_dims[i-1],2,bond_dims[i]) for i in range(n)]
```

Now check the bond dimensions and tensors

In [22]:

```
print("There are len(tensors) tensors")
print(bond_dims)
print(tensors[5].shape)
```

Question: does the contration with one image give a probability of the image? Why?

Canonicalization using QR decompositions

In [23]:

```
def orthogonalize(site,going_right):
dl=bond_dims[site-1] # left bond dimension
d=bond_dims[site] # current bond dimension
if(going_right):
A=tensors[site].view(dl*2,d) # A is a matrix unfolded from the current tensor
Q,R=torch.qr(A)
R/=R.norm() # devided by norm
tensors[site] = Q.contiguous().view(dl,2,-1)
tensors[site+1] = (R@tensors[site+1].view(d,-1)).view(-1,2,bond_dims[site+1])
bond_dims[site] = Q.shape[1] # economy QR, so the right dimension could be either dl or d
else: # going left
A=tensors[site].view(dl,d*2).t()
Q,R=torch.qr(A)
R/=R.norm()
tensors[site]=Q.t().contiguous().view(-1,2,d)
tensors[site-1] = (tensors[site-1].view(-1,dl)@R.t()).view(bond_dims[site-2],2,-1)
bond_dims[site-1] = Q.shape[1]
```

Canonicalization gives several advantages:

- It make the partition function of the model equals to $1$.
- The isometries have condition number $1$, preserving very well the computation precisions.

In [24]:

```
for site in range(n-1):
orthogonalize(site,True)
```

In [25]:

```
tensors[783].shape
```

Out[25]:

Canonicalization changes the bond dimensions, to see it

In [26]:

```
print(bond_dims)
```

Now contracting mps with one image gives the probability amplitude of the image.

In [27]:

```
def get_psi():
psi=torch.ones([m,1,1])
for site in range(n):
psi = psi @ tensors[site][:,data[:,site],:].permute(1,0,2)
return(psi)
```

In [28]:

```
def gen_samples(ns):
samples=np.zeros([ns,n])
for site in range(n-1):# left canonicalize
orthogonalize(site,True)
for s in range(ns):
vec=torch.ones(1,1)
for site in range(n-1,-1,-1):
vec = (tensors[site].view(-1,bond_dims[site])@vec).view(-1,2)
p0 = vec[:,0].norm()**2/ (vec.norm()**2)
x = (0 if np.random.rand() < p0 else 1)
vec = vec[:,x]
samples[s][site]=x
return samples
```

Computing probability of a image consists of contracting bonds from the first tensor to the last tensor. Lots of computation results can be re-used in the future computations, so we would like to store the contraction results in the cache. Notice that the cache is for all images.

In [29]:

```
cache=[]
cache.append( torch.ones([m,1,1])) # The initial elements, all images have cache 1
for site in range(n-1):
B=cache[site] @ tensors[site][:,data[:,site],:].permute(1,0,2)
B /= B.abs().max()
cache.append( B ) # batched matrix multiplications
cache.append( torch.ones(m,1,1)) # the last element, matrix [1,1] for all images
```

Notice that code `B /= B.abs().max()`

is used to preserve computational precision, avoiding maximum value of the cache becomes too small. We can do this because the gradient is invariant by multiplying the cache by a constant, for each image.

Length of the cache is $n+1$. In the caceh, for an image, each pixel has a correponding vector, denoting the temporary results in tensor contractions (from left to right or from right to left).
Let us look at the content of cache for image alpha=1:

In [30]:

```
alpha=1 # the image 1
print("cache site 1 ",cache[0][alpha])
print("cache site 2 ",cache[1][alpha])
print("cache site 3 ",cache[2][alpha])
```

Then the probability amplitude $\psi$ for images can be obtained by

In [31]:

```
psi=get_psi()
```

Let us output the probablity for an image, which equals $|\psi|^2$

In [32]:

```
print("Probability of generating image 3 = %.5f"%(psi*psi)[3])
```

In [33]:

```
ax=plt.bar(range(1,psi.shape[0]+1),psi**2)
```

**Hey, the probability is $0$, what is wrong?**

Because the space is too large and the model is randomly initialized!

Actually the purpose of training is exactly to increase the probability of given images. This is the so-called *maximum likelihood learning*.

The basic procedure is sweeping back and force, from right to left, then from left to right. During each sweep, the visited tensor is updated according to the *gradients* of the log-probability with respect to tensor elements.

Some tips on the code:

- Only images with $i^{\mathrm{th}}$ element $v_i$ contribute to tensor element $A_{w_{i-1},v_i,w_i}$.
`torch.sum(left_vec.permute(0,2,1) @ right_vec.permute(0,2,1)`

computes $\Psi'(\mathbf{x})$.`psi = left_vec @ A.permute(1,0,2) @right_vec`

does not compute*real*$\Psi$, as cache is rescaled to preserve precision. Thus one needs to call`psi=get_psi()`

in order to compute correct probability amplitude $\Psi$.`tensors_bak=tensors.copy()`

is to backup the training environment for restoring after generating images.`@`

operator in`left_vec @ A.permute(1,0,2) @right_vec`

and in`cache[site] @ tensors[site][:,data[:,site],:].permute(1,0,2)`

works as batched matrix-matrix multiplications.- In
`orthogonalize()`

, imposing matrix $R$ to be norm-one using`R/=R.norm()`

actually let the partition function $Z=1$.

In [34]:

```
learning_rate=0.08
for epoch in range(9): # one sweep, from right to left, then from left to right
going_right=False
for site in [i for i in range(n-1,0,-1)]+[i for i in range(n-1)]:
# to update tensors[site] which is a 3-way tensor of size [dl,2,dr]
sys.stdout.write("\r Epoch #%d, site #%d / %d "%(epoch, site+1,n)); sys.stdout.flush()
if(site==0): going_right=True
gradients = torch.zeros_like(tensors[site])
for i in [0,1]: # the pixel could be either 0 or 1
idx=(data[:,site]==i).nonzero().type(torch.LongTensor).squeeze() # this returns indices of non-zero elements
if(idx.numel()==0): continue
left_vec = cache[site][idx,:,:] # a vector on the left of the site
right_vec = cache[site+1][idx,:,:] # a vector on the right of the site
A=tensors[site][:,data[idx,site],:]
if(idx.numel()==1):
A=A.view(A.shape[0],1,A.shape[1])
left_vec=left_vec.view(1,left_vec.shape[0],left_vec.shape[1])
right_vec=right_vec.view(1,right_vec.shape[0],right_vec.shape[1])
psi = left_vec @ A.permute(1,0,2) @right_vec
gradients[:,i,:] = torch.sum(left_vec.permute(0,2,1) @ right_vec.permute(0,2,1) / psi,0)
gradients = 2.0*(gradients/m-tensors[site])
tensors[site] += learning_rate * gradients/gradients.norm()
orthogonalize(site,going_right)
if(going_right):
cache[site+1] = cache[site] @ tensors[site][:,data[:,site],:].permute(1,0,2)
cache[site+1] /= cache[site+1].abs().max()
else:
cache[site] = tensors[site][:,data[:,site],:].permute(1,0,2) @ cache[site+1]
cache[site] /= cache[site].abs().max()
psi=get_psi()
tensors_bak=tensors.copy()
sys.stdout.write("NLL=%.3f, LowerBound=%.3f, total_prob=%.3f "%(-torch.mean(torch.log(psi*psi)),math.log(m),torch.sum(psi.squeeze()**2) ))
sys.stdout.write(" generating samples...")
ax=plt.bar(range(1,psi.shape[0]+1),psi.squeeze()**2)
imgs=gen_samples(30)
show_imgs(imgs,2,15,15,2)
tensors=tensors_bak.copy()
if(epoch < 8):
input("Press Enter to continue ")
```

- Ulrich Schollwock, “The density-matrix renormalization group in the age of matrix product states,” Annals of Physics 326, 96– 192 (2011).
- Tamara G. Kolda, and Brett W. Bader, "Tensor decompositions and applications", SIAM review 51, 455 (2009).
- Ivan V Oseledets, “Tensor-train decomposition,” SIAM Journal on Scientific Computing 33, 2295–2317 (2011).
- Edwin Miles Miles Stoudenmire and David J. Schwab, “Supervised Learning with Quantum-Inspired Tensor Networks,” Advances in Neural Information Processing Systems 29, 4799 (2016), arXiv:1605.05775.
- Alexander Novikov, Mikhail Trofimov, and Ivan Oseledets, “Exponential machines,” arXiv:1605.05775 (2016).
- Jinguo Liu, and Lei Wang,Differentiable Learning of Quantum Circuit Born Machine, arXiv:1804.04168 (2018).
- Edwin Miles Stoudenmire, “Learning relevant features of data with multi-scale tensor networks,” Quantum Science and Technology (2018).