# A trained generative model with a few qubits

Welcome back to our course on *generative modeling with quantum hardware*. In this lesson, we will develop a trained generative model, optimize the parameters of a quantum circuit and find out which role entanglement plays for learning complicated data distributions.

So far, we have learned that, in a generative machine learning task, you receive data and the goal is to learn a model of the true distribution underlying the data. Then, you want to be able to generate new data from this model, which has similar characteristics, but is not necessarily inside the training set.

Lets get started and import the necessary Python tools.

!pip install qiskit !pip install pylatexenc import numpy as np import matplotlib.pyplot as plt from qiskit.circuit import QuantumCircuit, Parameter from qiskit import Aer

## Recap: Defining a quantum circuit generative model and the dataset

Last time, we introduced a quantum circuit generative model which encodes probability distributions over discrete data. Particularly, the data are binary qubit measurements , e.g. for qubits. The probability can formally be written as

where are the parameters of a parametrized quantum circuit which prepares the quantum state . In practice, this represents the probability of obtaining the outcome when you measure the prepared quantum state.

Let us recollect the function that flexibly constructs a quantum circuit with any number of qubits and layers :

def get_circuit(number_of_qubits, number_of_layers): number_of_parameters = 2 * number_of_qubits * number_of_layers thetas = [Parameter(r"\theta" + f"{ii+1}") for ii in range(number_of_parameters)] # initialize the qiskit circuit circuit = QuantumCircuit(number_of_qubits) parameter_counter = 0 # loop over the layers of the quantum circuit for l in range(number_of_layers): # apply a layer of RX rotations for n in range(number_of_qubits): circuit.rx(thetas[parameter_counter], n) parameter_counter += 1 # apply a layer of RZ rotations for n in range(number_of_qubits): circuit.rz(thetas[parameter_counter], n) parameter_counter += 1 # apply a layer of CNOTs for n in range(number_of_qubits - 1): circuit.cnot(n, n + 1) # instructions to measure all qubits circuit.measure_all() return circuit

For and , the quantum circuit looks like this:

number_of_qubits = 4 number_of_layers = 2 circuit = get_circuit(number_of_qubits, number_of_layers) circuit.draw("mpl")

The dataset that we constructed last time was a discretized version of the dataset used in Part 1 and Part 2 of our course:

training_data_dict = {'0000': 1, '0001': 0, '0010': 2, '0011': 4, '0100': 4, '0101': 10, '0110': 7, '0111': 9, '1000': 7, '1001': 5, '1010': 3, '1011': 8, '1100': 14, '1101': 19, '1110': 6, '1111': 1} # normalize the counts to be interpreted as probabilities count_sum = sum(training_data_dict.values()) for x, prob in training_data_dict.items(): training_data_dict[x] = prob/count_sum print(training_data_dict)

{'0000': 0.01, '0001': 0.0, '0010': 0.02, '0011': 0.04, '0100': 0.04, '0101': 0.1, '0110': 0.07, '0111': 0.09, '1000': 0.07, '1001': 0.05, '1010': 0.03, '1011': 0.08, '1100': 0.14, '1101': 0.19, '1110': 0.06, '1111': 0.01}

For convenience, let us define a function to get samples from our model given a circuit and a vector of parameters.

simulator = Aer.get_backend("aer_simulator") def get_model_dict(circuit, parameters): # bind the circuit parameters to numbers circuit_with_params = circuit.bind_parameters(parameters) # run the circuit and get the result result = simulator.run(circuit_with_params, shots=10000).result() # extract the measurement counts as a python dictionary model_dict = result.get_counts() return model_dict

With a random initial guess for our parameter vector , we can plot the quantum circuit model distribution together with the dataset.

np.random.seed(42) # the quantum circuit has 2*n*d parameters number_of_parameters = 2 * number_of_qubits * number_of_layers # define value for the initial parameters parameters = np.random.uniform(0, 2 * np.pi, size=number_of_parameters) # get the model dict using our function model_dict = get_model_dict(circuit, parameters) # for convenience sort by binary order model_dict = dict(sorted(model_dict.items())) # do the plotting fig, ax = plt.subplots(1, 1) plt.bar( training_data_dict.keys(), np.array(list(training_data_dict.values())) / sum(training_data_dict.values()), # turn the counts into probabilities alpha=0.5, edgecolor="royalblue", linewidth=2, label="Training Distribution", ) plt.bar( model_dict.keys(), np.array(list(model_dict.values())) / sum(model_dict.values()), # turn the counts into probabilities alpha=0.5, edgecolor="red", linewidth=2, label="Model Distribution", ) plt.xticks(rotation="vertical") plt.legend(fontsize=15) ax.set_ylabel("Counts", fontsize=16) ax.set_xlabel("Outcome", fontsize=16) ax.tick_params(labelsize=14) plt.ylim(0, 0.30) plt.show()

## Getting a trained a generative model

A crucial component of our generative modeling task is the loss function that we optimize to learn the data. Here, we use the KL divergence

which measures the similarity between data generated from our model and from the dataset. The KL divergence approaches if the distributions match perfectly, i.e., .

def kl_divergence(data_distribution, model_distribution): # we will require a small offset "e" to avoid possible log(0) singularities e = 1e-6 # make sure the values in the dictionaries add up to 1 sum_data = sum(data_distribution.values()) sum_model = sum(model_distribution.values()) for bin_outcome, count in data_distribution.items(): data_distribution[bin_outcome] = max(count / sum_data, e) for bin_outcome, count in model_distribution.items(): model_distribution[bin_outcome] = max(count / sum_model, e) # calculate the first term in the KL divergence formula term1 = 0 for bin_outcome, prob in data_distribution.items(): term1 += prob * np.log(prob) # calculate the second term in the KL divergence formula term2 = 0 for bin_outcome, prob in data_distribution.items(): term2 += prob * np.log(model_distribution.get(bin_outcome, e)) return term1 - term2

Our goal now is to optimize the parameters of the quantum circuit with the goal of lowering the KL divergence. One approach for doing that is via *gradient descent*. Similarly to Part 2 of this course, we define a simple function to estimate a numerical gradient of the KL divergence using finite differences:

with a small (epsilon).

def calculate_gradient(model_parameters, circuit, finite_distance_eps): gradient_vector = np.zeros(len(model_parameters)) # loop over all parameters for kk in range(len(model_parameters)): # define the eps vector in the direction of the parameter delta_vector = np.zeros(len(model_parameters)) delta_vector[kk] = finite_distance_eps # add or subtract eps from the parameter params_plus = model_parameters + delta_vector params_minus = model_parameters - delta_vector # get the corresponding measurements from the quantum circuit model_dict_plus = get_model_dict(circuit, params_plus) model_dict_minus = get_model_dict(circuit, params_minus) # calculate the KL divergence values kl_plus = kl_divergence(training_data_dict, model_dict_plus) kl_minus = kl_divergence(training_data_dict, model_dict_minus) # calculate the finite difference gradient gradient = (kl_plus - kl_minus)/(2*finite_distance_eps) gradient_vector[kk] = gradient return gradient_vector

This time, we choose , which is a little larger than last time. We do this because we want to be able to confidently differentiate the two evaluations of the model at different parameters with a finite number of samples from the quantum computer.

And as in Part 2, we define a simple gradient descent algorithm via

with a *learning rate* (eta), where is the training iteration index.

eps = 5e-2 %%time # generate initial random parameters np.random.seed(42) parameters = np.random.uniform(0, 2 * np.pi, size=number_of_parameters) # learning rate and number of iterations learning_rate = 0.1 T = 100 all_kls = [] # evaluate and record the initial KL divergence loss model_dict = get_model_dict(circuit, parameters) all_kls.append(kl_divergence(training_data_dict, model_dict)) # train! for t in range(T): # one gradient descent step gradient_vector = calculate_gradient(parameters, circuit, eps) parameters -= learning_rate*gradient_vector # evaluate and record the KL divergence loss model_dict = get_model_dict(circuit, parameters) all_kls.append(kl_divergence(training_data_dict, model_dict)) final_parameters = parameters.copy()

And plot up the optimization results:

fig, ax = plt.subplots(1,1) plt.plot(all_kls, linewidth=3) ax.set_ylabel("KL Divergence", fontsize=16) ax.set_xlabel("Iteration ", fontsize=16) plt.ylim(0, 2) plt.show()

This is a lot of improvement in terms of the KL divergence! To visually confirm that the freshly optimized model represents the data distribution much better, let us plot the model distribution together with the training data.

np.random.seed(42) # get the model dict using our function model_dict = get_model_dict(circuit, final_parameters) # for convenience sort by binary order model_dict = dict(sorted(model_dict.items())) # do the plotting fig, ax = plt.subplots(1, 1) x_points = np.linspace(-1.5, 1.5, 1000) plt.bar( training_data_dict.keys(), np.array(list(training_data_dict.values())) / sum(training_data_dict.values()), # turn the counts into probabilities alpha=0.5, edgecolor="royalblue", linewidth=2, label="Training Distribution", ) plt.bar( model_dict.keys(), np.array(list(model_dict.values())) / sum(model_dict.values()), # turn the counts into probabilities alpha=0.5, edgecolor="red", linewidth=2, label="Optimized Model Distribution, d=2", ) plt.xticks(rotation="vertical") plt.legend(fontsize=15) ax.set_ylabel("Counts", fontsize=16) ax.set_xlabel("Outcome", fontsize=16) ax.tick_params(labelsize=14) plt.ylim(0, 0.30) plt.show()

This model distribution resembles the training data distribution pretty well, and our trained quantum circuit could already be used as a generative model.But let us take a few steps back and think about about the structure of the quantum circuit and how it may affect performance.

## The role of entanglement

Remember that we defined the number of layers 𝑑=2, which gave us the circuit layout that you can see above. How would this all have worked if we used 𝑑=1, i.e., only one layer of entangling `CNOT`

gates? This exploration is similar to our QML 104 lesson, which you can find here.

number_of_layers = 1 circuit1 = get_circuit(number_of_qubits, number_of_layers) circuit1.draw("mpl")

Train it …

%%time number_of_parameters1 = 2 * number_of_qubits * number_of_layers # generate initial random parameters np.random.seed(42) parameters = np.random.uniform(0, 2 * np.pi, size=number_of_parameters1) # learning rate and number of iterations learning_rate = 0.1 T = 100 all_kls1 = [] # evaluate and record the initial KL divergence loss model_dict = get_model_dict(circuit1, parameters) all_kls1.append(kl_divergence(training_data_dict, model_dict)) # train! for t in range(T): # one gradient descent step gradient_vector = calculate_gradient(parameters, circuit1, eps) parameters -= learning_rate*gradient_vector # evaluate and record the KL divergence loss model_dict = get_model_dict(circuit1, parameters) all_kls1.append(kl_divergence(training_data_dict, model_dict)) final_parameters1 = parameters.copy()

… and plot it together with the previous run with 𝑑=2:

fig, ax = plt.subplots(1,1) plt.plot(all_kls1, linewidth=3, label="d=1 layer") plt.plot(all_kls, linewidth=3, label="d=2 layers") ax.set_ylabel("KL Divergence", fontsize=16) ax.set_xlabel("Iteration ", fontsize=16) plt.legend(fontsize=14) plt.ylim(0, 2) plt.show()

We can see that 𝑑=2 reaches slightly better KL divergence values than 𝑑=1. Let’s try the more adventurous 𝑑=5 and compare.

number_of_layers = 5 circuit5 = get_circuit(number_of_qubits, number_of_layers) %%time number_of_parameters5 = 2 * number_of_qubits * number_of_layers # generate initial random parameters np.random.seed(42) parameters = np.random.uniform(0, 2 * np.pi, size=number_of_parameters5) # learning rate and number of iterations learning_rate = 0.1 T = 100 all_kls5 = [] # evaluate and record the initial KL divergence loss model_dict = get_model_dict(circuit5, parameters) all_kls5.append(kl_divergence(training_data_dict, model_dict)) # train! for t in range(T): # one gradient descent step gradient_vector = calculate_gradient(parameters, circuit5, eps) parameters -= learning_rate*gradient_vector # evaluate and record the KL divergence loss model_dict = get_model_dict(circuit5, parameters) all_kls5.append(kl_divergence(training_data_dict, model_dict)) final_parameters5 = parameters.copy()

Plot up all KL divergences that we collected:

fig, ax = plt.subplots(1,1) plt.plot(all_kls1, linewidth=3, label="d=1 layers") plt.plot(all_kls, linewidth=3, label="d=2 layers") plt.plot(all_kls5, linewidth=3, label="d=5 layers") ax.set_ylabel("KL Divergence", fontsize=16) ax.set_xlabel("Iteration ", fontsize=16) plt.legend(fontsize=14) plt.ylim(0, 2) plt.show()

𝑑=5 enables the KL divergence to go much lower, and approach the perfect value of 𝐾𝐿=0. The corresponding model distribution looks like this:

np.random.seed(42) # get the model dict using our function model_dict = get_model_dict(circuit5, final_parameters5) # for convenience sort by binary order model_dict = dict(sorted(model_dict.items())) # do the plotting fig, ax = plt.subplots(1, 1) plt.bar( training_data_dict.keys(), np.array(list(training_data_dict.values())) / sum(training_data_dict.values()), # turn the counts into probabilities alpha=0.5, edgecolor="royalblue", linewidth=2, label="Training Distribution", ) plt.bar( model_dict.keys(), np.array(list(model_dict.values())) / sum(model_dict.values()), # turn the counts into probabilities alpha=0.5, edgecolor="red", linewidth=2, label="Optimized Model Distribution, d=5", ) plt.xticks(rotation="vertical") plt.legend(fontsize=15) ax.set_ylabel("Counts", fontsize=16) ax.set_xlabel("Outcome", fontsize=16) ax.tick_params(labelsize=14) plt.ylim(0, 0.30) plt.show()

This really is a much better fit on the training distribution , just as the low KL divergence indicated. But why? Is it just the additional parameters in the quantum circuit with more layers? No! In fact, without any entangling gates, the additional parameters would be redundant. What makes this quantum circuit model powerful (also called more **expressive**) is the entangling gates sandwiched between parametrized single-qubit gates. Lets think about this on the level of a two-qubit **wavefunction**.

### Optional discussion on entanglement

**NOTE: If the following calculation is too involved for you, don’t worry! You will be still be able to take away the most important message.**

Assume that you want to learn the target distribution that contains and . You might notice that this is the distribution that you would expect if you measured a *Bell state*. See our recent lesson on this very special state here.

A general two-qubit wavefunction can be written as:

where all are complex numbers and their absolute squares add up to . If the two qubits are not entangled, then by definition, you can split up the state of both qubits into the product of the individual states:

Writing each state with it’s own amplitudes and , respectively, we can write

**Question**: Can you find a combination of such that you can multiply the terms back together and achieve , and all other terms are ? Such a state would create the desired target distribution.

No, that is not possible! You need at least one `CNOT`

gate, that entangles the two qubits.

The important message is:

**Without any entangling gates, you are very restricted in the distributions ** **that you can attain. And inversely, increasing the number of entangling gates in your quantum circuit allows you to model increasingly complicated distributions.**

## Conclusion

Quantum circuit generative models utilize the phenomena of entanglement and wavefunction collapse (i.e., the measurement) as a resource, and are thus promissing candidates for efficiently learning and sampling from complicated discrete probability distributions. In this lesson, you learned how to train a quantum circuit generative model using the KL divergence loss function, and how more entanglement enables the model to learn continuously better models of the training data distribution.

But there is much more to learn. For example, to we really want to reach 𝐾𝐿=0 on the training dataset? Can more entanglement be detrimental for training? How do you choose a good ansatz for a particular dataset? Does KL divergence training scale well as we increase the number of qubits? And much more…

Stay tuned for our next and final lesson of *Generative Modeling with Quantum Hardware*, where we explore cutting edge research in the domain quantum circuit generative models. See you soon!