Chapter 15: Generating Data

Source: Python Crash Course, 3rd Edition by Eric Matthes

Data visualization is the use of visual representations to explore and present patterns in datasets. It’s closely associated with data analysis, which uses code to explore the patterns and connections in a dataset. A dataset can be a small list of numbers that fits in a single line of code, or it can be terabytes of data that include many different kinds of information.

Creating effective data visualizations is about more than just making information look nice. When a representation of a dataset is simple and visually appealing, its meaning becomes clear to viewers. People will see patterns and significance in your datasets that they never knew existed.

Fortunately, you don’t need a supercomputer to visualize complex data. Python is so efficient that with just a laptop, you can quickly explore datasets containing millions of individual data points.

People use Python for data-intensive work in genetics, climate research, political and economic analysis, and much more. Data scientists have written an impressive array of visualization and analysis tools in Python, many of which are available to you as well. One of the most popular tools is Matplotlib, a mathematical plotting library. In this chapter, we’ll use Matplotlib to make simple plots, such as line graphs and scatter plots. Then we’ll create a more interesting dataset based on the concept of a random walk — a visualization generated from a series of random decisions.

We’ll also use a package called Plotly, which creates visualizations that work well on digital devices, to analyze the results of rolling dice. Plotly generates visualizations that automatically resize to fit a variety of display devices. These visualizations can also include a number of interactive features, such as emphasizing particular aspects of the dataset when users hover over different parts of the visualization.

Installing Matplotlib

To use Matplotlib for your initial set of visualizations, you’ll need to install it using pip, just like we did with pytest in Chapter 11 (see Installing pytest with pip on page 210).

To install Matplotlib, enter the following command at a terminal prompt:

$ python -m pip install --user matplotlib

If you use a command other than python to run programs or start a terminal session, such as python3, your command will look like this:

$ python3 -m pip install --user matplotlib

To see the kinds of visualizations you can make with Matplotlib, visit the Matplotlib home page at matplotlib.org and click Plot types. When you click a visualization in the gallery, you’ll see the code used to generate the plot.

Plotting a Simple Line Graph

Let’s plot a simple line graph using Matplotlib and then customize it to create a more informative data visualization. We’ll use the square number sequence 1, 4, 9, 16, and 25 as the data for the graph.

To make a simple line graph, specify the numbers you want to work with and let Matplotlib do the rest:

mpl_squares.py

import matplotlib.pyplot as plt

squares = [1, 4, 9, 16, 25]

fig, ax = plt.subplots()    (1)
ax.plot(squares)

plt.show()

1 We follow a common Matplotlib convention by calling the subplots() function. This function can generate one or more plots in the same figure. The variable fig represents the entire figure, which is the collection of plots that are generated. The variable ax represents a single plot in the figure; this is the variable we’ll use most of the time when defining and customizing a single plot.

We first import the pyplot module using the alias plt so we don’t have to type pyplot repeatedly. (You’ll see this convention often in online examples, so we’ll use it here.) The pyplot module contains a number of functions that help generate charts and plots.

We create a list called squares to hold the data that we’ll plot. We then use the plot() method, which tries to plot the data it’s given in a meaningful way. The function plt.show() opens Matplotlib’s viewer and displays the plot. The viewer allows you to zoom and navigate the plot, and you can save any plot images you like by clicking the disk icon.

Changing the Label Type and Line Thickness

Although the initial plot shows that the numbers are increasing, the label type is too small and the line is a little thin to read easily. Fortunately, Matplotlib allows you to adjust every feature of a visualization.

We’ll use a few of the available customizations to improve this plot’s readability. Let’s start by adding a title and labeling the axes:

mpl_squares.py

import matplotlib.pyplot as plt

squares = [1, 4, 9, 16, 25]

fig, ax = plt.subplots()
ax.plot(squares, linewidth=3)   (1)

# Set chart title and label axes.
ax.set_title("Square Numbers", fontsize=24)   (2)
ax.set_xlabel("Value", fontsize=14)           (3)
ax.set_ylabel("Square of Value", fontsize=14)

# Set size of tick labels.
ax.tick_params(labelsize=14)                  (4)

plt.show()

1	The `linewidth` parameter controls the thickness of the line that `plot()` generates.
2	The `set_title()` method sets an overall title for the chart. The `fontsize` parameters control the size of the text in various elements on the chart.
3	The `set_xlabel()` and `set_ylabel()` methods allow you to set a title for each of the axes.
4	The method `tick_params()` styles the tick marks. Here it sets the font size of the tick mark labels to 14 on both axes.

Correcting the Plot

Now that we can read the chart better, we can see that the data is not plotted correctly. Notice at the end of the graph that the square of 4.0 is shown as 25! Let’s fix that.

When you give plot() a single sequence of numbers, it assumes the first data point corresponds to an x-value of 0, but our first point corresponds to an x-value of 1. We can override the default behavior by giving plot() both the input and output values used to calculate the squares:

mpl_squares.py

import matplotlib.pyplot as plt

input_values = [1, 2, 3, 4, 5]
squares = [1, 4, 9, 16, 25]

fig, ax = plt.subplots()
ax.plot(input_values, squares, linewidth=3)

# Set chart title and label axes.
# --snip--

Now plot() doesn’t have to make any assumptions about how the output numbers were generated, and the resulting plot is correct.

You can specify a number of arguments when calling plot() and use a number of methods to customize your plots after generating them. We’ll continue to explore these approaches to customization as we work with more interesting datasets throughout this chapter.

Using Built-in Styles

Matplotlib has a number of predefined styles available. These styles contain a variety of default settings for background colors, gridlines, line widths, fonts, font sizes, and more. To see the full list of available styles, run the following lines in a terminal session:

>>> import matplotlib.pyplot as plt
>>> plt.style.available
['Solarize_Light2', '_classic_test_patch', '_mpl-gallery', --snip--]

To use any of these styles, add one line of code before calling subplots():

mpl_squares.py

import matplotlib.pyplot as plt

input_values = [1, 2, 3, 4, 5]
squares = [1, 4, 9, 16, 25]

plt.style.use('seaborn-v0_8')
fig, ax = plt.subplots()
# --snip--

A wide variety of styles is available; play around with these styles to find some that you like.

Plotting and Styling Individual Points with scatter()

Sometimes, it’s useful to plot and style individual points based on certain characteristics. For example, you might plot small values in one color and larger values in a different color. You could also plot a large dataset with one set of styling options and then emphasize individual points by replotting them with different options.

To plot a single point, pass the single x- and y-values of the point to scatter():

scatter_squares.py

import matplotlib.pyplot as plt

plt.style.use('seaborn-v0_8')
fig, ax = plt.subplots()
ax.scatter(2, 4)

plt.show()

Let’s style the output to make it more interesting. We’ll add a title, label the axes, and make sure all the text is large enough to read:

import matplotlib.pyplot as plt

plt.style.use('seaborn-v0_8')
fig, ax = plt.subplots()
ax.scatter(2, 4, s=200)   (1)

# Set chart title and label axes.
ax.set_title("Square Numbers", fontsize=24)
ax.set_xlabel("Value", fontsize=14)
ax.set_ylabel("Square of Value", fontsize=14)

# Set size of tick labels.
ax.tick_params(labelsize=14)

plt.show()

1	We call `scatter()` and use the `s` argument to set the size of the dots used to draw the graph.

Plotting a Series of Points with scatter()

To plot a series of points, we can pass scatter() separate lists of x- and y-values, like this:

scatter_squares.py

import matplotlib.pyplot as plt

x_values = [1, 2, 3, 4, 5]
y_values = [1, 4, 9, 16, 25]

plt.style.use('seaborn-v0_8')
fig, ax = plt.subplots()
ax.scatter(x_values, y_values, s=100)

# Set chart title and label axes.
# --snip--

The x_values list contains the numbers to be squared, and y_values contains the square of each number. When these lists are passed to scatter(), Matplotlib reads one value from each list as it plots each point. The points to be plotted are (1, 1), (2, 4), (3, 9), (4, 16), and (5, 25).

Calculating Data Automatically

Writing lists by hand can be inefficient, especially when we have many points. Rather than writing out each value, let’s use a loop to do the calculations for us.

Here’s how this would look with 1,000 points:

scatter_squares.py

import matplotlib.pyplot as plt

x_values = range(1, 1001)                          (1)
y_values = [x**2 for x in x_values]

plt.style.use('seaborn-v0_8')
fig, ax = plt.subplots()
ax.scatter(x_values, y_values, s=10)               (2)

# Set chart title and label axes.
# --snip--

# Set the range for each axis.
ax.axis([0, 1100, 0, 1_100_000])                   (3)

plt.show()

1	We start with a `range` of x-values containing the numbers 1 through 1,000. Next, a list comprehension generates the y-values by looping through the x-values, squaring each number, and assigning the results to `y_values`.
2	Because this is a large dataset, we use a smaller point size.
3	Before showing the plot, we use the `axis()` method to specify the range of each axis. The `axis()` method requires four values: the minimum and maximum values for the x-axis and the y-axis. Here, we run the x-axis from 0 to 1,100 and the y-axis from 0 to 1,100,000.

Customizing Tick Labels

When the numbers on an axis get large enough, Matplotlib defaults to scientific notation for tick labels. This is usually a good thing, but you can tell Matplotlib to keep using plain notation if you prefer:

# --snip--
# Set the range for each axis.
ax.axis([0, 1100, 0, 1_100_000])
ax.ticklabel_format(style='plain')

plt.show()

The ticklabel_format() method allows you to override the default tick label style for any plot.

Defining Custom Colors

To change the color of the points, pass the argument color to scatter() with the name of a color to use in quotation marks, as shown here:

ax.scatter(x_values, y_values, color='red', s=10)

You can also define custom colors using the RGB color model. To define a color, pass the color argument a tuple with three float values (one each for red, green, and blue, in that order), using values between 0 and 1. For example, the following line creates a plot with light-green dots:

ax.scatter(x_values, y_values, color=(0, 0.8, 0), s=10)

Values closer to 0 produce darker colors, and values closer to 1 produce lighter colors.

Using a Colormap

A colormap is a sequence of colors in a gradient that moves from a starting to an ending color. In visualizations, colormaps are used to emphasize patterns in data. For example, you might make low values a light color and high values a darker color.

The pyplot module includes a set of built-in colormaps. To use one of these colormaps, you need to specify how pyplot should assign a color to each point in the dataset. Here’s how to assign a color to each point, based on its y-value:

scatter_squares.py

# --snip--
plt.style.use('seaborn-v0_8')
fig, ax = plt.subplots()
ax.scatter(x_values, y_values, c=y_values, cmap=plt.cm.Blues, s=10)

# Set chart title and label axes.
# --snip--

The c argument is similar to color, but it’s used to associate a sequence of values with a color mapping. We pass the list of y-values to c, and then tell pyplot which colormap to use with the cmap argument. This code colors the points with lower y-values light blue and the points with higher y-values dark blue.

You can see all the colormaps available in pyplot at matplotlib.org. Go to Tutorials, scroll down to Colors, and click Choosing Colormaps in Matplotlib.

Saving Your Plots Automatically

If you want to save the plot to a file instead of showing it in the Matplotlib viewer, you can use plt.savefig() instead of plt.show():

plt.savefig('squares_plot.png', bbox_inches='tight')

The first argument is a filename for the plot image, which will be saved in the same directory as scatter_squares.py. The second argument trims extra whitespace from the plot. You can also call savefig() with a Path object, and write the output file anywhere you want on your system.

Try It Yourself

15-1. Cubes: A number raised to the third power is a cube. Plot the first five cubic numbers, and then plot the first 5,000 cubic numbers.

15-2. Colored Cubes: Apply a colormap to your cubes plot.

Random Walks

In this section, we’ll use Python to generate data for a random walk and then use Matplotlib to create a visually appealing representation of that data. A random walk is a path that’s determined by a series of simple decisions, each of which is left entirely to chance. You might imagine a random walk as the path a confused ant would take if it took every step in a random direction.

Random walks have practical applications in nature, physics, biology, chemistry, and economics. For example, a pollen grain floating on a drop of water moves across the surface of the water because it’s constantly pushed around by water molecules. Molecular motion in a water drop is random, so the path a pollen grain traces on the surface is a random walk. The code we’ll write next models many real-world situations.

Creating the RandomWalk Class

To create a random walk, we’ll create a RandomWalk class, which will make random decisions about which direction the walk should take. The class needs three attributes: one variable to track the number of points in the walk, and two lists to store the x- and y-coordinates of each point in the walk.

We’ll only need two methods for the RandomWalk class: the init() method and fill_walk(), which will calculate the points in the walk. Let’s start with the init() method:

random_walk.py

from random import choice                      (1)

class RandomWalk:
    """A class to generate random walks."""

    def __init__(self, num_points=5000):       (2)
        """Initialize attributes of a walk."""
        self.num_points = num_points

        # All walks start at (0, 0).
        self.x_values = [0]                   (3)
        self.y_values = [0]

1	To make random decisions, we’ll store possible moves in a list and use the `choice()` function (from the `random` module) to decide which move to make each time a step is taken.
2	We set the default number of points in a walk to 5000, which is large enough to generate some interesting patterns but small enough to generate walks quickly.
3	We make two lists to hold the x- and y-values, and we start each walk at the point (0, 0).

Choosing Directions

We’ll use the fill_walk() method to determine the full sequence of points in the walk. Add this method to random_walk.py:

random_walk.py

def fill_walk(self):
    """Calculate all the points in the walk."""

    # Keep taking steps until the walk reaches the desired length.
    while len(self.x_values) < self.num_points:   (1)

        # Decide which direction to go, and how far to go.
        x_direction = choice([1, -1])
        x_distance = choice([0, 1, 2, 3, 4])      (2)
        x_step = x_direction * x_distance          (3)

        y_direction = choice([1, -1])
        y_distance = choice([0, 1, 2, 3, 4])
        y_step = y_direction * y_distance          (4)

        # Reject moves that go nowhere.
        if x_step == 0 and y_step == 0:            (5)
            continue

        # Calculate the new position.
        x = self.x_values[-1] + x_step            (6)
        y = self.y_values[-1] + y_step

        self.x_values.append(x)
        self.y_values.append(y)

1	We first set up a loop that runs until the walk is filled with the correct number of points.
2	We use `choice([1, -1])` to choose a value for `x_direction`, which returns either 1 for movement to the right or −1 for movement to the left. Next, `choice([0, 1, 2, 3, 4])` randomly selects a distance to move in that direction. The inclusion of a 0 allows for the possibility of steps that have movement along only one axis.
3	We determine the length of each step in the x-direction by multiplying the direction of movement by the distance chosen.
4	We determine the length of each step in the y-direction the same way.
5	If the values of both `x_step` and `y_step` are 0, the walk doesn’t go anywhere; when this happens, we `continue` the loop.
6	To get the next x-value for the walk, we add the value in `x_step` to the last value stored in `x_values` and do the same for the y-values. When we have the new point’s coordinates, we append them to `x_values` and `y_values`.

Plotting the Random Walk

Here’s the code to plot all the points in the walk:

rw_visual.py

import matplotlib.pyplot as plt

from random_walk import RandomWalk

# Make a random walk.
rw = RandomWalk()          (1)
rw.fill_walk()

# Plot the points in the walk.
plt.style.use('classic')
fig, ax = plt.subplots()
ax.scatter(rw.x_values, rw.y_values, s=15)   (2)
ax.set_aspect('equal')                        (3)
plt.show()

1	We create a random walk and assign it to `rw`, making sure to call `fill_walk()`.
2	To visualize the walk, we feed the walk’s x- and y-values to `scatter()` and choose an appropriate dot size.
3	By default, Matplotlib scales each axis independently. But that approach would stretch most walks out horizontally or vertically. Here we use the `set_aspect()` method to specify that both axes should have equal spacing between tick marks.

Generating Multiple Random Walks

Every random walk is different, and it’s fun to explore the various patterns that can be generated. One way to use the preceding code to make multiple walks without having to run the program several times is to wrap it in a while loop, like this:

rw_visual.py

import matplotlib.pyplot as plt

from random_walk import RandomWalk

# Keep making new walks, as long as the program is active.
while True:
    # Make a random walk.
    # --snip--
    plt.show()

    keep_running = input("Make another walk? (y/n): ")
    if keep_running == 'n':
        break

This code generates a random walk, displays it in Matplotlib’s viewer, and pauses with the viewer open. When you close the viewer, you’ll be asked whether you want to generate another walk.

Styling the Walk

In this section, we’ll customize our plots to emphasize the important characteristics of each walk and deemphasize distracting elements.

Coloring the Points

We’ll use a colormap to show the order of the points in the walk, and remove the black outline from each dot so the color of the dots will be clearer. To color the points according to their position in the walk, we pass the c argument a list containing the position of each point. Because the points are plotted in order, this list just contains the numbers from 0 to 4,999:

rw_visual.py

# --snip--
while True:
    # Make a random walk.
    rw = RandomWalk()
    rw.fill_walk()

    # Plot the points in the walk.
    plt.style.use('classic')
    fig, ax = plt.subplots()
    point_numbers = range(rw.num_points)   (1)
    ax.scatter(rw.x_values, rw.y_values, c=point_numbers, cmap=plt.cm.Blues,
        edgecolors='none', s=15)
    ax.set_aspect('equal')
    plt.show()
    # --snip--

1 We use range() to generate a list of numbers equal to the number of points in the walk. We assign this list to point_numbers, which we’ll use to set the color of each point in the walk. We pass point_numbers to the c argument, use the Blues colormap, and then pass edgecolors='none' to get rid of the black outline around each point. The result is a plot that varies from light to dark blue, showing exactly how the walk moves from its starting point to its ending point.

Plotting the Starting and Ending Points

In addition to coloring points to show their position along the walk, it would be useful to see exactly where each walk begins and ends. To do so, we can plot the first and last points individually after the main series has been plotted. We’ll make the end points larger and color them differently to make them stand out:

rw_visual.py

# --snip--
while True:
    # --snip--
    ax.scatter(rw.x_values, rw.y_values, c=point_numbers, cmap=plt.cm.Blues,
        edgecolors='none', s=15)
    ax.set_aspect('equal')

    # Emphasize the first and last points.
    ax.scatter(0, 0, c='green', edgecolors='none', s=100)
    ax.scatter(rw.x_values[-1], rw.y_values[-1], c='red', edgecolors='none',
        s=100)

    plt.show()
    # --snip--

To show the starting point, we plot the point (0, 0) in green and in a larger size (s=100) than the rest of the points. To mark the end point, we plot the last x- and y-values in red with a size of 100 as well. Make sure you insert this code just before the call to plt.show() so the starting and ending points are drawn on top of all the other points.

Cleaning Up the Axes

Let’s remove the axes in this plot so they don’t distract from the path of each walk. Here’s how to hide the axes:

rw_visual.py

# --snip--
while True:
    # --snip--
    ax.scatter(rw.x_values[-1], rw.y_values[-1], c='red', edgecolors='none',
        s=100)

    # Remove the axes.
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)

    plt.show()
    # --snip--

To modify the axes, we use the ax.get_xaxis() and ax.get_yaxis() methods to get each axis, and then chain the set_visible() method to make each axis invisible. As you continue to work with visualizations, you’ll frequently see this chaining of methods to customize different aspects of a visualization.

Adding Plot Points

Let’s increase the number of points, to give us more data to work with. To do so, we increase the value of num_points when we make a RandomWalk instance and adjust the size of each dot when drawing the plot:

rw_visual.py

# --snip--
while True:
    # Make a random walk.
    rw = RandomWalk(50_000)
    rw.fill_walk()

    # Plot the points in the walk.
    plt.style.use('classic')
    fig, ax = plt.subplots()
    point_numbers = range(rw.num_points)
    ax.scatter(rw.x_values, rw.y_values, c=point_numbers, cmap=plt.cm.Blues,
        edgecolors='none', s=1)
    # --snip--

This example creates a random walk with 50,000 points and plots each point at size s=1. The resulting walk is wispy and cloudlike. We’ve created a piece of art from a simple scatter plot! Experiment with this code to see how much you can increase the number of points in a walk before your system starts to slow down significantly.

Altering the Size to Fill the Screen

A visualization is much more effective at communicating patterns in data if it fits nicely on the screen. To make the plotting window better fit your screen, you can adjust the size of Matplotlib’s output. This is done in the subplots() call:

fig, ax = plt.subplots(figsize=(15, 9))

When creating a plot, you can pass subplots() a figsize argument, which sets the size of the figure. The figsize parameter takes a tuple that tells Matplotlib the dimensions of the plotting window in inches.

Matplotlib assumes your screen resolution is 100 pixels per inch; if this code doesn’t give you an accurate plot size, adjust the numbers as necessary. Or, if you know your system’s resolution, you can pass subplots() the resolution using the dpi parameter:

fig, ax = plt.subplots(figsize=(10, 6), dpi=128)

This should help make the most efficient use of the space available on your screen.

Try It Yourself

15-3. Molecular Motion: Modify rw_visual.py by replacing ax.scatter() with ax.plot(). To simulate the path of a pollen grain on the surface of a drop of water, pass in the rw.x_values and rw.y_values, and include a linewidth argument. Use 5,000 instead of 50,000 points to keep the plot from being too busy.

15-4. Modified Random Walks: In the RandomWalk class, x_step and y_step are generated from the same set of conditions. Modify the values in these lists to see what happens to the overall shape of your walks. Try a longer list of choices for the distance, such as 0 through 8, or remove the −1 from the x- or y-direction list.

15-5. Refactoring: The fill_walk() method is lengthy. Create a new method called get_step() to determine the direction and distance for each step, and then calculate the step. You should end up with two calls to get_step() in fill_walk():

x_step = self.get_step()
y_step = self.get_step()

This refactoring should reduce the size of fill_walk() and make the method easier to read and understand.

Rolling Dice with Plotly

In this section, we’ll use Plotly to produce interactive visualizations. Plotly is particularly useful when you’re creating visualizations that will be displayed in a browser, because the visualizations will scale automatically to fit the viewer’s screen. These visualizations are also interactive; when the user hovers over certain elements on the screen, information about those elements is highlighted.

We’ll build our initial visualization in just a couple lines of code using Plotly Express, a subset of Plotly that focuses on generating plots with as little code as possible. Once we know our plot is correct, we’ll customize the output just as we did with Matplotlib.

In this project, we’ll analyze the results of rolling dice. When you roll one regular, six-sided die, you have an equal chance of rolling any of the numbers from 1 through 6. However, when you use two dice, you’re more likely to roll certain numbers than others. We’ll try to determine which numbers are most likely to occur by generating a dataset that represents rolling dice.

Installing Plotly

Install Plotly using pip, just as you did for Matplotlib:

$ python -m pip install --user plotly
$ python -m pip install --user pandas

Plotly Express depends on pandas, which is a library for working efficiently with data, so we need to install that as well.

To see what kind of visualizations are possible with Plotly, visit the gallery of chart types at plotly.com/python.

Creating the Die Class

We’ll create the following Die class to simulate the roll of one die:

die.py

from random import randint

class Die:
    """A class representing a single die."""

    def __init__(self, num_sides=6):    (1)
        """Assume a six-sided die."""
        self.num_sides = num_sides

    def roll(self):
        """"Return a random value between 1 and number of sides."""
        return randint(1, self.num_sides)   (2)

1	The `init()` method takes one optional argument. With the `Die` class, when an instance of our die is created, the number of sides will be six if no argument is included. If an argument is included, that value will set the number of sides on the die. (Dice are named for their number of sides: a six-sided die is a D6, an eight-sided die is a D8, and so on.)
2	The `roll()` method uses the `randint()` function to return a random number between 1 and the number of sides. This function can return the starting value (1), the ending value (`num_sides`), or any integer between the two.

Rolling the Die

Before creating a visualization based on the Die class, let’s roll a D6, print the results, and check that the results look reasonable:

die_visual.py

from die import Die

# Create a D6.
die = Die()     (1)

# Make some rolls, and store results in a list.
results = []
for roll_num in range(100):     (2)
    result = die.roll()
    results.append(result)

print(results)

1	We create an instance of `Die` with the default six sides.
2	We roll the die 100 times and store the result of each roll in the list `results`.

Here’s a sample set of results:

[4, 6, 5, 6, 1, 5, 6, 3, 5, 3, 5, 3, 2, 2, 1, 3, 1, 5, 3, 6, 3, 6, 5,
 4, 1, 1, 4, 2, 3, 6, 4, 2, 6, 4, 1, 3, 2, 5, 6, 3, 6, 2, 1, 1, 3, 4,
 1, 4, 3, 5, 1, 4, 5, 5, 2, 3, 3, 1, 2, 3, 5, 6, 2, 5, 6, 1, 3, 2, 1,
 1, 1, 6, 5, 5, 2, 2, 6, 4, 1, 4, 5, 1, 1, 1, 4, 5, 3, 3, 1, 3, 5, 4,
 5, 6, 5, 4, 1, 5, 1, 2]

A quick scan of these results shows that the Die class seems to be working. We see the values 1 and 6, so we know the smallest and largest possible values are being returned, and because we don’t see 0 or 7, we know all the results are in the appropriate range.

Analyzing the Results

We’ll analyze the results of rolling one D6 by counting how many times we roll each number:

die_visual.py

# --snip--
# Make some rolls, and store results in a list.
results = []
for roll_num in range(1000):    (1)
    result = die.roll()
    results.append(result)

# Analyze the results.
frequencies = []
poss_results = range(1, die.num_sides+1)     (2)
for value in poss_results:
    frequency = results.count(value)         (3)
    frequencies.append(frequency)            (4)

print(frequencies)

1	Because we’re no longer printing the results, we can increase the number of simulated rolls to 1000.
2	To analyze the rolls, we create the empty list `frequencies` to store the number of times each value is rolled. We then generate all the possible results we could get.
3	We loop through the possible values, count how many times each number appears in `results`.
4	We append each count value to `frequencies`.

These results look reasonable: we see six frequencies, one for each possible number when you roll a D6, and no frequency is significantly higher than any other.

Making a Histogram

Now that we have the data we want, we can generate a visualization in just a couple lines of code using Plotly Express:

die_visual.py

import plotly.express as px

from die import Die
# --snip--

for value in poss_results:
    frequency = results.count(value)
    frequencies.append(frequency)

# Visualize the results.
fig = px.bar(x=poss_results, y=frequencies)
fig.show()

We first import the plotly.express module, using the conventional alias px. We then use the px.bar() function to create a bar graph. In the simplest use of this function, we only need to pass a set of x-values and a set of y-values. Here the x-values are the possible results from rolling a single die, and the y-values are the frequencies for each possible result.

The final line calls fig.show(), which tells Plotly to render the resulting chart as an HTML file and open that file in a new browser tab.

This chart is dynamic and interactive. If you change the size of your browser window, the chart will resize to match the available space. If you hover over any of the bars, you’ll see a pop-up highlighting the specific data related to that bar.

Feel free to try different chart types by changing px.bar() to something like px.scatter() or px.line(). You can find a full list of available chart types at plotly.com/python/plotly-express.

Customizing the Plot

Now that we know we have the correct kind of plot and our data is being represented accurately, we can focus on adding the appropriate labels and styles for the chart.

The first way to customize a plot with Plotly is to use some optional parameters in the initial call that generates the plot, in this case, px.bar(). Here’s how to add an overall title and a label for each axis:

die_visual.py

# --snip--
# Visualize the results.
title = "Results of Rolling One D6 1,000 Times"            (1)
labels = {'x': 'Result', 'y': 'Frequency of Result'}      (2)
fig = px.bar(x=poss_results, y=frequencies, title=title, labels=labels)
fig.show()

1	We first define the title that we want.
2	To define axis labels, we write a dictionary. The keys in the dictionary refer to the labels we want to customize, and the values are the custom labels we want to use. Here we give the x-axis the label `Result` and the y-axis the label `Frequency of Result`.

Rolling Two Dice

Rolling two dice results in larger numbers and a different distribution of results. Let’s modify our code to create two D6 dice to simulate the way we roll a pair of dice. Each time we roll the pair, we’ll add the two numbers (one from each die) and store the sum in results. Save a copy of die_visual.py as dice_visual.py and make the following changes:

dice_visual.py

import plotly.express as px

from die import Die

# Create two D6 dice.
die_1 = Die()
die_2 = Die()

# Make some rolls, and store results in a list.
results = []
for roll_num in range(1000):
    result = die_1.roll() + die_2.roll()   (1)
    results.append(result)

# Analyze the results.
frequencies = []
max_result = die_1.num_sides + die_2.num_sides   (2)
poss_results = range(2, max_result+1)            (3)
for value in poss_results:
    frequency = results.count(value)
    frequencies.append(frequency)

# Visualize the results.
title = "Results of Rolling Two D6 Dice 1,000 Times"
labels = {'x': 'Result', 'y': 'Frequency of Result'}
fig = px.bar(x=poss_results, y=frequencies, title=title, labels=labels)
fig.show()

1	After creating two instances of `Die`, we roll the dice and calculate the sum of the two dice for each roll. The smallest possible result (2) is the sum of the smallest number on each die.
2	The largest possible result (12) is the sum of the largest number on each die, which we assign to `max_result`.
3	The variable `max_result` makes the code for generating `poss_results` much easier to read. This code allows us to simulate rolling a pair of dice with any number of sides.

This graph shows the approximate distribution of results you’re likely to get when you roll a pair of D6 dice. As you can see, you’re least likely to roll a 2 or a 12 and most likely to roll a 7. This happens because there are six ways to roll a 7: 1 and 6, 2 and 5, 3 and 4, 4 and 3, 5 and 2, and 6 and 1.

Further Customizations

There’s one issue that we should address with the plot we just generated. Now that there are 11 bars, the default layout settings for the x-axis leave some of the bars unlabeled. Plotly has an update_layout() method that can be used to make a wide variety of updates to a figure after it’s been created. Here’s how to tell Plotly to give each bar its own label:

dice_visual.py

# --snip--
fig = px.bar(x=poss_results, y=frequencies, title=title, labels=labels)

# Further customize chart.
fig.update_layout(xaxis_dtick=1)

fig.show()

The update_layout() method acts on the fig object, which represents the overall chart. Here we use the xaxis_dtick argument, which specifies the distance between tick marks on the x-axis. We set that spacing to 1, so that every bar is labeled.

Rolling Dice of Different Sizes

Let’s create a six-sided die and a ten-sided die, and see what happens when we roll them 50,000 times:

dice_visual_d6d10.py

import plotly.express as px

from die import Die

# Create a D6 and a D10.
die_1 = Die()
die_2 = Die(10)     (1)

# Make some rolls, and store results in a list.
results = []
for roll_num in range(50_000):
    result = die_1.roll() + die_2.roll()
    results.append(result)

# Analyze the results.
# --snip--

# Visualize the results.
title = "Results of Rolling a D6 and a D10 50,000 Times"   (2)
labels = {'x': 'Result', 'y': 'Frequency of Result'}
# --snip--

1	To make a D10, we pass the argument 10 when creating the second `Die` instance.
2	We change the title of the graph as well.

Instead of one most likely result, there are five such results. This happens because there’s still only one way to roll the smallest value (1 and 1) and the largest value (6 and 10), but the smaller die limits the number of ways you can generate the middle numbers. There are six ways to roll a 7, 8, 9, 10, or 11, these are the most common results, and you’re equally likely to roll any one of them.

Saving Figures

When you have a figure you like, you can always save the chart as an HTML file through your browser. But you can also do so programmatically. To save your chart as an HTML file, replace the call to fig.show() with a call to fig.write_html():

fig.write_html('dice_visual_d6d10.html')

The write_html() method requires one argument: the name of the file to write to. If you only provide a filename, the file will be saved in the same directory as the .py file. You can also call write_html() with a Path object, and write the output file anywhere you want on your system.

Try It Yourself

15-6. Two D8s: Create a simulation showing what happens when you roll two eight-sided dice 1,000 times. Try to picture what you think the visualization will look like before you run the simulation, then see if your intuition was correct. Gradually increase the number of rolls until you start to see the limits of your system’s capabilities.

15-7. Three Dice: When you roll three D6 dice, the smallest number you can roll is 3 and the largest number is 18. Create a visualization that shows what happens when you roll three D6 dice.

15-8. Multiplication: When you roll two dice, you usually add the two numbers together to get the result. Create a visualization that shows what happens if you multiply these numbers by each other instead.

15-9. Die Comprehensions: For clarity, the listings in this section use the long form of for loops. If you’re comfortable using list comprehensions, try writing a comprehension for one or both of the loops in each of these programs.

15-10. Practicing with Both Libraries: Try using Matplotlib to make a die-rolling visualization, and use Plotly to make the visualization for a random walk. (You’ll need to consult the documentation for each library to complete this exercise.)

Summary

In this chapter, you learned to generate datasets and create visualizations of that data. You created simple plots with Matplotlib and used a scatter plot to explore random walks. You also created a histogram with Plotly, and used it to explore the results of rolling dice of different sizes.

Generating your own datasets with code is an interesting and powerful way to model and explore a wide variety of real-world situations. As you continue to work through the data visualization projects that follow, keep an eye out for situations you might be able to model with code.

In Chapter 16, you’ll download data from online sources and continue to use Matplotlib and Plotly to explore that data.

Applied Exercises: Ch 15 — Generating Data

These exercises apply the data generation and analysis concepts from the chapter — sequential data generation, random walk simulation, frequency analysis, and distribution modeling — to infrastructure, security, and language learning contexts. No Matplotlib or Plotly installation is required; results are printed or written to CSV/JSON for later visualization.

Domus Digitalis / Homelab

D15-1. Service Uptime Squares: Generate a list of simulated uptime percentages for nodes 1 through 20, calculated as min(100, node ** 1.5) rounded to 2 decimal places. Print the list and use min(), max(), and sum() / len() to compute statistics. This mirrors the square number sequence generation pattern.

D15-2. Network Latency Walk: Implement a LatencyWalk class modeled on RandomWalk. Initialize with num_points=1000 and a starting latency of 20.0 ms. In fill_walk(), each step adds a random delta from choice([-2, -1, 0, 1, 2, 3]) but clamps the value between 1 and 200 (no negative latency). After generating the walk, print the min, max, and average latency.

D15-3. VLAN Traffic Frequency: Create a TrafficDie class analogous to Die that returns a random VLAN ID from a weighted list (INFRA=100 appears 3×, SECURITY=110 appears 2×, SERVICES=120, DATA=10, GUEST=30 each appear once). Roll it 1000 times. Count the frequency of each VLAN ID using .count(). Print the frequency table.

D15-4. BGP Event Histogram Data: Simulate 500 BGP session events where each event is one of ['established', 'idle', 'active', 'connect', 'opensent', 'openconfirm'], chosen randomly. Count the frequency of each state. Print a text-based bar chart where each state is followed by a bar of # characters proportional to its count (scale to max 40 chars).

D15-5. Stack Health Time Series: Generate a time series of 100 health check results for a 5-service stack. Each check, each service has an 80% chance of being 'up' and 20% chance of being 'degraded'. Store results as a list of dicts. Print the percentage of checks where all 5 services were up simultaneously.

CHLA / ISE / Network Security

C15-1. Auth Success Rate Curve: Generate auth success counts for ISE policy sets 1 through 10, calculated as int(1000 * (1 - 0.05 * i)) where i is the policy set index. This simulates decreasing success rate as policies get stricter. Print the list and compute min, max, and mean.

C15-2. Log Volume Walk: Implement a LogVolumeWalk class modeled on RandomWalk. Initialize with num_points=500 and a starting volume of 1000. Each step adds a random delta from choice([-50, -25, 0, 25, 50, 100]), clamped between 0 and 10000. After generating the walk, print the min, max, and average volume across all points.

C15-3. Syslog Severity Frequency: Create a SyslogDie class that returns a random severity level (0–7) with realistic weights (0=1, 1=2, 2=5, 3=10, 4=20, 5=15, 6=30, 7=17 occurrences in the pool). Roll it 1000 times. Count the frequency of each severity. Print the frequency table with severity labels.

C15-4. Endpoint Auth Distribution: Simulate 500 endpoint authentication attempts where each attempt randomly selects one of ['EAP-TLS', 'EAP-TEAP', 'MSCHAPv2', 'MAB', 'WebAuth']. Count the frequency of each protocol. Print a text-based bar chart proportional to frequency (scale to max 40 chars).

C15-5. Pipeline Throughput Time Series: Generate a 200-step time series of pipeline throughput values starting at 5000 events/sec. Each step, throughput changes by a random delta from randint(-200, 300), clamped between 0 and 20000. Count how many steps the throughput was above 8000 (high load) and below 2000 (near-stall). Print a summary.

General Sysadmin / Linux

L15-1. Disk Usage Squares: Generate a list of simulated disk usage percentages for filesystems 1 through 15, calculated as min(100, fs ** 1.8 / 10) rounded to 2 decimal places. Print the list and compute min, max, and mean. Flag any filesystem above 80%.

L15-2. CPU Load Walk: Implement a CPULoadWalk class modeled on RandomWalk. Initialize with num_points=300 and a starting load of 20.0. Each step adds a random delta from choice([-3, -2, -1, 0, 1, 2, 5]), clamped between 0 and 100. After generating the walk, print the min, max, and average CPU load. Count steps above 80% (critical).

L15-3. Package Install Frequency: Create an InstallDie class that returns a random exit code from [0, 0, 0, 0, 1, 2] (0=success, 1=partial, 2=failure). Roll it 500 times. Count the frequency of each exit code. Print the frequency table with labels and a success rate percentage.

L15-4. Service Restart Distribution: Simulate 300 service restart events where each restart selects a restart reason from ['OOM', 'segfault', 'config_change', 'manual', 'dependency_failure', 'watchdog']. Count the frequency of each reason. Print a text-based bar chart.

L15-5. Filesystem Check Time Series: Generate a 150-step time series tracking the number of filesystem errors detected per check, starting at 0. Each step adds randint(0, 3) errors and occasionally resolves randint(0, 1). Clamp to 0 minimum. Count the number of steps with more than 5 active errors. Print a summary including total errors accumulated.

Spanish / DELE C2

E15-1. Vocabulary Growth Curve: Generate a list of cumulative vocabulary words learned across 20 study sessions, calculated as int(50 * session ** 0.8). Print the list and compute min, max, and growth rate from session 1 to 20. This mirrors the square sequence pattern.

E15-2. Study Intensity Walk: Implement a StudyWalk class modeled on RandomWalk. Initialize with num_points=200 and a starting intensity of 50 (0–100 scale). Each step adds a random delta from choice([-5, -3, 0, 3, 5, 8]), clamped between 0 and 100. After generating the walk, print the min, max, and average intensity across all steps.

E15-3. Chapter Difficulty Frequency: Create a ChapterDie class that returns a difficulty rating (1–5) for Don Quijote chapters, with weights approximating a realistic distribution (1=5, 2=15, 3=40, 4=30, 5=10 in the pool). Roll it 500 times. Count the frequency of each rating. Print the frequency table with labels.

E15-4. DELE Exercise Type Distribution: Simulate 400 DELE practice sessions where each session randomly selects one of ['comprensión lectora', 'comprensión auditiva', 'expresión escrita', 'expresión oral', 'léxico', 'gramática']. Count the frequency of each exercise type. Print a text-based bar chart proportional to frequency.

E15-5. Vocabulary Retention Time Series: Generate a 100-step time series tracking retained vocabulary words, starting at 500. Each step adds randint(0, 15) new words and subtracts randint(0, 5) (forgot), clamped to 0 minimum. After generating the series, print the min, max, final count, and net words gained over all 100 steps.