Chapter 16: Downloading Data
|
Source: Python Crash Course, 3rd Edition by Eric Matthes |
In this chapter, you’ll download datasets from online sources and create working visualizations of that data. You can find an incredible variety of data online, much of which hasn’t been examined thoroughly. The ability to analyze this data allows you to discover patterns and connections that no one else has found.
We’ll access and visualize data stored in two common data formats: CSV
and JSON. We’ll use Python’s csv module to process weather data stored
in the CSV format and analyze high and low temperatures over time in two
different locations. We’ll then use Matplotlib to generate a chart based
on our downloaded data to display variations in temperature in two
dissimilar environments: Sitka, Alaska, and Death Valley, California.
Later in the chapter, we’ll use the json module to access earthquake
data stored in the GeoJSON format and use Plotly to draw a world map
showing the locations and magnitudes of recent earthquakes.
By the end of this chapter, you’ll be prepared to work with various types of datasets in different formats, and you’ll have a deeper understanding of how to build complex visualizations. Being able to access and visualize online data is essential to working with a wide variety of real-world datasets.
The CSV File Format
One simple way to store data in a text file is to write the data as a series of values separated by commas, called comma-separated values. The resulting files are CSV files. For example, here’s a chunk of weather data in CSV format:
"USW00025333","SITKA AIRPORT, AK US","2021-01-01",,"44","40"
This is an excerpt of weather data from January 1, 2021, in Sitka, Alaska. It includes the day’s high and low temperatures, as well as a number of other measurements from that day. CSV files can be tedious for humans to read, but programs can process and extract information from them quickly and accurately.
We’ll begin with a small set of CSV-formatted weather data recorded in
Sitka; it is available in this book’s resources at
ehmatthes.github.io/pcc_3e. Make a folder called weather_data
inside the folder where you’re saving this chapter’s programs. Copy the
file sitka_weather_07-2021_simple.csv into this new folder.
|
The weather data in this project was originally downloaded from ncdc.noaa.gov/cdo-web. |
Parsing the CSV File Headers
Python’s csv module in the standard library parses the lines in a CSV
file and allows us to quickly extract the values we’re interested in.
Let’s start by examining the first line of the file, which contains a
series of headers for the data:
from pathlib import Path
import csv
path = Path('weather_data/sitka_weather_07-2021_simple.csv') (1)
lines = path.read_text(encoding='utf-8').splitlines()
reader = csv.reader(lines) (2)
header_row = next(reader) (3)
print(header_row)
| 1 | We first import Path and the csv module. We then build a Path
object that looks in the weather_data folder, and points to the
specific weather data file we want to work with. We read the file
and chain the splitlines() method to get a list of all lines in
the file, which we assign to lines. |
| 2 | We build a reader object by calling the function csv.reader() and
passing it the list of lines from the CSV file. This is an object
that can be used to parse each line in the file. |
| 3 | When given a reader object, the next() function returns the next
line in the file, starting from the beginning of the file. Here we
call next() only once, so we get the first line of the file, which
contains the file headers. We assign the data that’s returned to
header_row. |
Here’s the output:
['STATION', 'NAME', 'DATE', 'TAVG', 'TMAX', 'TMIN']
The reader object processes the first line of comma-separated values in
the file and stores each value as an item in a list. The header STATION
represents the code for the weather station that recorded this data. The
NAME header indicates that the second value in each line is the name
of the weather station. The rest of the headers specify what kinds of
information were recorded in each reading. The data we’re most
interested in for now are the date (DATE), the high temperature
(TMAX), and the low temperature (TMIN).
Printing the Headers and Their Positions
To make it easier to understand the file header data, let’s print each header and its position in the list:
# --snip--
reader = csv.reader(lines)
header_row = next(reader)
for index, column_header in enumerate(header_row):
print(index, column_header)
The enumerate() function returns both the index of each item and the
value of each item as you loop through a list.
Here’s the output showing the index of each header:
0 STATION
1 NAME
2 DATE
3 TAVG
4 TMAX
5 TMIN
We can see that the dates and their high temperatures are stored in columns 2 and 4. To explore this data, we’ll process each row of data and extract the values with the indexes 2 and 4.
Extracting and Reading Data
Now that we know which columns of data we need, let’s read in some of that data. First, we’ll read in the high temperature for each day:
# --snip--
reader = csv.reader(lines)
header_row = next(reader)
# Extract high temperatures.
highs = [] (1)
for row in reader: (2)
high = int(row[4]) (3)
highs.append(high)
print(highs)
| 1 | We make an empty list called highs. |
| 2 | We then loop through the remaining rows in the file. The reader object continues from where it left off in the CSV file and automatically returns each line following its current position. Because we’ve already read the header row, the loop will begin at the second line where the actual data begins. |
| 3 | On each pass through the loop we pull the data from index 4,
corresponding to the header TMAX, and assign it to the variable
high. We use the int() function to convert the data, which is
stored as a string, to a numerical format so we can use it. We then
append this value to highs. |
The following listing shows the data now stored in highs:
[61, 60, 66, 60, 65, 59, 58, 58, 57, 60, 60, 60, 57, 58, 60, 61,
63, 63, 70, 64, 59, 63, 61, 58, 59, 64, 62, 70, 70, 73, 66]
We’ve extracted the high temperature for each date and stored each value in a list. Now let’s create a visualization of this data.
Plotting Data in a Temperature Chart
To visualize the temperature data we have, we’ll first create a simple plot of the daily highs using Matplotlib, as shown here:
from pathlib import Path
import csv
import matplotlib.pyplot as plt
path = Path('weather_data/sitka_weather_07-2021_simple.csv')
lines = path.read_text(encoding='utf-8').splitlines()
# --snip--
# Plot the high temperatures.
plt.style.use('seaborn-v0_8')
fig, ax = plt.subplots()
ax.plot(highs, color='red') (1)
# Format plot.
ax.set_title("Daily High Temperatures, July 2021", fontsize=24) (2)
ax.set_xlabel('', fontsize=16) (3)
ax.set_ylabel("Temperature (F)", fontsize=16)
ax.tick_params(labelsize=16)
plt.show()
| 1 | We pass the list of highs to plot() and pass color='red' to
plot the points in red. (We’ll plot the highs in red and the lows in
blue.) |
| 2 | We then specify formatting details, such as the title, font size, and labels, just as we did in Chapter 15. |
| 3 | Because we have yet to add the dates, we won’t label the x-axis,
but ax.set_xlabel() does modify the font size to make the default
labels more readable. |
The datetime Module
Let’s add dates to our graph to make it more useful. The first date from the weather data file is in the second row of the file:
"USW00025333","SITKA AIRPORT, AK US","2021-07-01",,"61","53"
The data will be read in as a string, so we need a way to convert the
string "2021-07-01" to an object representing this date. We can
construct an object representing July 1, 2021, using the strptime()
method from the datetime module. Let’s see how strptime() works in a
terminal session:
>>> from datetime import datetime
>>> first_date = datetime.strptime('2021-07-01', '%Y-%m-%d')
>>> print(first_date)
2021-07-01 00:00:00
We first import the datetime class from the datetime module. Then we
call the method strptime() with the string containing the date we want
to process as its first argument. The second argument tells Python how
the date is formatted. In this example, '%Y-' tells Python to look for
a four-digit year before the first dash; '%m-' indicates a two-digit
month before the second dash; and '%d' means the last part of the
string is the day of the month, from 1 to 31.
| Argument | Meaning |
|---|---|
|
Weekday name, such as Monday |
|
Month name, such as January |
|
Month, as a number (01 to 12) |
|
Day of the month, as a number (01 to 31) |
|
Four-digit year, such as 2019 |
|
Two-digit year, such as 19 |
|
Hour, in 24-hour format (00 to 23) |
|
Hour, in 12-hour format (01 to 12) |
|
AM or PM |
|
Minutes (00 to 59) |
|
Seconds (00 to 61) |
Plotting Dates
We can improve our plot by extracting dates for the daily high temperature readings, and using these dates on the x-axis:
from pathlib import Path
import csv
from datetime import datetime
import matplotlib.pyplot as plt
path = Path('weather_data/sitka_weather_07-2021_simple.csv')
lines = path.read_text(encoding='utf-8').splitlines()
reader = csv.reader(lines)
header_row = next(reader)
# Extract dates and high temperatures.
dates, highs = [], [] (1)
for row in reader:
current_date = datetime.strptime(row[2], '%Y-%m-%d') (2)
high = int(row[4])
dates.append(current_date)
highs.append(high)
# Plot the high temperatures.
plt.style.use('seaborn-v0_8')
fig, ax = plt.subplots()
ax.plot(dates, highs, color='red') (3)
# Format plot.
ax.set_title("Daily High Temperatures, July 2021", fontsize=24)
ax.set_xlabel('', fontsize=16)
fig.autofmt_xdate() (4)
ax.set_ylabel("Temperature (F)", fontsize=16)
ax.tick_params(labelsize=16)
plt.show()
| 1 | We create two empty lists to store the dates and high temperatures from the file. |
| 2 | We convert the data containing the date information (row[2]) to a
datetime object and append it to dates. |
| 3 | We pass the dates and the high temperature values to plot(). |
| 4 | The call to fig.autofmt_xdate() draws the date labels diagonally
to prevent them from overlapping. |
Plotting a Longer Timeframe
With our graph set up, let’s include additional data to get a more
complete picture of the weather in Sitka. Copy the file
sitka_weather_2021_simple.csv, which contains a full year’s worth of
weather data for Sitka, to the folder where you’re storing the data for
this chapter’s programs.
Now we can generate a graph for the entire year’s weather:
# --snip--
path = Path('weather_data/sitka_weather_2021_simple.csv')
lines = path.read_text(encoding='utf-8').splitlines()
# --snip--
# Format plot.
ax.set_title("Daily High Temperatures, 2021", fontsize=24)
ax.set_xlabel('', fontsize=16)
# --snip--
We modify the filename to use the new data file and we update the title of our plot to reflect the change in its content.
Plotting a Second Data Series
We can make our graph even more useful by including the low temperatures. We need to extract the low temperatures from the data file and then add them to our graph, as shown here:
# --snip--
reader = csv.reader(lines)
header_row = next(reader)
# Extract dates, and high and low temperatures.
dates, highs, lows = [], [], [] (1)
for row in reader:
current_date = datetime.strptime(row[2], '%Y-%m-%d')
high = int(row[4])
low = int(row[5]) (2)
dates.append(current_date)
highs.append(high)
lows.append(low)
# Plot the high and low temperatures.
plt.style.use('seaborn-v0_8')
fig, ax = plt.subplots()
ax.plot(dates, highs, color='red')
ax.plot(dates, lows, color='blue') (3)
# Format plot.
ax.set_title("Daily High and Low Temperatures, 2021", fontsize=24) (4)
# --snip--
| 1 | We add the empty list lows to hold low temperatures. |
| 2 | We extract and store the low temperature for each date from the
sixth position in each row (row[5]). |
| 3 | We add a call to plot() for the low temperatures and color these
values blue. |
| 4 | We update the title. |
Shading an Area in the Chart
Having added two data series, we can now examine the range of
temperatures for each day. Let’s add a finishing touch to the graph by
using shading to show the range between each day’s high and low
temperatures. To do so, we’ll use the fill_between() method, which
takes a series of x-values and two series of y-values and fills the
space between the two series of y-values:
# --snip--
# Plot the high and low temperatures.
plt.style.use('seaborn-v0_8')
fig, ax = plt.subplots()
ax.plot(dates, highs, color='red', alpha=0.5) (1)
ax.plot(dates, lows, color='blue', alpha=0.5)
ax.fill_between(dates, highs, lows, facecolor='blue', alpha=0.1) (2)
# --snip--
| 1 | The alpha argument controls a color’s transparency. An alpha value
of 0 is completely transparent, and a value of 1 (the default) is
completely opaque. By setting alpha to 0.5, we make the red and
blue plot lines appear lighter. |
| 2 | We pass fill_between() the list dates for the x-values and then
the two y-value series highs and lows. The facecolor argument
determines the color of the shaded region; we give it a low alpha
value of 0.1 so the filled region connects the two data series
without distracting from the information they represent. |
The shading helps make the range between the two datasets immediately apparent.
Error Checking
We should be able to run the sitka_highs_lows.py code using data for
any location. But some weather stations collect different data than
others, and some occasionally malfunction and fail to collect some of
the data they’re supposed to. Missing data can result in exceptions that
crash our programs, unless we handle them properly.
For example, let’s see what happens when we attempt to generate a temperature plot for Death Valley, California. First, let’s run the code to see the headers that are included in this data file:
from pathlib import Path
import csv
path = Path('weather_data/death_valley_2021_simple.csv')
lines = path.read_text(encoding='utf-8').splitlines()
reader = csv.reader(lines)
header_row = next(reader)
for index, column_header in enumerate(header_row):
print(index, column_header)
Here’s the output:
0 STATION
1 NAME
2 DATE
3 TMAX
4 TMIN
5 TOBS
The date is in the same position, at index 2. But the high and low
temperatures are at indexes 3 and 4, so we’ll need to change the indexes
in our code to reflect these new positions. Instead of including an
average temperature reading for the day, this station includes TOBS, a
reading for a specific observation time.
When we modify the program to read from the Death Valley data file and
change the indexes to correspond to this file’s TMAX and TMIN
positions, we get an error:
Traceback (most recent call last):
File "death_valley_highs_lows.py", line 17, in <module>
high = int(row[3])
ValueError: invalid literal for int() with base 10: '' (1)
| 1 | The traceback tells us that Python can’t process the high temperature
for one of the dates because it can’t turn an empty string ('')
into an integer. |
We’ll run error-checking code when the values are being read from the CSV file to handle exceptions that might arise. Here’s how to do this:
# --snip--
for row in reader:
current_date = datetime.strptime(row[2], '%Y-%m-%d')
try: (1)
high = int(row[3])
low = int(row[4])
except ValueError:
print(f"Missing data for {current_date}") (2)
else:
dates.append(current_date) (3)
highs.append(high)
lows.append(low)
# Plot the high and low temperatures.
# --snip--
# Format plot.
title = "Daily High and Low Temperatures, 2021\nDeath Valley, CA" (4)
ax.set_title(title, fontsize=20)
ax.set_xlabel('', fontsize=16)
# --snip--
| 1 | Each time we examine a row, we try to extract the date and the high and low temperature. |
| 2 | If any data is missing, Python will raise a ValueError and we
handle it by printing an error message that includes the date of the
missing data. |
| 3 | If all data for a date is retrieved without error, the else block
will run and the data will be appended to the appropriate lists. |
| 4 | Because we’re plotting information for a new location, we update the title to include the location on the plot, and we use a smaller font size to accommodate the longer title. |
When you run death_valley_highs_lows.py now, you’ll see that only one
date had missing data:
Missing data for 2021-05-04 00:00:00
Because the error is handled appropriately, our code is able to generate
a plot, which skips over the missing data. Many datasets you work with
will have missing, improperly formatted, or incorrect data. You can use
the tools you learned in the first half of this book to handle these
situations. Here we used a try-except-else block to handle missing
data. Sometimes you’ll use continue to skip over some data, or use
remove() or del to eliminate some data after it’s been extracted.
Use any approach that works, as long as the result is a meaningful,
accurate visualization.
Downloading Your Own Data
To download your own weather data, follow these steps:
Visit the NOAA Climate Data Online site at www.ncdc.noaa.gov/cdo-web. In the Discover Data By section, click Search Tool. In the Select a Dataset box, choose Daily Summaries. Select a date range, and in the Search For section, choose ZIP Codes. Enter the ZIP code you’re interested in and click Search. On the next page, you’ll see a map and some information about the area you’re focusing on. Below the location name, click View Full Details, or click the map and then click Full Details. Scroll down and click Station List to see the weather stations that are available in this area. Click one of the station names and then click Add to Cart. This data is free, even though the site uses a shopping cart icon. In the upper-right corner, click the cart. In Select the Output Format, choose Custom GHCN-Daily CSV. Make sure the date range is correct and click Continue. On the next page, you can select the kinds of data you want. Make your choices and then click Continue. On the last page, enter your email address and click Submit Order. You’ll receive a confirmation that your order was received, and in a few minutes, you should receive another email with a link to download your data.
The data you download should be structured just like the data we worked with in this section. It might have different headers than those you saw in this section, but if you follow the same steps we used here, you should be able to generate visualizations of the data you’re interested in.
Try It Yourself
16-1. Sitka Rainfall: Sitka is located in a temperate rainforest, so
it gets a fair amount of rainfall. In the data file
sitka_weather_2021_full.csv is a header called PRCP, which represents
daily rainfall amounts. Make a visualization focusing on the data in
this column. You can repeat the exercise for Death Valley if you’re
curious how little rainfall occurs in a desert.
16-2. Sitka-Death Valley Comparison: The temperature scales on the Sitka and Death Valley graphs reflect the different data ranges. To accurately compare the temperature range in Sitka to that of Death Valley, you need identical scales on the y-axis. Change the settings for the y-axis on one or both of the charts in the figures, then make a direct comparison between temperature ranges in Sitka and Death Valley (or any two places you want to compare).
16-3. San Francisco: Are temperatures in San Francisco more like temperatures in Sitka or temperatures in Death Valley? Download some data for San Francisco, and generate a high-low temperature plot for San Francisco to make a comparison.
16-4. Automatic Indexes: In this section, we hardcoded the indexes
corresponding to the TMIN and TMAX columns. Use the header row to
determine the indexes for these values, so your program can work for
Sitka or Death Valley. Use the station name to automatically generate an
appropriate title for your graph as well.
16-5. Explore: Generate a few more visualizations that examine any other weather aspect you’re interested in for any locations you’re curious about.
Mapping Global Datasets: GeoJSON Format
In this section, you’ll download a dataset representing all the
earthquakes that have occurred in the world during the previous month.
Then you’ll make a map showing the location of these earthquakes and how
significant each one was. Because the data is stored in the GeoJSON
format, we’ll work with it using the json module. Using Plotly’s
scatter_geo() plot, you’ll create visualizations that clearly show the
global distribution of earthquakes.
Downloading Earthquake Data
Make a folder called eq_data inside the folder where you’re saving
this chapter’s programs. Copy the file eq_1_day_m1.geojson into this
new folder. Earthquakes are categorized by their magnitude on the Richter
scale. This file includes data for all earthquakes with a magnitude M1
or greater that took place in the last 24 hours (at the time of this
writing). This data comes from one of the United States Geological
Survey’s earthquake data feeds, at
earthquake.usgs.gov/earthquakes/feed.
Examining GeoJSON Data
When you open eq_1_day_m1.geojson, you’ll see that it’s very dense and
hard to read. The json module provides a variety of tools for exploring
and working with JSON data. Some of these tools will help us reformat the
file so we can look at the raw data more easily before we work with it
programmatically.
Let’s start by loading the data and displaying it in a format that’s easier to read. This is a long data file, so instead of printing it, we’ll rewrite the data to a new file. Then we can open that file and scroll back and forth through the data more easily:
from pathlib import Path
import json
# Read data as a string and convert to a Python object.
path = Path('eq_data/eq_data_1_day_m1.geojson')
contents = path.read_text(encoding='utf-8')
all_eq_data = json.loads(contents) (1)
# Create a more readable version of the data file.
path = Path('eq_data/readable_eq_data.geojson') (2)
readable_contents = json.dumps(all_eq_data, indent=4) (3)
path.write_text(readable_contents)
| 1 | We read the data file as a string, and use json.loads() to convert
the string representation of the file to a Python object. In this
case, the entire dataset is converted to a single dictionary, which
we assign to all_eq_data. |
| 2 | We then define a new path where we can write this same data in a more readable format. |
| 3 | The json.dumps() function can take an optional indent argument,
which tells it how much to indent nested elements in the data
structure. |
When you look in your eq_data directory and open the file
readable_eq_data.json, here’s the first part of what you’ll see:
{
"type": "FeatureCollection",
"metadata": { (1)
"generated": 1649052296000,
"url": "https://earthquake.usgs.gov/earthquakes/.../1.0_day.geojson",
"title": "USGS Magnitude 1.0+ Earthquakes, Past Day",
"status": 200,
"api": "1.10.3",
"count": 160
},
"features": [ (2)
--snip--
| 1 | The first part of the file includes a section with the key
"metadata". This tells us when the data file was generated and
where we can find the data online. It also gives us a human-readable
title and the number of earthquakes included in this file. In this
24-hour period, 160 earthquakes were recorded. |
| 2 | The GeoJSON file has a structure helpful for location-based data.
The information is stored in a list associated with the key
"features". Because this file contains earthquake data, the data
is in list form where every item in the list corresponds to a single
earthquake. |
Let’s look at a dictionary representing a single earthquake:
{
"type": "Feature",
"properties": { (1)
"mag": 1.6,
--snip--
"title": "M 1.6 - 27 km NNW of Susitna, Alaska" (2)
},
"geometry": { (3)
"type": "Point",
"coordinates": [
-150.7585, (4)
61.7591, (5)
56.3
]
},
"id": "ak0224bju1jx"
},
| 1 | The key "properties" contains a lot of information about each
earthquake. We’re mainly interested in the magnitude of each
earthquake, associated with the key "mag". |
| 2 | We’re also interested in the "title" of each event, which provides
a nice summary of its magnitude and location. |
| 3 | The key "geometry" helps us understand where the earthquake
occurred. We’ll need this information to map each event. |
| 4 | The longitude for each earthquake is stored in a list associated
with the key "coordinates". |
| 5 | The latitude follows the longitude in the same list. |
|
When we talk about locations, we often say the location’s latitude first, followed by its longitude. This convention probably arose because humans discovered latitude long before we developed the concept of longitude. However, many geospatial frameworks list the longitude first and then the latitude, because this corresponds to the (x, y) convention we use in mathematical representations. The GeoJSON format follows the (longitude, latitude) convention. If you use a different framework, it’s important to learn what convention that framework follows. |
Making a List of All Earthquakes
First, we’ll make a list that contains all the information about every earthquake that occurred.
from pathlib import Path
import json
# Read data as a string and convert to a Python object.
path = Path('eq_data/eq_data_1_day_m1.geojson')
contents = path.read_text(encoding='utf-8')
all_eq_data = json.loads(contents)
# Examine all earthquakes in the dataset.
all_eq_dicts = all_eq_data['features']
print(len(all_eq_dicts))
We take the data associated with the key 'features' in the
all_eq_data dictionary, and assign it to all_eq_dicts. We know this
file contains records of 160 earthquakes, and the output verifies that
we’ve captured all the earthquakes in the file:
160
Notice how short this code is. The neatly formatted file
readable_eq_data.json has over 6,000 lines. But in just a few lines,
we can read through all that data and store it in a Python list. Next,
we’ll pull the magnitudes from each earthquake.
Extracting Magnitudes
We can loop through the list containing data about each earthquake, and extract any information we want. Let’s pull out the magnitude of each earthquake:
# --snip--
all_eq_dicts = all_eq_data['features']
mags = [] (1)
for eq_dict in all_eq_dicts:
mag = eq_dict['properties']['mag'] (2)
mags.append(mag)
print(mags[:10])
| 1 | We make an empty list to store the magnitudes, and then loop through
the list all_eq_dicts. Inside this loop, each earthquake is
represented by the dictionary eq_dict. |
| 2 | Each earthquake’s magnitude is stored in the 'properties' section
of this dictionary, under the key 'mag'. |
We print the first 10 magnitudes, so we can see whether we’re getting the correct data:
[1.6, 1.6, 2.2, 3.7, 2.92000008, 1.4, 4.6, 4.5, 1.9, 1.8]
Extracting Location Data
The location data for each earthquake is stored under the key
"geometry". Inside the geometry dictionary is a "coordinates" key,
and the first two values in this list are the longitude and latitude.
Here’s how we’ll pull this data:
# --snip--
all_eq_dicts = all_eq_data['features']
mags, lons, lats = [], [], []
for eq_dict in all_eq_dicts:
mag = eq_dict['properties']['mag']
lon = eq_dict['geometry']['coordinates'][0] (1)
lat = eq_dict['geometry']['coordinates'][1]
mags.append(mag)
lons.append(lon)
lats.append(lat)
print(mags[:10])
print(lons[:5])
print(lats[:5])
| 1 | The code eq_dict['geometry'] accesses the dictionary representing
the geometry element of the earthquake. The second key,
'coordinates', pulls the list of values associated with
'coordinates'. Finally, the 0 index asks for the first value in
the list of coordinates, which corresponds to an earthquake’s
longitude. |
When we print the first 5 longitudes and latitudes, the output shows that we’re pulling the correct data:
[1.6, 1.6, 2.2, 3.7, 2.92000008, 1.4, 4.6, 4.5, 1.9, 1.8]
[-150.7585, -153.4716, -148.7531, -159.6267, -155.248336791992]
[61.7591, 59.3152, 63.1633, 54.5612, 18.7551670074463]
With this data, we can move on to mapping each earthquake.
Building a World Map
Using the information we’ve pulled so far, we can build a simple world map. Although it won’t look presentable yet, we want to make sure the information is displayed correctly before focusing on style and presentation issues. Here’s the initial map:
from pathlib import Path
import json
import plotly.express as px
# --snip--
for eq_dict in all_eq_dicts:
# --snip--
title = 'Global Earthquakes'
fig = px.scatter_geo(lat=lats, lon=lons, title=title) (1)
fig.show()
| 1 | We import plotly.express with the alias px, just as we did in
Chapter 15. The scatter_geo() function allows you to overlay a
scatterplot of geographic data on a map. In the simplest use of this
chart type, you only need to provide a list of latitudes and a list
of longitudes. |
When you run this file, you should see a simple map of global earthquake activity. This shows the power of the Plotly Express library; in just three lines of code, we have a map of global earthquake activity.
Representing Magnitudes
A map of earthquake activity should show the magnitude of each earthquake. We can also include more data, now that we know the data is being plotted correctly.
# --snip--
# Read data as a string and convert to a Python object.
path = Path('eq_data/eq_data_30_day_m1.geojson')
contents = path.read_text(encoding='utf-8')
# --snip--
title = 'Global Earthquakes'
fig = px.scatter_geo(lat=lats, lon=lons, size=mags, title=title)
fig.show()
We load the file eq_data_30_day_m1.geojson, to include a full 30 days'
worth of earthquake activity. We also use the size argument in the
px.scatter_geo() call, which specifies how the points on the map will
be sized. We pass the list mags to size, so earthquakes with a
higher magnitude will show up as larger points on the map.
Customizing Marker Colors
We can use Plotly’s color scales to customize each marker’s color, according to the severity of the corresponding earthquake. We’ll also use a different projection for the base map.
# --snip--
fig = px.scatter_geo(lat=lats, lon=lons, size=mags, title=title,
color=mags, (1)
color_continuous_scale='Viridis', (2)
labels={'color': 'Magnitude'}, (3)
projection='natural earth', (4)
)
fig.show()
| 1 | The color argument tells Plotly what values it should use to
determine where each marker falls on the color scale. We use the
mags list to determine the color for each point, just as we did
with the size argument. |
| 2 | The color_continuous_scale argument tells Plotly which color scale
to use. Viridis is a color scale that ranges from dark blue to
bright yellow, and it works well for this dataset. |
| 3 | By default, the color scale on the right of the map is labeled
color; this is not representative of what the colors actually
mean. The labels argument takes a dictionary as a value. We only
need to set one custom label on this chart, making sure the color
scale is labeled Magnitude instead of color. |
| 4 | The projection argument accepts a number of common map projections.
Here we use the 'natural earth' projection, which rounds the ends
of the map. Also, note the trailing comma after this last argument.
When a function call has a long list of arguments spanning multiple
lines like this, it’s common practice to add a trailing comma so
you’re always ready to add another argument on the next line. |
Other Color Scales
You can choose from a number of other color scales. To see the available color scales, enter the following two lines in a Python terminal session:
>>> import plotly.express as px
>>> px.colors.named_colorscales()
['aggrnyl', 'agsunset', 'blackbody', ..., 'mygbm']
Feel free to try out these color scales in the earthquake map, or with any dataset where continuously varying colors can help show patterns in the data.
Adding Hover Text
To finish this map, we’ll add some informative text that appears when you hover over the marker representing an earthquake. In addition to showing the longitude and latitude, which appear by default, we’ll show the magnitude and provide a description of the approximate location as well.
# --snip--
mags, lons, lats, eq_titles = [], [], [], [] (1)
mag = eq_dict['properties']['mag']
lon = eq_dict['geometry']['coordinates'][0]
lat = eq_dict['geometry']['coordinates'][1]
eq_title = eq_dict['properties']['title'] (2)
mags.append(mag)
lons.append(lon)
lats.append(lat)
eq_titles.append(eq_title)
title = 'Global Earthquakes'
fig = px.scatter_geo(lat=lats, lon=lons, size=mags, title=title,
# --snip--
projection='natural earth',
hover_name=eq_titles, (3)
)
fig.show()
| 1 | We first make a list called eq_titles to store the title of each
earthquake. |
| 2 | The 'title' section of the data contains a descriptive name of the
magnitude and location of each earthquake. We pull this information
and assign it to the variable eq_title, and then append it to the
list eq_titles. |
| 3 | In the px.scatter_geo() call, we pass eq_titles to the
hover_name argument. Plotly will now add the information from the
title of each earthquake to the hover text on each point. |
In less than 30 lines of code, we’ve created a visually appealing and meaningful map of global earthquake activity that also illustrates the geological structure of the planet.
Try It Yourself
16-6. Refactoring: The loop that pulls data from all_eq_dicts uses
variables for the magnitude, longitude, latitude, and title of each
earthquake before appending these values to their appropriate lists.
Instead of using these temporary variables, pull each value from
eq_dict and append it to the appropriate list in one line. Doing so
should shorten the body of this loop to just four lines.
16-7. Automated Title: In this section, we used the generic title
Global Earthquakes. Instead, you can use the title for the dataset in
the metadata part of the GeoJSON file. Pull this value and assign it
to the variable title.
16-8. Recent Earthquakes: You can find online data files containing information about the most recent earthquakes over 1-hour, 1-day, 7-day, and 30-day periods. Go to earthquake.usgs.gov/earthquakes/feed/v1.0/geojson.php and you’ll see a list of links to datasets for various time periods, focusing on earthquakes of different magnitudes. Download one of these datasets and create a visualization of the most recent earthquake activity.
16-9. World Fires: In the resources for this chapter, you’ll find a
file called world_fires_1_day.csv. This file contains information about
fires burning in different locations around the globe, including the
latitude, longitude, and brightness of each fire. Using the
data-processing work from the first part of this chapter and the mapping
work from this section, make a map that shows which parts of the world
are affected by fires. You can download more recent versions of this data
at earthdata.nasa.gov/earth-observation-data/near-real-time/firms/active-fire-data.
Summary
In this chapter, you learned how to work with real-world datasets. You
processed CSV and GeoJSON files, and extracted the data you want to
focus on. Using historical weather data, you learned more about working
with Matplotlib, including how to use the datetime module and how to
plot multiple data series on one chart. You plotted geographical data on
a world map in Plotly, and learned to customize the style of the map.
As you gain experience working with CSV and JSON files, you’ll be able to process almost any data you want to analyze. You can download most online datasets in either or both of these formats. By working with these formats, you’ll be able to learn how to work with other data formats more easily as well.
In the next chapter, you’ll write programs that automatically gather their own data from online sources, and then you’ll create visualizations of that data. These are fun skills to have if you want to program as a hobby and are critical skills if you’re interested in programming professionally.
Applied Exercises: Ch 16 β Downloading Data
These exercises apply the chapter’s patterns β CSV parsing, datetime handling, JSON extraction, nested dict traversal, and error handling β to infrastructure logs, security events, and language learning data. No external data downloads are required; exercises generate or simulate their own data.
Domus Digitalis / Homelab
D16-1. Node Log CSV Parser: Create a CSV file called node_events.csv
with columns DATE,HOSTNAME,SERVICE,STATUS,LATENCY_MS. Add at least 15
rows of simulated data. Write a program that reads the file using
pathlib and csv.reader(), prints the header with enumerate(), then
extracts the date (using datetime.strptime()), hostname, and latency
into separate lists. Print the first 5 entries of each list.
D16-2. BGP Session Timeline: Create a CSV file bgp_sessions.csv with
columns DATE,PEER,ASN,STATE,DURATION_SEC. Add 20 rows covering a
30-day period. Write a program that reads the file, parses dates with
strptime('%Y-%m-%d'), and builds a dict mapping each date to a list of
session records for that day. Print the count of sessions per day.
D16-3. Multi-Series Node Health: Create a CSV file
node_health.csv with columns DATE,CPU_PCT,MEM_PCT,DISK_PCT. Add 30
rows. Write a program that reads all three metrics into separate lists.
Find dates where CPU > 80 and print a warning message for each. Use
try-except ValueError to handle any missing values, printing the date
and skipping the row.
D16-4. GeoJSON-Style Stack Inventory: Create a dict in your program
that mimics GeoJSON structure β a 'features' list where each item has
'properties' (hostname, role, VLAN) and 'geometry' (rack, unit).
Write it to stack_inventory.json using json.dumps(indent=4). Read it
back with json.loads(), extract all hostnames and roles, and print a
summary table.
D16-5. Node Alert Hover Data: Extend D16-4: add a 'title' field to
each feature’s 'properties' (e.g., "kvm-01 (hypervisor) β VLAN 100").
Extract all titles, latitudes (rack number as float), and longitudes
(unit number as float) into separate lists. Print the first 5 entries of
each list, mirroring the earthquake eq_titles / lons / lats
pattern.
CHLA / ISE / Network Security
C16-1. ISE Auth Log CSV Parser: Create a CSV file ise_auth_log.csv
with columns DATE,USERNAME,MAC,PROTOCOL,RESULT. Add 20 rows. Write a
program that reads the file, prints headers with enumerate(), and
extracts date, username, and result into separate lists. Parse dates
with strptime('%Y-%m-%d %H:%M:%S'). Print the first 5 entries of each
list.
C16-2. Syslog Event Timeline: Create a CSV file syslog_events.csv
with columns DATE,SEVERITY,SOURCE,MESSAGE. Add 25 rows across a 7-day
period. Write a program that reads the file, parses dates, and builds a
dict mapping each date to a list of events. Print the count of events
per day and flag any day with more than 5 critical (severity 0-2) events.
C16-3. Multi-Series Auth Metrics: Create a CSV file
auth_metrics.csv with columns DATE,SUCCESS_COUNT,FAIL_COUNT,
TIMEOUT_COUNT. Add 30 rows. Extract all three series into separate
lists. Find dates where FAIL_COUNT > SUCCESS_COUNT and print a warning
for each. Use try-except ValueError for missing values.
C16-4. GeoJSON-Style Policy Inventory: Create a dict mimicking
GeoJSON structure β a 'features' list where each item has
'properties' (policy_name, protocol, result) and 'geometry'
(priority, set_id). Write to policy_inventory.json with
json.dumps(indent=4). Read back, extract all policy names and
protocols, and print a summary table.
C16-5. Pipeline Hover Data: Extend C16-4: add a 'title' field to
each feature (e.g., "802.1X Wired | EAP-TLS | Allow"). Extract all
titles, priorities (as float), and set IDs (as float) into separate
lists. Print the first 5 entries of each, mirroring the earthquake hover
text pattern.
General Sysadmin / Linux
L16-1. Service Log CSV Parser: Create a CSV file service_log.csv
with columns DATE,SERVICE,HOST,STATUS,DURATION_SEC. Add 20 rows.
Write a program that reads the file, prints headers with enumerate(),
and extracts date, service, and status into separate lists. Parse dates
with strptime('%Y-%m-%d'). Print the first 5 entries of each list.
L16-2. Package Install Timeline: Create a CSV file pkg_installs.csv
with columns DATE,PACKAGE,VERSION,EXIT_CODE. Add 20 rows across a
10-day period. Write a program that reads the file, parses dates, and
builds a dict mapping each date to a list of installs. Print the count
of installs per day and flag any day with a non-zero exit code.
L16-3. Multi-Series System Metrics: Create a CSV file
system_metrics.csv with columns DATE,CPU_PCT,MEM_PCT,NET_MBPS. Add
30 rows. Extract all three series. Find dates where any metric exceeds
90 and print a warning. Use try-except ValueError for missing values.
L16-4. GeoJSON-Style Server Inventory: Create a dict mimicking
GeoJSON structure β a 'features' list where each item has
'properties' (hostname, os, rack) and 'geometry' (datacenter, floor).
Write to server_inventory.json. Read back, extract all hostnames and
OS values, and print a summary table.
L16-5. Server Hover Data: Extend L16-4: add a 'title' field to
each feature (e.g., "kvm-01 | Rocky Linux 9 | Rack 3"). Extract all
titles, datacenter IDs, and floor numbers into separate lists. Print the
first 5 entries of each list.
Spanish / DELE C2
E16-1. Vocabulary Progress CSV Parser: Create a CSV file
vocab_progress.csv with columns DATE,CHAPTER,WORDS_LEARNED,
WORDS_REVIEWED. Add 20 rows. Write a program that reads the file,
prints headers with enumerate(), and extracts date, chapter, and
words_learned into separate lists. Parse dates with strptime('%Y-%m-%d').
Print the first 5 entries of each list.
E16-2. Study Session Timeline: Create a CSV file study_sessions.csv
with columns DATE,TOPIC,DURATION_MIN,SCORE. Add 25 rows across a
30-day period. Write a program that reads the file, parses dates, and
builds a dict mapping each date to a list of sessions. Print the count
of sessions per day and flag any day with a score below 60.
E16-3. Multi-Series DELE Metrics: Create a CSV file
dele_metrics.csv with columns DATE,VOCAB_SCORE,GRAMMAR_SCORE,
READING_SCORE. Add 30 rows. Extract all three series. Find dates where
any score drops below 50 and print a warning. Use try-except ValueError
for missing values.
E16-4. GeoJSON-Style Chapter Notes: Create a dict mimicking GeoJSON
structure β a 'features' list where each item has 'properties'
(chapter_number, title, notes) and 'geometry' (part, section). Write
to donquijote_notes.json with json.dumps(indent=4). Read back,
extract all chapter numbers and titles, and print a summary table.
E16-5. Chapter Hover Data: Extend E16-4: add a 'title' field to
each feature (e.g., "Ch. 30 | De lo que le aconteciΓ³ al hidalgo…").
Extract all titles, part numbers (as float), and section numbers (as
float) into separate lists. Print the first 5 entries of each list,
mirroring the earthquake hover text pattern.