Python pandas
pandas DataFrame patterns for system inventory, infrastructure data analysis, and reporting.
Creating DataFrames
A DataFrame is a table — rows and columns, like a spreadsheet in memory. Three ways to create one, depending on your data source.
import pandas as pd
# Each dict is a row. Keys become column names.
rows = [
{'package': 'python-matplotlib', 'version': '3.10.8-5'},
{'package': 'python-pandas', 'version': '2.3.3-2'},
]
df = pd.DataFrame(rows)
# package version
# 0 python-matplotlib 3.10.8-5
# 1 python-pandas 2.3.3-2
import subprocess, json
result = subprocess.run(
['systemctl', 'list-units', '--type=service', '--all',
'--no-pager', '--output=json'],
capture_output=True, text=True
)
data = json.loads(result.stdout)
df = pd.DataFrame(data) # List of dicts → DataFrame in one line
# pacman -Q gives "name version" per line — no JSON available
result = subprocess.run(['pacman', '-Q'], capture_output=True, text=True)
rows = []
for line in result.stdout.strip().splitlines():
parts = line.split(None, 1) # Split on first whitespace
if len(parts) == 2:
rows.append({'package': parts[0], 'version': parts[1]})
df = pd.DataFrame(rows)
When to use each:
-
json.loads()→ DataFrame — when commands have-j,--json,--output=jsonflags -
Text parsing → list of dicts → DataFrame — when they don’t (
pacman -Q,ps aux,ss -tlnp) -
pd.read_csv()— when loading previously exported data
Filtering Rows
Boolean indexing: put a condition inside df[…] and only matching rows come back.
This replaces grep — but it’s type-aware and chainable.
# All failed services
failed = df[df['sub'] == 'failed']
# All running services
running = df[df['active'] == 'active']
# All packages starting with 'python-' — like grep '^python-'
python_pkgs = df[df['package'].str.startswith('python-')]
# Packages containing 'lib' anywhere — like grep 'lib'
lib_pkgs = df[df['package'].str.contains('lib')]
# Case-insensitive search
matches = df[df['description'].str.contains('network', case=False, na=False)]
# na=False prevents NaN values from causing errors
# Check which packages were explicitly installed
explicit_set = {'python-matplotlib', 'python-pandas', 'neovim', 'tmux'}
df['explicit'] = df['package'].isin(explicit_set)
# Filter to only UP or UNKNOWN interfaces
active = df[df['state'].isin(['UP', 'UNKNOWN'])]
# Running AND loaded
healthy = df[(df['active'] == 'active') & (df['sub'] == 'running')]
# Failed OR errored
broken = df[(df['sub'] == 'failed') | (df['sub'] == 'error')]
Parentheses are mandatory with & and | — Python operator precedence requires them.
Sorting and Counting
# Sort by column — ascending (default)
df.sort_values('package')
# Descending
df.sort_values('port', ascending=False)
# Sort by multiple columns
df.sort_values(['role', 'interface'])
# Total rows
df.shape[0] # shape returns (rows, cols)
# Count values in a column — like sort | uniq -c | sort -rn
df['sub'].value_counts()
# running 20
# exited 30
# dead 68
# Count True values in a boolean column
df['explicit'].sum() # True = 1, False = 0
Groupby and Aggregation
groupby() splits the DataFrame into groups, then applies a function to each group.
This replaces awk associative arrays — but with named operations.
# How many interfaces per role?
df.groupby('role').size()
# role
# container 4
# loopback 1
# vpn 1
# wifi 1
# wired 1
# Per role: count interfaces AND list their names
by_role = df.groupby('role').agg(
count=('interface', 'size'),
interfaces=('interface', lambda x: ', '.join(x)),
)
# agg() takes named tuples: (column, function)
# 'size' = count rows in group
# lambda = custom function applied to the column
df.groupby('category').agg(
total=('value', 'sum'),
average=('value', 'mean'),
minimum=('value', 'min'),
maximum=('value', 'max'),
count=('value', 'size'),
unique=('value', 'nunique'),
first=('name', 'first'),
all_names=('name', lambda x: ', '.join(x)),
)
Flattening Nested JSON
Many infrastructure commands return nested JSON.
ip -j addr nests addr_info inside each interface.
lsblk -J nests children inside parent devices.
You must walk the tree and flatten it yourself.
import subprocess, json
import pandas as pd
result = subprocess.run(['ip', '-j', 'addr'], capture_output=True, text=True)
data = json.loads(result.stdout)
rows = []
for iface in data:
ipv4, ipv6 = None, None
for addr in iface.get('addr_info', []):
if addr.get('family') == 'inet' and not ipv4:
ipv4 = addr.get('local')
elif addr.get('family') == 'inet6' and not ipv6:
ipv6 = addr.get('local')
rows.append({
'interface': iface.get('ifname'),
'state': iface.get('operstate'),
'mac': iface.get('address'),
'ipv4': ipv4,
'ipv6': ipv6,
})
df = pd.DataFrame(rows)
result = subprocess.run(
['lsblk', '-J', '-o', 'NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT,FSUSE%,MODEL'],
capture_output=True, text=True
)
data = json.loads(result.stdout)
rows = []
def walk(devices, parent=''):
for dev in devices:
rows.append({
'name': dev.get('name'),
'size': dev.get('size'),
'type': dev.get('type'),
'mountpoint': dev.get('mountpoint'),
'usage_pct': dev.get('fsuse%'),
'parent': parent,
})
if 'children' in dev:
walk(dev['children'], parent=dev['name'])
walk(data.get('blockdevices', []))
df = pd.DataFrame(rows)
The pattern: walk the tree, build a flat list of dicts, create the DataFrame once at the end.
Never call pd.DataFrame() inside a loop — build the list first, then convert.
Exporting Data
df.to_csv('/tmp/inventory/packages.csv', index=False)
# index=False omits the row numbers — you almost always want this
# Read it back
df = pd.read_csv('/tmp/inventory/packages.csv')
df.to_json('/tmp/inventory/interfaces.json', orient='records', indent=2)
# orient='records' → [{row1}, {row2}] (array of objects)
# orient='columns' → {"col1": [...], "col2": [...]}
# orient='index' → {"0": {row1}, "1": {row2}}
# 'records' is what you want 99% of the time
# Read it back
df = pd.read_json('/tmp/inventory/interfaces.json')
from pathlib import Path
output = Path('/tmp/inventory')
output.mkdir(parents=True, exist_ok=True)
dataframes = {
'packages': packages_df,
'services': services_df,
'interfaces': interfaces_df,
}
for name, df in dataframes.items():
df.to_csv(output / f"{name}.csv", index=False)
df.to_json(output / f"{name}.json", orient='records', indent=2)
The shell equivalent: your jq exports are single-format. pandas exports to any format from the same data with one line each.
pandas ↔ Shell Equivalents
| Operation | Shell | pandas |
|---|---|---|
Filter rows |
|
|
Count per group |
|
|
Sort |
|
|
Row count |
|
|
Select columns |
|
|
Unique values |
|
|
Handle nulls |
|
|
Join two files |
|
|
Export CSV |
|
|
The advantage: pandas operations chain without re-parsing. The shell re-parses text at every pipe stage. pandas transforms structured data in memory.
See Also
-
Visualization — matplotlib dashboards consuming DataFrames
-
File Operations — JSON/YAML/CSV parsing (pre-pandas patterns)
-
CLI Patterns — subprocess, JSON processing
-
jq Sysadmin — the shell commands these DataFrames wrap