Python pandas

pandas DataFrame patterns for system inventory, infrastructure data analysis, and reporting.

Creating DataFrames

A DataFrame is a table — rows and columns, like a spreadsheet in memory. Three ways to create one, depending on your data source.

From a list of dicts (most common — parsed command output)
import pandas as pd

# Each dict is a row. Keys become column names.
rows = [
    {'package': 'python-matplotlib', 'version': '3.10.8-5'},
    {'package': 'python-pandas',     'version': '2.3.3-2'},
]
df = pd.DataFrame(rows)
#              package    version
# 0  python-matplotlib  3.10.8-5
# 1      python-pandas    2.3.3-2
From JSON directly (when commands give structured output)
import subprocess, json

result = subprocess.run(
    ['systemctl', 'list-units', '--type=service', '--all',
     '--no-pager', '--output=json'],
    capture_output=True, text=True
)
data = json.loads(result.stdout)
df = pd.DataFrame(data)  # List of dicts → DataFrame in one line
From parsed text (when there’s no JSON output)
# pacman -Q gives "name version" per line — no JSON available
result = subprocess.run(['pacman', '-Q'], capture_output=True, text=True)
rows = []
for line in result.stdout.strip().splitlines():
    parts = line.split(None, 1)  # Split on first whitespace
    if len(parts) == 2:
        rows.append({'package': parts[0], 'version': parts[1]})
df = pd.DataFrame(rows)

When to use each:

  • json.loads() → DataFrame — when commands have -j, --json, --output=json flags

  • Text parsing → list of dicts → DataFrame — when they don’t (pacman -Q, ps aux, ss -tlnp)

  • pd.read_csv() — when loading previously exported data

Filtering Rows

Boolean indexing: put a condition inside df[…​] and only matching rows come back. This replaces grep — but it’s type-aware and chainable.

Basic equality filter
# All failed services
failed = df[df['sub'] == 'failed']

# All running services
running = df[df['active'] == 'active']
String operations with .str accessor
# All packages starting with 'python-' — like grep '^python-'
python_pkgs = df[df['package'].str.startswith('python-')]

# Packages containing 'lib' anywhere — like grep 'lib'
lib_pkgs = df[df['package'].str.contains('lib')]

# Case-insensitive search
matches = df[df['description'].str.contains('network', case=False, na=False)]
# na=False prevents NaN values from causing errors
Set membership with .isin()
# Check which packages were explicitly installed
explicit_set = {'python-matplotlib', 'python-pandas', 'neovim', 'tmux'}
df['explicit'] = df['package'].isin(explicit_set)

# Filter to only UP or UNKNOWN interfaces
active = df[df['state'].isin(['UP', 'UNKNOWN'])]
Multiple conditions (& for AND, | for OR)
# Running AND loaded
healthy = df[(df['active'] == 'active') & (df['sub'] == 'running')]

# Failed OR errored
broken = df[(df['sub'] == 'failed') | (df['sub'] == 'error')]

Parentheses are mandatory with & and | — Python operator precedence requires them.

Sorting and Counting

Sort rows
# Sort by column — ascending (default)
df.sort_values('package')

# Descending
df.sort_values('port', ascending=False)

# Sort by multiple columns
df.sort_values(['role', 'interface'])
Count things
# Total rows
df.shape[0]  # shape returns (rows, cols)

# Count values in a column — like sort | uniq -c | sort -rn
df['sub'].value_counts()
# running    20
# exited     30
# dead       68

# Count True values in a boolean column
df['explicit'].sum()  # True = 1, False = 0

Groupby and Aggregation

groupby() splits the DataFrame into groups, then applies a function to each group. This replaces awk associative arrays — but with named operations.

Count per group
# How many interfaces per role?
df.groupby('role').size()
# role
# container    4
# loopback     1
# vpn          1
# wifi         1
# wired        1
Multiple aggregations with .agg()
# Per role: count interfaces AND list their names
by_role = df.groupby('role').agg(
    count=('interface', 'size'),
    interfaces=('interface', lambda x: ', '.join(x)),
)
# agg() takes named tuples: (column, function)
#   'size' = count rows in group
#   lambda = custom function applied to the column
Common aggregation functions
df.groupby('category').agg(
    total=('value', 'sum'),
    average=('value', 'mean'),
    minimum=('value', 'min'),
    maximum=('value', 'max'),
    count=('value', 'size'),
    unique=('value', 'nunique'),
    first=('name', 'first'),
    all_names=('name', lambda x: ', '.join(x)),
)

Flattening Nested JSON

Many infrastructure commands return nested JSON. ip -j addr nests addr_info inside each interface. lsblk -J nests children inside parent devices. You must walk the tree and flatten it yourself.

Network interfaces — extracting nested addr_info
import subprocess, json
import pandas as pd

result = subprocess.run(['ip', '-j', 'addr'], capture_output=True, text=True)
data = json.loads(result.stdout)

rows = []
for iface in data:
    ipv4, ipv6 = None, None
    for addr in iface.get('addr_info', []):
        if addr.get('family') == 'inet' and not ipv4:
            ipv4 = addr.get('local')
        elif addr.get('family') == 'inet6' and not ipv6:
            ipv6 = addr.get('local')

    rows.append({
        'interface': iface.get('ifname'),
        'state': iface.get('operstate'),
        'mac': iface.get('address'),
        'ipv4': ipv4,
        'ipv6': ipv6,
    })

df = pd.DataFrame(rows)
Block devices — recursive children
result = subprocess.run(
    ['lsblk', '-J', '-o', 'NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT,FSUSE%,MODEL'],
    capture_output=True, text=True
)
data = json.loads(result.stdout)

rows = []

def walk(devices, parent=''):
    for dev in devices:
        rows.append({
            'name': dev.get('name'),
            'size': dev.get('size'),
            'type': dev.get('type'),
            'mountpoint': dev.get('mountpoint'),
            'usage_pct': dev.get('fsuse%'),
            'parent': parent,
        })
        if 'children' in dev:
            walk(dev['children'], parent=dev['name'])

walk(data.get('blockdevices', []))
df = pd.DataFrame(rows)

The pattern: walk the tree, build a flat list of dicts, create the DataFrame once at the end. Never call pd.DataFrame() inside a loop — build the list first, then convert.

Exporting Data

CSV — spreadsheet-compatible, Excel opens it directly
df.to_csv('/tmp/inventory/packages.csv', index=False)
# index=False omits the row numbers — you almost always want this

# Read it back
df = pd.read_csv('/tmp/inventory/packages.csv')
JSON — API-compatible, jq-compatible
df.to_json('/tmp/inventory/interfaces.json', orient='records', indent=2)
# orient='records' → [{row1}, {row2}]  (array of objects)
# orient='columns' → {"col1": [...], "col2": [...]}
# orient='index'   → {"0": {row1}, "1": {row2}}
# 'records' is what you want 99% of the time

# Read it back
df = pd.read_json('/tmp/inventory/interfaces.json')
Multiple DataFrames to a directory
from pathlib import Path

output = Path('/tmp/inventory')
output.mkdir(parents=True, exist_ok=True)

dataframes = {
    'packages': packages_df,
    'services': services_df,
    'interfaces': interfaces_df,
}

for name, df in dataframes.items():
    df.to_csv(output / f"{name}.csv", index=False)
    df.to_json(output / f"{name}.json", orient='records', indent=2)

The shell equivalent: your jq exports are single-format. pandas exports to any format from the same data with one line each.

pandas ↔ Shell Equivalents

Operation Shell pandas

Filter rows

grep 'failed'

df[df['sub'] == 'failed']

Count per group

awk '{print $4}' | sort | uniq -c

df['sub'].value_counts()

Sort

sort -k1

df.sort_values('package')

Row count

wc -l

df.shape[0]

Select columns

awk '{print $1, $3}'

df[['name', 'state']]

Unique values

sort -u

df['role'].unique()

Handle nulls

sed 's/null//g'

df.fillna('')

Join two files

join file1 file2

df1.merge(df2, on='key')

Export CSV

| tee file.csv

df.to_csv('file.csv')

The advantage: pandas operations chain without re-parsing. The shell re-parses text at every pipe stage. pandas transforms structured data in memory.

See Also