Phase 5: YAML Persistence

Phase 5: Save and Load with YAML

Objective

Give the graph the ability to persist to disk and load from disk using YAML files. The graph becomes durable — you can build it, save it, shut down, and reload it later. You can also split data across multiple YAML files by domain and merge them at load time.

Python Concepts

Concept Plain English

import yaml

"Bring in the yaml library so I can use it." import is how Python loads code someone else wrote. It is conceptually equivalent to source ~/.bashrc — it makes functions and classes from another file available. yaml is the PyYAML library. It is not built into Python; you install it with uv add pyyaml.

from pathlib import Path

Import the Path class from Python’s standard library. Path treats file paths as objects with methods, not raw strings. Path("data") / "projects.yml" builds data/projects.yml using / as a path join operator. No more string concatenation with os.path.join — this is the modern way.

with open(path) as f:

The with statement opens a file and guarantees it gets closed when the block ends, even if an error occurs. f is a file handle — you read from it or write to it. This is the Python equivalent of exec 3<file; read -r line <&3; exec 3<&- but without the complexity. The file closes automatically when the indented block exits.

try / except

Error handling. try: runs code that might fail. except FileNotFoundError: catches that specific error and runs recovery code instead of crashing. This is like command || fallback in bash, but with more precision — you specify which error to catch.

yaml.safe_load(f)

Parse YAML from a file handle into Python data structures. Lists become list, mappings become dict, strings stay str. safe_load is critical — never use yaml.load() without a Loader, as it can execute arbitrary Python code embedded in the YAML (a real security risk).

yaml.dump(data, f)

Serialize Python data structures to YAML and write to a file handle. default_flow_style=False forces block style (readable, one-item-per-line). sort_keys=False preserves insertion order.

Path.glob("*.yml")

Find all files matching a pattern in a directory. Returns an iterator of Path objects. Like ls data/*.yml but as a Python generator.

@classmethod

A method that belongs to the class itself, not an instance. AssociationGraph.load(path) creates a new graph from a file — you do not need an existing instance to call it. The first argument is cls (the class) instead of self (an instance).

Steps

1. Add the PyYAML dependency

uv add pyyaml

This updates pyproject.toml and installs the library into .venv/.

2. Define the YAML schema

The file format is a flat list of triples — source, relation, target:

# data/projects.yml
associations:
  - source: CISSP
    relation: covers
    target: access-control

  - source: CISSP
    relation: covers
    target: cryptography

  - source: CISSP
    relation: requires
    target: 5-years-experience

This is deliberately simple. Each entry is one fact. No nesting beyond the triple. Easy to read, easy to grep, easy to add to.

The schema rule: every entry must have exactly three keys: source, relation, target. All values are strings.

3. Write the save method

Add these imports at the top of graph.py:

from pathlib import Path

import yaml

Then add the method to AssociationGraph:

def save(self, path: Path) -> None:
    """Persist the graph to a YAML file.

    Writes all forward associations as a flat list of triples.
    The reverse dict is not saved -- it is reconstructed on load.
    """
    triples: list[dict[str, str]] = []

    for source, relations in self._forward.items():
        for relation, targets in relations.items():
            for target in targets:
                triples.append({
                    "source": source,
                    "relation": relation,
                    "target": target,
                })

    path.parent.mkdir(parents=True, exist_ok=True)

    with open(path, "w") as f:
        yaml.dump(
            {"associations": triples},
            f,
            default_flow_style=False,
            sort_keys=False,
        )

Walk through this:

  1. Three nested for loops iterate over every source, every relation under that source, and every target under that relation. Each combination becomes one triple.

  2. path.parent.mkdir(parents=True, exist_ok=True) — create the parent directory if it does not exist. parents=True creates intermediate directories (like mkdir -p). exist_ok=True means "don’t error if it already exists."

  3. with open(path, "w") as f: — open for writing. "w" means write mode (truncate the file).

  4. yaml.dump(…​) writes the Python dict as YAML.

Only the forward dict is saved. The reverse dict is derived data — it is reconstructed by replaying the associations during load. This avoids storing the same information twice and eliminates any risk of the two dicts becoming inconsistent on disk.

4. Write the load classmethod

@classmethod
def load(cls, path: Path) -> "AssociationGraph":
    """Load a graph from a YAML file.

    Returns an empty graph if the file does not exist.
    """
    graph = cls()

    try:
        with open(path) as f:
            data = yaml.safe_load(f)
    except FileNotFoundError:
        return graph

    if not data or "associations" not in data:
        return graph

    for entry in data["associations"]:
        graph.associate(
            source=entry["source"],
            relation=entry["relation"],
            target=entry["target"],
        )

    return graph

Key points:

  1. cls() calls the constructor — since this is a @classmethod, cls is AssociationGraph. This means if you later subclass, cls will be the subclass, and the method still works.

  2. try/except FileNotFoundError — if the file does not exist, return an empty graph. No crash. This is a deliberate design choice: loading a nonexistent file is not an error, it is an empty state.

  3. Each entry is passed to associate(), which rebuilds both the forward and reverse dicts. The save/load cycle goes through the same code path as manual association.

5. Write the load_directory classmethod

@classmethod
def load_directory(cls, directory: Path) -> "AssociationGraph":
    """Load and merge all YAML files in a directory.

    Each .yml file is loaded independently and merged into one graph.
    This lets you organize data by domain:
        data/projects.yml
        data/certifications.yml
        data/skills.yml
    """
    graph = cls()

    if not directory.is_dir():
        return graph

    for yml_path in sorted(directory.glob("*.yml")):
        try:
            with open(yml_path) as f:
                data = yaml.safe_load(f)
        except FileNotFoundError:
            continue

        if not data or "associations" not in data:
            continue

        for entry in data["associations"]:
            graph.associate(
                source=entry["source"],
                relation=entry["relation"],
                target=entry["target"],
            )

    return graph

sorted() ensures files are loaded in alphabetical order — deterministic behavior. Each file adds its associations to the same graph, merging them. Duplicate associations are handled by the if target not in targets guard in associate().

6. Write the tests

Create tests/test_persistence.py:

"""Tests for YAML persistence — Phase 5."""

from pathlib import Path

import pytest

from association_engine.graph import AssociationGraph


@pytest.fixture
def tmp_yaml(tmp_path: Path) -> Path:
    """Return a path for a temporary YAML file."""
    return tmp_path / "test.yml"


@pytest.fixture
def tmp_dir(tmp_path: Path) -> Path:
    """Return a temporary directory for multi-file tests."""
    d = tmp_path / "data"
    d.mkdir()
    return d


class TestSave:
    """Verify save writes valid YAML."""

    def test_creates_file(self, tmp_yaml: Path) -> None:
        g = AssociationGraph()
        g.associate("A", "relates-to", "B")
        g.save(tmp_yaml)
        assert tmp_yaml.exists()

    def test_file_is_valid_yaml(self, tmp_yaml: Path) -> None:
        import yaml

        g = AssociationGraph()
        g.associate("A", "relates-to", "B")
        g.save(tmp_yaml)

        with open(tmp_yaml) as f:
            data = yaml.safe_load(f)

        assert "associations" in data
        assert len(data["associations"]) == 1
        assert data["associations"][0]["source"] == "A"

    def test_creates_parent_directories(self, tmp_path: Path) -> None:
        deep_path = tmp_path / "a" / "b" / "c" / "data.yml"
        g = AssociationGraph()
        g.associate("X", "uses", "Y")
        g.save(deep_path)
        assert deep_path.exists()


class TestLoad:
    """Verify load restores graph state."""

    def test_roundtrip(self, tmp_yaml: Path) -> None:
        original = AssociationGraph()
        original.associate("CISSP", "covers", "access-control")
        original.associate("CISSP", "covers", "cryptography")
        original.associate("CCNP", "covers", "routing")
        original.save(tmp_yaml)

        restored = AssociationGraph.load(tmp_yaml)
        assert restored.query("CISSP") == original.query("CISSP")
        assert restored.query("CCNP") == original.query("CCNP")
        assert restored.keys() == original.keys()

    def test_reverse_rebuilt_on_load(self, tmp_yaml: Path) -> None:
        original = AssociationGraph()
        original.associate("CISSP", "covers", "access-control")
        original.save(tmp_yaml)

        restored = AssociationGraph.load(tmp_yaml)
        rev = restored.reverse_query("access-control")
        assert "CISSP" in rev["covered-by"]

    def test_missing_file_returns_empty(self, tmp_path: Path) -> None:
        g = AssociationGraph.load(tmp_path / "nonexistent.yml")
        assert g.keys() == []

    def test_empty_file_returns_empty(self, tmp_yaml: Path) -> None:
        tmp_yaml.write_text("")
        g = AssociationGraph.load(tmp_yaml)
        assert g.keys() == []


class TestLoadDirectory:
    """Verify multi-file loading and merging."""

    def test_merges_files(self, tmp_dir: Path) -> None:
        g1 = AssociationGraph()
        g1.associate("CISSP", "covers", "crypto")
        g1.save(tmp_dir / "certs.yml")

        g2 = AssociationGraph()
        g2.associate("Python", "uses", "pip")
        g2.save(tmp_dir / "tools.yml")

        merged = AssociationGraph.load_directory(tmp_dir)
        assert "CISSP" in merged.keys()
        assert "Python" in merged.keys()

    def test_missing_directory_returns_empty(self, tmp_path: Path) -> None:
        g = AssociationGraph.load_directory(tmp_path / "ghost")
        assert g.keys() == []

    def test_deduplicates_across_files(self, tmp_dir: Path) -> None:
        for name in ("a.yml", "b.yml"):
            g = AssociationGraph()
            g.associate("X", "uses", "Y")
            g.save(tmp_dir / name)

        merged = AssociationGraph.load_directory(tmp_dir)
        assert merged.query("X")["uses"].count("Y") == 1
tmp_path is a built-in pytest fixture. It provides a unique temporary directory for each test, automatically cleaned up after the test run. You never need to manage temp files yourself.

7. Run

uv run pytest tests/ -v
uv run ruff check src/ tests/

Checklist

  • uv add pyyaml completed

  • save() method writes YAML triples

  • load() classmethod restores a graph from a file

  • load_directory() merges multiple YAML files

  • Missing files return empty graphs, not errors

  • Reverse dict is rebuilt on load (not stored)

  • tests/test_persistence.py with 10 tests

  • All tests pass (Phase 4 tests + Phase 5 tests)

  • Create data/projects.yml with real associations

Verification

uv run pytest tests/ -v --tb=short 2>&1 | tail -5

The graph is now durable. Phase 6 wraps it in a CLI so you can query it from the terminal — the environment where you are already fluent.