Phase 5: YAML Persistence
Phase 5: Save and Load with YAML
Objective
Give the graph the ability to persist to disk and load from disk using YAML files. The graph becomes durable — you can build it, save it, shut down, and reload it later. You can also split data across multiple YAML files by domain and merge them at load time.
Python Concepts
| Concept | Plain English |
|---|---|
|
"Bring in the |
|
Import the |
|
The |
|
Error handling.
|
|
Parse YAML from a file handle into Python data structures.
Lists become |
|
Serialize Python data structures to YAML and write to a file handle.
|
|
Find all files matching a pattern in a directory.
Returns an iterator of |
|
A method that belongs to the class itself, not an instance.
|
Steps
1. Add the PyYAML dependency
uv add pyyaml
This updates pyproject.toml and installs the library into .venv/.
2. Define the YAML schema
The file format is a flat list of triples — source, relation, target:
# data/projects.yml
associations:
- source: CISSP
relation: covers
target: access-control
- source: CISSP
relation: covers
target: cryptography
- source: CISSP
relation: requires
target: 5-years-experience
This is deliberately simple.
Each entry is one fact.
No nesting beyond the triple.
Easy to read, easy to grep, easy to add to.
The schema rule: every entry must have exactly three keys: source, relation, target.
All values are strings.
3. Write the save method
Add these imports at the top of graph.py:
from pathlib import Path
import yaml
Then add the method to AssociationGraph:
def save(self, path: Path) -> None:
"""Persist the graph to a YAML file.
Writes all forward associations as a flat list of triples.
The reverse dict is not saved -- it is reconstructed on load.
"""
triples: list[dict[str, str]] = []
for source, relations in self._forward.items():
for relation, targets in relations.items():
for target in targets:
triples.append({
"source": source,
"relation": relation,
"target": target,
})
path.parent.mkdir(parents=True, exist_ok=True)
with open(path, "w") as f:
yaml.dump(
{"associations": triples},
f,
default_flow_style=False,
sort_keys=False,
)
Walk through this:
-
Three nested
forloops iterate over every source, every relation under that source, and every target under that relation. Each combination becomes one triple. -
path.parent.mkdir(parents=True, exist_ok=True)— create the parent directory if it does not exist.parents=Truecreates intermediate directories (likemkdir -p).exist_ok=Truemeans "don’t error if it already exists." -
with open(path, "w") as f:— open for writing."w"means write mode (truncate the file). -
yaml.dump(…)writes the Python dict as YAML.
Only the forward dict is saved. The reverse dict is derived data — it is reconstructed by replaying the associations during load. This avoids storing the same information twice and eliminates any risk of the two dicts becoming inconsistent on disk.
4. Write the load classmethod
@classmethod
def load(cls, path: Path) -> "AssociationGraph":
"""Load a graph from a YAML file.
Returns an empty graph if the file does not exist.
"""
graph = cls()
try:
with open(path) as f:
data = yaml.safe_load(f)
except FileNotFoundError:
return graph
if not data or "associations" not in data:
return graph
for entry in data["associations"]:
graph.associate(
source=entry["source"],
relation=entry["relation"],
target=entry["target"],
)
return graph
Key points:
-
cls()calls the constructor — since this is a@classmethod,clsisAssociationGraph. This means if you later subclass,clswill be the subclass, and the method still works. -
try/except FileNotFoundError— if the file does not exist, return an empty graph. No crash. This is a deliberate design choice: loading a nonexistent file is not an error, it is an empty state. -
Each entry is passed to
associate(), which rebuilds both the forward and reverse dicts. The save/load cycle goes through the same code path as manual association.
5. Write the load_directory classmethod
@classmethod
def load_directory(cls, directory: Path) -> "AssociationGraph":
"""Load and merge all YAML files in a directory.
Each .yml file is loaded independently and merged into one graph.
This lets you organize data by domain:
data/projects.yml
data/certifications.yml
data/skills.yml
"""
graph = cls()
if not directory.is_dir():
return graph
for yml_path in sorted(directory.glob("*.yml")):
try:
with open(yml_path) as f:
data = yaml.safe_load(f)
except FileNotFoundError:
continue
if not data or "associations" not in data:
continue
for entry in data["associations"]:
graph.associate(
source=entry["source"],
relation=entry["relation"],
target=entry["target"],
)
return graph
sorted() ensures files are loaded in alphabetical order — deterministic behavior.
Each file adds its associations to the same graph, merging them.
Duplicate associations are handled by the if target not in targets guard in associate().
6. Write the tests
Create tests/test_persistence.py:
"""Tests for YAML persistence — Phase 5."""
from pathlib import Path
import pytest
from association_engine.graph import AssociationGraph
@pytest.fixture
def tmp_yaml(tmp_path: Path) -> Path:
"""Return a path for a temporary YAML file."""
return tmp_path / "test.yml"
@pytest.fixture
def tmp_dir(tmp_path: Path) -> Path:
"""Return a temporary directory for multi-file tests."""
d = tmp_path / "data"
d.mkdir()
return d
class TestSave:
"""Verify save writes valid YAML."""
def test_creates_file(self, tmp_yaml: Path) -> None:
g = AssociationGraph()
g.associate("A", "relates-to", "B")
g.save(tmp_yaml)
assert tmp_yaml.exists()
def test_file_is_valid_yaml(self, tmp_yaml: Path) -> None:
import yaml
g = AssociationGraph()
g.associate("A", "relates-to", "B")
g.save(tmp_yaml)
with open(tmp_yaml) as f:
data = yaml.safe_load(f)
assert "associations" in data
assert len(data["associations"]) == 1
assert data["associations"][0]["source"] == "A"
def test_creates_parent_directories(self, tmp_path: Path) -> None:
deep_path = tmp_path / "a" / "b" / "c" / "data.yml"
g = AssociationGraph()
g.associate("X", "uses", "Y")
g.save(deep_path)
assert deep_path.exists()
class TestLoad:
"""Verify load restores graph state."""
def test_roundtrip(self, tmp_yaml: Path) -> None:
original = AssociationGraph()
original.associate("CISSP", "covers", "access-control")
original.associate("CISSP", "covers", "cryptography")
original.associate("CCNP", "covers", "routing")
original.save(tmp_yaml)
restored = AssociationGraph.load(tmp_yaml)
assert restored.query("CISSP") == original.query("CISSP")
assert restored.query("CCNP") == original.query("CCNP")
assert restored.keys() == original.keys()
def test_reverse_rebuilt_on_load(self, tmp_yaml: Path) -> None:
original = AssociationGraph()
original.associate("CISSP", "covers", "access-control")
original.save(tmp_yaml)
restored = AssociationGraph.load(tmp_yaml)
rev = restored.reverse_query("access-control")
assert "CISSP" in rev["covered-by"]
def test_missing_file_returns_empty(self, tmp_path: Path) -> None:
g = AssociationGraph.load(tmp_path / "nonexistent.yml")
assert g.keys() == []
def test_empty_file_returns_empty(self, tmp_yaml: Path) -> None:
tmp_yaml.write_text("")
g = AssociationGraph.load(tmp_yaml)
assert g.keys() == []
class TestLoadDirectory:
"""Verify multi-file loading and merging."""
def test_merges_files(self, tmp_dir: Path) -> None:
g1 = AssociationGraph()
g1.associate("CISSP", "covers", "crypto")
g1.save(tmp_dir / "certs.yml")
g2 = AssociationGraph()
g2.associate("Python", "uses", "pip")
g2.save(tmp_dir / "tools.yml")
merged = AssociationGraph.load_directory(tmp_dir)
assert "CISSP" in merged.keys()
assert "Python" in merged.keys()
def test_missing_directory_returns_empty(self, tmp_path: Path) -> None:
g = AssociationGraph.load_directory(tmp_path / "ghost")
assert g.keys() == []
def test_deduplicates_across_files(self, tmp_dir: Path) -> None:
for name in ("a.yml", "b.yml"):
g = AssociationGraph()
g.associate("X", "uses", "Y")
g.save(tmp_dir / name)
merged = AssociationGraph.load_directory(tmp_dir)
assert merged.query("X")["uses"].count("Y") == 1
tmp_path is a built-in pytest fixture.
It provides a unique temporary directory for each test, automatically cleaned up after the test run.
You never need to manage temp files yourself.
|
7. Run
uv run pytest tests/ -v
uv run ruff check src/ tests/
Checklist
-
uv add pyyamlcompleted -
save()method writes YAML triples -
load()classmethod restores a graph from a file -
load_directory()merges multiple YAML files -
Missing files return empty graphs, not errors
-
Reverse dict is rebuilt on load (not stored)
-
tests/test_persistence.pywith 10 tests -
All tests pass (Phase 4 tests + Phase 5 tests)
-
Create
data/projects.ymlwith real associations
Verification
uv run pytest tests/ -v --tb=short 2>&1 | tail -5
The graph is now durable. Phase 6 wraps it in a CLI so you can query it from the terminal — the environment where you are already fluent.