# Stage 6 — JoGPT Core Content Master Dataset

## What this folder is

Stage 6 is the **production-ready content layer** for the JoGPT shell. It replaces the Stage 4 output as the authoritative teaching dataset for all 42 Pearson T Level Digital Support and Security core spec sections.

This folder exists because Stage 4, while clean, had two categories of problems that made it unsuitable for direct use as a student-facing resource:

1. **Content weight imbalance** — scraped source volume varied widely between topics, causing some sections to appear more important than others based purely on how much material was found, not how much the spec requires.
2. **Spec compliance gaps** — several sections contained out-of-scope content (US legislation, ABAC), missing framing blocks (CIA triad, IAAA), incorrect command-word depth, and no diagram provision despite the spec mandating visual representations.

Stage 6 corrects all of these and adds the diagram layer.

---

## Files in this folder

### Core_Content_Master_Dataset.md *(primary — use this with the generator)*

The generator-ready master. All 42 spec sections, corrected and standardised. Pipeline metadata stripped. Mermaid diagram blocks removed — the generator (`generate-topic-content.js`) cannot yet parse fenced code blocks, so this is the file to point it at.

**4,675 lines.**

### Core_Content_Master_Dataset_DIAGRAMS.md *(authoring source — do not feed to generator)*

Identical to the above but with all 17 Mermaid diagram blocks restored. This is the long-term authoring source. Once the generator is extended to handle `type: 'mermaid'` blocks, this file replaces the clean version.

**4,975 lines** (300-line difference is the Mermaid code).

### apply_stage6_updates.py

The Python script that produced both files above from the Stage 4 corrected output. Kept for reproducibility. Do not re-run unless you need to regenerate from scratch — it overwrites `Core_Content_Master_Dataset.md`.

### stage6_update_log.txt

Run log from the most recent execution of `apply_stage6_updates.py`. Confirms all 16 placeholder replacements, 42 metadata strips, 4 exam angle insertions, and 2 presentation comments.

### data_cleaning/

Contains the analysis inputs that drove Stage 6 corrections:
- `CONTENT_WEIGHT_REFERENCE.md` — per-section volume and tier analysis
- `DIAGRAM_REQUIREMENTS_MAP.html` — which spec sections require or benefit from diagrams, rated M (mandated), S (strongly recommended), or O (optional)
- `EXAMPLE_STANDARDISED_SECTION_2.2.md` — reference example for section structure
- `SECTION_PROMPT_TEMPLATE.md` — the template spec sections must conform to

---

## What changed between Stage 4 and Stage 6

### Stage 4 — `complete_core_master.md` (4,661 lines)

The frozen output from the Stage 3 to Stage 4 pipeline. Combined Core Paper 1 and Core Paper 2. Mechanically cleaned (encoding, deduplication, formatting) but not content-reviewed. Contains:

- US legislation in section 4.1 (HIPAA, ECPA, CFAA) — out of scope for Pearson spec
- ABAC (Attribute-Based Access Control) in section 8.4 — explicitly out of scope
- RuBAC description in section 3.11 using incorrect framing
- Section 1.2 spec table using wrong command words and listing pseudocode as the primary algorithm representation term
- No CIA triad framing block in section 8.1
- No IAAA framing block in section 8.3
- Section 6.1 (Emerging Issues) with only one sub-domain instead of three
- Pipeline metadata embedded in every section (`Sources file`, `Usable Stage 3 sources`, `Subtopic` fields)
- No diagram provision anywhere in the file
- No exam angle callouts

**Stage 4 is preserved as-is at its original path and must not be overwritten.**

### Stage 6 — corrections applied

| Section | What changed |
|---------|-------------|
| 1.1 | Presentation comments added at 1.1.7 and 1.1.9 (sparse sub-sections). Exam angle callout added. Two Mermaid diagrams: CT components graph + decomposition tree. |
| 1.2 | Spec coverage table rebuilt with correct Pearson command words. "Written descriptions using hierarchical markers" replaces pseudocode as the primary term. Exam angle callout added. Two Mermaid diagrams: flowchart symbol set + worked password-reset flowchart. |
| 2.7 | Four Mermaid diagrams: DFD symbol set, worked DFD, IFD symbol set, worked IFD. |
| 2.8 | 5x5 risk matrix as a Markdown table (Mermaid cannot render grids). |
| 2.9 | Two Mermaid diagrams: Gantt chart + Kanban board. |
| 3.11 | RuBAC description corrected. RuBAC vs RBAC comparison table added. |
| 4.1 | US legislation (HIPAA, ECPA, CFAA) replaced entirely with Budapest Convention on Cybercrime (2001) content — the correct international law for this spec. |
| 6.1 | Expanded from one sub-domain to three: AI and automation, sustainability and e-waste, digital inclusion. |
| 7.1 | RAID 1, RAID 5, and RAID 10 diagrams added. Comparison table added. |
| 7.3 | TCP/IP disambiguation note added. Exam angle callout added. Four Mermaid diagrams: star/mesh/tree topologies, OSI 7-layer model, TCP/IP 4-layer model, data packet structure. |
| 8.1 | CIA triad framing block added at section open. |
| 8.3 | IAAA framing block confirmed present. IAAA flow Mermaid diagram added. |
| 8.4 | ABAC removed. CIA triad diagram added. Exam angle callout added. |
| All | Pipeline metadata stripped (42 blocks). All 16 diagram placeholders replaced with actual Mermaid or Markdown content. |

---

## How diagrams reach the website

The current generator (`scripts/generate-topic-content.js`) does not parse fenced code blocks. To enable Mermaid rendering, three changes are needed:

1. **`index.html`** — load Mermaid.js from CDN and call `mermaid.initialize({ startOnLoad: true })`.
2. **`generate-topic-content.js` parser** — detect backtick-mermaid fences in `parseLineBlocks`, collect lines to closing fence, emit `{ type: 'mermaid', code: '...' }`.
3. **`generate-topic-content.js` renderer** — add a case in `renderSectionBlocks` for `block.type === 'mermaid'` that outputs `<div class="mermaid">...</div>`.

Once those changes are in place, point the generator at `Core_Content_Master_Dataset_DIAGRAMS.md` instead of the clean version.

---

## Remaining root files (not yet created)

These were listed in the original Stage 6 plan and are not yet produced:

- `Core_Content_Master_READABLE.pdf`
- `Core_Content_Master_Sources.md`
- `Core_Content_Master_Teaching.md`
- `Core_Content_Master_READABLE.svg`
- `mermaid.mid.js`

---

## Authority order

When any content decision is in dispute:

1. Pearson specification
2. JoGPT Master Source Teaching Guide (`master_source_teaching_FINAL.md`)
3. Stage 6 `Core_Content_Master_Dataset_DIAGRAMS.md`
4. Stage 4 `complete_core_master.md` (reference only — do not modify)
