Validating YAML frontmatter with JSONSchema

Table of Contents

Consistency is hard
#

Over my time using Obsidian, I’ve independently authored around 400 notes. Over time I’ve had a relatively consistent schema for my tags and frontmatter attributes:

---
publish: false
summary: ""
aliases: []
title: ""
source: []
tags:

- Status/New
---

Getting too deep into what all of these mean is outside the scope of this post. For now, it’s enough to know that for any Obsidian note, these properties must be present in order for my pipelines to do their job.

Manually Managed Metadata
#

Until now, I managed my note frontmatter by hand, or with sed/grep. I’ve got a bit of experience using these tools to manipulate text files, so it’s been relatively comfortable but extremely manual.

Configuration Drift
#

The problem is that over time, humans get sloppy, forget things, decide to do things differently. In practice, this doesn’t impact the usage of my vault in Obsidian; I access most of my notes via the Quick Switcher so filenames and aliases are the things I really focus on.

A place where consistency does matter is when you’re automating tasks. Tools that work with Markdown like static site generators care a lot about frontmatter metadata.

For these tools to work the way I expect and need them to, I need to guarantee that my notes are configured correctly.

What are the options?
#

This is a project I’ve been meditating on for a long time. The specific problem I had is that most markdown frontmatter is YAML. I’d done cursory searching and come up with no satisfying results for a “YAML schema engine”, something to formally validate the structure and content of a YAML document.

I was a fool. For years I’d know that YAML was a superset of JSON, and I’d assume that the superset part meant that no tool that expects JSON could ever be guaranteed work on YAML and that’s not acceptable for automation.

The detail that matters is that only the syntax is a superset of JSON. The underlying data types: null, bool, integer, string, array, and object, still map onto JSON 1 to 1. With that revelation, my work could finally begin.

golang and jsonschema
#

My implementation language of choice is Go, naturally. Speed, type-safety, and cross-compilation all make for a great pipeline.

import (
        "fmt"
        "io"

        "github.com/santhosh-tekuri/jsonschema/v5"
        _ "github.com/santhosh-tekuri/jsonschema/v5/httploader"
        "gopkg.in/yaml.v3"
)

func Validate(schemaURL string, r io.Reader) error {
        var m interface{}

        dec := yaml.NewDecoder(r)
        err := dec.Decode(&m)
        if err != nil {
                return fmt.Errorf("error decoding YAML: %w", err)
        }

        compiler := jsonschema.NewCompiler()
        schema, err := compiler.Compile(schemaURL)
        if err != nil {
                return fmt.Errorf("error compiling schema: %w", err)
        }
        if err := schema.Validate(m); err != nil {
                return fmt.Errorf("error validating target: %w", err)
        }

        return nil
}

Validate() is basically all you need in terms of Go code. The full code repo has a bit more complexity because I’m wiring things through Cobra and stuff, but here’s some sample output:

go run cmd/obp/*.go validate -s https://schemas.ndumas.com/obsidian/note.schema.json -t Resources/blog/published/
2023/06/01 10:31:27 scanning "mapping-aardwolf.md"
2023/06/01 10:31:27 scanning "schema-bad.md"
2023/06/01 10:31:27 validation error: &fmt.wrapError{msg:"error validating target: jsonschema: '' does not validate with https://schemas.ndumas.com/obsidian/note.schema.json#/required: missing properties: 'title', 'summary', 'tags'", err:(*jsonschema.ValidationError)(0xc0000b3740)}
2023/06/01 10:31:27 error count for "schema-bad.md": 1
2023/06/01 10:31:27 scanning "schema-good.md"

You get a relatively detailed summary of why validation failed and a non-zero exit code, exactly what you need to prevent malformed data from entering your pipeline.

how to schema library?
#

You might notice that when I specify a schema, it’s hosted at schemas.ndumas.com. Here you can find the repository powering that domain.

It’s pretty simple, just a handful of folders and the following Drone pipeline:

kind: pipeline
name: publish-schemas

clone:
  depth: 1


steps:
- name: publish
  image: drillster/drone-rsync
  settings:
    key:
      from_secret: BLOG_DEPLOY_KEY
    user: blog
    port: 22
    delete: true
    recursive: true
    hosts: ["schemas.ndumas.com"]
    source: /drone/src/
    target: /var/www/schemas.ndumas.com/
    include: ["*.schema.json"]
    exclude: ["**.*"]

and this Caddy configuration block:

schemas.ndumas.com {
    encode gzip
    file_server {
      browse
    }
    root * /var/www/schemas.ndumas.com
}

Feel free to browse around the schema site.

Success Story???
#

At time of writing, I haven’t folded this into any pipelines. This code is basically my proof-of-concept for only a small small part of a larger rewrite of my pipeline.

Future Use Cases
#

The one use-case that seemed really relevant was for users of the Breadcrumbs plugin. That one uses YAML metadata extensively to create complex hierarchies and relationships. Perfect candidate for a schema validation tool.

Putting Lipgloss on a Snake: Prettier Help Output for Cobra

8 May 2023·900 words·5 mins

Using lipgloss to abstract away the specifics of nice terminal output.

Mapping Aardwolf with Graphviz and Golang

6 April 2023·934 words·5 mins

Maxing out your CPU for fun and profit with dense graphs, or how I’m attempting to follow through on my plan to work on projects with more visual outputs

Gardening with Quartz

4 March 2023·1660 words·8 mins

When you want a container built right, you have to do it yourself.