df -h
to see which partition is full, then hunting through directories
with du
commands to find what’s consuming all the space. Maybe it’s log files
that weren’t rotated properly. Maybe a backup process went rogue. Maybe someone’s
script started dumping debug files everywhere.
Every single disk space incident follows the same forensic pattern, yet here you
are at 4am, manually typing the same commands while your production systems are
grinding to a halt.
This is the kind of toil that keeps SREs up at night, literally.
The investigation is completely systematic, the data sources are predictable,
and the triage steps never change. It’s the perfect candidate for automation.
$EDITOR
. Paste the following Agent definition
into the file:
description
of an Agent is used by the Router to
decide which Agent to run for a given input. In this example we want the Agent
to run only when the alert is about critical disk space issues.
prompt
is where you give the Agent instructions, written in a runbook
format. Make sure any instructions you give are achievable using the tools
you have allowed the Agent to use (see below).
tools
section explicitly grants permission to use specific tools. You can
list individual tools, or use wildcards and regex patterns to limit what the
Agent can use.
To see all of the available tools your Unpage installation has access to, run:
shell_check_disk_space
and shell_check_large_files
,
which are custom shell commands that check disk usage and
identify large files on remote instances. Custom shell commands allow you to
extend the functionality of Unpage without having to write a new plugin.
~/.unpage/profiles/default/config.yaml
and add the following:
unpage agent serve
and add the webhook URL to your PagerDuty account: