
At Stately, we’ve been hosting our own GitHub Action Runners on-prem which has given us a lot of flexibility and control over our builds. Over the last few months we’ve had a few availability hiccups with our runners that inspired us to whip up a quick pingdom-style monitoring system that reports outages to our Slack. And as a bonus we decided to go the serverless route and build using AWS Lambda with StatelyDB for persistence.
Okay so it was the third time in a week that our catwoman and scarecrow Runners had gone into the dreaded Offline status in our GitHub Actions page. It was time to take action. We quickly drafted up a simple set of requirements:
A data model pretty quickly emerged:
RepositoryRepository contains one or more RunnersRunner becomes unhealthy we want to create an OutageEventSimple! And since we’re using StatelyDB, coming up with an Elastic Schema was really natural. Let’s walk through what each of these logical models look like expressed as StatelyDB Item Types:
A Repository contains the metadata you would expect like owner, name and some standard timestamps. You’ll also notice that our Key Path is /repo-:repoId which means we’re going to partition our data by repository, which will come into play as you see our other Item Types.
import { itemType, string, bool, timestampMilliseconds } from "@stately-cloud/schema";
itemType("Repository", {
keyPath: "/repo-:repoId",
fields: {
/** Repository identifier (owner/name format) */
repoId: { type: string },
/** GitHub repository owner */
owner: { type: string },
/** GitHub repository name */
name: { type: string },
/** Whether this repository is currently being monitored */
isActive: { type: bool },
/** When this repository was first added to monitoring */
createdAt: {
type: timestampMilliseconds,
fromMetadata: "createdAtTime",
},
/** Last time monitoring was performed on this repository */
lastSyncedAt: { type: timestampMilliseconds },
},
});
The star of our show, the Action Runner. Right away you see that we’re using a nested Key Path of /repo-:repoId/runner-:name which means we’re using the Respository identifier as our partition key. The powerful side effect of this nesting is that we can use the StatelyDB list operation to easily query all of the Items by prefix for a given Repository in a single call!
itemType("Runner", {
// Primary key path: Each runner belongs to a repository
keyPath: "/repo-:repoId/runner-:name",
fields: {
/** GitHub's runner ID (numeric) */
runnerId: { type: uint },
/** Repository this runner belongs to */
repoId: { type: string },
/** Runner name as shown in GitHub */
name: { type: string },
/** Current status of the runner */
status: { type: RunnerStatus },
/** Whether the runner is enabled in GitHub */
enabled: { type: bool },
/** Operating system of the runner */
os: { type: string },
/** Labels assigned to this runner */
labels: { type: arrayOf(Label) },
/** Last time this runner was seen/checked */
lastSeenAt: { type: timestampMilliseconds },
/** First time this runner was discovered */
firstSeenAt: { type: timestampMilliseconds },
/** When this runner record was created */
createdAt: {
type: timestampMilliseconds,
fromMetadata: "createdAtTime",
},
/** Last time this runner record was updated */
updatedAt: {
type: timestampMilliseconds,
fromMetadata: "lastModifiedAtTime",
},
},
});
And finally we have our OutageEvent Item Type that keeps track of every time a Runner has a hiccup. Notice again how we’re leveraging nested Key Paths to make querying a breeze. We’re also utilizing StatelyDB’s TTL feature to limit our retention period for OutageEvent records to 30 days – once they’re older than the TTL period StatelyDB will automatically delete them.
itemType("OutageEvent", {
keyPath: "/repo-:repoId/history-:runnerId/outage-:outageId",
ttl: {
// Outage events are retained for 30 days
source: "fromCreated",
durationSeconds: 30 * 24 * 60 * 60,
},
fields: {
/**
* Unique identifier for this outage event.
* These will be sequential per runner, e.g. 1, 2, 3, 4.
*/
outageId: {
type: uint,
initialValue: "sequence",
},
/** Repository this outage belongs to */
repoId: { type: string },
/** Runner that experienced the outage */
runnerId: { type: uint },
/** Runner name as shown in GitHub */
runnerName: { type: string },
/** Status that triggered this outage event */
status: { type: RunnerStatus },
/** When the outage was first detected */
startedAt: { type: timestampMilliseconds },
/** When the outage was resolved (zero if ongoing) */
resolvedAt: {
type: timestampMilliseconds,
required: false,
},
/** Description of the outage */
description: { type: string },
/** Whether a notification was sent for this outage */
notificationSent: { type: bool },
/** When this outage record was created */
createdAt: {
type: timestampMilliseconds,
fromMetadata: "createdAtTime",
},
/** Last time this outage record was updated */
updatedAt: {
type: timestampMilliseconds,
fromMetadata: "lastModifiedAtTime",
},
},
});(The full schema is available here.)
The core of our monitoring system is a Lambda function that runs every five minutes to check the status of all our GitHub runners. This function handles several key tasks:
Let’s look at the most important parts of the implementation:
export const handler = async (_event: Record<string, unknown>) => {
console.log("Starting GitHub runner monitoring process");
try {
// Fetch all required parameters from SSM
const params = await fetchSSMParameters();
// Initialize StatelyDB client
const statelyClient = createClient(BigInt(params.statelydbStoreId), {
authTokenProvider: accessKeyAuth({
accessKey: params.statelydbAccessKey,
}),
region: params.statelydbRegion,
});
// Parse the list of repositories to monitor
const repositories = JSON.parse(params.repositories);
// Process each repository
for (const repo of repositories) {
// Fetch runners from GitHub API
const runners = await fetchGitHubRunners(repo, params.githubToken);
// Get existing runners from StatelyDB
const existingRunners = await fetchExistingRunners(statelyClient, repoId);
// Process each runner...
}
} catch (error) {
console.error("Error in GitHub runner monitoring:", error);
// Error handling...
}
};We’re using AWS SSM Parameter Store to securely store sensitive values like our GitHub token and StatelyDB credentials. This is a best practice for serverless applications that avoids hardcoding secrets in your code.
To create the StatelyDB client, we’re using the createClient function generated from our schema along with the accessKeyAuth authentication method, which is the recommended approach for server-side applications. For more information on authentication options, check out the StatelyDB documentation on creating a client.
The full implementation of the Lambda handler can be found in the index.ts file on GitHub.
When a Runner transitions from a healthy to unhealthy state, we need to create an outage record. Similarly, when it recovers, we need to resolve that outage. Here’s how we handle these transitions:
// Check if Runner entered an unhealthy state
if (UNHEALTHY_STATUSES.includes(status) && status !== oldStatus) {
await handleUnhealthyRunner(
statelyClient,
existingRunner,
status,
params.slackWebhook,
);
}
// Check if Runner recovered from an unhealthy state
if (
!UNHEALTHY_STATUSES.includes(status) &&
UNHEALTHY_STATUSES.includes(oldStatus)
) {
const outageId = await resolveOutage(
statelyClient,
repoId,
githubRunner.id,
);
if (params.slackWebhook) {
await sendSlackRecoveryNotification(
params.slackWebhook,
existingRunner,
status,
outageId,
);
}
}The handleUnhealthyRunner function creates a new OutageEvent record in StatelyDB:
async function handleUnhealthyRunner(
client: DatabaseClient,
runner: Runner,
status: number,
slackWebhook: string,
) {
console.log(
`Runner ${runner.name} (${runner.runnerId}) is now in unhealthy state: ${status}`,
);
// Create a new outage event
const outage = await client.put(
client.create("OutageEvent", {
repoId: runner.repoId,
runnerId: runner.runnerId,
runnerName: runner.name,
status,
startedAt: BigInt(Date.now()),
description: `Runner ${runner.name} entered ${statusToString(
status,
)} state`,
notificationSent: false,
}),
);
// Send notification to Slack...
}And when a Runner recovers, we need to resolve any open outage events:
async function resolveOutage(
client: DatabaseClient,
repoId: string,
runnerId: number,
): Promise<bigint> {
let lastOutageId: bigint = BigInt(0);
// Find the last outage for this runner
for await (const item of client.beginList(
`/repo-${repoId}/history-${runnerId}/outage-`,
{ limit: 1, sortDirection: SortDirection.SORT_DESCENDING },
)) {
if (client.isType(item, "OutageEvent") && !item.resolvedAt) {
// Mark outage as resolved
item.resolvedAt = BigInt(Date.now());
await client.put(item);
lastOutageId = item.outageId;
console.log(`Resolved outage ${item.outageId} for runner ${runnerId}`);
}
}
return lastOutageId;
}This is a great example of how StatelyDB’s Key Path design makes querying easy. We can efficiently find the latest outage for a specific Runner by using the beginList operation with a Key Path prefix and sorting in descending order.
A monitoring system isn’t very useful if no one knows when there’s a problem. We integrated Slack notifications to alert our team whenever a Runner goes offline or recovers:
async function sendSlackNotification(
slackWebhook: string,
runner: Runner,
status: number,
outageId: bigint,
) {
const statusText = statusToString(status);
const message = {
text: `🚨 GitHub Runner Alert 🚨`,
blocks: [
{
type: "header",
text: {
type: "plain_text",
text: `🚨 GitHub Runner Alert: ${statusText} 🚨`,
},
},
{
type: "section",
fields: [
{
type: "mrkdwn",
text: `*Repository:*\n${runner.repoId}`,
},
{
type: "mrkdwn",
text: `*Runner:*\n${runner.name}`,
},
{
type: "mrkdwn",
text: `*Status:*\n${statusText}`,
},
{
type: "mrkdwn",
text: `*Outage ID:*\n${outageId}`,
},
],
},
{
type: "context",
elements: [
{
type: "mrkdwn",
text: `Detected at ${new Date().toISOString()}`,
},
],
},
],
};
await axios.post(slackWebhook, message);
console.log(`Sent Slack notification for runner ${runner.name}`);
}We use Slack’s Block Kit to create nicely formatted messages with all the relevant information about the outage.
Reactive notifications are great, but we also wanted a way to check in on our Runners on-demand. We implemented a second Lambda function that handles Slack slash commands, allowing team members to query runner status and history directly from Slack:
export const handler = async (
event: APIGatewayProxyEvent,
): Promise<APIGatewayProxyResult> => {
try {
// Verify request signature...
// Initialize StatelyDB client
const statelyClient = await getStatelyClient();
// For now, we only care about one repo
const repoId = "stately";
// Parse the event body as URL-encoded data
const body = querystring.parse(event.body || "");
const { command, text } = body;
// Respond to slash commands
let blocks: SlackBlock[] = [];
switch (command) {
case "/runner-history":
if (!text) {
blocks = [
{
type: "section",
text: {
type: "mrkdwn",
text: "Please provide a runner name.",
},
},
];
} else {
blocks = await getRecentOutagesForRunner(
statelyClient,
repoId,
text as string,
);
}
break;
case "/runner-status-all":
blocks = await getStatusForRunners(statelyClient, repoId);
break;
default:
// Handle unknown command...
break;
}
return {
statusCode: 200,
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify({
response_type: "in_channel",
blocks: blocks,
}),
};
} catch (error) {
console.error("Error:", error);
return {
statusCode: 500,
body: JSON.stringify({ error: "Internal server error" }),
};
}
};We’ve implemented two commands:
/runner-history [runner-name] - Shows the recent outage history for a specific runner/runner-status-all - Shows the current status of all runnersThe implementation for getting runner history shows another great example of using StatelyDB’s list operation with Key Paths:
async function getRecentOutagesForRunner(
statelyClient: DatabaseClient,
repoId: string,
runnerName: string,
): Promise<SlackBlock[]> {
// First look up the Runner ID
const runner = await statelyClient.get(
"Runner",
keyPath`/repo-${repoId}/runner-${runnerName}`,
);
if (!runner) {
return [
{
type: "section",
text: {
type: "mrkdwn",
text: `Sorry, I couldn't find a runner with the name "${runnerName}"`,
},
},
];
}
const latestOutages: OutageEvent[] = [];
const iter = statelyClient.beginList(
keyPath`/repo-${repoId}/history-${runner.runnerId}/outage-`,
{ limit: 5, sortDirection: SortDirection.SORT_DESCENDING },
);
for await (const item of iter) {
if (statelyClient.isType(item, "OutageEvent")) {
latestOutages.push(item);
}
}
// Format blocks for Slack response...
}Note the use of the keyPath tagged template literal, which is a helper function provided by StatelyDB that ensures IDs are correctly formatted in key paths. This is especially important when working with UUIDs and other binary data.
The full implementation of the Slack command handler can be found in the slack.ts file on GitHub.
We used AWS CDK to define our infrastructure as code. This makes our deployment process repeatable and transparent. Our CDK stack includes:
Here’s a simplified version of our CDK stack:
export class GitHubRunnerMonitorStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
// Lambda execution role
const lambdaRole = new iam.Role(this, "GitHubRunnerMonitorRole", {
assumedBy: new iam.ServicePrincipal("lambda.amazonaws.com"),
managedPolicies: [
iam.ManagedPolicy.fromAwsManagedPolicyName(
"service-role/AWSLambdaBasicExecutionRole",
),
],
});
// Add permissions to access SSM parameters
lambdaRole.addToPolicy(
new iam.PolicyStatement({
actions: ["ssm:GetParameter", "ssm:GetParameters"],
resources: [
`arn:aws:ssm:${this.region}:${this.account}:parameter/github-runner-monitor/*`,
],
}),
);
// Create monitoring Lambda function
const monitorFunction = new lambda.Function(
this,
"GitHubRunnerMonitorFunction",
{
runtime: lambda.Runtime.NODEJS_18_X,
handler: "index.handler",
code: lambda.Code.fromAsset(path.join(__dirname, "../dist")),
timeout: cdk.Duration.minutes(5),
memorySize: 512,
role: lambdaRole,
environment: {
NODE_OPTIONS: "--enable-source-maps",
},
description:
"Lambda function to monitor GitHub self-hosted runners and alert when unhealthy",
},
);
// Create EventBridge rule to trigger Lambda every 5 minutes
const rule = new events.Rule(this, "ScheduleRule", {
schedule: events.Schedule.rate(cdk.Duration.minutes(5)),
description: "Trigger GitHub runner monitoring every 5 minutes",
});
// Add Lambda as target for the rule
rule.addTarget(
new targets.LambdaFunction(monitorFunction, {
retryAttempts: 2,
}),
);
// Create Slack interaction Lambda and API Gateway...
}
}The full CDK stack can be found in the github-runner-monitor-stack.ts file on GitHub.
So there you have it: a simple monitoring system using StatelyDB and AWS Lambda. This project was a perfect match for StatelyDB’s strengths:
If you’re interested in using this monitoring system for your own GitHub Action runners, the full source code is available at https://github.com/StatelyCloud/action-runner-monitor.
Want to learn more about StatelyDB? Check out our documentation or read more about our Elastic Schema on our blog.