Monitoring GitHub Action Runners with StatelyDB, AWS CDK, and Lambda

At Stately, we’ve been hosting our own GitHub Action Runners on-prem which has given us a lot of flexibility and control over our builds. Over the last few months we’ve had a few availability hiccups with our runners that inspired us to whip up a quick pingdom-style monitoring system that reports outages to our Slack. And as a bonus we decided to go the serverless route and build using AWS Lambda with StatelyDB for persistence.
‍

Let’s Do This!

Okay so it was the third time in a week that our catwoman and scarecrow Runners had gone into the dreaded Offline status in our GitHub Actions page. It was time to take action. We quickly drafted up a simple set of requirements:

Periodically poll the GitHub API for the status of all of our Runners. Every 5 minutes is fine.
When a Runner changes from healthy to unhealthy, create a record so we can track the recent history of health events per Runner. When it flips back to healthy, resolve the outage record.
Alert our internal Slack of any Runner health changes.
We need a simple way of quickly checking the status of any Runner and also see recent outage events. A Slack bot feels like just enough UI.
We’re already hosting our own Runners on-prem so let’s use the lightest-weight compute to run our monitoring probe. A serverless Lambda function feels just right for this.
Keep track of all our state in a serverless cloud database. As fate would have it, we happened to have spent the last year and a half building our own database! Lucky.
‍

Tracking Runner State

A data model pretty quickly emerged:

We want to support multiple GitHub repositories, so we’ll want to track each Repository
Each Repository contains one or more Runners
When a Runner becomes unhealthy we want to create an OutageEvent

Simple! And since we’re using StatelyDB, coming up with an Elastic Schema was really natural. Let’s walk through what each of these logical models look like expressed as StatelyDB Item Types:

Repository

A Repository contains the metadata you would expect like owner, name and some standard timestamps. You’ll also notice that our Key Path is /repo-:repoId which means we’re going to partition our data by repository, which will come into play as you see our other Item Types.

import { itemType, string, bool, timestampMilliseconds } from "@stately-cloud/schema";

itemType("Repository", {
  keyPath: "/repo-:repoId",
  fields: {
    /** Repository identifier (owner/name format) */
    repoId: { type: string },

    /** GitHub repository owner */
    owner: { type: string },

    /** GitHub repository name */
    name: { type: string },

    /** Whether this repository is currently being monitored */
    isActive: { type: bool },

    /** When this repository was first added to monitoring */
    createdAt: {
      type: timestampMilliseconds,
      fromMetadata: "createdAtTime",
    },

    /** Last time monitoring was performed on this repository */
    lastSyncedAt: { type: timestampMilliseconds },
  },
});

‍

Runner

The star of our show, the Action Runner. Right away you see that we’re using a nested Key Path of /repo-:repoId/runner-:name which means we’re using the Respository identifier as our partition key. The powerful side effect of this nesting is that we can use the StatelyDB list operation to easily query all of the Items by prefix for a given Repository in a single call!‍

itemType("Runner", {
  // Primary key path: Each runner belongs to a repository
  keyPath: "/repo-:repoId/runner-:name",
  fields: {
    /** GitHub's runner ID (numeric) */
    runnerId: { type: uint },

    /** Repository this runner belongs to */
    repoId: { type: string },

    /** Runner name as shown in GitHub */
    name: { type: string },

    /** Current status of the runner */
    status: { type: RunnerStatus },

    /** Whether the runner is enabled in GitHub */
    enabled: { type: bool },

    /** Operating system of the runner */
    os: { type: string },

    /** Labels assigned to this runner */
    labels: { type: arrayOf(Label) },

    /** Last time this runner was seen/checked */
    lastSeenAt: { type: timestampMilliseconds },

    /** First time this runner was discovered */
    firstSeenAt: { type: timestampMilliseconds },

    /** When this runner record was created */
    createdAt: {
      type: timestampMilliseconds,
      fromMetadata: "createdAtTime",
    },

    /** Last time this runner record was updated */
    updatedAt: {
      type: timestampMilliseconds,
      fromMetadata: "lastModifiedAtTime",
    },
  },
});

‍

OutageEvent

And finally we have our OutageEvent Item Type that keeps track of every time a Runner has a hiccup. Notice again how we’re leveraging nested Key Paths to make querying a breeze. We’re also utilizing StatelyDB’s TTL feature to limit our retention period for OutageEvent records to 30 days – once they’re older than the TTL period StatelyDB will automatically delete them.‍

itemType("OutageEvent", {
  keyPath: "/repo-:repoId/history-:runnerId/outage-:outageId",
  ttl: {
    // Outage events are retained for 30 days
    source: "fromCreated",
    durationSeconds: 30 * 24 * 60 * 60,
  },
  fields: {
    /**
     * Unique identifier for this outage event.
     * These will be sequential per runner, e.g. 1, 2, 3, 4.
     */
    outageId: {
      type: uint,
      initialValue: "sequence",
    },

    /** Repository this outage belongs to */
    repoId: { type: string },

    /** Runner that experienced the outage */
    runnerId: { type: uint },

    /** Runner name as shown in GitHub */
    runnerName: { type: string },

    /** Status that triggered this outage event */
    status: { type: RunnerStatus },

    /** When the outage was first detected */
    startedAt: { type: timestampMilliseconds },

    /** When the outage was resolved (zero if ongoing) */
    resolvedAt: {
      type: timestampMilliseconds,
      required: false,
    },

    /** Description of the outage */
    description: { type: string },

    /** Whether a notification was sent for this outage */
    notificationSent: { type: bool },

    /** When this outage record was created */
    createdAt: {
      type: timestampMilliseconds,
      fromMetadata: "createdAtTime",
    },

    /** Last time this outage record was updated */
    updatedAt: {
      type: timestampMilliseconds,
      fromMetadata: "lastModifiedAtTime",
    },
  },
});

(The full schema is available here.)

Creating the Runner Status Checker Lambda

The core of our monitoring system is a Lambda function that runs every five minutes to check the status of all our GitHub runners. This function handles several key tasks:

Fetching configuration from AWS SSM Parameter Store
Initializing the StatelyDB client
Querying the GitHub API for runner status
Comparing current status with previously stored status
Creating or resolving outage records as needed
Sending notifications to Slack

Let’s look at the most important parts of the implementation:‍‍

export const handler = async (_event: Record<string, unknown>) => {
  console.log("Starting GitHub runner monitoring process");

  try {
    // Fetch all required parameters from SSM
    const params = await fetchSSMParameters();

    // Initialize StatelyDB client
    const statelyClient = createClient(BigInt(params.statelydbStoreId), {
      authTokenProvider: accessKeyAuth({
        accessKey: params.statelydbAccessKey,
      }),
      region: params.statelydbRegion,
    });

    // Parse the list of repositories to monitor
    const repositories = JSON.parse(params.repositories);

    // Process each repository
    for (const repo of repositories) {
      // Fetch runners from GitHub API
      const runners = await fetchGitHubRunners(repo, params.githubToken);

      // Get existing runners from StatelyDB
      const existingRunners = await fetchExistingRunners(statelyClient, repoId);

      // Process each runner...
    }
  } catch (error) {
    console.error("Error in GitHub runner monitoring:", error);
    // Error handling...
  }
};

We’re using AWS SSM Parameter Store to securely store sensitive values like our GitHub token and StatelyDB credentials. This is a best practice for serverless applications that avoids hardcoding secrets in your code.

To create the StatelyDB client, we’re using the createClient function generated from our schema along with the accessKeyAuth authentication method, which is the recommended approach for server-side applications. For more information on authentication options, check out the StatelyDB documentation on creating a client.

The full implementation of the Lambda handler can be found in the index.ts file on GitHub.

Recording and Managing Outages

When a Runner transitions from a healthy to unhealthy state, we need to create an outage record. Similarly, when it recovers, we need to resolve that outage. Here’s how we handle these transitions:

// Check if Runner entered an unhealthy state
if (UNHEALTHY_STATUSES.includes(status) && status !== oldStatus) {
  await handleUnhealthyRunner(
    statelyClient,
    existingRunner,
    status,
    params.slackWebhook,
  );
}

// Check if Runner recovered from an unhealthy state
if (
  !UNHEALTHY_STATUSES.includes(status) &&
  UNHEALTHY_STATUSES.includes(oldStatus)
) {
  const outageId = await resolveOutage(
    statelyClient,
    repoId,
    githubRunner.id,
  );
  if (params.slackWebhook) {
    await sendSlackRecoveryNotification(
      params.slackWebhook,
      existingRunner,
      status,
      outageId,
    );
  }
}

The handleUnhealthyRunner function creates a new OutageEvent record in StatelyDB:

async function handleUnhealthyRunner(
  client: DatabaseClient,
  runner: Runner,
  status: number,
  slackWebhook: string,
) {
  console.log(
    `Runner ${runner.name} (${runner.runnerId}) is now in unhealthy state: ${status}`,
  );

  // Create a new outage event
  const outage = await client.put(
    client.create("OutageEvent", {
      repoId: runner.repoId,
      runnerId: runner.runnerId,
      runnerName: runner.name,
      status,
      startedAt: BigInt(Date.now()),
      description: `Runner ${runner.name} entered ${statusToString(
        status,
      )} state`,
      notificationSent: false,
    }),
  );

  // Send notification to Slack...
}

And when a Runner recovers, we need to resolve any open outage events:

async function resolveOutage(
  client: DatabaseClient,
  repoId: string,
  runnerId: number,
): Promise<bigint> {
  let lastOutageId: bigint = BigInt(0);
  // Find the last outage for this runner
  for await (const item of client.beginList(
    `/repo-${repoId}/history-${runnerId}/outage-`,
    { limit: 1, sortDirection: SortDirection.SORT_DESCENDING },
  )) {
    if (client.isType(item, "OutageEvent") && !item.resolvedAt) {
      // Mark outage as resolved
      item.resolvedAt = BigInt(Date.now());
      await client.put(item);
      lastOutageId = item.outageId;
      console.log(`Resolved outage ${item.outageId} for runner ${runnerId}`);
    }
  }
  return lastOutageId;
}

This is a great example of how StatelyDB’s Key Path design makes querying easy. We can efficiently find the latest outage for a specific Runner by using the beginList operation with a Key Path prefix and sorting in descending order.

Building Slack Notifications

A monitoring system isn’t very useful if no one knows when there’s a problem. We integrated Slack notifications to alert our team whenever a Runner goes offline or recovers:

async function sendSlackNotification(
  slackWebhook: string,
  runner: Runner,
  status: number,
  outageId: bigint,
) {
  const statusText = statusToString(status);

  const message = {
    text: `🚨 GitHub Runner Alert 🚨`,
    blocks: [
      {
        type: "header",
        text: {
          type: "plain_text",
          text: `🚨 GitHub Runner Alert: ${statusText} 🚨`,
        },
      },
      {
        type: "section",
        fields: [
          {
            type: "mrkdwn",
            text: `*Repository:*\n${runner.repoId}`,
          },
          {
            type: "mrkdwn",
            text: `*Runner:*\n${runner.name}`,
          },
          {
            type: "mrkdwn",
            text: `*Status:*\n${statusText}`,
          },
          {
            type: "mrkdwn",
            text: `*Outage ID:*\n${outageId}`,
          },
        ],
      },
      {
        type: "context",
        elements: [
          {
            type: "mrkdwn",
            text: `Detected at ${new Date().toISOString()}`,
          },
        ],
      },
    ],
  };

  await axios.post(slackWebhook, message);
  console.log(`Sent Slack notification for runner ${runner.name}`);
}

We use Slack’s Block Kit to create nicely formatted messages with all the relevant information about the outage.

Adding Interactive Slack Commands

Reactive notifications are great, but we also wanted a way to check in on our Runners on-demand. We implemented a second Lambda function that handles Slack slash commands, allowing team members to query runner status and history directly from Slack:‍

export const handler = async (
  event: APIGatewayProxyEvent,
): Promise<APIGatewayProxyResult> => {
  try {
    // Verify request signature...

    // Initialize StatelyDB client
    const statelyClient = await getStatelyClient();

    // For now, we only care about one repo
    const repoId = "stately";

    // Parse the event body as URL-encoded data
    const body = querystring.parse(event.body || "");
    const { command, text } = body;

    // Respond to slash commands
    let blocks: SlackBlock[] = [];
    switch (command) {
      case "/runner-history":
        if (!text) {
          blocks = [
            {
              type: "section",
              text: {
                type: "mrkdwn",
                text: "Please provide a runner name.",
              },
            },
          ];
        } else {
          blocks = await getRecentOutagesForRunner(
            statelyClient,
            repoId,
            text as string,
          );
        }
        break;
      case "/runner-status-all":
        blocks = await getStatusForRunners(statelyClient, repoId);
        break;
      default:
        // Handle unknown command...
        break;
    }

    return {
      statusCode: 200,
      headers: {
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        response_type: "in_channel",
        blocks: blocks,
      }),
    };
  } catch (error) {
    console.error("Error:", error);
    return {
      statusCode: 500,
      body: JSON.stringify({ error: "Internal server error" }),
    };
  }
};

‍We’ve implemented two commands:

/runner-history [runner-name] - Shows the recent outage history for a specific runner
/runner-status-all - Shows the current status of all runners

The implementation for getting runner history shows another great example of using StatelyDB’s list operation with Key Paths:

async function getRecentOutagesForRunner(
  statelyClient: DatabaseClient,
  repoId: string,
  runnerName: string,
): Promise<SlackBlock[]> {
  // First look up the Runner ID
  const runner = await statelyClient.get(
    "Runner",
    keyPath`/repo-${repoId}/runner-${runnerName}`,
  );
  if (!runner) {
    return [
      {
        type: "section",
        text: {
          type: "mrkdwn",
          text: `Sorry, I couldn't find a runner with the name "${runnerName}"`,
        },
      },
    ];
  }

  const latestOutages: OutageEvent[] = [];
  const iter = statelyClient.beginList(
    keyPath`/repo-${repoId}/history-${runner.runnerId}/outage-`,
    { limit: 5, sortDirection: SortDirection.SORT_DESCENDING },
  );
  for await (const item of iter) {
    if (statelyClient.isType(item, "OutageEvent")) {
      latestOutages.push(item);
    }
  }

  // Format blocks for Slack response...
}

Note the use of the keyPath tagged template literal, which is a helper function provided by StatelyDB that ensures IDs are correctly formatted in key paths. This is especially important when working with UUIDs and other binary data.

The full implementation of the Slack command handler can be found in the slack.ts file on GitHub.
‍

Setting Up the AWS CDK Infrastructure

We used AWS CDK to define our infrastructure as code. This makes our deployment process repeatable and transparent. Our CDK stack includes:

An EventBridge rule that triggers our monitoring Lambda every 5 minutes
The monitoring Lambda function
The Slack interaction Lambda function
An API Gateway endpoint for Slack commands
IAM roles and permissions

Here’s a simplified version of our CDK stack:

export class GitHubRunnerMonitorStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Lambda execution role
    const lambdaRole = new iam.Role(this, "GitHubRunnerMonitorRole", {
      assumedBy: new iam.ServicePrincipal("lambda.amazonaws.com"),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName(
          "service-role/AWSLambdaBasicExecutionRole",
        ),
      ],
    });

    // Add permissions to access SSM parameters
    lambdaRole.addToPolicy(
      new iam.PolicyStatement({
        actions: ["ssm:GetParameter", "ssm:GetParameters"],
        resources: [
          `arn:aws:ssm:${this.region}:${this.account}:parameter/github-runner-monitor/*`,
        ],
      }),
    );

    // Create monitoring Lambda function
    const monitorFunction = new lambda.Function(
      this,
      "GitHubRunnerMonitorFunction",
      {
        runtime: lambda.Runtime.NODEJS_18_X,
        handler: "index.handler",
        code: lambda.Code.fromAsset(path.join(__dirname, "../dist")),
        timeout: cdk.Duration.minutes(5),
        memorySize: 512,
        role: lambdaRole,
        environment: {
          NODE_OPTIONS: "--enable-source-maps",
        },
        description:
          "Lambda function to monitor GitHub self-hosted runners and alert when unhealthy",
      },
    );

    // Create EventBridge rule to trigger Lambda every 5 minutes
    const rule = new events.Rule(this, "ScheduleRule", {
      schedule: events.Schedule.rate(cdk.Duration.minutes(5)),
      description: "Trigger GitHub runner monitoring every 5 minutes",
    });

    // Add Lambda as target for the rule
    rule.addTarget(
      new targets.LambdaFunction(monitorFunction, {
        retryAttempts: 2,
      }),
    );

    // Create Slack interaction Lambda and API Gateway...
  }
}

The full CDK stack can be found in the github-runner-monitor-stack.ts file on GitHub.
‍

Wrap Up

So there you have it: a simple monitoring system using StatelyDB and AWS Lambda. This project was a perfect match for StatelyDB’s strengths:

Elastic Schema: Our data model was easy to define and can evolve over time as our needs change
Efficient Key Paths: The hierarchical key path structure made querying related data extremely efficient
Time to Live (TTL): We can automatically expire old outage events without additional code
TypeScript Integration: The generated client code ensured type safety when working with our data

If you’re interested in using this monitoring system for your own GitHub Action runners, the full source code is available at https://github.com/StatelyCloud/action-runner-monitor.

Want to learn more about StatelyDB? Check out our documentation or read more about our Elastic Schema on our blog.

‍

Let’s Do This!

Tracking Runner State

Repository

Runner

OutageEvent

Creating the Runner Status Checker Lambda

Recording and Managing Outages

Building Slack Notifications

Adding Interactive Slack Commands

Setting Up the AWS CDK Infrastructure

Wrap Up

Ship Your Next Feature This Week

Turn Your Data Into Your Competitive Edge

Monitoring GitHub Action Runners with StatelyDB, AWS CDK, and Lambda

Let’s Do This!

Tracking Runner State

Repository

Runner

OutageEvent

Creating the Runner Status Checker Lambda

Recording and Managing Outages

Building Slack Notifications

Adding Interactive Slack Commands

Setting Up the AWS CDK Infrastructure

Wrap Up

Ship Your Next Feature This Week

Turn Your Data Into Your Competitive Edge

Get StatelyDB Updates

Ship Your Next Feature This Week