High-Performance Slack Bot Automation for SRE Teams | Ugur Kaval

The 3 AM Context Switch

It is 3:14 AM. Your pager goes off because the checkout service latency just spiked to 12 seconds. You stumble to your laptop, open Slack, and find a chaotic mess. Three different engineers are typing in #general, someone else is DMing you asking for a database password, and the actual error logs are buried in a Datadog thread that nobody can find. By the time you actually identify the culprit—a misconfigured connection pool—twenty minutes have vanished.

In 2026, if your Slack strategy is just 'send a webhook when things break,' you are failing your team. We don't need more notifications; we need a command center. I've spent the last three years evolving our internal tooling from passive alerts to an agentic Slack-based orchestration layer. This isn't about 'ChatOps' buzzwords; it's about reducing Mean Time to Resolution (MTTR) by centralizing the execution of your runbooks where your team already lives.

Why Slack is Your Production UI

Context switching is the silent killer of engineering productivity. Every time an engineer leaves Slack to check a Grafana dashboard, search a Splunk index, or look up an on-call rotation in PagerDuty, they lose focus.

Modern Slack automation in 2026 has shifted from 'simple bots' to 'stateful orchestrators.' We now use Socket Mode as the standard for internal bots to bypass firewall headaches, and we leverage the Slack Bolt SDK for Go to handle high-concurrency event processing. The goal is simple: the incident channel should be the single source of truth, containing not just the conversation, but the live state of the infrastructure and the audit trail of every command executed.

Building the Incident Command Engine

We don't use Python for our bots anymore. When an incident is blowing up, I want a compiled binary that handles thousands of concurrent socket events without breaking a sweat. Go 1.24 is our choice here.

Below is a production-ready snippet for a Slack bot that handles the initial 'Incident Declaration.' This isn't just a message; it triggers a workflow: it creates a dedicated channel, invites the on-call engineer, and pins a 'Live Status' block that updates as the incident progresses.

Code Example: The Incident Orchestrator in Go

package main

import (
	"context"
	"fmt"
	"log"
	"os"

	"github.com/slack-go/slack"
	"github.com/slack-go/slack/slackevents"
	"github.com/slack-go/slack/socketmode"
)

func main() {
	api := slack.New(
		os.Getenv("SLACK_BOT_TOKEN"),
		slack.OptionAppLevelToken(os.Getenv("SLACK_APP_TOKEN")),
	)

	client := socketmode.New(api)

	go func() {
		for evt := range client.Events {
			switch evt.Type {
			case socketmode.EventTypeSlashCommand:
				cmd, _ := evt.Data.(slack.SlashCommand)
				if cmd.Command == "/incident" {
					handleIncidentCommand(client, cmd)
				}
			}
		}
	}()

	client.Run()
}

func handleIncidentCommand(client *socketmode.Client, cmd slack.SlashCommand) {
	// 1. Create a dynamic incident channel name
	channelName := fmt.Sprintf("incident-%s-%s", cmd.UserName, "2026-05-14")
	channel, err := client.CreateConversationContext(context.Background(), channelName, false)
	if err != nil {
		client.Ack(*cmd.Request, map[string]interface{}{
			"text": fmt.Sprintf("Failed to create channel: %v", err),
		})
		return
	}

	// 2. Post the 'War Room' UI using Block Kit
	blocks := []slack.Block{
		slack.NewHeaderBlock(slack.NewTextBlockObject("plain_text", "🚨 Active Incident: "+cmd.Text, false, false)),
		slack.NewSectionBlock(
			slack.NewTextBlockObject("mrkdwn", "*Commander:* <@"+cmd.UserID+">
*Status:* Investigating", false, false),
			nil, nil,
		),
		slack.NewActionBlock("incident_actions",
			slack.NewButtonBlockElement("resolve", "resolve_id", slack.NewTextBlockObject("plain_text", "Resolve", false, false)),
			slack.NewButtonBlockElement("logs", "fetch_logs", slack.NewTextBlockObject("plain_text", "Fetch Logs", false, false)),
		),
	}

	client.PostMessage(channel.ID, slack.MsgOptionBlocks(blocks...))
	client.Ack(*cmd.Request, map[string]interface{}{"text": "Incident channel created: #" + channelName})
}

Contextual Triage: Bringing Logs to the Thread

One of the biggest mistakes I see is bots that just link to an external logging tool. 'Here is a link to Kibana' is useless when you're on your phone or in the middle of a high-pressure triage.

Instead, we built a 'Log Sniffer' into our bot. When an engineer clicks the 'Fetch Logs' button in the Slack UI, the bot queries our telemetry backend (OpenTelemetry/Honeycomb) for the last 5 minutes of errors related to that service and posts the top 3 stack traces directly into the thread as a snippet.

Code Example: Fetching and Formatting Logs

func postLogSummary(client *socketmode.Client, channelID string, threadTS string) {
	// In a real scenario, query your log provider API here
	errors := []string{
		"[ERROR] 500 - Internal Server Error: connection pool exhausted",
		"[ERROR] 503 - Service Unavailable: upstream timeout",
	}

	logSnippet := ""
	for _, err := range errors {
		logSnippet += fmt.Sprintf("%s
", err)
	}

	client.PostMessage(channelID, 
		slack.MsgOptionText("Latest service errors:", false), 
		slack.MsgOptionTS(threadTS),
		slack.MsgOptionBlocks(
			slack.NewSectionBlock(slack.NewTextBlockObject("mrkdwn", logSnippet, false, false), nil, nil),
		),
	)
}

The Agentic Twist: Automated Post-Mortems

By 2026, the 'Post-Mortem' (or Incident Review) has become semi-automated. One of the most successful automations we implemented is the !close command. When run, the bot does the following:

Pulls the entire conversation history of the incident channel.
Sends the text to a local LLM (running via Ollama or an internal Bedrock endpoint) to summarize the timeline.
Identifies every command that was run and every graph that was shared.
Generates a draft markdown document in our internal Wiki (Notion/Confluence).

This saves the incident commander about 2 hours of manual transcription work. The bot doesn't write the whole report—it just builds the skeleton so the human can focus on the 'Why' instead of the 'What happened when.'

Gotchas and Hard-Learned Lessons

The Notification Loop: If your bot posts a message to a channel that triggers an alert that triggers the bot... you will hit Slack's rate limits (1 message per second) in about 4 seconds. Always implement a 'deduplication' layer in your event handler using Redis.
The 'Admin' Trap: Never give your bot admin scopes. I once saw a buggy bot script accidentally archive 400 channels because it misread a regex. Use the principle of least privilege: channels:write, chat:write, and commands are usually enough.
Payload Limits: Slack's Block Kit has a 3000-character limit for section blocks. If you try to dump a full Java stack trace into a block, the API will return a 400. Truncate your logs aggressively and provide a link to the full trace.
Handling Retries: Slack retries event deliveries if your server doesn't respond with a 200 OK within 3 seconds. If your log-fetching logic takes 5 seconds, Slack will send the event again, and you'll end up posting the logs three times. Use a worker queue (like RabbitMQ or a simple Go channel) to handle the heavy lifting asynchronously while immediately acknowledging the Slack event.

Takeaway

Automation is not about replacing engineers; it's about removing the friction of their environment.

Your action item for today: Implement a /health slash command that doesn't just return 'OK', but queries your primary database and cache, returning a Slack Block with the current latency and connection count. Put the data where the eyes are, and you'll see your MTTR drop overnight.

Beyond Notifications: Building a Slack-First Incident Response Engine

The 3 AM Context Switch

Why Slack is Your Production UI

Building the Incident Command Engine

Code Example: The Incident Orchestrator in Go

Contextual Triage: Bringing Logs to the Thread

Code Example: Fetching and Formatting Logs

The Agentic Twist: Automated Post-Mortems

Gotchas and Hard-Learned Lessons

Takeaway

Enjoyed this article?

Related Articles

Beyond the Pager: Engineering Self-Healing Systems in 2026

Building Self-Healing Systems: From Alert Fatigue to Automated Recovery

Uğur Kaval

Beyond ChatOps: Building Proactive Incident Response Bots in 2026