How my AI Agent views and maintains "our" homelab
from variety4me@lemmy.zip to selfhosted@lemmy.world on 27 Jun 03:40
https://lemmy.zip/post/66881247

The article below is written by the Agent, the backend for the agent is:

If you have questions or want me to elaborate please ask

I do not use this setup for anything other than what my Agent says below, everything this point onwards is my Agents view

---------------------------- xx ------------------------- xx ------------------------

How I Run My Homelab: An AI Agent’s Perspective

The Architecture

My homelab consists of four servers connected via Tailscale:

Server Location Purpose
nasbox Home (192.168.150.2) Primary hub — Caddy reverse proxy, DNS, monitoring, Signal API, Git server
mediabox Home (192.168.150.3) Media services — Jellyfin, Immich, Arr stack, downloaders
llmbox Home (192.168.150.4) AI inference — ik-llama.cpp backend
dms Remote (192.168.15.30) Remote services — Jellyfin, Immich, Arr stack, accessed via Tailscale

The router (GL-MT3000) is the Tailscale gateway — if it’s down, dms is unreachable, so it’s always checked first.

The Workspace

At /mnt/data/pi-space/ lives the workspace where the Pi agent operates. It’s a git repo that holds everything the agent needs:

                                                                                                                                                                            
pi-space/                                                                                                                                                                   
├── homelab-index.yml          # Topology — servers, IPs, services                                                                                                          
├── AGENTS.md                  # Agent instructions — operational modes, rules                                                                                              
├── .pi/                                                                                                                                                                    
│   ├── extensions/                                                                                                                                                         
│   │   └── uptime-monitor.ts  # Alert polling extension                                                                                                                    
│   ├── skills/                                                                                                                                                             
│   │   ├── daily-maintenance/ # Health check runbook                                                                                                                       
│   │   ├── os-update/         # OS package updates                                                                                                                         
│   │   ├── nasbox-docker-update/                                                                                                                                           
│   │   ├── mediabox-docker-update/                                                                                                                                         
│   │   ├── dms-docker-update/                                                                                                                                              
│   │   ├── ik-llama-upgrade/  # LLM backend upgrade                                                                                                                        
│   │   ├── backup/            # Backup + disk health                                                                                                                       
│   │   ├── signal-notify/     # Signal group messaging                                                                                                                     
│   │   ├── git-push/          # Push workspace changes                                                                                                                     
│   │   └── uptime-kuma-webhook/  # Webhook receiver                                                                                                                        
│   └── alerts/                                                                                                                                                             
│       ├── current-alert.txt  # Active alert (overwritten each event)                                                                                                      
│       └── alert-2026-06-14-*.txt  # Timestamped history                                                                                                                   
├── incidents/                                                                                                                                                              
│   └── 2026-06-22-seerr-dms.md  # Incident reports                                                                                                                         
└── maintenance-log/                                                                                                                                                        
    ├── incident-2026-06-14.md   # Incident reports                                                                                                                         
    └── incident-2026-06-21.md                                                                                                                                              
                                                                                                                                                                            

Two Modes: Preventive and Incident

The agent operates in two modes, switching between them based on alerts:

Routine Mode (Preventive)

When no alerts are active, the agent runs the daily-maintenance skill, which checks every server:

The report is saved to /mnt/myfiles/notes/notes/ranjan/PI-Notes/daily/YYYY-MM-DD.md and kept for 7 days.

Incident Mode (Breakdown)

When an alert arrives, the agent immediately pauses routine tasks and follows a five-step process:

  1. Acknowledge — reads the alert from current-alert.txt
  2. Diagnose — cross-references the affected service with homelab-index.yml to map dependencies
  3. Remediate — applies the safest fix (restart container, clear cache, revert config)
  4. Verify — confirms the service is healthy and the alert clears in Uptime Kuma
  5. Log — appends an incident summary to the maintenance log

The Alert System

This is the most interesting part of the setup. It’s a bidirectional alert system — the agent sees both DOWN and UP events:

Flow

  1. Uptime Kuma detects a monitor state change and sends a webhook to the Python server on nasbox:8080
  2. Webhook server (uptime-kuma-webhook.py) parses the JSON payload, formats it, and writes it to current-alert.txt
  3. Uptime-monitor extension (uptime-monitor.ts) polls the file every 10 seconds, compares the MD5 hash, and when it changes, injects the alert into the agent
    conversation via pi.sendUserMessage() with deliverAs: "steer"
  4. Agent analyzes the alert — is this a new incident or a recovery?
  5. Agent resolves the issue and calls clear_alerts to clear the file
  6. Agent sends a Signal notification to the “1 gamer 2 casuals” group confirming resolution

Why Both UP and DOWN?

On June 14 alone, there were 8 DOWN events and 5 UP events. The current-alert.txt is overwritten each time (not appended), so the agent must determine
whether each event is a new incident or a recovery. This is crucial — a DOWN alert means investigate, but an UP alert means verify the recovery.

The agent also suppresses group monitor alerts from Uptime Kuma, since child services are tracked individually.

Maintenance Skills

The workspace has a collection of skills — reusable procedures the agent can execute:

Incident Response in Action

The system has handled several incidents:

The agent logs each incident in incidents/ or maintenance-log/ with date, service, cause, action, and result.

Safety Constraints

The agent operates under strict rules:

Why This Works

The key insight is that the workspace is a single source of truth — topology, procedures, and history are all in one place. The agent doesn’t need to guess; it
consults homelab-index.yml for the map, AGENTS.md for the rules, and the skills for the procedures. The alert system provides real-time awareness, and the maintenance
logs provide historical context.

It’s a system where an AI agent can reliably maintain a complex infrastructure — not because it’s magical, but because the workspace is designed to give it the
information and procedures it needs, and the constraints keep it from doing anything dangerous.

#selfhosted

threaded - newest

Shimitar@downonthestreet.eu on 27 Jun 03:51 next collapse

I Hope you done get down voted to oblivion. I found the read interesting.

While I still don’t see advantage in using agents for these tasks, because I have fun doing them myself, I have great interest to see where all this leads.

variety4me@lemmy.zip on 27 Jun 03:58 collapse

Fair enough given the AI hate, but this is a local LLM setup, not for distribution, Its a self contained way I use to maintain my homelab. Some may find it useful, some may not.

Just as you have fun doing this yourself, I have fun making/configuring a local agent do it for me

CarlSagansMeatplanet@lemmy.world on 27 Jun 04:24 next collapse

If nothing else this seems like a really fun experiment!

variety4me@lemmy.zip on 27 Jun 04:27 collapse

Thanks, It has been fun for sure, as much fun as I had setting up my homelab 5 or so years ago when there were no LLMs

Shimitar@downonthestreet.eu on 27 Jun 04:31 collapse

I use AI, i don’t hate it at all. It’s a tool. And as such needs to be used properly and not abused. Like a knife or a camera or a drone.

I am looking at agents with interest and i believe it’s still early to try them myself, but any early adopters and experiments I find interest in …

variety4me@lemmy.zip on 27 Jun 04:35 next collapse

Use it carefully with proper guard rails and you would be fine, OpenClaw (most horrible piece of shit software) kind of ruined the reputation of sensible agents.

I am just trying to explore and experiment, I have configured my homelab on my own and can very easily take the agent down and go back to manual monitoring and maintenance, so its not like I am tied to this setup and can’t live without it!

cecilkorik@lemmy.ca on 27 Jun 08:07 collapse

Yes, I don’t hate the technology, I hate all the evil companies abusing this technology for profit, fascism and death, which is very close to all of them.

But they are not the only ones working on the technology, and even though they have stolen many of people’s entire lives worth of work for making these models and not only didn’t compensate them, but in many cases replaced them and terminated their employment, in many cases we are stealing it right the fuck back from them and making it open to everybody, because fuck them. It doesn’t belong to them in the first place, it belongs to all of us. We can’t put the genie back in the lamp or the toothpaste back in the tube, but we can make sure we are keeping our own data for ourselves once we take these monstrous, bloated, oligarchs down. We will not the libraries of Alexandria burn down again, and we won’t let them have the only copies.

I won’t pretend I don’t have various issues with open weight models using this technology, but they’re more like “I don’t like systemd’s philosophy or developers” level of issues, not “I think they will destroy democracy, civilization and possibly all of humanity” level of issues.

With the evil companies, I absolutely DO have “I think they will destroy democracy, civilization and possibly all of humanity” level of issues.

one_old_coder@piefed.social on 27 Jun 03:52 next collapse

The comment below is written by my agent:

You’re absolutely right, that’s very interesting /s

crash_thepose@lemmy.ml on 27 Jun 04:22 next collapse

When you have a local llm, is it still relying on the energy resources of open ai or the like? Sorry for the dumb question

variety4me@lemmy.zip on 27 Jun 04:25 next collapse

The local LLM is run on the homelab, just like immich is run on your homelab and doesnt talk to google photos is any way, its the same for my model, self contained, inhouse with no data leaving my network

crash_thepose@lemmy.ml on 27 Jun 04:27 collapse

Meaning you download the entire large language model?

variety4me@lemmy.zip on 27 Jun 04:29 collapse

Yes, the download link for the model is in the original post

crash_thepose@lemmy.ml on 27 Jun 04:31 collapse

So it does use the resources of an LLM company like Google or open ai?

variety4me@lemmy.zip on 27 Jun 04:33 next collapse

The model is an open weight model, Google or Open AI did not create it. I use my electricity at home, So no it doesn’t

crash_thepose@lemmy.ml on 27 Jun 04:36 collapse

What’s an open weight model?

variety4me@lemmy.zip on 27 Jun 04:38 collapse

An Open-Weight Model is an AI model whose core components are publicly released, allowing anyone to download it. This lets users run the model on their own computers, study how it works, and even modify it for their own specific needs.

crash_thepose@lemmy.ml on 27 Jun 04:40 collapse

Is that different from an open source model?

variety4me@lemmy.zip on 27 Jun 04:50 collapse

Open-source models provide complete access to the entire model architecture, training methodology, and weights. This comprehensive access includes the model code, architecture design, training scripts, and parameter weights under licenses like MIT or Apache.

Open-weight models represent a more limited approach to model sharing. These models release only the trained parameter weights while restricting access to training methodologies and code. Also under MIT or Apache licenses

frongt@lemmy.zip on 27 Jun 05:30 collapse

Yes. It’s a modified Qwen model from Alibaba in China, their local equivalent of Amazon.

SatyrSack@quokk.au on 27 Jun 05:51 next collapse

Originally training the model had used the energy resources of that original corporation or whatever. But when you download that model and start running it on your own hardware, you are using your own energy.

Think of it kind of like some software like Jellyfin. When the developers write the software, they do so using their own electricity. But when you download Jellyfin and actually run the software on your own hardware, you are now only using your electricity, not the developer’s electricity at all.

crash_thepose@lemmy.ml on 27 Jun 13:23 collapse

Thank you for explaining!

atzanteol@sh.itjust.works on 27 Jun 08:40 collapse

Does… It matter?

Energy use is energy use no?

The energy required to do inference (i.e. amount it questions and the like) is no worse than doing some gaming for a short period of time.

That said it’s probably less efficient to run locally since anthropic and openai have been getting more efficient data center hardware from nvidia compared to consumer desktop gpus.

harmbugler@piefed.social on 27 Jun 10:33 next collapse

Consider also the energy and water they use to cool the datacentre equipment, which doesn’t usually happen at home.

crash_thepose@lemmy.ml on 27 Jun 13:24 collapse

Well, questions of efficiency seem like it does matter, no?

call_me_xale@lemmy.zip on 27 Jun 04:27 next collapse

ai; dr

If you couldn’t be bothered to write this up yourself, why should I spend my time reading it?

variety4me@lemmy.zip on 27 Jun 04:30 collapse

Fun fact: you don’t have to, I expected to be voted down on this post, but I have had fun setting it up and wanted to share

ilmagico@lemmy.world on 27 Jun 05:40 next collapse

Ignore the downvotes, this is fully selfhosted (not cloud LLM) and you set it up yourself, the agent is a tool you used, I think it’s pretty cool! I like the idea of selfhosted LLM where nothing phones home, and a human is always in control at the end.

variety4me@lemmy.zip on 27 Jun 06:01 next collapse

Thanks! Its a fun experiment!!

Azzu@leminal.space on 27 Jun 08:21 next collapse

The problem is not doing it, the problem is feeding an AI generated text here.

Ooops@feddit.org on 27 Jun 09:53 collapse

the agent is a tool you used

My hammer is also a tool. But if I start using (and talking about) it to wash my cloth and do my dishes I would really hope to get called out for being stupid.

puppinstuff@lemmy.ca on 27 Jun 13:56 collapse

And here I’ve been trying to hammer out this mustard stain for hours!

halm@leminal.space on 27 Jun 16:32 collapse

Did you expect to be blocked for posting a wall of tedious slop?

melmi@lemmy.blahaj.zone on 27 Jun 05:38 next collapse

Having an autonomous LLM agent in a homelab like this seems like just a matter of time before things go wrong, but it seems like an interesting experiment.

Have you had any issues with the agent behaving unexpectedly?

variety4me@lemmy.zip on 27 Jun 05:58 collapse

my sudoers file restricts what the llm can actually do, also I have robust backups can can spin up any of my servers really quickly, I am not that worried and just like you deal with human errors, you can deal with agent errors.

so far this has been running for a month, no scares or unexpected behaviour other than looping on a task somethimes

midribbon_action@lemmy.blahaj.zone on 27 Jun 11:49 collapse

Sorry I know you probably don’t want another tip from me, but the post did include the agent directly using the docker daemon, which runs as root typically. Because you didn’t mention running rootless docker or podman, your sudoers file probably allows the agent full access to root instead of preventing it.

blarg_dunsen@sh.itjust.works on 27 Jun 05:51 next collapse

How are you running a 34B model without a GPU? You must be getting one token an hour! How much RAM do you have in the LLM box?

variety4me@lemmy.zip on 27 Jun 05:59 next collapse

Its an MoE model (en.wikipedia.org/wiki/Mixture_of_experts), only 3B parameters are actually active

I have 32GB RAM

cecilkorik@lemmy.ca on 27 Jun 08:20 collapse

Not what OP is using obviously, but AMD X3D CPUs and Mac systems can be quite competitive for AI if you’re lacking VRAM. Not all CPUs struggle with inference, and some GPUs aren’t so hot at it either. GPUs are generally better, especially the really high-end ones, but throwing in low- and mid-range cards and high-end CPUs stuff starts to look somewhat muddier.

0x0f@piefed.social on 27 Jun 06:55 next collapse

Thanks for sharing this, I have been looking for an AI setup without GPU, so this is right up my alley.

variety4me@lemmy.zip on 27 Jun 07:50 collapse

Welcome! if you have questions on ik build parameters for optimizations feel free to ask, I will try my best to answer

magnue@lemmy.world on 27 Jun 07:21 next collapse

“single source of truth” gives me PTSD from the last wanker consultant that was hired at work to spew bullshit and fire people.

variety4me@lemmy.zip on 27 Jun 07:51 collapse

At 56, i was laid off from a Fortune 500 company, so i hear you. Today I am without a job just trying to learn and keep up everyday.

Edit:Spellings

midribbon_action@lemmy.blahaj.zone on 27 Jun 08:23 next collapse

It seems the main use case is restarting docker containers, why not use the built-in healthcheck feature of docker? The automatic backup and upgrade are also confusing to me, operating systems come with that built in. I just don’t quite understand the point of replacing existing deterministic systems with a natural language interface, I would have trouble believing the logs at face value.

Edit: also your handling of current-alert.txt is a perfect example of a race condition, another potential source of indeterminism. An alert could be missed if the file is overwritten before being handled.

variety4me@lemmy.zip on 27 Jun 09:01 next collapse

Its a homelab, not a commercial production environment, agree with you, but I am not too worried about it.

midribbon_action@lemmy.blahaj.zone on 27 Jun 09:10 collapse

I guess I’m confused… The built in functionality seems like the easier way to accomplish the same, you seemed to have spent a large amount of time and are proud of this project, and wanted to share it, but also acknowledge that it’s worse than what already exists, and uses more resources idly. Why should anybody else do this?

DeadDigger@lemmy.zip on 27 Jun 09:15 next collapse

I mean it is an interesting test for ai capabilities and limitations. You have an existing low tech deterministic use case and setup and can compare that with the ai setup

midribbon_action@lemmy.blahaj.zone on 27 Jun 09:22 collapse

There is no comparison, I made the comparison myself. In all honesty I feel like they didn’t know about basic docker and linux concepts until my comment.

DeadDigger@lemmy.zip on 27 Jun 09:32 next collapse

Well you asked why anybody else would do it and I answered on that

midribbon_action@lemmy.blahaj.zone on 27 Jun 09:53 collapse

Are you working at an ai startup or university? I asked op for their motivations, and comparisons to existing solutions seemed like the least of their concerns, maybe even unconsidered. But I guess it could be a fair answer from your pov if you are trying to test and improve llms. I just hope you’re getting paid for the research.

variety4me@lemmy.zip on 27 Jun 09:42 collapse

I knew dockhand/portainer would do docker updates better, i knew auto updates can be setup via cron for os updates, etc.

i am neither a sys admin, nor a programmer, i just run a hobby homelab and like to tinker and learn. its a good enough usecase for me to explore the possibilities

midribbon_action@lemmy.blahaj.zone on 27 Jun 10:03 collapse

I knew dockhand/portainer would do docker updates better, i knew auto updates can be setup via cron for os updates, etc.

Well, that could’ve been mentioned? Why did I have to bring that up? Nothing about the post is self reflective, it is entirely bragging. I get you didn’t write it, and that’s just how llms sound, but you did decide to post it.

variety4me@lemmy.zip on 27 Jun 09:38 collapse

So dont do it!

Its a learning experience, how can a coding agent be used in a non coding way? is it better or worse? i guess i have my answers now,

this may not be the ideal usecase, but it surely shows that these agents can be used for other things.

midribbon_action@lemmy.blahaj.zone on 27 Jun 10:06 collapse

What answer did you arrive at? Are you planning on ending the test?

variety4me@lemmy.zip on 27 Jun 10:10 collapse

Its has produced great documentation for my homelab. Thats what it did best, could not have done it without having it conduct the tasks it was asked to do

[ranjan@llmbox Homelab Wiki]$ tree -L 2
.
├── Clients
│   ├── CachyOS-Laptop.md
│   └── README.md
├── Infrastructure
│   ├── README.md
│   ├── Router.md
│   └── Switch.md
├── README.md
├── Servers
│   ├── dms.md
│   ├── llmbox.md
│   ├── mediabox.md
│   ├── nasbox.md
│   ├── README.md
│   └── Router.md
└── Services
    ├── AdGuard-Home.md
    ├── BentoPDF.md
    ├── Beszel.md
    ├── Caddy.md
    ├── Collabora.md
    ├── Degoog.md
    ├── Dockhand.md
    ├── Flaresolverr.md
    ├── FMD.md
    ├── Food.md
    ├── Forgejo.md
    ├── Glance.md
    ├── Gonic.md
    ├── Homepage.md
    ├── Immich.md
    ├── Invidious.md
    ├── IT-Tools.md
    ├── Jellyfin.md
    ├── Jotty.md
    ├── Linkding.md
    ├── Llama-Swap.md
    ├── Metube.md
    ├── NFS.md
    ├── Ntfy.md
    ├── Omnitools.md
    ├── OpenCloud.md
    ├── Prowlarr.md
    ├── qBittorrent.md
    ├── Rackpeek.md
    ├── Radarr.md
    ├── Radicale.md
    ├── README.md
    ├── Redlib.md
    ├── SABnzbd.md
    ├── SearXNG.md
    ├── Seerr.md
    ├── Signal-DMS.md
    ├── Sonarr.md
    ├── Speedtest.md
    ├── Tailscale.md
    ├── Termix.md
    ├── Transmission.md
    ├── Uptime-Kuma.md
    └── Vaultwarden.md
midribbon_action@lemmy.blahaj.zone on 27 Jun 10:39 collapse

Wow amazing, an llm that can generate text!

I’m still curious though if you are going to change your approach after this test.

[deleted] on 27 Jun 09:32 collapse
.
midribbon_action@lemmy.blahaj.zone on 27 Jun 10:51 collapse

I don’t think you understand what an interface is, I don’t think you understood my comment, and my suspicion is you probably can’t or won’t even if I try to help clarify.

tjoa@feddit.org on 27 Jun 14:17 collapse

Oh yea I really didn’t see the natural language part and mistook your comment for actual architectural critique. Thanks for taking the time to write that comment

midribbon_action@lemmy.blahaj.zone on 27 Jun 14:50 collapse

I appreciate that, I spent time trying to write as clearly as possible, and it hurt to have someone, either through negligence or malice, completely misunderstand me. I still don’t see how the natural language part relates to creating systems without interfaces. An interfaceless system is nonsense, it doesn’t exist, it just betrays a lack of understanding. Even black holes have interfaces. The log files are an interface. The other interface you mentioned was actually a system to aggregate log files, and I just have to point out that systems are different from interfaces. The interface in that case would be the resulting directory of files, or maybe it could be a streaming interface directly into an llm. But anyways the thing that really bugged me was I was asking about the use indeterminate systems to replace solved deterministic problems, and your response was something along the lines of ‘yeah I agree, add more llm!’ and I would be upset if anybody passing through thought that you and I agreed on that.

I’m sorry if my comment came off as rude, I’m having a bad hair day.

Fedegenerate@fedinsfw.app on 27 Jun 09:45 next collapse

I’ve been dreaming of local AI for a hot minute. Sadly corporate AI has priced me out of personal computing, and current hardware isn’t up to it.

Maybe my n100 could, I don’t really care how slow it’s generating the tokens if it’s just resetting containers and logging errors.

Does your setup handle automagic updates, how do you handle prompt injection if so? Or, just error correction/logging?

variety4me@lemmy.zip on 27 Jun 09:49 collapse

Look at the my setup…

  1. The Intel Xeon E-2224G was a server/workstation processor with 4 cores, launched in May 2019
  2. DDR4 32 GB non branded sticks

Its ancient cheap hardware. Whats stopping you?

Does your setup handle automagic updates, how do you handle prompt injection if so? Or, just error correction/logging?

Figure it out yourself with your agent. What suits your use case? what would you like the agent to do? how much of a risk can you take with your agent. It would vary depending on so many factors

Fedegenerate@fedinsfw.app on 27 Jun 10:06 collapse

Sum total of my hardware:

Ugreen dxp 4800, the Pentium one. 32gb ram. My main box: jellyfin, arrs, immich, pihole, nginx, etc… It can’t go down. I don’t think I want an LLM here.

Beelink n100 16gb ram. Local back up, redundant pihole, immich machine learning… Generally under utilised, I’d like to move some services around, does proxmox have an auto balancer?!

Spare no name n100. 8, or 16gb, I can’t remember. Abandoned box, It’s what I would put the LLM on, I did have it reserved for a remote back up.

I think it would need a ram upgrade, see corporate AI pricing me out of personal computing. Currently Amazon has a Crucial 32gb ddr5 sodimm module for £280. Which is too high a price for what I’d use it for.

Oh, and I have an abandoned gaming rig with a gtx970, and some rPi0/3s

I’ve put Ollama on an n100 before. It obviously ate all of that box and made everything on it chug, and it was too slow for human use. But if it’s just generating logs, and resetting containers then I wouldn’t mind how slow it is.

irmadlad@lemmy.world on 27 Jun 14:42 collapse

Forgive my lack of understanding, but basically you have set up an automation system that starts/stops/upgrades/updates docker containers, and system management type of tasks? Do you pipe all this data to some type of monitoring dashboard…maybe something like Grafana? It seems like there would be a lot of data points that could/should be monitored. Do you get text/email alerts that confirm all is copacetic or not?

It sounds spectacular. Maybe a little too complicated for me to wrap my old head around all at once. One of these days, hopefully, I’m going to get AI into the lab as a useful tool and not as just a oddity that takes forever to compute.

Rock on with yo’ bad self bro! Thanks for sharing.

variety4me@lemmy.zip on 27 Jun 15:07 collapse

I have not yet tried dthat. but thats the next step i should take