Forgejo fills up hard drive with repo-archives
from jeena@piefed.jeena.net to selfhosted@lemmy.world on 08 Sep 01:17
https://piefed.jeena.net/post/235083

Ugh, apparently yesterday a bot visited my Forgejo instance and queried everything, which caused Forgejo to create repo archives for everything. Git on the instance is 2.1 GB in size, but the repo archive filled up everything and is 120 GB. I really didn’t expect such a spike.

That meant that it filled up the whole hard drive and the server and all the services and websites on it went down while I was sleeping.

Luckily it seems that just deleting that directory fixes the problem temporarily. I also disabled the possibility of downloading archived from the UI but I’m not sure if this will prevent bots from generating those archives again. I also can’t just make the directory read only because it uses it for other things like mirroring, etc too.

For small instances like mine those archives are quite a headache.

#selfhosted

threaded - newest

harsh3466@lemmy.ml on 08 Sep 01:36 next collapse

Not saying this is an option for you, only that I kept my forgejo instance private to avoid dealing with this AI crawler bullshit. I hope you find a good solution.

jeena@piefed.jeena.net on 08 Sep 02:16 next collapse

Yeah I understand, but the whole point of me hosting my instance was to make my code public.

harsh3466@lemmy.ml on 08 Sep 02:26 next collapse

And I totally understand that. These AI crawlers really suck.

possiblylinux127@lemmy.zip on 08 Sep 04:29 next collapse

Have you looked at Codeberg?

jeena@piefed.jeena.net on 08 Sep 05:09 collapse

Codeberg is a instance of forgejo, I run my own instance because I don't want to be dependent on others.

omegabyte@piefed.zip on 10 Sep 03:44 collapse

I appreciate that you make your stuff public. I can't find the specific repos right now but I know I've referenced your code for various fediverse things I've dabbled in over the last year or so.

tired_n_bored@lemmy.world on 10 Sep 10:06 collapse

I was just about to install Gitea. Any substantial differences between the two?

harsh3466@lemmy.ml on 12 Sep 12:52 collapse

I don’t know the specifics but forgejo is a gitea fork. There was/is some controversy around gitea governance and movent towards prioritizing a closed source paid/private versions of gitea.

Again, I don’t know details, just very broad strokes. I chose forgejo because it’s under active Foss development and I didnt want to deal with potentially going with gitea and then having to abandon it later for whatever reason might develop.

cmnybo@discuss.tchncs.de on 08 Sep 02:03 next collapse

Are you using anything to defend against bots?

jeena@piefed.jeena.net on 08 Sep 02:18 collapse

I have nothing against bots per se, they help to spread the word about my open source code which I want to share with others.

It's just unfortunate that forgejo fills up the hard drive to such an extend and doesn't quite let you disable this archive feature.

solrize@lemmy.ml on 08 Sep 04:01 collapse

Are you saying if someone (such as a scraper) tries to download a snapshot, forgejo makes a disk file containing the snapshot, sends it, and keeps it around forever? That sounds crazy to me and I’d open a bug or try to fix it.

jeena@piefed.jeena.net on 08 Sep 05:04 collapse

It makes a zip file and a tarball, and keeps them for cached for other people to download in the future.

solrize@lemmy.ml on 08 Sep 05:36 collapse

Ok so it’s just a matter of limiting the cache size.

jeena@piefed.jeena.net on 08 Sep 11:26 collapse

There is no setting like that, at least I can't find it.

communism@lemmy.ml on 08 Sep 12:19 collapse

I think you should open a Forgejo issue requesting a cache size limit option. It seems like quite a big problem if bots can fill up your hard drive like this without you setting a limit on all data used by Forgejo (when, for single-user instances, you probably only want to limit archive size or size of any data the public can create, not the size of your own repos)

jeena@piefed.jeena.net on 08 Sep 13:09 collapse

Ok, there was one issue already and I added my comment to it: https://codeberg.org/forgejo/forgejo/issues/7011#issuecomment-7022288

frongt@lemmy.zip on 08 Sep 02:20 next collapse

Does it not require an account for that? I would open a feature request if it doesn’t, else it creates a denial-of-service attack.

jeena@piefed.jeena.net on 08 Sep 02:28 collapse

It does not, because that feature is usually used for scripts to download some specific release archive, etc. and other git hosting solutions do the same.

hddsx@lemmy.ca on 08 Sep 04:17 next collapse

I used cloudfares captcha equivalent and my bots dropped to zero

jeena@piefed.jeena.net on 08 Sep 05:06 collapse

But then how do people who search for code like yours find your open source code if not though a search engine which uses a indexing not?

SteveTech@programming.dev on 08 Sep 13:01 collapse

Cloudflare usually blocks ‘unknown’ bots, which are basically bots that aren’t search crawlers. Also I’ve got Cloudflare setup to challenge requests for .zip, .tar.gz, or .bundle files, so that it doesn’t affect anyone unless they download from their browser.

There’s also probably a way to configure something similar in Anubis, if you don’t like a middleman snooping your requests.

possiblylinux127@lemmy.zip on 08 Sep 04:28 next collapse

You should limit the amount of storage available to a single service.

Also, set up Anubis or restrict access

jeena@piefed.jeena.net on 08 Sep 05:07 collapse

Yeah, I really need to figure out how to do quotas per service.

foster@lemmy.hangdaan.com on 08 Sep 12:57 collapse

If you have a Linux server, you can try partitioning your drive using LVM. You can prevent services from consuming all disk space by giving each one their own logical volume.

jeena@piefed.jeena.net on 08 Sep 13:10 collapse

I already have LVM but I was using it to combine drives. But it's not a bad idea, if I can't do it with Docker, at least that would be a different solution.

fireshell@kbin.earth on 08 Sep 04:28 next collapse

Anubis is usually installed in such a case.

jeena@piefed.jeena.net on 08 Sep 05:08 collapse

I need to look into it, thanks!

Korbs@lemmy.sudovanilla.org on 08 Sep 05:25 next collapse

Yeah, I put not protection in front of mine, aftering noticing bots were scanning code and did grab emails. Using Anibus for now, still looking at other alternatives.

Black616Angel@discuss.tchncs.de on 08 Sep 06:06 next collapse

I’ve searched the docs a bit and found this setting: forgejo.org/docs/latest/…/config-cheat-sheet/#quo…

It seems to be partially for your case, though I don’t see artifacts, but you could limit all of forgejo to like 5GB and probably be good.

jeena@piefed.jeena.net on 08 Sep 12:54 collapse

Hm, I'm afraid none of them really seems to cover the repo-archives case, therefor I'm afraid the size:all doesn't include the repo-archives either.

But I'm running it in a container, perhaps I can limit the size the container gets assigned.

Black616Angel@discuss.tchncs.de on 08 Sep 13:30 collapse

It kinda seems like it. Docker apparently does have this functionality as seen here: stackoverflow.com/questions/40494536/…/40499023#4…

You could try limiting it to 5 GB using the forgejo settings and 7GB using docker and then just look, how big it is.

jeena@piefed.jeena.net on 08 Sep 14:59 collapse

Hm, but this only works on tmpfs which is in memory. It seems that with XFS I could have done it too: https://fabianlee.org/2020/01/13/linux-using-xfs-project-quotas-to-limit-capacity-within-a-subdirectory/ but I used ext4 out of habit.

fireshell@kbin.earth on 08 Sep 06:29 next collapse

Script for monitoring disk space in Linux

The script below is designed to monitor disk space usage on a specified server partition. Configurable parameters include the maximum allowable percentage of disk space usage (MAX), the e-mail address to receive alerts (EMAIL) and the target partition (PARTITION).

The script uses the df command to collect disk usage information and sends email alerts if the current usage exceeds the specified threshold

#!/bin/bash
# Script: ./df_guard.sh [config_file]

# Set the maximum allowed disk space usage percentage
MAX=90

# Set the email address to receive alerts
EMAIL=user@example.com

# Set the partition to monitor (change accordingly, e.g., /dev/sda1)
PARTITION=/dev/sda1

# Get the current disk usage percentage and related information
USAGE_INFO=$(df -h "$PARTITION" | awk 'NR==2 {print $5, $1, $2, $3, $4}' | tr '\n' ' ')
USAGE=$(echo "$USAGE_INFO" | awk '{print int($1)}') # Remove the percentage sign

if [ "$USAGE" -gt "$MAX" ]; then
# Send an email alert with detailed disk usage information
echo -e "Warning: Disk space usage on $PARTITION is $USAGE%.\n\nDisk Usage Information:\n$USAGE_INFO" | \
mail -s "Disk Space Alert on $HOSTNAME" "$EMAIL"
fi

Installation

sudo install -m 0755 df_guard.sh /usr/local/bin/df_guard.sh

Make the script executable:

sudo chmod +x /usr/local/bin/df_guard.sh

Launch examples

  • Every 15 minutes.

In crontab (root)

*/15 * * * * * /usr/local/bin/df_guard.sh
jeena@piefed.jeena.net on 08 Sep 12:22 collapse

I have monitoring of it, but it happened during night when I was sleeping.

Actually I saw a lot of forgejo action on the server yesterday but didn't think it would go so fast.

Moonrise2473@feddit.it on 08 Sep 09:12 next collapse

i “fixed” the problem of those fucking bots by blocking everyone except my country

jeena@piefed.jeena.net on 08 Sep 12:55 collapse

Sadly that's not the solution to my problem. The whole point op open-sourcing for me is to make it accessible to as many people as possible.

jeena@piefed.jeena.net on 08 Sep 14:51 next collapse

For now I asked chatgtp to help me to implement a simple return 403 on bot user agent. I looked into my logs and collected the bot names which I saw. I know it won't hold forever but for now it's quite nice, I just added this file to /etc/nginx/conf.d/block_bots.conf and it gets run before all the vhosts and rejects all bots. The rest just goes normally to the vhosts. This way I don't need to implement it in each vhost seperatelly.

➜ jeena@Abraham conf.d cat block_bots.conf 
# /etc/nginx/conf.d/block_bots.conf  

# 1️⃣ Map user agents to $bad_bot  
map $http_user_agent $bad_bot {  
    default 0;  

    ~*SemrushBot                            1;  
    ~*AhrefsBot                             1;  
    ~*PetalBot                              1;  
    ~*YisouSpider                           1;  
    ~*Amazonbot                             1;  
    ~*VelenPublicWebCrawler                 1;  
    ~*DataForSeoBot                          1;  
    ~*Expanse,\ a\ Palo\ Alto\ Networks\ company 1;  
    ~*BacklinksExtendedBot                   1;  
    ~*ClaudeBot                              1;  
    ~*OAI-SearchBot                          1;  
    ~*GPTBot                                 1;  
    ~*meta-externalagent                     1;  
}  

# 2️⃣ Global default server to block bad bots  
server {  
    listen 80 default_server;  
    listen [::]:80 default_server;  
    listen 443 ssl default_server;  
    listen [::]:443 ssl default_server;  

    # dummy SSL cert for HTTPS  
    ssl_certificate     /etc/ssl/certs/ssl-cert-snakeoil.pem;  
    ssl_certificate_key /etc/ssl/private/ssl-cert-snakeoil.key;  

    # block bad bots  
    if ($bad_bot) {  
        return 403;  
    }  

    # close connection for anything else hitting default server  
    return 444;  
}  
JadedBlueEyes@programming.dev on 11 Sep 22:51 collapse

A few days late, but I have a pretty similar usecase to you on forgejo.ellis.link. My solution is go-away, git.gammaspectra.live/git/go-away, which just sits as a reverse proxy in between traefik and Forgejo. I haven’t enabled fancy stuff like TLS fingerprinting. It’s been effective enough at killing the bots downloading archives and DDoSing the server from residential IPs. My config is based on the example Forgejo config, but with a few tweaks. Too long to post here, though, so message me if you need access

jeena@piefed.jeena.net on 12 Sep 00:35 collapse

For now I feel disabling archives and my simple list of bots to drop in Nginx seems to work very well, it doesn't create the archives anymore and the load went down also on the server.