Off-and-on trying out an account over at @[email protected] due to scraping bots bogging down lemmy.today to the point of near-unusability.

  • 58 Posts
  • 1.48K Comments
Joined 2 years ago
cake
Cake day: October 4th, 2023

help-circle
  • I don’t know of a pre-wrapped utility to do that, but assuming that this is a Linux system, here’s a simple bash script that’d do it.

    #!/bin/bash
    
    # Set this.  Path to a new, not-yet-existing directory that will retain a copy of a list
    # of your files.  You probably don't actually want this in /tmp, or
    # it'll be wiped on reboot.
    
    file_list_location=/tmp/storage-history
    
    # Set this.  Path to location with files that you want to monitor.
    
    path_to_monitor=path-to-monitor
    
    # If the file list location doesn't yet exist, create it.
    if [[ ! -d "$file_list_location" ]]; then
        mkdir "$file_list_location"
        git -C "$file_list_location" init
    fi
    
    # in case someone's checked out things at a different time
    git -C "$file_list_location" checkout master
    find "$path_to_monitor"|sort>"$file_list_location/files.txt"
    git -C "$file_list_location" add "$file_list_location/files.txt"
    git -C "$file_list_location" commit -m "Updated file list for $(date)"
    

    That’ll drop a text file at /tmp/storage-history/files.txt with a list of the files at that location, and create a git repo at /tmp/storage-history that will contain a history of that file.

    When your drive array kerplodes or something, your files.txt file will probably become empty if the mount goes away, but you’ll have a git repository containing a full history of your list of files, so you can go back to a list of the files there as they existed at any historical date.

    Run that script nightly out of your crontab or something ($ crontab -e to edit your crontab).

    As the script says, you need to choose a file_list_location (not /tmp, since that’ll be wiped on reboot), and set path_to_monitor to wherever the tree of files is that you want to keep track of (like, /mnt/file_array or whatever).

    You could save a bit of space by adding a line at the end to remove the current files.txt after generating the current git commit if you want. The next run will just regenerate files.txt anyway, and you can just use git to regenerate a copy of the file at for any historical day you want. If you’re not familiar with git, $ git log to find the hashref for a given day, $ git checkout <hashref> to move where things were on that day.

    EDIT: Moved the git checkout up.



  • One issue I do wonder about — the Threadverse is pretty small today compared to something like Reddit, but I also suspect that a lot of these Peertube hosts don’t have massive amounts of spare bandwidth to handle sudden, coordinated spikes in demand. I wonder if enough people start using this and a lot of people hit the same Peertube instance at the same time in response to a post of a video, whether it might be enough to produce bandwidth congestion on that instance.







  • Is this worth the effort?

    In terms of electricity cost?

    I wouldn’t do it myself.

    If you want to know whether it’s going to save money, you want to see how much power it uses — you can use a wattmeter, or look up the maximum amount on the device ratings to get an upper end. Look up how much you’re paying per kWh in electricity. Price the hardware. Put a price on your labor. Then you can get an estimate.

    My guess, without having any of those numbers, is that it probably isn’t.



  • I’m fine with it — and I think that it improves searchability to have one per community, rather than some bulk post – but the posts need to be marked NSFW, and some of the announcement posts are not, which means that people who have NSFW stuff blocked are still getting them. I think that that’s the real problem.

    My understanding is that this is something of an exceptional situation, as apparently lemmynsfw.com — the biggest NSFW community host on the Threadiverse — went down and the admin was supposed to be away for some months, so it’s not coming back up in at least the near future, and so it sounds like people are setting up alternatives on other instances.


  • You would typically want to use static ip addresses for servers (because if you use DHCP the IP is gonna change sooner or later, and it’s gonna be a pain in the butt).

    In this case, he controls the local DHCP server, which is gonna be running on the OpenWRT box, so he can set it to always assign whatever he wants to a given MAC.


  • tal@lemmy.todaytoSelfhosted@lemmy.world[Solved] OpenWrt & fail2ban
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    1
    ·
    edit-2
    6 days ago

    except that all requests’ IP addresses are set to the router’s IP address (192.168.3.1), so I am unable to use proper rate limiting and especially fail2ban.

    I’d guess that however the network is configured, you have the router NATting traffic going from the LAN to the Internet (typical for a home broadband router) as well as from the home LAN to the server.

    That does provide security benefits in that you’ve basically “put the server on the Internet side of things”, and the server can’t just reach into the LAN, same as anything else on the Internet. The NAT table has to have someone on the LAN side opening a connection to establish a new entry.

    But…then all of those hosts on the LAN are going to have the same IP address from the server’s standpoint. That’s the experience that hosts on the Internet have towards the same hosts on your LAN.

    It sounds like you also want to use DHCP:

    Getting the router to actually assign an IP address to the server was quite a headache

    I’ve never used VLANs on Linux (or OpenWRT, and don’t know how it interacts with the router’s hardware).

    I guess what you want to do is to not NAT traffic going from the LAN (where most of your hardware lives) and the DMZ (where the server lives), but still to disallow the DMZ from communicating with the LAN.

    considers

    So, I don’t know whether the VLAN stuff is necessary on your hardware to prevent the router hardware from acting like a switch, moving Ethernet packets directly, without them going to Linux. Might be the case.

    I suppose what you might do — from a network standpoint, don’t know off-the-cuff how to do it on OpenWRT, though if you’re just using it as a generic Linux machine, without using any OpenWRT-specific stuff, I’m pretty sure that it’s possible — is to give the OpenWRT machine two non-routable IP addresses, something like:

    192.168.1.1 for the LAN

    and

    192.168.2.1 for the DMZ

    The DHCP server listens on 192.168.1.1 and serves DHCP responses for the LAN that tell it to use 192.168.1.1 as the default route. Ditto for hosts in the DMZ. It hands out addresses from the appropriate pool. So, for example, the server in the DMZ would maybe be assigned 192.168.2.2.

    Then it should be possible to have a routing table entry to route 192.168.1.1 to 192.168.2.0/24 via 192.168.2.1 and vice versa, 192.168.2.1 to 192.168.1.0/24 via 192.168.1.1. Linux is capable of doing that, as that’s standard IP routing stuff.

    When a LAN host initiates a TCP connection to a DMZ host, it’ll look up its IP address in its routing table, say “hey, that isn’t on the same network as me, send it to the default route”. That’ll go to 192.168.1.1, with a destination address of 192.168.2.2. The OpenWRT box forwards it, doing IP routing, to 192.168.2.1, and then that box says “ah, that’s on my network, send it out the network port with VLAN tag whatever” and the switch fabric is configured to segregate the ports based on VLAN tag, and only sends the packet out the port associated with the DMZ.

    The problem is that the reason that home users typically derive indirect security benefits from use NAT is that it intrinsically disallows incoming connections from the server to the LAN. This will make that go away — the LAN hosts and DMZ hosts will be on separate “networks”, so things like ARP requests and other stuff at the purely-Ethernet level won’t reach each other, but they can freely communicate with each other at the IP level, because the two 192.168.X.1 virtual addresses will route packets between each the two networks. You’re going to need to firewall off incoming TCP connections (and maybe UDP and ICMP and whatever else you want to block) inbound on the 192.168.1.0/24 network from the 192.168.2.0/24 network. You can probably do that with iptables at the Linux level. OpenWRT may have some sort of existing firewall package that applies a set of iptables rules. I think that all the traffic should be reaching the Linux kernel in this scenario.

    If you get that set up, hosts at 192.168.2.2, on the DMZ, should be able to see connections from 192.168.1.2, on the LAN, using its original IP address.

    That should work if what you had was a Linux box with three Ethernet cards (one for each of the Internet, LAN, and WAN) and the VLAN switch hardware stuff wasn’t in the picture; you’d just not do any VLAN stuff then. I’m not 100% certain that any VLAN switching fabric stuff might muck that up — I’ve only very rarely touched VLANs myself, and never tried to do this, use VLANs to hack switch fabric attached directly to a router to act like independent NICs. But I can believe that it’d work.

    If you do set it up, I’d also fire up sudo tcpdump on the server. If things are working correctly, sudo ping -b 192.168.1.255 on a host on the LAN shouldn’t show up as reaching the server. However, ping 192.168.2.2 should.

    You’re going to want traffic that doesn’t match a NAT table entry and is coming in from the Internet to be forwarded to the DMZ vlan.

    That’s a high-level of what I believe needs to happen. But I can’t give you a hand-holding walkthrough to configure it via off-the-cuff knowledge, because I haven’t needed to do a fair bit of this myself — sorry on that.

    EDIT: This isn’t the question you asked, but I’d also add that what I’d probably do myself if I were planning to set something like this up is get a small, low power Linux machine with multiple NICs (well, okay, probably one NIC, multiple ports). That cuts the switch-level stuff that I think that you’d likely otherwise need to contend with out of the picture, and then I don’t think that you’d need to deal with VLANs, which is a headache that I wouldn’t want, especially if getting it wrong might have security implications. If you need more ports for the LAN, then just throw a regular old separate hardware Ethernet switch on the LAN port. You know that the switch can’t be moving traffic between the LAN and DMZ networks itself then, because it can’t touch the DMZ. But I don’t know whether that’d make financial sense in your case, if you’ve already got the router hardware.





  • Ehhh…I mean, if the things have a GPS receiver, which I assume that they do, they can probably be configured to move to a given location and then only then flip on the cell radio to act as a relay.

    EDIT: Honestly, I’m kind of surprised that someone hasn’t tried a drone that can deploy, say hydrogen or helium balloons with a relay radio hanging from them. It’s gotta be a complete pain in the ass to try to shoot balloons down, as they’re cheap, and they probably linger in an area long enough to permit for operations using them as a relay on an extended basis. They can also probably get a lot higher than a comparable drone, if that’s desirable.





  • Actually, thinking about this…a more-promising approach might be deterrent via poisoning the information source. Not bulletproof, but that might have some potential.

    So, the idea here is that what you’d do there is to create a webpage that looks, to a human, as if only the desired information shows up.

    But you include false information as well. Not just an insignificant difference, as with a canary trap, or a real error intended to have minimal impact, only to identify an information source, as with a trap street. But outright wrong information, stuff where reliance on the stuff would potentially be really damaging to people relying on the information.

    You stuff that information into the page in a way that a human wouldn’t readily see. Maybe you cover that text up with an overlay or something. That’s not ideal, and someone browsing using, say, a text-mode browser like lynx might see the poison, but you could probably make that work for most users. That has some nice characteristics:

    • You don’t have to deal with the question of whether the information rises to the level of copyright infringement or not. It’s still gonna dick up responses being issued by the LLM.

    • Legal enforcement, which is especially difficult across international borders — The Pirate Bay continues to operate to this day, for example — doesn’t come up as an issue. You’re deterring via a different route.

    • The Internet Archive can still archive the pages.

    Someone could make a bot that post-processes your page to strip out the poison, but you could sporadically change up your approach, change it over time, and the question for an AI company is whether it’s easier and safer to just license your content and avoid the risk of poison, or to risk poisoned content slipping into their model whenever a media company adopts a new approach.

    I think the real question is whether someone could reliably make a mechanism that’s a general defeat for that. For example, most AI companies probably are just using raw text today for efficiency, but for specifically news sources known to do this, one could generate a screenshot of a page in a browser and then OCR the text. The media company could maybe still take advantage of ways in which generalist OCR and human vision differ — like, maybe humans can’t see text that’s 1% gray on a black background, but OCR software sees it just fine, so that’d be a place to insert poison. Or maybe the page displays poisoned information for a fraction of a second, long enough to be screenshotted by a bot, and then it vanishes before a human would have time to read it.

    shrugs

    I imagine that there are probably already companies working on the problem, on both sides.