• 1 Post
  • 68 Comments
Joined 1 year ago
cake
Cake day: June 2nd, 2023

help-circle

  • Traditional graphics code works by having the CPU generate a sequence of commands which are packed together and sent to the GPU to run. This extension let’s you write code which runs on the GPU to generate commands, and then execute those same commands on the GPU without involving the CPU at all.

    This is a super powerful feature which makes it possible to do things which simply weren’t feasible in the traditional model. Vulkan improved on OpenGL by allowing people to build command buffers on multiple threads, and also re-use existing command buffers, but GPU pipelines are getting so wide that scenes containing many objects with different render settings are bottlenecked by the rate at which the CPU can prepare commands, not by GPU throughput. Letting the GPU generate its own commands means you can leverage the GPU’s massive parallelism for the entire render process, and can also make render state changes much cheaper.

    (For anyone familiar, this is basically a more fleshed out version of NVIDIA’s proprietary NV_command_list extension for OpenGL, except that it’s in Vulkan and standardized across all GPU drivers)




  • You’ve made me uncertain if I’ve somehow never noticed this before, so I gave it a shot. I’ve been dd-ing /dev/random onto one of those drives for the last 20 minutes and the transfer rate has only dropped by about 4MB/s since I started, which is about the kind of slowdown I would expect as the drive head gets closer to the center of the platter.

    EDIT: I’ve now been doing 1.2GB/s onto an 8 drive RAID0 (8x 600GB 15k SAS Seagates) for over 10 minutes with no noticable slowdown. That comes out to 150MB/s per drive, and these drives are from 2014 or 2015. If you’re only getting 60MB/s on a modern non-SMR HDD, especially something as dense as an 18TB drive, you’ve either configured something wrong or your hardware is broken.


  • This is for very long sustained writes, like 40TiB at a time. I can’t say I’ve ever noticed any slowdown, but I’ll keep a closer eye on it next time I do another huge copy. I’ve also never seen any kind of noticeable slowdown on my 4 8TB SATA WD golds, although they only get to about 150MB/s each.

    EDIT: The effect would be obvious pretty fast at even moderate write speeds, I’ve never seen a drive with more than a GB of cache. My 16TB drives have 256MB, and the 8TB drives only 64MB of cache.










  • It’s not that obscure - I had a use case a while back where I had multiple rocksdb instances running on the same machine and wanted each of them to store their WAL only on SSD storage with compression and have the main tables be stored uncompressed on an HDD array with write-through SSD cache (ideally using the same set of SSDs for cost). I eventually did it, but it required partitioning the SSDs in half, using one half for a bcache (not bcachefs) in front of the HDDs and then using the other half of the SSDs to create a compressed filesystem which I then created subdirectories on and bind mounted each into the corresponding rocksdb database.

    Yes, it works, but it’s also ugly as sin and the SSD allocation between the cache and the WAL storage is also fixed (I’d like to use as much space as possible for caching). This would be just a few simple commands using bcachefs, and would also be completely transparent once configured (no messing around with dozens of fstab entries or bind mounts).


  • ext4 aims to not lose data under the assumption that the single underlying drive is reliable. btrfs/bcachefs/ZFS assume that one/many of the perhaps dozens of underlying drives could fail entirely or start returning garbage at any time, and try to ensure that the bad drive can be kicked out and replaced without losing any data or interrupting the system. They’re both aiming for stability, but stability requirements are much different at scale than a “dumb” filesystem can offer, because once you have enough drives one of them WILL fail and ext4 cannot save you in that situation.

    Complaining that datacenter-grade filesystems are unreliable when using them in your home computer is like removing all but one of the engines from a 747 and then complaining that it’s prone to crashing. Of course it is, because it was designed under the assumption that there would be redundancy.


  • XFS still isn’t a multi-device filesystem, though… of course you can run it on top of mdraid/LVM, but that still doesn’t come close to the flexibility of what these specialized filesystems can do. Being able to simply run btrfs device add /dev/sdx1 / and immediately having the new space available is far less hassle than adding a device to an md array, then resizing the partition and then resizing the filesystem (and removing a device is even worse). Snapshots are a similar deal - sure, LVM can let you snapshot your entire virtual block device, but your snapshots are block devices themselves which need to be explicitly mounted, while in btrfs/bcachefs a snapshot is just a directory, and can be isolated to a specific subvolume rather than the entire block device.

    Data checksums are also substantially less useful when the filesystem can’t address the underlying devices individually, because it makes repairing the data from a replica impossible. If you have a file on an md RAID1 device and one of the replicas has a bad block, you might be able to detect the bitrot by verifying the checksum, but you can’t actually fix it, because even though there is a second copy of the data on another drive, mdadm simply exposes a simple block device and doesn’t provide any way to read from “the other copy”. mdraid can recover from total drive failure, but not data corruption.



  • bcachefs is way more flexible than btrfs on multi-device filesystems. You can group storage devices together based on performance/capacity/whatever else, and then do funky things like assigning a group of SSDs as a write-through/write-back cache for a bigger array of HDDs. You can also configure a ton of properties for individual files or directories, including the cache+main storage group, amount of data replicas, compression type, and quite a bit more.

    So you could have two files in the same folder, one of them stored compressed on an array of HDDs in RAID10 and the other one stored on a different array of HDDs uncompressed in RAID5 with a write-back SSD cache, and wouldn’t have to fiddle around with multiple filesystems and bind mounts - everything can be configured by simply setting xattr values. You could even have a third file which is striped across both groups of HDDs without having to partition them up.


  • I don’t recall the name, but I saw something some time ago which got infinite google drive storage by creating a bunch of empty folders and packing the data into the folder names. The storage limit is the sum of the size of all your files, so if there are no files then you don’t have any storage used, even if you have 100Gb of folder names.