_______ __ _______
| | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----.
| || _ || __|| < | -__|| _| | || -__|| | | ||__ --|
|___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____|
on Gopher (inofficial)
HTML Visit Hacker News on the Web
COMMENT PAGE FOR:
HTML Garage â An S3 object store so reliable you can run it outside datacenters
adamcharnock wrote 2 hours 37 min ago:
Copy/paste from a previous thread [0]:
Weâve done some fairly extensive testing internally recently and
found that Garage is somewhat easier to deploy in comparison to our
existing use of MinIO, but is not as performant at high speeds. IIRC we
could push about 5 gigabits of (not small) GET requests out of it, but
something blocked it from reaching the 20-25 gigabits (on a 25g NIC)
that MinIO could reach (also 50k STAT requests/s, over 10 nodes)
I donât begrudge it that. I get the impression that Garage isnât
necessarily focussed on this kind of use case.
---
In addition:
Next time we come to this we are going to look at RustFS [1], as well
as Ceph/Rook [2].
We can see we're going to have to move away from MinIO in the
foreseeable future. My hope is that the alternatives get a boost of
interest given the direction MinIO is now taking.
[0]: [1]: [2]:
HTML [1]: https://news.ycombinator.com/item?id=46140342
HTML [2]: https://rustfs.com/
HTML [3]: https://rook.io/
nine_k wrote 32 min ago:
They explicitly say that top performance is not a goal: «high
performances constrain a lot the design and the infrastructure; we
seek performances through minimalism only» ( [1] )
But it might be interesting to see where the time is spent. I suspect
they may be doing fewer things in parallel than MinIO, but maybe it's
something entirely different.
HTML [1]: https://garagehq.deuxfleurs.fr/documentation/design/goals/
__turbobrew__ wrote 1 hour 24 min ago:
I wouldnât use rook if you solely want S3. It is a massively
complex system which you really need to invest in understanding or
else your cluster will croak at some point and you will have no idea
on how to fix it.
breakingcups wrote 1 hour 11 min ago:
IS there a better solution for self-healing S3 storage that you
could recommend? I'm also curious what will make a rook cluster
croak after some time and what kind of maintenance is required in
your experience.
adamcharnock wrote 48 min ago:
Not used it yet, but RustFS sounds like it has self healing
HTML [1]: https://docs.rustfs.com/troubleshooting/healing.html
adastra22 wrote 57 min ago:
ceph?
hardwaresofton wrote 2 hours 4 min ago:
Please also consider including SeaweedFS in the testing.
awoimbee wrote 2 hours 48 min ago:
How is garage for a simple local dev env ?
I recently used seaweedfs since they have a super simple minimal setup
compared to garage which seemed to require a config file just to get
started.
supernes wrote 2 hours 57 min ago:
I tried it recently. Uploaded around 300 documents (1GB) and then went
to delete them. Maybe my client was buggy, because the S3 service
inside the container crashed and couldn't recover - I had to restart
it. It's a really cool project, but I wouldn't really call it
"reliable" from my experience.
allanrbo wrote 3 hours 10 min ago:
I use Syncthing a lot. Is Garage only really useful if you specifically
want to expose an S3 drop in compatible API, or does it also provide
other benefits over syncthing?
sippeangelo wrote 2 hours 57 min ago:
You use Syncthing for object storage?
lxpz wrote 2 hours 58 min ago:
They are not solving the same problem.
Syncthing will synchronize a full folder between an arbitrary number
of machines, but you still have to access this folder one way or
another.
Garage provides an HTTP API for your data, and handles internally the
placement of this data among a set of possible replica nodes. But the
data is not in the form of files on disk like the ones you upload to
the API.
Syncthing is good for, e.g., synchronizing your documents or music
collection between computers. Garage is good as a storage service for
back-ups with e.g. Restic, for media files stored by a web
application, for serving personal (static) web sites to the Internet.
Of course, you can always run something like Nextcloud in front of
Garage and get folder synchronization between computers somewhat like
what you would get with Syncthing.
But to answer your question, yes, Garage only provides a
S3-compatible API specifically.
ekjhgkejhgk wrote 3 hours 16 min ago:
Anybody understand how this compares with Vast?
JonChesterfield wrote 4 hours 13 min ago:
Corrupts data on power loss according to their own docs. Like what you
get outside of data centers. Not reliable then.
lxpz wrote 3 hours 27 min ago:
Losing a node is a regular occurrence, and a scenario for which
Garage has been designed.
The assumption Garage makes, which is well-documented, is that of 3
replica nodes, only 1 will be in a crash-like situation at any time.
With 1 crashed node, the cluster is still fully functional. With 2
crashed nodes, the cluster is unavailable until at least one
additional node is recovered, but no data is lost.
In other words, Garage makes a very precise promise to its users,
which is fully respected. Database corruption upon power loss enters
in the definition of a "crash state", similarly to a node just being
offline due to an internet connection loss. We recommend making
metadata snapshots so that recovery of a crashed node is faster and
simpler, but it's not required per se: Garage can always start over
from an empty database and recover data from the remaining copies in
the cluster.
To talk more about concrete scenarios: if you have 3 replicas in 3
different physical locations, the assumption of at-most one crashed
node is pretty reasonable, it's quite unlikely that 2 of the 3
locations will be offline at the same time. Concerning data
corruption on a power loss, the probability to lose power at 3
distant sites at the exact same time with the same data in the write
buffers is extremely low, so I'd say in practice it's not a problem.
Of course, this all implies a Garage cluster running with 3-way
replication, which everyone should do.
JonChesterfield wrote 2 hours 6 min ago:
That is a much stronger guarantee than your documentation currently
claims. One site falling over and being rebuilt without loss is
great. One site losing power, corrupting the local state, then
propagating that corruption to the rest of the cluster would not be
fine. Different behaviours.
lxpz wrote 1 hour 40 min ago:
Fair enough, we will work on making the documentation clearer.
jiggawatts wrote 2 hours 46 min ago:
So if you put a 3-way cluster in the same building and they lose
power together, then what? Is your data toast?
InitialBP wrote 2 hours 0 min ago:
It sounds like that's a possibility, but why on earth would you
take the time to setup a 3 node cluster of object storage for
reliability and ignore one of the key tenants of what makes it
reliable?
lxpz wrote 2 hours 42 min ago:
If I make certain assumptions and you respect them, I will give
you certain guarantees. If you don't respect them, I won't
guarantee anything. I won't guarantee that your data will be
toast either.
topspin wrote 4 hours 13 min ago:
No tags on objects.
Garage looks really nice: I've evaluated it with test code and
benchmarks and it looks like a winner. Also, very straightforward
deployment (self contained executable) and good docs.
But no tags on objects is a pretty big gap, and I had to shelve it. If
Garage folk see this: please think on this. You obviously have the
talent to make a killer application, but tags are table stakes in the
"cloud" API world.
lxpz wrote 3 hours 20 min ago:
Thank you for your feedback, we will take it into account.
topspin wrote 1 hour 9 min ago:
Great, and thank you.
I really, really appreciate that Garage accommodates running as a
single node without work-arounds and special configuration to yield
some kind of degraded state. Despite the single minded focus on
distributed operation you no doubt hear endlessly (as seen among
some comments here,) there are, in fact, traditional use cases
where someone will be attracted to Garage only for the API
compatibility, and where they will achieve availability in
production sufficient to their needs by means other than
clustering.
apawloski wrote 4 hours 36 min ago:
Is it the same consistency model as S3? I couldn't see anything about
it in their docs.
lxpz wrote 3 hours 24 min ago:
Read-after-write consistency : yes (after PutObject has finished, the
object will be immediately visible in all subsequent requests,
including GetObject and ListObjects)
Conditionnal writes : no, we can't do it with CRDTs, which are the
core of Garage's design.
skrtskrt wrote 2 hours 31 min ago:
Does RAMP or CURE offer any possibility of conditional writes with
CRDTs?
I have had these papers on my list to read for months, specifically
wondering if it could be applied to Garage [1]
HTML [1]: https://dd.thekkedam.org/assets/documents/publications/Rep...
HTML [2]: http://www.bailis.org/papers/ramp-sigmod2014.pdf
lxpz wrote 1 hour 25 min ago:
I had a very rapid look at these two papers, it looks like none
of them allow the implementation of compare-and-swap, which is
required for if-match / if-none-match support. They have a weaker
definition of a "transaction". Which is to be expected as they
only implement causal consistency at best and not consensus,
whereas consensus is required for compare-and-swap.
thhck wrote 5 hours 26 min ago:
BTW [1] is one of the most beautiful website I have ever seen
HTML [1]: https://deuxfleurs.fr/
codethief wrote 3 hours 49 min ago:
It's beautiful from an artistic point of view but also rather hard to
read and probably not very accessible (haven't checked it, though,
since I'm on my phone).
isoprophlex wrote 2 hours 57 min ago:
Works perfectly on an iphone. I can't attest to the accessibility
features, but the aesthetic is absolutely wonderful. Something I
love, and went for on my own portfolio/company website... this is
executed 100x better tho, clearly a labor of love and not 30
minutes of shitting around in vi.
wyattjoh wrote 5 hours 47 min ago:
Wasn't expecting to see it hosted on forgejo. Kind of a breath of fresh
air to be honest.
Eikon wrote 5 hours 50 min ago:
Unfortunately, this doesnât support conditional writes through
if-match and if-none-match [0] and thus is not compatible with ZeroFS
[1].
[0] [1]
HTML [1]: https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1052
HTML [2]: https://github.com/Barre/ZeroFS
faizshah wrote 5 hours 52 min ago:
One really useful usecase for Garage for me has been data engineering
scripts. I can just use the S3 integration that every tool has to dump
to garage and then I can more easily scale up to cloud later.
agwa wrote 6 hours 2 min ago:
Does this support conditional PUT (If-Match / If-None-Match)?
codethief wrote 3 hours 46 min ago:
HTML [1]: https://news.ycombinator.com/item?id=46328218
doctorpangloss wrote 6 hours 9 min ago:
[1] this is the reliability question no?
HTML [1]: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main-v1...
lxpz wrote 3 hours 23 min ago:
I talked about the meaning of the Jepsen test and the results we
obtained in the FOSDEM'24 talk: [1] Slides are available here:
HTML [1]: https://archive.fosdem.org/2024/schedule/event/fosdem-2024-3...
HTML [2]: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/4efc8...
fabian2k wrote 6 hours 10 min ago:
Looks interesting for something like local development. I don't intend
to run production object storage myself, but some of the stuff in the
guide to the production setup ( [1] ) would scare me a bit:
> For the metadata storage, Garage does not do checksumming and
integrity verification on its own, so it is better to use a robust
filesystem such as BTRFS or ZFS. Users have reported that when using
the LMDB database engine (the default), database files have a tendency
of becoming corrupted after an unclean shutdown (e.g. a power outage),
so you should take regular snapshots to be able to recover from such a
situation.
It seems like you can also use SQLite, but a default database that
isn't robust against power failure or crashes seems suprising to me.
HTML [1]: https://garagehq.deuxfleurs.fr/documentation/cookbook/real-wor...
lxpz wrote 3 hours 17 min ago:
If you know of an embedded key-value store that supports
transactions, is fast, has good Rust bindings, and does
checksumming/integrity verification by default such that it almost
never corrupts upon power loss (or at least, is always able to
recover to a valid state), please tell me, and we will integrate it
into Garage immediately.
__turbobrew__ wrote 1 hour 21 min ago:
RocksDB possibly. Used in high throughput systems like Ceph OSDs.
patmorgan23 wrote 1 hour 48 min ago:
Valkey?
fabian2k wrote 2 hours 9 min ago:
I don't really know enough about the specifics here. But my main
points isn't about checksums, but more something like WAL in
Postgres. For an embedded KV store this is probably not the
solution, but my understanding is that there are data structures
like LSM that would result in similar robustness. But I don't
actually understand this topic well enough.
Checksumming detects corruption after it happened. A database like
Postgres will simply notice it was not cleanly shut down and put
the DB into a consistent state by replaying the write ahead log on
startup. So that is kind of my default expectation for any DB that
handles data that isn't ephemeral or easily regenerated.
But I also likely have the wrong mental model of what Garage does
with the metadata, as I wouldn't have expected that to be ever
limited by Sqlite.
lxpz wrote 1 hour 55 min ago:
So the thing is, different KV stores have different trade-offs,
and for now we haven't yet found one that has the best of all
worlds.
We do recommend SQLite in our quick-start guide to setup a
single-node deployment for small/moderate workloads, and it works
fine. The "real world deployment" guide recommends LMDB because
it gives much better performance (with the current status of
Garage, not to say that this couldn't be improved), and the risk
of critical data loss is mitigated by the fact that such a
deployment would use multi-node replication, meaning that the
data can always be recovered from another replica if one node is
corrupted and no snapshot is available. Maybe this should be
worded better, I can see that the alarmist wording of the
deployment guide is creating quite a debate so we probably need
to make these facts clearer.
We are also experimenting Fjall as an alternate KV engine based
on LSM, as it theoretically has good speed and crash resilience,
which would make it the best option. We are just not recommending
it by default yet, as we don't have much data to confirm that it
works up to these expectations.
agavra wrote 2 hours 46 min ago:
Sounds like a perfect fit for [1] -- it's just that (an embedded,
rust, KV store that supports transactions).
It's built specifically to run on object storage, currently relies
on the `object_store` crate but we're consdering OpenDAL instead so
if Garage works with those crates (I assume it does if its S3
compatible) it should just work OOTB.
HTML [1]: https://slatedb.io/
BeefySwain wrote 3 hours 12 min ago:
(genuinely asking) why not SQLite by default?
lxpz wrote 3 hours 5 min ago:
We were not able to get good enough performance compared to LMDB.
We will work on this more though, there are probably many ways
performance can be increased by reducing load on the KV store.
srcreigh wrote 1 hour 10 min ago:
Did you try WITHOUT ROWID? Your sqlite implementation[1] uses a
BLOB primary key. In SQLite, this means each operation requires
2 b-tree traversals: The BLOB->rowid tree and the rowid->data
tree.
If you use WITHOUT ROWID, you traverse only the BLOB->data
tree.
Looking up lexicographically similar keys gets a huge
performance boost since sqlite can scan a B-Tree node and the
data is contiguous. Your current implementation is chasing
pointers to random locations in a different b-tree.
I'm not sure exactly whether on disk size would get smaller or
larger. It probably depends on the key size and value size
compared to the 64 bit rowids. This is probably a well studied
question you could find the answer to.
[1]
HTML [1]: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit...
lxpz wrote 50 min ago:
Very interesting, thank you. It would probably make sense for
most tables but not all of them because some are holding
large CRDT values.
tensor wrote 2 hours 13 min ago:
Keep in mind that write safety comes with performance
penalties. You can turn off write protections and many
databases will be super fast, but easily corrupt.
skrtskrt wrote 2 hours 37 min ago:
Could you use something like Fly's Corrosion to shard and
distribute the SQLite data?
It uses a CRDT reconciliation, which is familiar for Garage.
lxpz wrote 2 hours 15 min ago:
Garage already shards data by itself if you add more nodes,
and it is indeed a viable path to increasing throughput.
moffkalast wrote 5 hours 23 min ago:
That's not something you can do reliably in software, datacenter
grade NVMe drives come with power loss protection and additional
capacitors to handle that gracefully. If power is cut at the wrong
moment the partition may not be mountable afterwards otherwise.
If you really live somewhere with frequent outages, buy an industrial
drive that has a PLP rating. Or get a UPS, they tend to be cheaper.
crote wrote 4 hours 57 min ago:
Isn't that the entire point of write-ahead logs, journaling file
systems, and fsync in general? A roll-back or roll-forward due to a
power loss causing a partial write is completely expected, but
surely consumer SSDs wouldn't just completely ignore fsync and
blatantly lie that the data has been persisted?
As I understood it, the capacitors on datacenter-grade drives are
to give it more flexibility, as it allows the drive to issue a
successful write response for cached data: the capacitor guarantees
that even with a power loss the write will still finish, so for all
intents and purposes it has been persisted, so an fsync can return
without having to wait on the actual flash itself, which greatly
increases performance. Have I just completely misunderstood this?
unsnap_biceps wrote 4 hours 17 min ago:
you actually don't need capacitors for rotating media, Western
Digital has a feature called "ArmorCache" that uses the
rotational energy in the platters to power the drive long enough
to sync the volatile cache to a non volatile storage.
HTML [1]: https://documents.westerndigital.com/content/dam/doc-lib...
patmorgan23 wrote 1 hour 47 min ago:
Good I love engineers
toomuchtodo wrote 4 hours 8 min ago:
Very cool, like the ram air turbine that deploys on aircraft in
the event of a power loss.
Nextgrid wrote 4 hours 43 min ago:
> ignore fsync and blatantly lie that the data has been persisted
Unfortunately they do:
HTML [1]: https://news.ycombinator.com/item?id=38371307
btown wrote 4 hours 33 min ago:
If the drives continue to have power, but the OS has crashed,
will the drives persist the data once a certain amount of time
has passed? Are datacenters set up to take advantage of this?
unsnap_biceps wrote 4 hours 20 min ago:
Yes, the drives are unaware of the OS state.
Nextgrid wrote 4 hours 20 min ago:
> will the drives persist the data once a certain amount of
time has passed
Yes, otherwise those drives wouldn't work at all and would
have a 100% warranty return rate. The reason they get away
with it is that the misbehavior is only a problem in a
specific edge-case (forgetting data written shortly before a
power loss).
igor47 wrote 5 hours 29 min ago:
I've been using minio for local dev but that version is unmaintained
now. However, I was put off by the minimum requirements for garage
listed on the page -- does it really need a gig of RAM?
dsvf wrote 2 hours 24 min ago:
I always understood this requirement as "garage will run fine on
hardware with 1GB RAM total" - meaning the 1GB includes the RAM
used by the OS and other processes. I think that most current
consumer hardware that is a, potential garage host, even on the low
end, has at least 1GB total RAM.
lxpz wrote 3 hours 16 min ago:
It does not, at least not for a small local dev server. I believe
RAM usage should be around 50-100MB, increasing if you have many
requests with large objects.
archon810 wrote 5 hours 27 min ago:
The current latest Minio release that is working for us for local
development is now almost a year old and soon enough we will have
to upgrade. Curious what others have replaced it with that is as
easy to set up and has a management UI.
mbreese wrote 2 hours 55 min ago:
I think that's part of the pitch here... swapping out Minio for
Garage. Both scale a lot more than for just local development,
but local dev certainly seems like a good use-case here.
Powdering7082 wrote 6 hours 28 min ago:
No erasure coding seems like a pretty big loss in terms of how much
resources do you need to get good resiliency & efficiency
munro wrote 4 hours 32 min ago:
I was looking at using this on an LTO tape library, it seems the only
resiliency is through replication, but this was my main concern with
this project, what happens with HW goes bad
lxpz wrote 3 hours 7 min ago:
If you have replication, you can lose one of the replica, that's
the point. This is what Garage was designed for, and it works.
Erasure coding is another debate, for now we have chosen not to
implement it, but I would personally be open to have it supported
by Garage if someone codes it up.
hathawsh wrote 1 hour 59 min ago:
Erasure coding is an interesting topic for me. I've run some
calculations on the theoretical longevity of digital storage. If
you assume that today's technology is close to what we'll be
using for a long time, then cross-device erasure coding wins,
statistically. However, if you factor in the current exponential
rate of technological development, simply making lots of copies
and hoping for price reductions over the next few years turns out
to be a winning strategy, as long as you don't have vendor
lock-in. In other words, I think you're making great choices.
ai-christianson wrote 6 hours 42 min ago:
I love garage. I think it has applications beyond the standard self
host s3 alternative.
It's a really cool system for hyper converged architecture where
storage requests can pull data from the local machine and only hit the
network when needed.
SomaticPirate wrote 6 hours 42 min ago:
Seeing a ton of adoption of this after the Minio debacle [1] was
useful.
RustFS also looks interesting but for entirely non-technical reasons we
had to exclude it.
Anyone have any advice for swapping this in for Minio?
HTML [1]: https://www.repoflow.io/blog/benchmarking-self-hosted-s3-compa...
klooney wrote 4 hours 6 min ago:
Seaweed looks good in those benchmarks, I haven't heard much about it
for a while.
scottydelta wrote 4 hours 49 min ago:
From what I have seen in the previous discussions here (since and
before Minio debacle) and at work, Garage is a solid replacement.
Implicated wrote 6 hours 26 min ago:
> but for entirely non-technical reasons we had to exclude it
Able/willing to expand on this at all? Just curious.
NitpickLawyer wrote 6 hours 5 min ago:
Not the same person you asked, but my guess would be that it is
seen as a chinese product.
dewey wrote 5 hours 40 min ago:
What is this based on, honest question as from the landing page I
don't get that impression. Are many committers China-based?
NitpickLawyer wrote 5 hours 33 min ago:
[1] > Beijing Address: Area C, North Territory, Zhongguancun
Dongsheng Science Park, No. 66 Xixiaokou Road, Haidian
District, Beijing
> Beijing ICP Registration No. 2024061305-1
HTML [1]: https://rustfs.com.cn/
dewey wrote 5 hours 30 min ago:
Oh, I misread the initial comment and thought they had to
exclude Garage. Thanks!
lima wrote 5 hours 43 min ago:
RustFS appears to be very early-stage with no real distributed
systems architecture: [1] I'm not sure if it even has any sort of
cluster consensus algorithm? I can't imagine it not eating
committed writes in a multi-node deployment.
Garage and Ceph (well, radosgw) are the only open source
S3-compatible object storage which have undergone serious
durability/correctness testing. Anything else will most likely
eat your data.
HTML [1]: https://github.com/rustfs/rustfs/pull/884
dpedu wrote 6 hours 27 min ago:
I have not tried either myself, but I wanted to mention that Versity
S3 Gateway looks good too. [1] I am also curious how Ceph S3 gateway
compares to all of these.
HTML [1]: https://github.com/versity/versitygw
skrtskrt wrote 2 hours 34 min ago:
When I was there, DigitalOcean was writing a complete replacement
for the Ceph S3 gateway because its performance under high
concurrency was awful.
They just completely swapped out the whole service from the stack
and wrote one in Go because of how much better the concurrency
management was, and Ceph's team and codebase C++ was too resistant
to change.
zipzad wrote 3 hours 40 min ago:
I'd be curious to know how versitygw compares to rclone serve S3.
DIR <- back to front page