codevoid.de/1/hn/comments_45694122.gph

  URI:

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   A sharded DuckDB on 63 nodes runs 1T row aggregation challenge in 5 sec
       
       
        peter_d_sherman wrote 13 hours 10 min ago:
        >"The GizmoEdge Server receives a SQL query from the client, parses it,
        and generates two statements:
        
        o A worker SQL to execute on each distributed node
        
        o A combinatorial SQL to run server-side for final aggregation"
        
        A specific instance of MapReduce (using SQL!):
        
  HTML  [1]: https://en.wikipedia.org/wiki/MapReduce
       
        fifilura wrote 20 hours 26 min ago:
        Isn't Trino built for exactly this, without the quirky workarounds?
       
        up2isomorphism wrote 20 hours 36 min ago:
        Sensational title, a reflection of âattention is all you needâ.(pun
        intended)
       
        hbarka wrote 22 hours 22 min ago:
        SELECT COUNT(DISTINCT) has entered the challenge.
       
          philbe77 wrote 21 hours 41 min ago:
          good point :) - we can re-aggregate HyperLogLog (HLL) sketches to get
          a pretty accurate NDV (Count Distinct) - see Query.farm's
          DataSketches DuckDB extension here: [1] We also have Bitmap
          aggregation capabilities for exact count distinct - something I
          worked with Oracle, Snowflake, Databricks, and DuckDB labs on
          implementing.  It isn't as fast as HLL - but it is 100% accurate...
          
  HTML    [1]: https://github.com/Query-farm/datasketches
       
            fifilura wrote 19 hours 30 min ago:
            I remember BigQuery had Distinct with HLL accuracy 10 years ago but
            rather quickly replaced it with actual accuracy.
            
            How would you compare this solution to BigQuery?
       
        fHr wrote 22 hours 25 min ago:
        Better and worth more then all the quantum bs I have to listen to.
       
        kwillets wrote 1 day ago:
        This is fun, but I'm confused by the architecture. Duckdb is based on
        one-off queries that can scale momentarily and then disappear, but this
        seems to run on k8s and maintain a persistent distributed worker pool.
        
        This pool lacks many of the features of a distributed cluster such as
        recovery, quorum, and storage state management, and queries run through
        a single server. What happens when a node goes down? Does it give up,
        replan, or just hang? How does it divide up resources between multiple
        requests? Can it distribute joins and other intermediate operators?
        
        I have a soft spot in my heart for duckdb, but its uniqueness is in
        avoiding the large-scale clustering that other engines already do
        reasonably well.
       
        afpx wrote 1 day ago:
        Iâve never used DuckDB, but I was surprised by the 30 GiB of memory.
        Many years ago when I used to use EMR a lot, I would go for > 10 TiB of
        RAM to keep all the data in memory and only spill over to SSD on big
        joins.
       
        sdairs wrote 1 day ago:
        Pretty big caveat; 5 seconds AFTER all data has been loaded into memory
        - over 2 minutes if you also factor reading the files from S3 and
        loading memory. So to get this performance you will need to run hot:
        4000 CPUs and and 30TB of memory going 24/7.
       
          lisbbb wrote 20 hours 39 min ago:
          Wow (Owen Wilson voice).  That's still impressive that it can be
          done.  Just having 4k cpus going reliably for any period of time is
          pretty nifty.  The problem I have run into is that even big companies
          say they want this kind of compute until they get the bill for it.
       
          philbe77 wrote 21 hours 44 min ago:
          hi sdairs, we did store the data on the worker nodes for the
          challenge, but not in memory.  We wrote the data to the local NVMe
          SSD storage on the node.  Linux may cache the filesystem data, but we
          didn't load the data directly into memory.  We like to preserve the
          memory for aggregations, joins, etc. as much as possible...
          
          It is true you would need to run the instance(s) 24/7 to get the
          performance all day, the startup time over a couple minutes is not
          ideal.    We have a lot of work to do on the engine, but it has been a
          fun learning experience...
       
            jamesblonde wrote 20 hours 19 min ago:
            With 2 modern NVMe disks per host (15 GB/s) and pcie 5.0, it should
            only take 15s to read 30 TB into memory on 63 hosts.
            
            You can find those disks on Hetzner. Not AWS, though.
       
              jiggawatts wrote 15 hours 23 min ago:
              I donât understand why both Azure and AWS have local SSDs that
              are an order of magnitude slower than what I can get in a laptop.
              If Hetzner can do it, surely so can they!
              
              Not to mention that Azure now exposes local drives as raw NVMe
              devices mapped straight through to the guest with no
              virtualisation overheads.
       
                jamesblonde wrote 3 hours 50 min ago:
                It would undercut all their higher level services - like
                DynamoDB, CosmosDB, etc.
                
                Databases would suddenly go BRRR in the cloud and show up
                cloud-native (S3) based databases for the high latency services
                they are.
       
            sdairs wrote 21 hours 13 min ago:
            Thanks for clarifying; I'm not trying to take anything away from
            you, I work in the OLAP space too so it's always good to see people
            pushing it forwards. It would be interesting to see a comparison of
            totally cold Vs hot caches.
            
            Are you looking at distributed queries directly over S3? We did
            this in ClickHouse and can do instant virtual sharding over large
            data sets S3. We call it parallel replicas
            
  HTML      [1]: https://clickhouse.com/blog/clickhouse-parallel-replicas
       
              tanelpoder wrote 18 hours 36 min ago:
              (I submitted this link). My interest in this approach in general
              is about observability infra at scale - thinking about buffering
              detailed events, metrics and thread samples at the edge and later
              only extract things of interest, after early filtering at the
              edge. Iâm a SQL & database nerd, thus this approach looks
              interesting.
       
            otterley wrote 21 hours 20 min ago:
            âLinux may cache the filesystem dataâ means thereâs a
            non-zero likelihood that the data in memory unless you dropped
            caches right before you began the benchmark. You donât have to
            explicitly load it into memory for this to be true. Whatâs more,
            unless you are in charge of how memory is used, the kernel is going
            to make its own decisions as to what to cache and what to evict,
            which can make benchmarks unreproducible.
            
            Itâs important to know what you are benchmarking before you start
            and to control for extrinsic factors as explicitly as possible.
       
          trhway wrote 22 hours 59 min ago:
          the [1] has always stipulated "Must sort to and from operating system
          files on secondary storage." and thus felt as a more reasonable
          estimate of overall system performance
          
  HTML    [1]: https://sortbenchmark.org
       
          mey wrote 23 hours 48 min ago:
          So how would that compare to DynamoDB or BigQuery?  (I have zero
          interest in paying for running that experiment).
          
          In theory a Zen 5 / Eypc Turin can have up to 4TB of ram.  So how
          would a more traditional non-clustered DB stand up?
          
          1000 k8s pods, each with 30gb of ram, there has to be a bit of
          overhead/wastage going on.
       
            mulmen wrote 23 hours 37 min ago:
            Are you asking how Dynamo compares at the storage level?  Like in
            comparison to S3?  As a key-value database it doesnât even have a
            native aggregation capability.    Itâs a very poor choose for OLAP.
            
            BigQuery is comparable to DuckDB.  Iâm curious how the various
            Redshift flavors (provisioned, serverless, spectrum) and Spark
            compare.
            
            I donât have a lot of experience with DuckDB but it seems like
            Spark is the most comparable.
       
              fifilura wrote 19 hours 8 min ago:
              BigQuery is built for the distributed case while DuckDB is single
              CPU and requires the workarounds described in the article to act
              like a distributed engine.
       
                tishj wrote 2 hours 41 min ago:
                DuckDB is not single CPU, it's single machine - big difference
       
                mulmen wrote 18 hours 23 min ago:
                Ah ok.    Maybe that does make sense as a comparison to ask if
                you need an analytics stack or can just grind through your prod
                Dynamo.
       
          lumost wrote 1 day ago:
          It does make me wonder whether all of the investment in hot-loading
          of GPU infrastructure for LLM workloads is portable to databases.
          30TB of GPU memory will be roughly 200 B200 cards or roughly 1200 per
          hour compared to the $240/hour pricing for the CPU based cluster. The
          GPU cluster would assuredly crush the CPU cluster with a suitable DB
          given it has 80x the FP32 FLOP capacity. You'd expect the in-memory
          GPU solution to be cheaper (assuming optimized software) with a 5x
          growth in GPU memory per card, or today if the workload can be
          bin-packed efficiently.
       
            eulgro wrote 22 hours 55 min ago:
            Do databases do matrix multiplication? Why would they even use
            floats?
       
              lisbbb wrote 20 hours 35 min ago:
              That's a great question.  I never worked on any cool NASA stuff
              which would involve large scale number crunching.  In the corpo
              space, that's not been my experience at all.  We were trying to
              solve big data problems of like, how to report on medical claims
              that are in flight (which are hardly ever static until much later
              after the claim is long completed and no longer interesting to
              anyone) and do it at scale of tens of thousands per hour.  It
              never went that well, tbh, because it is so hard to validate what
              a "claim" even is since it is changing in real time.  I don't
              think excess GPUs would help with that.
       
              radarsat1 wrote 21 hours 48 min ago:
              what do you think vector databases are? absolutely. i think the
              idea of a database and a "model" could start to really be merged
              this way..
       
              lumost wrote 22 hours 6 min ago:
              lot's of columns are float valued, GPU tensor cores can be
              programmed to do many operations between different float/int
              valued vectors. Strings can also be processed in this manner as
              they are simply vectors of integers. NVidia publishes official
              TPC benchmarks for each GPU release.
              
              The idea of a GPU database has been reasonably well explored,
              they are extremely fast - but have been cost ineffective due to
              GPU costs. When the dataset is larger than GPU memory, you also
              incur slowdowns due to cycling between CPU and GPU memory.
       
          CaptainOfCoit wrote 1 day ago:
          Yeah, pretty misleading it feels like.
          
          For background, here is the initial ideation of the "One Trillion Row
          Challenge" challenge this submission originally aimed to participate
          in:
          
  HTML    [1]: https://docs.coiled.io/blog/1trc.html
       
        vysakh0 wrote 1 day ago:
        Duckdb is an excellent OLAP db, I have had customers who had s3 data
        lake of parquet and use databricks or other expensive tool, when they
        could easily use duckdb.. Given we have cursor/claude code, it is not
        that hard for lot of use cases, I think the lack of documentation on
        how duckdb functions -- in terms of how it loads these files etc are
        some of the reasons companies are not even trying to adopt duckdb. I
        think blogs like this is a great testament for duckdb's performance!
       
          adammarples wrote 22 hours 0 min ago:
          I have been playing today with ducklake, and I have to confess I
          don't quite get what it does that duckdb doesn't already do, if
          duckdb can just run on top of parquet files quite happily without
          this extension...
       
            RobinL wrote 20 hours 58 min ago:
            It's main purpose is to solve the problem of upserts to a data
            lake, because upsert operations to file based data storage are a
            real pain.
       
          lopatin wrote 23 hours 51 min ago:
          DuckDB is great but itâs barely OLAP right? A key part of OLAP is
          âonlineâ. Since the writer process blocks any other processes
          from doing reads, calling it OLAP is a stretch I think.
       
            ansgri wrote 22 hours 42 min ago:
            Isn't the Online part here about getting results immediately after
            query, as opposed to overnight batch reports? So if you don't
            completely overwhelm DuckDB with writes, it still qualifies. The
            quality you're describing is something like "realtime analytics",
            and is a whole another category: Clickhouse doesn't qualify
            (batching updates, merging etc. â but it's clearly OLAP), Druid
            does.
       
              sdairs wrote 22 hours 10 min ago:
              ClickHouse is the market leader in real-time analytics so it's an
              interesting take that you don't think it qualifies.
       
                ansgri wrote 19 hours 17 min ago:
                For certain definition of realtime, certainly (as would any
                system with bounded ingestion latency), but itâs not
                low-latency streaming realtime. Tens of seconds or more can
                pass before new data becomes visible in queries in normal
                operation. Thereâs batching, thereâs merging, and its
                overall architecture prioritizes throughput over latency.
       
              lopatin wrote 22 hours 12 min ago:
              Huh yeah looks like I was totally wrong about what online meant.
              So yeah DuckDB is OLAP. Not that anyone was asking me in the
              first place. Carry on :)
       
          mrtimo wrote 1 day ago:
          I have experience with duckDB but not databricks... from the
          perspective of a company, is a tool like databricks more "secure"
          than duckdb? If my company adopts duckdb as a datalake, how do we
          secure it?
       
            rapatel0 wrote 23 hours 47 min ago:
            Duckdb can run as a local instance that points to parquet files in
            a n s3 bucket. So your "auth" can live on the layer that gives
            permissions to access that bucket.
       
        1a527dd5 wrote 1 day ago:
        The title buries the lede a little
        
        > Our cluster ran on Azure Standard E64pds v6 nodes, each providing 64
        vCPUs and 504 GiB of RAM.
        
        Yes, I would _expect_ when each node has that kind of power it should
        return very impressive speeds.
       
        lolive wrote 1 day ago:
        Why doesn't such large-scale test the big feature everyone needs, which
        is inner join at scale?
       
          philbe77 wrote 1 day ago:
          This is something we are trying to take a novel approach to as well. 
          We have a video demonstrating some TPC-H SF10TB queries which perform
          inner joins, etc. - with GizmoEdge as well:
          
  HTML    [1]: https://www.youtube.com/watch?v=hlSx0E2jGMU
       
            lolive wrote 1 day ago:
            Does that study go into the global vision of DuckLake ?
       
        sammy2255 wrote 1 day ago:
        How would a 63 node Clickhouse cluster compare? >:)
       
        djhworld wrote 1 day ago:
        Interesting and fun
        
        > Workers download, decompress, and materialize their shards into
        DuckDB databases built from Parquet files.
        
        I'm interested to know whether the 5s query time includes this
        materialization step of downloading the files etc, or is this result
        from workers that have been "pre-warmed". Also is the data in DuckDB in
        memory or on disk?
       
          philbe77 wrote 1 day ago:
          hi djhworld.  The 5s does not include the download/materialization
          step.  That parts takes the worker about 1 to 2 minutes for this data
          set.  I didn't know that this was going on HackerNews or would be
          this popular - I will try to get more solid stats on that part, and
          update the blog accordingly.
          
          You can have GizmoEdge reference cloud (remote) data as well, but of
          course that would be slower than what I did for the challenge here...
          
          The data is on disk - on locally mounted NVMe on each worker - in the
          form of a DuckDB database file (once the worker has converted it from
          parquet).  I originally kept the data in parquet, but the duckdb
          format was about 10 to 15% faster - and since I was trying to squeeze
          every drop of performance - I went ahead and did that...
          
          Thanks for the questions.
          
          GizmoEdge is not production yet - this was just to demonstrate the
          art of the possible.  I wanted to divide-and-conquer a huge dataset
          with a lot of power...
       
            philbe77 wrote 1 day ago:
            I've since learned (from a DuckDB blog) - that DuckDB seems to do
            better when the XFS filesytem.    I used ext4 for this, so I may be
            able to get another 10 to 15% (maybe!).
            
            DuckDB blog:
            
  HTML      [1]: https://duckdb.org/2025/10/09/benchmark-results-14-lts
       
        mosselman wrote 1 day ago:
        Are there any good instructions somewhere on how to set this up? As in
        not 63 nodes. But a distributed duckdb instance
       
          philbe77 wrote 1 day ago:
          Hi mosselman, GizmoEdge is not open-source.  DeepSeek has "smallpond"
          however, which is open-source: [1] I plan on getting GizmoEdge to
          production-grade quality eventually so folks can use it as a service
          or licensed software.  There is a lot of work to do, though :)
          
  HTML    [1]: https://github.com/deepseek-ai/smallpond
       
        ta12653421 wrote 1 day ago:
        When reading such extreme numbers, I'm always thinking what I may be
        doing wrong, when my MSSQL based CRUD application warms up its caches
        with around 600.000 rows and it takes 30 seconds to load them from DB
        into RAM on my 4x3GHz machine :-D
        
        Maybe I'm missing something fundamental here
       
          riku_iki wrote 1 day ago:
          Could you run some query like select sum(banch of columns) from
          my_table and see how long it will take?
          
          600k rows is likely less than 1GB of data, and should take about
          second to load into RAM on modern nvme ssd raids.
       
          RobinL wrote 1 day ago:
          Yes - OLAP database are built with a completely different performance
          tradeoff.  The way data is stored and the query planner are optimised
          for exactly these types of queries.  
          If you're working in an oltp system, you're not necessarily doing it
          wrong, but you may wish to consider exporting the data to use in an
          OLAP tool if you're frequently doing big queries.  And nowadays
          there's ways to 'do both ' e.g. you can run the duckdb query engine
          within a postgres instance
       
          zwnow wrote 1 day ago:
          This type of stuff is usually hyperoptimized for no reason and serves
          no real purpose, you are doing just fine
       
          rovr138 wrote 1 day ago:
          Would OLAP be better than OLTP for those queries you're doing?
       
          dgan wrote 1 day ago:
          I also had misfortune working with MSSQL is it was so so unbearably
          slow, because i couldnt upload data in bulk. I guess its forbidden
          technology
       
            Foobar8568 wrote 1 day ago:
            Or you didn't use MSSQL properly, there are at least 2 or 3 ways to
            do bulk upload on MS SQL, not sure in today era.
       
              dgan wrote 21 hours 11 min ago:
              Maybe? Don't know. I never had problemes bulk uploading into
              Postgres tho, it's right there in documentation and I don't have
              to have a weird executable on my corporately castrated laptop
       
                Foobar8568 wrote 11 hours 11 min ago:
                 [1] That's one way, another was BCP.
                
                But yeah if you are using python and loading row by row, or a
                large amount into a large table that has a clustered index,
                chances are that it'll be dead slow but that's expected.
                
  HTML          [1]: https://learn.microsoft.com/en-us/sql/t-sql/statements...
       
                  dgan wrote 8 hours 36 min ago:
                  The documentation you providee requires for the file to be
                  present on the server side, not the client side, which is
                  very dufferent from postgres.
                  
                  As for BCP executable, i couldnt find a way for it to accept
                  any type of date[time] at all
       
        NorwegianDude wrote 1 day ago:
        This is very silly. You're not doing the challenge if you do the work
        up front. The idea is that you start with a file and the goal is to get
        the result as fast as possible.
        
        How long did it take to distribute and import the data to all workers,
        what is the total time from file to result?
        
        I can do this a million times faster on one machine, it just depends on
        what work I do up front.
       
          philbe77 wrote 1 day ago:
          You should do it then, and post it here.  I did do it with one
          machine as well:
          
  HTML    [1]: https://gizmodata.com/blog/gizmosql-one-trillion-row-challen...
       
            NorwegianDude wrote 1 day ago:
            Nobody cares if I can do it a million times faster, everyone can.
            It's cheating.
            
            The whole reason you have to account for the time you spend setting
            it up is so that all work spent processing the data is timed.
            Otherwise we can just precomputed the answer and print it on
            demand, that is very fast and easy.
            
            Just getting it into memory is a large bottleneck in the actual
            challenge.
            
            If I first put it into a DB with statistics that tracks the needed
            min/max/mean then it's basically instant to retrieve, but also
            slower to set up because that work needs to be done somewhere.
            That's why the challenge is time from file to result.
       
        ferguess_k wrote 1 day ago:
        Wait until you see a 800-line Tableau query that joins TB data with TB
        data /s
       
          kwillets wrote 1 day ago:
          Don't forget the 2 hour tableau cloud runtime limit.
       
          philbe77 wrote 1 day ago:
          :D that is scary!
       
        boshomi wrote 1 day ago:
        >âIn our talk, we will describe the design rationale of the DuckLake
        format and its principles of simplicity, scalability, and speed. We
        will show the DuckDB implementation of DuckLake in action and discuss
        the implications for data architecture in general.
        
        Prof. Hannes MÃ¼hleisen, cofounder of DuckDB:
        
        [DuckLake - The SQL-Powered Lakehouse Format for the Rest of Us by
        Prof. Hannes MÃ¼hleisen]( [1] ) (53 min) 
        Talk from Systems Distributed '25:
        
  HTML  [1]: https://www.youtube.com/watch?v=YQEUkFWa69o
  HTML  [2]: https://systemsdistributed.com
       
        nodesocket wrote 1 day ago:
        > Each GizmoEdge worker pod was provisioned with 3.8 vCPUs (3800 m) and
        30 GiB RAM, allowing roughly 16 workers per nodeâmeaning the test
        required about 63 nodes in total.
        
        How was this node setup chosen? Specially 3.8 vCPU and 30 GiB RAM per?
        Why not just run 16 workers total using the entire 64 vCPU and 504 GiB
        of memory each?
       
          philbe77 wrote 1 day ago:
          Hi nodesocket - I tried to do 4 CPUs per node, but Kubernetes takes a
          small (about 200m) CPU request amount for daemon processes - so if
          you try to request 4 (4000m) CPUs x 16 - you'll spill one pod over -
          fitting only 15 per node.
          
          I was out of quota in Azure - so I had to fit in the 63 nodes... :)
       
            nodesocket wrote 1 day ago:
            But why split up a vm into so many workers instead of utilizing the
            entire vm as a dedicated single worker? Whatâs the performance
            gain and strategy?
       
              philbe77 wrote 1 day ago:
              I'm not exactly sure yet.  My goal was to not have the shards be
              too large so as to be un-manageable.  In theory - I could just
              have had 63 (or 64) huge shards - and 1 worker per K8s node, but
              I haven't tried it.
              
              There are so many variables to try - it is a little
              overwhelming...
       
                nodesocket wrote 1 day ago:
                Would be interesting to test. Iâm thinking there may not be a
                benefit to having so many workers on a vm instead of just the
                entire vm resources as a single worker. Could be wrong, but
                that would be a bit surprising.
       
        MobiusHorizons wrote 1 day ago:
        > Once trusted, each worker executes its local query through DuckDB and
        streams intermediate Arrow IPC datasets back to the server over secure
        WebSockets. The server merges and aggregates all results in parallel to
        produce the final SQL resultâoften in seconds.
        
        Can someone explain why you would use websockets in an application
        where neither end is a browser? Why not just use regular sockets and
        cut the overhead of the http layer? Is there a real benefit Iâm
        missing?
       
          rcxdude wrote 7 hours 30 min ago:
          I've done this. It's a reasonably straightforward way to multiplex
          multiple endpoints over a single TCP socket, and it also gives you a
          framing protocol. It doesn't have a particularly high overhead (past
          the initial headers and such it's just a socket with a pretty
          lightweight frame header). You can find a library in basically every
          language and/or framework and they often deal with a bunch of other
          details for you.
       
          lantastic wrote 20 hours 28 min ago:
          Others pointed plenty of arguments, but the ones I find most
          compelling (not necessarily useful in this context) are:
          
          - you can serve any number of disjoint websocket services via same
          port via HTTP routing
          - this also means you can do TLS termination in one place, so
          downstream websocket service doesn't have to deal with the
          nitty-gritty of certificates.
          
          Sure, it adds a hop compared to socket passing, and there are ways to
          get similar fanout with TCP with a custom protocol. But you need to
          add this to every stack that interacting components use, while
          websockets libraries exist for most languages that are likely to be
          used in such an endeavor.
       
          nurettin wrote 1 day ago:
          http 101 upgrade isn't much of an overhead and there are tried and
          tested websocket/ssl libraries with pretty callback interfaces versus
          your custom binary protocol. I would still choose the latter but I
          wouldn't recommend it.
       
            immibis wrote 20 hours 44 min ago:
            you can apply this reasoning to a lot of protocols, like why not
            use Nostr over websockets? I mean, I don't see any reason to do
            this with Nostr over websockets, but also, why not? it's not much
            overhead right?
       
              stevenhuang wrote 16 hours 3 min ago:
              Comparing ws to nostr shows you might not understand how ws
              actually works. You realize after connection setup it's just a
              tcp socket? It's not framed by http headers if that's what you're
              wondering. The ws frame is like 6 bytes.
       
          lucideer wrote 1 day ago:
          > overhead of the http layer
          
          Detail of this well-covered in sibling comments, but at a
          higher-level, two thoughts on this:
          
          1. I see a lot of backlash lately against everything being
          HTTP-ified, with little justification other than a presumption that
          it necessarily adds overhead. Perf-wise, HTTP has come a long way &
          modern HTTP is a very efficient protocol. I think this has cleared
          the way for it to be a foundation for many more things than in the
          past. HTTP/3 being over UDP might clear the way for more of this
          (albeit I think the overhead of TCP/IP is also often overstated - see
          e.g. MQTT).
          
          2. Overhead can be defined in two ways: perf. & maintenance
          complexity. Modern HTTP does add a bit of the latter, so in that
          context it may be a fair concern, but I think the large range of
          competing implementations probably obviates any concern here & the
          alternative usually involves doing something custom (albeit simpler),
          so you run into inconsistency, re-invented wheels & bus factor issues
          there.
       
            immibis wrote 1 day ago:
            Using stuff like HTTP signals a lack of understanding of the whole
            stack. IMO it's important for programmers to understand computers.
            You can write programs without understanding computers, but it's
            best if you go and learn about computers first. You can use
            abstractions but you should also understand the abstractions.
            
            There are two ways I've noticed to design an application.
            
            Some people grab some tools out of their toolbox that look like
            they fit - I need a client/server, I know web clients/servers, so
            I'll use a web client/server.
            
            Other people think about what the computer actually has to do and
            then write code to achieve that: Computer A has to send a block of
            data to computer B, and this has to work on Linux (which means no
            bit-banging - you can only go as low as raw sockets). This type of
            person may still take shortcuts, but it's by intention, not because
            it's the only thing they know: if HTTP is only one function call in
            Python, it makes sense to use HTTP, not because it's the only thing
            you know but because it's good enough, you know it works well
            enough for this problem, and you can change it later if it becomes
            a bottleneck.
            
            Websockets are an odd choice because they're sort of the worst of
            both worlds: they're barely more convenient as raw sockets (there's
            framing, but framing is easy), but they also add a bunch of
            performance and complexity overhead over raw sockets, and more
            things that can go wrong. So it doesn't seem to win on the
            convenience/laziness front nor the performance/security/robustness
            front. If your client had to be a web browser, or could sometimes
            be a web browser, or if you wanted to pass the connections through
            an HTTP reverse proxy, those would be good reasons to choose
            websockets, but none of them are the case here.
       
              gr4vityWall wrote 19 hours 38 min ago:
              > they're barely more convenient as raw sockets (there's framing,
              but framing is easy)
              
              I think it's significant more convenient if your stack touches
              multiple programming languages. Otherwise you'd have to implement
              framing yourself for all of them. Not hard, but I don't see the
              benefit either.
              
              > they also add a bunch of performance and complexity overhead
              over raw sockets
              
              What performance overhead is there over raw sockets once you're
              past the protocol upgrade? It seems negligible if you connection
              is even slightly long-lived.
       
              lucideer wrote 1 day ago:
              Acknowledging that a huge number of people (the vast majority)
              are going to use the only option they know rather than the best
              of a set of options they know, I still think that for a person
              who's fully versed in all available options, Websockets is a
              better option than you make out.
              
              > they're barely more convenient as raw sockets
              
              Honestly, raw sockets are pretty convenient - I'm not convinced
              Websockets are more convenient at all (assuming you already know
              both & there's no learning curves). Raw sockets might even be
              more convenient.
              
              I think it's features rather than convenience that is more likely
              to drive Websocket usage when comparing the two.
              
              > they also add a bunch of performance and complexity overhead
              over raw sockets
              
              This is the part that I was getting at in my above comment. I
              agree in theory, but I just think that the "a bunch" quantifier
              is bit of an exaggeration. They really add very very little
              performance overhead in practice: a negligible amount in most
              cases.
              
              So for a likely-negligible performance loss, & a
              likely-negligible convenience difference, you're getting a
              protocol with built-in encryption, widespread documentation &
              community support - especially important if you're writing code
              that other people will need to take over & maintain - & as you
              alluded to: extensibility (you may never need browser support or
              http proxying, but having the option is compelling when the
              trade-offs are so negligible).
       
            fweimer wrote 1 day ago:
            One reason comes to my mind: HTTP is no longer a stable protocol
            with well-understood security properties. If you deploy it today,
            people expect interoperability with clients and servers that
            implement future protocol upgrades, resulting in an ongoing
            maintenance burden that a different protocol choice would avoid.
       
              lucideer wrote 1 day ago:
              I'm absolutely not an expert of any kind on protocol details, so
              pardon my ignorance here but this surprises me: is this true?
              
              High-level spec changes have been infrequent, with long dual
              support periods, & generally seen pretty slow gradual client &
              server adoption. 1.1 was 1997 & continues to have widespread
              support today. 2 & 3 were proposed in 2015 & 2016 - almost 2
              decades later - & 2 is only really starting to see wide support
              today, with 3 still broadly unsupported.
              
              I'm likely missing a lot of nuance in between versioned releases
              though - I know e.g. 2 saw at least two major additions/updates,
              though I thought those were mostly additive security features
              rather than changes to existing protocol features.
       
                tsimionescu wrote 9 hours 57 min ago:
                I also don't understand what GP meant. Not only is  HTTP/1.1
                universally supported by every HTTP client and server today,
                HTTP/1.0 is as well, and you'll even find lots of support for
                HTPP/0.9. I have never heard of a program or security device
                that speaks HTTP/2.0 but doesn't allow HTTP/1.1.
       
          simonw wrote 1 day ago:
          If you're using sockets you still need to come up with some kind of
          protocol on top of those sockets for the data that's being
          transferred - message delimiter, a data format etc. Then you have to
          build client libraries for that protocol.
          
          WebSockets solve a bunch of those low level problems for you, in a
          well specified way with plenty of existing libraries.
       
            zerd wrote 1 day ago:
            WebSocket doesn't specify data format, it's just bytes, so they
            have to handle that themselves. It looks like they're using Arrow
            IPC.
            
            Since they're using Arrow they might look into Flight RPC [1] which
            is made for this use case.
            
  HTML      [1]: https://arrow.apache.org/docs/format/Flight.html
       
            HumblyTossed wrote 1 day ago:
            ASCII table codes 1,2,3 & 4  pretty simple to use.
       
              tracker1 wrote 23 hours 34 min ago:
              Now solve for encryption, authorization, authentication...
              
              WS(S) has in the box solutions for a lot of these... on top of
              that, application gateways, distribution, failover etc.  You get
              a lot of already solved solutions in the box, so to speak.  If
              you use raw sockets, now you have to implement all of these
              things yourself, and you aren't gaining much over just using WSS.
       
              dns_snek wrote 1 day ago:
              Sure, in principle. Someone already mentioned binary data, then
              you come up with a framing scheme and get to write protocol
              documentation, but why? What's the benefit?
       
                HumblyTossed wrote 1 day ago:
                Simplicity.
       
                  ryanjshaw wrote 1 day ago:
                  You misspelled âbugs and maintenance nightmareâ
       
              jcheng wrote 1 day ago:
              Not if you're passing binary data
       
                woodruffw wrote 1 day ago:
                Even beyond that: the ASCII delimiter control codes are
                perfectly valid UTF-8 (despite not being printable), so using
                them for in-band signaling is a recipe for pain on arbitrary
                UTF-8 data.
       
                  mananaysiempre wrote 1 day ago:
                  If you know your data is UTF-8, then bytes 0xFE and 0xFF are
                  guaranteed to be free. Strictly speaking, 0xC0, 0xC1, and
                  0xF5 through 0xFD also are, but the two top values are free
                  even if you are very lax and allow overlong encodings as well
                  as codepoints up to 2Â³Â² â 1.
       
                    woodruffw wrote 23 hours 36 min ago:
                    I think it would probably be better to invest in a proper
                    framing design than trying to poke holes in UTF-8.
                    
                    (This is true regardless of UTF-8 -- in-band encodings are
                    almost always brittle!)
       
          kevincox wrote 1 day ago:
          > the overhead of the http layer
          
          There isn't much overhead here other than connection setup. For
          HTTP/1 the connection is just "upgraded" to websockets. For HTTP/2 I
          think the HTTP layer still lives on a bit so that you can use
          connection multiplexing (which maybe be overhead if you have no use
          for it here) but that is still a very thin layer.
          
          So I think the question isn't so much HTTP overhead but WebSocket
          overhead. WebSockets add a bit of message framing and whatnot that
          may be overhead if you don't need it.
          
          In 99% of applications if you need encryption, authentication and
          message framing you would be hard-pressed to find a significantly
          more efficient option.
       
            toast0 wrote 1 day ago:
            > In 99% of applications if you need encryption, authentication and
            message framing you would be hard-pressed to find a significantly
            more efficient option.
            
            AFAIK, websockets doesn't do authentication? And the encryption it
            does is minimal, optional xor with a key disclosed in the
            handshake. It does do framing.
            
            It's not super common, but if all your messages have a 16-bit
            length, you can just use TLS framing. I would argue that TLS
            framing is ineffecient (multiple length terms), but using it by
            itself is better than adding a redundant framing layer.
            
            But IMHO, there is significant benefit from removing a layer where
            it'd unneeded.
       
              LunaSea wrote 1 day ago:
              > AFAIK, websockets doesn't do authentication?
              
              Websocket allows for custom header and query parameters which
              make it possible to run a basic authentication scheme and later
              on additional autorisation in the message themselves if really
              necessary.
              
              > And the encryption it does is minimal, optional xor with a key
              disclosed in the handshake. It does do framing.
              
              Web Secure Socket (WSS) is the TLS encrypted version of
              Websockets (WS) (similar to HTTP vs. HTTPS).
       
                fyrn_ wrote 1 day ago:
                Worth noting that wbesockets in the browser don't allow custom
                headers and custom header support is spotty accross sever
                impls.
                It's just not exposed in the javascript API. There has been an
                open chrome bug for that for like 15 years
       
                  LunaSea wrote 21 hours 31 min ago:
                  > Worth noting that wbesockets in the browser don't allow
                  custom headers
                  
                  They do during the initial handshake (protocol upgrade from
                  HTTP to WebSocket).
                  
                  Afterwards the message body can be used to send authorisation
                  data.
                  
                  Server support will depend on tech but Node.js has great
                  support.
       
                    fyrn_ wrote 15 hours 11 min ago:
                     [1] No, I don't think you get it.
                    `new Websocket()` from JS takes no arguments for headers.
                    You literally can't send headers during the handshake from
                    JS. [2] Actually will look into using the subprotocol as a
                    way to do auth, but most impls in the wild send the auth as
                    the first message.
                    
                    The fact the protocol in theory supports it doesn't really
                    matter much since no browser implements that part of the
                    spec.
                    
  HTML              [1]: https://github.com/whatwg/websockets/issues/16
  HTML              [2]: https://developer.mozilla.org/en-US/docs/Web/API/W...
       
          sureglymop wrote 1 day ago:
          Wait but websockets aren't over http right? Just the initiation and
          then there is a protocol upgrade or am I wrong? What overhead is
          there otherwise?
       
            tsimionescu wrote 1 day ago:
            You're right, WebSockets aren't over HTTP, they just use HTTP for
            the connection initiation. They do add some overhead in two places:
            one, when opening a new connection, since you go TCP -> TLS -> HTTP
            -> WebSockets -> Your protocol ; and two, they do add some per
            packet overhead, since there is a WebSocket encapsulation of your
            data - but this is much smaller than typical HTTP request/response
            overhead.
       
          philbe77 wrote 1 day ago:
          Hi MobiusHorizons, I happened to use websockets b/c it was the
          technology I was familiar with.  I will try to learn more about
          normal sockets to see if I could perhaps make them work with the app.
           Thanks for the suggestion...
       
            gopalv wrote 1 day ago:
            >  will try to learn more about normal sockets to see if I could
            perhaps make them work with the app.
            
            There's a whole skit in the vein of "What have the Romans ever done
            for us?" about ZeroMQ[1] which has probably lost to the search
            index now.
            
            As someone who has held a socket wrench before, fought tcp_cork and
            dsack, Websockets isn't a bad abstraction to be on top of,
            especially if you are intending to throw TLS in there anyway.
            
            Low level sockets is like assembly, you can use it but it is a
            whole box of complexity (you might use it completely raw sometimes
            like a tickle ack in the ctdb[2] implementation). [1] - [1] [2] -
            
  HTML      [1]: https://news.ycombinator.com/item?id=32242238
  HTML      [2]: https://linux.die.net/man/1/ctdb
       
            DanielHB wrote 1 day ago:
            if you really want maximum performance maybe consider using CoAP
            for node-communication: [1] It is UDP-based but adds handshakes and
            retransmissions. But I am guessing for your benchmark transmission
            overhead isn't a major concern.
            
            Websockets are not that bad, only the initial connection is HTTP.
            As long as you don't create a ton of connections all the time it
            shouldn't be much slower than a TCP-based socket (purely
            theoretical assumption on my part, I never tested).
            
  HTML      [1]: https://en.wikipedia.org/wiki/Constrained_Application_Prot...
       
        shinypenguin wrote 1 day ago:
        Is the dataset somewhere accessible? Does anyone know more about the
        "1T challenge", or is it just the 1B challenge moved up a notch?
        
        Would be interesting to see if it would be possible to handle such data
        on one node, since the servers they are using are quite beefy.
       
          achabotl wrote 1 day ago:
          The One Trillion Row Challenge was proposed by Coiled in 2024.
          
  HTML    [1]: https://docs.coiled.io/blog/1trc.html
       
          philbe77 wrote 1 day ago:
          Hi shinypenguin - the dataset and challenge are detailed here: [1]
          The data is in a publicly accessible bucket, but the requester is
          responsible for any egress fees...
          
  HTML    [1]: https://github.com/coiled/1trc
       
            simonw wrote 1 day ago:
            I suggest linking to that from the article, it is a useful
            clarification.
       
              philbe77 wrote 1 day ago:
              Good point - I'll update it...
       
            shinypenguin wrote 1 day ago:
            Hi, thank you for the link and quick response! :)
            
            Do you know if anyone attempted to run this on the least amount of
            hardware possible with reasonable processing times?
       
              philbe77 wrote 1 day ago:
              Yes - I also had GizmoSQL (a single-node DuckDB database engine)
              take the challenge - with very good performance (2 minutes for
              $0.10 in cloud compute cost):
              
  HTML        [1]: https://gizmodata.com/blog/gizmosql-one-trillion-row-cha...
       
        tgv wrote 1 day ago:
        Impressive, but those 63 nodes were "Azure Standard E64pds v6 nodes,
        each providing 64 vCPUs and 504 GiB of RAM." That's 4000 CPUs and 30TB
        memory.
       
          ramraj07 wrote 1 day ago:
          Sounds like the equivalent of a 4xl snowflake warehouse, which for
          such queries would take 30 seconds, with the added benefit of the
          data being cold stored in s3. Thus you only pay by the minute.
       
            philbe77 wrote 1 day ago:
            Challenge accepted - I'll try it on a 4XL Snowflake to get actual
            perf/cost
       
            hobs wrote 1 day ago:
            No, that would be equivalent to 64 4xl snowflake warehouses (though
            the rest of your point still stands).
       
              ramraj07 wrote 21 hours 6 min ago:
              Apologize for getting it wrong a few orders of magnitude, but
              thats even more ghastly if its so overpowered and yet takes this
              long.
       
              philbe77 wrote 1 day ago:
              Cost-wise, 64 4xl Snowflake clusters would cost: 64 x $384/hr -
              for a total of: $24,576/hr (I believe)
       
                __mharrison__ wrote 1 day ago:
                What was the cost of the duck implementation?
       
          RamtinJ95 wrote 1 day ago:
          At that scale it cannot be cheaper than just running the same
          workload on BigQuery or Snowflake or?
       
            philbe77 wrote 1 day ago:
            A Standard E64pds v6 costs: $3.744 / hr on demand.  At 63 nodes -
            the cost is: $235.872 / hr - still cheaper than a Snowflake 4XL
            cluster - costing: 128 credits / hr at $3/credit = $384 / hr.
       
              philbe77 wrote 1 day ago:
              If I used "spot" instances - it would have been 63 x $0.732/hr
              for a total of: $45.99 / hr.
       
              philbe77 wrote 1 day ago:
              At 5 seconds - the query technically cost: $0.3276
       
                Keyframe wrote 1 day ago:
                That's like calculating a trip cost based on gas cost without
                accounting for car rental, gas station food, and especially
                mandatory bathroom fee after said food.
       
          ralegh wrote 1 day ago:
          Just noting that 4000 vCPUs usually means 2000 cores, 4000 threads
       
            electroly wrote 1 day ago:
            It doesn't mean that here. Epdsv6 is 1 core = 1 vCPU.
       
              ralegh wrote 1 day ago:
              I stand correctedâ¦
       
        maxmcd wrote 1 day ago:
        Are there any open sourced sharded query planners like this? Something
        that can aggregate queries across many duckdb/sqlite dbs?
       
          mritchie712 wrote 1 day ago:
          DeepSeek released smallpond
          
          0 - [1] 1 - [2] (overview for data engineers, practical application)
          
  HTML    [1]: https://github.com/deepseek-ai/smallpond
  HTML    [2]: https://www.definite.app/blog/smallpond
       
          hobofan wrote 1 day ago:
          Not directly DuckDB (though I think it might be able to be connected
          to that), but I think Apache Datafusion Ballista[0] would be a
          typical modern open source benchmark here.
          
          [0]:
          
  HTML    [1]: https://datafusion.apache.org/ballista/contributors-guide/ar...
       
       
   DIR <- back to front page