codevoid.de/1/hn/comments_46251460.gph

  URI:

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Lite^3, a JSON-compatible zero-copy serialization format
       
       
        IshKebab wrote 4 hours 21 min ago:
        I'm suspicious of their FlatBuffers performance comparison.
       
        yIt9R8 wrote 11 hours 1 min ago:
        The benchmarks are flawed, verification is not generally used after
        serialization with flatbuffers. Deserialization with flatbuffers is a
        simple reinterpret_cast so it makes no sense for it to be 41.69ms.
        
        It's just dishonest.
       
        mhalle wrote 11 hours 13 min ago:
        It would be interesting to use lite3 for blob storage in or with
        sqlite.
       
          weitendorf wrote 9 hours 46 min ago:
          That's kind of similar to my project collector: [1] It's
          protobuf/grpc based but uses json for serialization to make use of
          sqlite's json filtering functionality. However, it cannot be said to
          be zero-copy. It serializes binary protos into json and stores the
          binary protos directly for fast access, which allows you to skip
          deserialization when pulling out query results
          
  HTML    [1]: https://github.com/accretional/collector
       
        eliasdejong wrote 13 hours 47 min ago:
        Author here,
        
        First of all, hello Hacker News :)
        
        Many of the comments seem to address the design of key hashing. The
        reason for using hashed keys inside B-tree nodes instead of the string
        keys directly is threefold:
        
        1) The implementation is simplified.
        
        2) When performing a lookup, it is faster to compare fixed-sized
        elements than it is to do variable length string comparison.
        
        3) The key length is unlimited.
        
        I should say the documentation page is out of date regarding hash
        collisions. The format now supports probing thanks to a PR merged
        yesterday. So inserting colliding keys will actually work.
        
        It is true that databases and other formats do store string keys
        directly in the nodes. However as a memory format, runtime performance
        is very important. There is no disk or IO latency to 'hide behind'.
        
        Right now the hash function used is DJB2. It has the interesting
        property of somewhat preserving the lexicographical ordering of the key
        names. So hashes for keys like "item_0001", "item_0002" and "item_0003"
        are actually more likely to also be placed sequentially inside the
        B-tree nodes. This can be useful when doing a sequential scan on the
        semantic key names, otherwise you are doing a lot more random access.
        Also DJB2 is so simple that it can be calculated entirely by the C
        preprocessor at compile time, so you are not actually paying the
        runtime cost of hashing.
        
        We will be doing a lot more testing before DJB2 is finalized in the
        spec, but might later end up with a 'better' hash function such as
        XXH32.
        
        Finally, TRON/LiteÂ³ compared to other binary JSON formats (BSON,
        MsgPack, CBOR, Amazon Ion) is different in that:
        
        1) none of the formats mentioned provide direct zero-copy indexed
        access to the data
        
        2) none of the formats mentioned allow for partial mutation of the data
        without rewriting most of the document
        
        This last point 2) is especially significant. For example, JSONB in
        Postgres is immutable. When replacing or inserting one specific value
        inside an object or array, with JSONB you will rewrite the entire
        document as a result of this, even if it is several megabytes large. If
        you are performing frequent updates inside JSONB documents, this will
        cause severe write amplification. This is the case for all current
        Postgres versions.
        
        TRON/LiteÂ³ is designed to blur the line between memory and
        serialization format.
       
          p0w3n3d wrote 51 min ago:
          That's really impressive. As you wrote it in C it gets automatically
          compilable to webasm and usable in js. I wonder how Java would behave
          here... As JNI is not the fastest (used to be not the fastest?)
       
          andreyvit wrote 10 hours 43 min ago:
          Hey, I'm sorry, but your Postgres example is completely wrong:
          because of MVCC, a new version of the data will be stored on every
          update regardless of the choice of data representation, making the
          in-place mutability much less of an advantage. (It might be faster
          than a pair of a compact immutable format + mutable patch layer on
          top, or it might be slower; the answer ain't immediately obvious to
          me!)
          
          What you should be imagining instead is a document database entirely
          built around LiteÂ³-encoded documents, using something like rollback
          journals instead of MVCC.
          
          We're doing something similar in my company, storing
          zero-serialization immutable [1] docs in a key-value store (which are
          read via mmap with zero copying disk-to-usage) and using a mutable
          [2] overlay patch format for updates. In our analytics use cases,
          compact storage is very important, in-place mutability is irrelevant
          (again because of Copy-on-Write at the key-value store level), and
          the key advantage is zero serialization overhead.
          
          What I'm saying is that LiteÂ³ is a very timely and forward-looking
          format, but the merging of immutable and mutable formats into one
          carries tradeoffs that you probably want to discuss, and the
          discussion into the appropriate use cases is very much worth having.
          [1]
          
  HTML    [1]: https://github.com/andreyvit/edb/blob/main/kvo/immutable.go
  HTML    [2]: https://github.com/andreyvit/edb/blob/main/kvo/mutable.go
       
            eliasdejong wrote 9 hours 47 min ago:
            Hi, you are right in calling out the Postgres example in the
            context of DBs/MVCC. The purpose of JSONB is to be an indexable
            representation of JSON inside a Postgres database. It is not trying
            to be a standalone format for external interchange and therefore it
            is fulfilling very different requirements.
            
            A serialization format does not care about versioning or rollbacks.
            It is simply trying to organize data such that it can be sent over
            a network. If updates can be made in-place without requiring
            re-serialization, then that is always a benefit.
            
            Write amplification is still a fact however that I think deserves
            to be mentioned. To tackle this problem in the context of DBs/MVCC,
            you would have to use techniques other than in-place mutation like
            you mention: overlay/COW. Basically, LMDB-style.
            
            And yes I think databases is where this technology will eventually
            have the greatest potential, so that is where I am also looking.
       
        Jean-Papoulos wrote 14 hours 32 min ago:
        This is nice, but please don't clickbait headlines with straight-up
        lies. This is not JSON-compatible.
       
          koolala wrote 14 hours 0 min ago:
          Yeah JSON compatable is very different from convertable.
       
        bawolff wrote 15 hours 15 min ago:
        This is cool, but the headline makes it sound like the wire format is
        json compatible which is not the case. I'm not really sure why there is
        a focus on json here at all - its the least interesting part of this
        and the same could be said for almost every serialization format.
       
        tarasglek wrote 15 hours 41 min ago:
        hash collision limitation for keys is the most questionable part of
        design. Usually thats handled by forcing key lookup to verify that what
        you looked up matches what you tried to lookup. 
        Resolving this perf hit is probably doable by having an extra table of
        conflicting hashes
       
          eliasdejong wrote 13 hours 26 min ago:
          (author here)
          
          The documentation page is out of date, the format now resolves
          collisions through quadratic probing.
       
        lsb wrote 17 hours 5 min ago:
        This is super interesting!
        
        Apache Arrow is trying to do something similar, using Flatbuffer to
        serialize with zero-copy and zero-parse semantics, and an index
        structure built on top of that.
        
        Would love to see comparisons with Arrow
       
        koolala wrote 17 hours 13 min ago:
        GLTF is like this too (or PLY)? The main difference is the format of
        their headers? Just by reading the header you can parse the binary
        data. I'm surprised BSON and any of the other binary JSON formats they
        list don't support reading the memory layout in a header.
       
        rixed wrote 17 hours 38 min ago:
        So it's not really a serialization format, it's a compact, modifiable
        untyped tree, that one can therefore send to another machine with the
        same architecture. Or deserialise into native language specific data
        structures.
        
        Don't get me wrong, I find this type of data structures interesting and
        useful, but it's misleading to call it "serialization", unless my
        understanding is wrong.
       
          jesse__ wrote 2 hours 34 min ago:
          What is a serialization format, if not a data encoding "that one can
          therefore send to another machine"  ..    "Or deserialise into native
          language specific data structurs"  ..?
          
          I'm very confused by your comment.
       
          bawolff wrote 15 hours 10 min ago:
          I'm not sure what the distinction you are trying to make here is?
          
          How does machine architecture play into it? It sounds like int sizes
          are the same regardless of word sizes of the machine, the choices
          made just happen to have high performance for common machine
          architectures. Or is it about endianess? Do big endian machines even
          exist anymore?
       
            rixed wrote 59 min ago:
            Yes, integer sizes, float sizes, endianess, alignment
            requirement...
       
          koolala wrote 16 hours 39 min ago:
          You have to encode the type of all the binary data. Does that make it
          serialization?
       
        al2o3cr wrote 6 days ago:
        The docs mention that space for overwritten variable-sized values in
        the buffer is not reclaimed:
        
            The overridden space is never recovered, causing buffer size
            to grow indefinitely.
        
        Is the garbage at least zeroed? Otherwise seems like it could "leak"
        overwritten values when sending whole buffers via memcpy
       
          mjd wrote 17 hours 46 min ago:
          âBy default, deleted values are overwritten with NULL bytes (0x00).
          This is a safety feature since not doing so would leave 'deleted'
          entries intact inside the datastructure until they are overwritten by
          other values. If the user wishes to maximize performance at the cost
          of leaking deleted data, LITE3_ZERO_MEM_DELETED should be
          disabled.â
       
        cryptonector wrote 6 days ago:
        Lite^3 is a clever encoding for JSON data that is indexed as-encoded
        and is mutable in place.
        
        Perhaps I should have posted this URI instead: [1] Lite^3 deserves to
        be noticed by HN.  u/eliasdejong (the author) posted it 23 days ago but
        it didn't get very far.  I'm hoping this time it gets noticed.
        
  HTML  [1]: https://lite3.io/design_and_limitations.html
       
          dang wrote 2 hours 49 min ago:
          I've added that second link to the toptext.
          
          I'm sorry we missed that Show HN ( [1] )! It belonged in the SCP (
          [2] ).
          
  HTML    [1]: https://news.ycombinator.com/item?id=45992832
  HTML    [2]: https://news.ycombinator.com/item?id=26998308
       
          Someone wrote 16 hours 13 min ago:
          FTA#1: âHashmaps do not (efficiently) support range queries. Since
          the keys are stored in pseudorandom orderâ
          
          FTA#2: âObject keys (think JSON) are hashed to a 4-byte digest and
          stored inside B-tree nodesâ
          
          It still will likely be faster because of better cache locality, but
          doesnât that means this also does not (efficiently) support range
          queries?
          
          That page also says
          
          âtree traversal inside the critical path can be satisfied entirely
          using fixed 4-byte word comparisons, never actually requiring string
          comparisons except for detection of hash collisions. This design
          choice alone contributes to much of the runtime performance of
          LiteÂ³.â
          
          How can that be true, given that this beats libraries that use hash
          maps, that also rarely require string comparisons, by a large margin?
          
          Finally, [1] says:
          
          âInserting a colliding key will not corrupt your data or have side
          effects. It will simply fail to insert.â
          
          I also notice this uses the DJB2 hash function, which has hash
          collisions between short strings ( [2] ), and those are more likely
          to be present in json documents. You get about 8 + 3 Ã 5 = 23 bits
          of hash for four-character strings, for example, increasing the risk
          of collisions to, ballpark, about one in three thousand.
          
          => I think that needs fixing before this can be widely used.
          
  HTML    [1]: https://lite3.io/design_and_limitations.html#autotoc_md37
  HTML    [2]: http://dmytry.blogspot.com/2009/11/horrible-hashes.html
       
            nneonneo wrote 15 hours 31 min ago:
            Looking at the actual code ( [1] ), it seems like it performs up to
            128 probes to find a target before failing, rather than bailing
            immediately if a collision is detected. It seems like maybe the
            documentation needs to be updated?
            
            It's a bit unfortunate that the wire format is tied to a specific
            hash function. It also means that the spec will ossify around a
            specific hash function, which may not end up being the optimal
            choice. Neither JSON nor Protobuf have this limitation. One way
            around this would be to ditch the hashing and use the keys for the
            b-tree directly. It might be worth benchmarking - I don't think
            it's necessarily any slower, and an inline cache of key prefixes
            (basically a cheapo hash using the first N chars) should help
            preserve performance for common cases.
            
  HTML      [1]: https://github.com/fastserial/lite3/blob/main/src/lite3.c#...
       
              Someone wrote 13 hours 52 min ago:
              > It seems like maybe the documentation needs to be updated
              
              Looks like it, yes:
              
                /**
                  Enable hash probing to tolerate 32-bit hash collisions.
              
                  Hash probing configuration (quadratic open addressing for
              32-bit hashes:
                  h_i = h_0 + i^2)
              
                  Limit attempts with `LITE3_HASH_PROBE_MAX` (defaults to 128).
                  Probing cannot be disabled.
                */
                #ifndef LITE3_HASH_PROBE_MAX
                  #define LITE3_HASH_PROBE_MAX 128U
                #endif
              
                #if LITE3_HASH_PROBE_MAX < 2
                  #error "LITE3_HASH_PROBE_MAX must be >= 2"
                #endif
              
              > It also means that the spec will ossify around a specific hash
              function
              
              It is a bit ugly, and will break backwards compatibility, but
              supporting a second hash function isnât too hard.
              
              You can, on load, hash a few keys, compare them to the hashes,
              and, from that, if the input has many keys with high probability,
              infer which hash function was used.
              
              There also might be spare bit somewhere to indicate âuse the
              alternative hash functionâ.
              
              Reading the code (nice-looking, BTW, for C code, but since it is
              C code, also full of warnings that other languages can protect
              you from) I spotted this ( [1] ):
              
              > The JSON standard requires that the root-level type always be
              an âobject'
              > or 'array'. This also applies to LiteÂ³.
              
              I donât think that is true, and [2] agrees with that. Single
              values (numbers, strings, booleans, null) also are valid json.
              
  HTML        [1]: https://github.com/fastserial/lite3/blob/acbb97984eca118...
  HTML        [2]: https://www.json.org/json-en.html
       
          eric-p7 wrote 18 hours 25 min ago:
          This needs more attention than it's getting.  Perhaps if you made
          some changes to the landing pages could help?
          
          "outperforms the fastest JSON libraries (that make use of SIMD) by up
          to 120x depending on the benchmark. It also outperforms schema-only
          formats, such as Google Flatbuffers (242x). LiteÂ³ is possibly the
          fastest schemaless data format in the world."
          
          ^ This should be a bar graph at the top of the page that shows both
          serializing sizes and speeds.
          
          It would also be nice to see a json representation on the left and a
          color coded string of bytes on the right that shows how the data is
          packed.
          
          Then the explanation follows.
       
            sirfz wrote 14 hours 41 min ago:
            As already mentioned in other comments, it doesn't really make
            sense to compare to json parsers since lite3 parses, well, lite3
            and not json. It serves a different use case and I think focusing
            on performance vs json (especially json parsers) is not the best
            thing about this project
       
       
   DIR <- back to front page