codevoid.de/1/hn/comments_45580795.gph

  URI:

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Kaitai Struct: declarative binary format parsing language
       
       
        casey2 wrote 13 hours 23 min ago:
         [1] highly recommended if you like functional languages
        
  HTML  [1]: https://www.erlang.org/doc/system/bit_syntax.html
       
        metaPushkin wrote 15 hours 0 min ago:
        Enjoyable tool. When I developed my text RPG game, I prepared a Kaitai
        specification for the save file data format so that it would be easy to
        create third-party software for viewing and modifying it =)
       
        depierre wrote 17 hours 34 min ago:
        One of my personal favorites. I've used it for parsing SAP's RPC
        network protocol, reverse-engineering Garmin apps [0], and more
        recently in a CTF challenge that involved an unknown file format, among
        others. It's surprisingly quick to pick up once you get the hang of the
        syntax.
        
        The serialization branch for Python [1] (I haven't tried the Java one)
        has generally done the job for me, though I've had to patch a few edge
        cases.
        
        One feature I've often wished for is access to physical offsets within
        the file being parsed (e.g. being able to tell that a field foo that
        you just parsed starts at offset 0x100 from the beginning of the file).
        As far as I know, you only get relative offsets to the parent
        structure.
        
        0: [1] 1:
        
  HTML  [1]: https://github.com/anvilsecure/garmin-ciq-app-research/blob/ma...
  HTML  [2]: https://doc.kaitai.io/serialization.html
       
        Rucadi wrote 17 hours 52 min ago:
        The most success I had so far on doing a project where I had to work
        with binary data parsing is Deku in rust, I would give this a try if I
        have the opportunity
       
        Locutus_ wrote 18 hours 1 min ago:
        How is the write support now-adays, is it production quality now?
        
        I used Kaitai in a IoT project for building data ingress parsers and it
        was great. But not having write support was a bummer.
       
        somethingsome wrote 18 hours 21 min ago:
        I didn't check exactly what Kaitai does but, MPEG uses a custom SDL for
        it's binary syntax: [1] Just sharing, in case someone is interested :)
        
  HTML  [1]: https://mpeggroup.github.io/mpeg-sdl-editor/
       
        kodachi wrote 22 hours 0 min ago:
        The recent release of 0.11 marks the inclusion of the long awaited
        serialization feature. Python and Java only for now.
        I've been using it for a while for Python and although it has some
        rough edges, it works pretty well and I'm super excited for the
        project.
       
        pabs3 wrote 22 hours 10 min ago:
        Kaitai is one of many different tools that do this, there is a list of
        them here: [1] Personally I like GNU Poke.
        
  HTML  [1]: https://github.com/dloss/binary-parsing
       
          dhx wrote 21 hours 18 min ago:
          Is anyone aware of a project that provides simplified declaration of
          constraint checking?
          
          For example:
          
            structures:
              struct a { strz b, strz c, int d, str[d] e }
            
            constraints:
              len(b) > len(c)
              d > 6
              d <= 10
              e ~ /^ABC\d+/
       
        whitten wrote 22 hours 34 min ago:
        To quote from the page:
            id: flags
            type: u1
        
        This seems to say flags is a sort of unsigned integer.
        
        Is there a way to break the flags into big endiaN bits where the first
        two bits are either 01 or 10 but not 00 or 11 with 01 meaning DATA and
        01 meaning POINTER with the next five bits as a counter of segments and
        the next bit is 1 if the default is BLACK and 1 if the default is WHITE
        ?
       
          CGamesPlay wrote 21 hours 34 min ago:
          Appears so:
          
  HTML    [1]: https://doc.kaitai.io/user_guide.html#_bit_sized_integers
       
        carom wrote 23 hours 0 min ago:
        My dream for a parsing library / language is that it would be able to
        read, manipulate, and then re-serialize the data. I'm sure there are a
        ton of edge cases there, but the round trip would be so useful for
        fuzzing and program analysis.
       
          jaxefayo wrote 20 hours 3 min ago:
          From what Iâve read, kaitai does that now. For the longest time it
          could only parse, but I believe now it can generate/serialize.
       
        lzcdhr wrote 23 hours 23 min ago:
        Does it support incremental parsing? For example, when I am parsing a
        network protocol, can it still consume some data from the head of the
        buffer even if the data is incomplete? This would not only avoid
        multiple attempts to restart parsing from the beginning but also
        prevent the buffer from growing excessively.
       
        bburky wrote 1 day ago:
        Kaitai is pretty nice. Hex editors with structure parsing support used
        to be more rare than they are now, so I've used [1] instead a few
        times.
        
        Also, the newest Kaitai release added (long awaited) serialization
        support! I haven't had a chance to try it out.
        
  HTML  [1]: https://ide.kaitai.io/
  HTML  [2]: https://kaitai.io/news/2025/09/07/kaitai-struct-v0.11-released...
       
        Everdred2dx wrote 1 day ago:
        I had a ton of fun using Kaitai to write an unpacking script for a
        video game's proprietary pack file format. Super cool project.
        
        I did NOT have fun trying to use Kaitai to pack the files back
        together. Not sure if this has improved at all but a year or so ago you
        had to build dependencies yourself and the process was so cumbersome it
        ended up being easier to just write imperative code to do it myself.
       
          kodachi wrote 21 hours 49 min ago:
          It hasn't improved that much, you need to know the final size and
          fill all attributes, there are no defaults, at least in Python.
       
        dgan wrote 1 day ago:
        Wow this is good. My only complaint is annoyingly verbose yaml. What if
        I would like to use Kaitai instead of protobuffs, my .proto file is
        already a thousand lines, splitting each od these lines into 3-4 yaml
        indented lines is hurting readability
       
          indrora wrote 8 hours 32 min ago:
          You're using it for the wrong thing, then.
          
          KS isn't for general data mangling, it's for "I have this format and
          I need a de novo parser for it that works under explicit rules" and
          you're willing to do the work of fully implementing it from the bytes
          up.
       
            dgan wrote 3 hours 8 min ago:
            Ok. What if I want a general, polyglot data mangling tool, which
            doesnt produces metrics tons of bloat like protobuf does?
       
        imtringued wrote 1 day ago:
         [1] DFDL is heavily encroaching on Kaitai structs territory.
        
  HTML  [1]: https://en.wikipedia.org/wiki/Data_Format_Description_Language
       
        ginko wrote 1 day ago:
        No pure C backend?
       
          vendiddy wrote 1 day ago:
          It's not C but we have sponsored a Zig target for Kaitai. If anyone
          reading this knows Zig well, please comment because would love to get
          a code review of the generated code!
       
            vitalnodo wrote 18 hours 51 min ago:
            Can you share the link? I wonder also whether it uses comptine
            features.
       
              vendiddy wrote 15 hours 14 min ago:
              It is not yet ready but the master branch has an initial draft.
              [1] It would be premature to review now because there are some
              missing features and stuff that has to be cleaned up.
              
              But I am interested in finding someone experienced in Zig to help
              the maintainer with a sanity check to make best practices are
              being followed. (Would be willing to pay for their time.)
              
              If comptime is used, it would be minimal. This is because
              code-generation is being done anyway so that can be an explicit
              alternative to comptime. But we have considered using it in a few
              places to simplify the code-generation.
              
  HTML        [1]: https://github.com/kaitai-io/kaitai_struct_compiler/comm...
       
          dhsysusbsjsi wrote 1 day ago:
          This would be great for most projects as Swift for example is
          abandoned & 6+ years since last commit.
       
        sitkack wrote 1 day ago:
        What was the Python based binary parsing library from around 2010?
        Hachoir?
        
  HTML  [1]: https://hachoir.readthedocs.io/en/latest/index.html
       
          jonstewart wrote 1 day ago:
          Hachoir was rad, just not very fast.
       
          ctoth wrote 1 day ago:
          Construct?
       
        okanat wrote 1 day ago:
        Even if you don't want to use it since it is not as efficient as a
        hand-written specialized parser, Kaitai Struct gives a perfect way of
        documenting file formats. I love the idea and every bit of the project!
       
          jonstewart wrote 1 day ago:
          I like using it for parsing structs but then intersperse procedural
          code in it for loops/containers, so not everything gets read into RAM
          all at once.
       
        layoric wrote 1 day ago:
        I discovered this project recently and used it for Himawari Standard
        Data format and it made it so much easier. Definitely recommend using
        this if you need to create binary readers for uncommon formats.
       
        setheron wrote 1 day ago:
        Great timing!
        I just published [1] and contributed kaitai C++ STL runtime to nixpkgs
        
  HTML  [1]: https://github.com/fzakaria/nix-nar-kaitai-spec
  HTML  [2]: https://github.com/NixOS/nixpkgs/pull/454243
       
        mturk wrote 1 day ago:
        Kaitai is absolutely one of my favorite projects.  I use it for work
        (parsing scientific formats, prototyping and exploring those formats,
        etc) as well as for fun (reverse engineering games, formats for DOSbox
        core dumps, etc).
        
        I gave a guest lecture in a friend's class last week where we used
        Kaitai to back out the file format used in "Where in Time is Carmen
        Sandiego" and it was a total blast.  (For me.  Not sure that the class
        agreed?  Maybe.)  The Web IDE made this super easy -- [1] .
        
        (On my youtube page I've got recordings of streams where I work with
        Kaitai to do projects like these, but somehow I am not able to work up
        the courage to link them here.)
        
  HTML  [1]: https://ide.kaitai.io/
       
          heromal wrote 22 hours 52 min ago:
          I'm curious, how do you use it for Game RE?
       
            ACCount37 wrote 13 hours 43 min ago:
            Not the author, but also in RE.
            
            RE, especially of older and more specialized software, involves
            dealing with odd undocumented binary formats. Which you may have to
            dissect carefully with a hex editor and a decompiler, so that you
            can get at the data inside.
            
            Kaitai lets you prototype a parser for formats like that on the go,
            quick and easy.
       
              pvitz wrote 10 hours 6 min ago:
              A shot in the dark, but maybe you could give me a hint. Recently,
              I was interested in extracting sprites from an old game. I was
              able to reverse the file format of the data archive, which
              contained the game assets as files. However, I got stuck because
              the image files were obviously compressed. By chance, I found an
              open source reimplementation of the game and realised it was
              LZ77+Huffman compressed, but how would one detect the type of
              compression and parameters with only the file? That seems a
              pretty hard problem or are there good heuristics to detect that?
       
                ACCount37 wrote 9 hours 16 min ago:
                Some simpler cases like various RLE-type encodings can be
                figured out with that pattern recognizing brain - by staring at
                them really really hard.
                
                For harder cases? You take the binaries that read or write your
                compressed files, load them in your tool (typically Ghidra
                nowadays), and track down the code that does it.
                
                Then you either recognize what that code does (by staring at it
                really really hard), or try to re-implement it by hand while
                reading up on various popular compression algos in hope that
                doing this enlightens you.
                
                Third option now: feed the decompiled or reimplemented code to
                the best LLM you have access to, and ask it. Those things are
                downright superhuman at pattern matching known algorithms, so
                use them, no such thing as "cheating" in RE.
                
                The "hard mode" is compression implemented in hardware, with
                neither a software encoder or a software decoder available. In
                which case you better be ready for a lot of "feed data to the
                magic registers, see results, pray they give you a useful hint"
                type of blind hardware debugging. Sucks ass.
                
                The "impossible" is when you have just the compressed binaries,
                with no encoder or decoder or plaintext data available to you
                at all. Better hope it's something common or simple enough or
                it's fucking hopeless. Solving that kind of thing is
                cryptoanalysis level of mind fuck and I am neither qualified
                enough nor insane enough to advise on that.
                
                Another thing. For practical RE? ALWAYS CHECK PRIOR WORK FIRST.
                You finding an open source reimplementation? Good job, that's
                what you SHOULD be doing, no irony, that's what you should be
                doing ALWAYS. Always check whether someone has been there and
                done that! Always! Check whether someone has worked on this
                thing, or the older version of it, or another game in the same
                engine - anything at all. Can save you literal months of
                banging your head against the wall.
       
                  pvitz wrote 9 hours 3 min ago:
                  Thanks for your reply and advice! I guess what you describe
                  as "impossible" is the case I am mostly interested in, though
                  more for non-executable binary data. If I am not mistaken,
                  this goes under the term "file fragment classification", but
                  I have been wondering if practitioners might have figured out
                  some better ways than what one can find in scholarly
                  articles.
                  
                  And yes, searching for the reimplementation beforehand would
                  have saved me some hours :D
       
                    ACCount37 wrote 8 hours 12 min ago:
                    It's not about the data being executable. It's about having
                    access to whatever reads or writes this data.
                    
                    Whatever reads or writes this data has to be able to
                    compress or decompress it. And with any luck, you'll be
                    able to take the compression magic sauce from there.
       
                      pvitz wrote 7 hours 56 min ago:
                      I understood "binaries" in "compressed binaries" as
                      "executables", e.g. like a packed executable, but I see
                      that you mean indeed a binary file (and not e.g. a text
                      file).
       
                        ACCount37 wrote 5 hours 18 min ago:
                        Reread that just now, sorry for not making it clearer.
                        I kind of just used "binaries" in both senses? Hope the
                        context clears it up.
       
        theLiminator wrote 1 day ago:
        Is the main difference from [1] being that Kaitai is declarative?
        
  HTML  [1]: https://github.com/google/wuffs
       
          nigeltao wrote 1 day ago:
          See [1] > Kaitai Struct is in a similar space, generating safe
          parsers for multiple target programming languages from one
          declarative specification. Again, Wuffs differs in that it is a
          complete (and performant) end to end implementation, not just for the
          structured parts of a file format. Repeating a point in the previous
          paragraph, the difficulty in decoding the GIF format isn't in the
          regularly-expressible part of the format, it's in the LZW
          compression. Kaitai's GIF parser returns the compressed LZW data as
          an opaque blob.
          
          Taking PNG as an example, Kaitai will tell you the image's metadata
          (including width and height) and that the compressed pixels are in
          the such-and-such part of the file. But unlike Wuffs, Kaitai doesn't
          actually decode the compressed pixels.
          
          ---
          
          Wuffs' generated C code also doesn't need any capabilities, including
          the ability to malloc or free. Its example/mzcat program (equivalent
          to /bin/bzcat or /bin/zcat, for decoding BZIP2 or GZIP) self-imposes
          a SECCOMP_MODE_STRICT sandbox, which is so restrictive (and secure!)
          that it prohibits any syscalls other than read, write, _exit and
          sigreturn.
          
          (I am the Wuffs author.)
          
  HTML    [1]: https://github.com/google/wuffs/blob/main/doc/related-work.m...
       
            corysama wrote 21 hours 55 min ago:
            Wuffs looks pretty awesome. Thanks for making it.
            
            Wuffs is intended for files. But, would it be a bad idea to use it
            to parse network data from untrusted endpoints?
       
              nigeltao wrote 21 hours 20 min ago:
              It's a great idea. Chromium uses Wuffs to parse GIF data from the
              untrusted network.
              
              There's also a "wget some JSON and pipe that to what Wuffs calls
              example/jsonptr" example at
              
  HTML        [1]: https://nigeltao.github.io/blog/2020/jsonptr.html#sandbo...
       
          Sesse__ wrote 1 day ago:
          They overlap, but none does strictly more than the other.
          
          Kaitai is for describing, encoding and decoding file formats. Wuffs
          is for decoding images (which includes decoding certain file
          formats). Kaitai is multi-language, Wuffs compiles to C only. If you
          wrote a parser for PNGs, your Kaitai implementation could tell you
          what the resolution was, where the palette information was (if any),
          what the comments look like and on what byte the compressed pixel
          chunk started. Your Wuffs implementation would give you back the
          decoded pixels (OK, and the resolution).
          
          Think of Kaitai as an IDL generator for file formats, perhaps. It
          lets you parse the file into some sort of language-native struct
          (say, a series of nested objects) but doesn't try to process it
          beyond the parse.
       
          setheron wrote 1 day ago:
          Looking at that repo.. i have no clue how to get started.
       
            nigeltao wrote 1 day ago:
            The top-level README has a link called "Getting Started".
       
        woodruffw wrote 1 day ago:
        Kaitai Struct is really great. I've used it several times over the
        years to quickly pull in a parser that I'd otherwise have to hand-roll
        (and almost certainly get subtly wrong).
        
        Their reference parsers for Mach-O and DER work quite nicely in
        abi3audit[1]:
        
  HTML  [1]: https://github.com/pypa/abi3audit/tree/main/abi3audit/_vendor
       
        jdp wrote 1 day ago:
        I also like Protodata [1]. It's complementary as an exploration and
        transformation tool when working with binary data formats.
        
        [1] 
        
  HTML  [1]: https://github.com/evincarofautumn/protodata
       
        zzlk wrote 1 day ago:
        I wanted to use this a long time ago but the rust support wasn't there.
        I can see now that it's on the front page with apparently first class
        support so looks like I can give it a go again.
       
       
   DIR <- back to front page