URI: 
        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   How memory maps (mmap) deliver faster file access in Go
       
       
        lzaf wrote 8 hours 52 min ago:
        On a similar vein some time ago I had written a small toy lib that
        emulates the os.File interface for nmap backed files: [1] , 
        It even handled bus errors without panicking: [1]
        /blob/186f714343906bb9304ad5f30...
        
        Read and write performance was usually better especially with larger
        write sizes.
        Compared to os.File:
        
          ~/src/yammap$ go test -benchtime=4s -bench .
          goos: linux
          goarch: amd64
          pkg: github.com/zaf/yammap
          cpu: AMD Ryzen 9 5900X 12-Core Processor
          BenchmarkWrite-24         29085     164744 ns/op 25459.52 MB/s
          BenchmarkOSWrite-24         22204     215131 ns/op 19496.54 MB/s
          BenchmarkRead-24         29113     166820 ns/op 25142.72 MB/s
          BenchmarkOSRead-24         27451     172685 ns/op 24288.69 MB/s
        
  HTML  [1]: https://github.com/zaf/yammap
  HTML  [2]: https://github.com/zaf/yammap/blob/186f714343906bb9304ad5f30b1...
       
        benjiro wrote 15 hours 42 min ago:
        People are so focused on the mmap part, and the latency, that the usage
        is overlooked.
        
        > The last couple of weeks I've been working on an HTTP-backed
        filesystem.
        
        It feels like this is micro optimizations, that are going to get
        blocked anyway by the whole HTTP cycle anyway.
        
        There is also the benchmark issue:
        
        The enhanced CDB format seems to be focused on a read only benefits, as
        writes introduced a lot of latency, and issue with mmap. In other
        words, there is a need to freeze for the mmap, then unfreeze, write for
        updates, freeze for mmap ...
        
        This cycle introduces overhead, does it not? Has this been benchmarked?
        Because from what i am seeing, the benefits are mostly in the frozen
        state (aka read only).
        
        If the data is changed infrequently, why not just use json? No matter
        how slow it is, if your just going to do http requests for the
        directory listing, your overhead is not the actual file format.
        
        If this enhanced file format was used as file storage, and you want to
        be able to fast read files, that is a different matter. Then there are
        ways around it with keeping "part" files where files 1 ... 1000 are in
        file.01, 2 ... 2000 in file.02 (thus reducing overhead from the file
        system). And those are memory mapped for fast reading. And where
        updates are invalidated files/rewrites (as i do not see any
        delete/vacume ability in the file format).
        
        So, the actual benefits just for a file directory listing db escapes
        me.
       
          dahfizz wrote 12 hours 29 min ago:
          This reads like complete nonsense. If HTTP is involved, lets just
          give up and make the system as slow as possible?
          
          The HTTP request needs to actually be actioned by the server before
          it can respond. Reducing the time it takes for the server to do the
          thing (accessing files) will meaningfully improve overall
          performance.
          
          Switching out to JSON will meaningfully degrade performance. For no
          benefit.
       
            benjiro wrote 11 hours 28 min ago:
            >  If HTTP is involved, lets just give up and make the system as
            slow as possible?
            
            Did i write that? Please leave flamebait out of these discussions.
            
            The original author (today) answered why they wanted to use this
            approach and the benefits from it. This has been missing in this
            entire discussion. So i really do not understand where you get this
            confidence.
            
            > Switching out to JSON will meaningfully degrade performance. For
            no benefit.
            
            Without knowing why or how the system was used, and now we know it
            is used as a transport medium between the db/nodes, its more clear
            as to why json is a issue for them. Does not explain how you
            conclude it will "meaningfully degrade performance" when this
            information was not available to any of us.
       
          perbu wrote 13 hours 38 min ago:
          We need to support over 10M files in each folder. JSON wouldn't fare
          well as the lack of indices makes random access problematic.
          Composing a JSON file with many objects is, at least with the current
          JSON implementation, not feasible.
          
          CDB is only a transport medium. The data originates in PostgreSQL and
          upon request, stored in CDB and transferred. Writing/freezing to CDB
          is faster than encoding JSON.
          
          CDB also makes it possible to access it directly, with ranged HTTP
          requests. It isn't something I've implemented, but having the option
          to do so is nice.
       
            benjiro wrote 11 hours 31 min ago:
            > CDB is only a transport medium. The data originates in PostgreSQL
            and upon request, stored in CDB and transferred. Writing/freezing
            to CDB is faster than encoding JSON.
            
            Might have been interesting to actually include this in the
            article, do you not think so? ;-)
            
            The way the article is written, made it seen that you used cdb on
            edge nodes to store metadata. With no information as to what your
            storing / access, how, why ... This is part of the reason we have
            these discussions here.
       
              perbu wrote 10 hours 2 min ago:
              The post is about mmap and my somewhat successful use of it. If
              I've described my whole stack it would have been a small thesis
              and not really interesting.
       
        karel-3d wrote 16 hours 41 min ago:
        mmap is fine when you know the file fits in memory, and you need random
        file reads/writes of only some parts of the file. It's not magic.
        
        It's also quite hard to debug in go, because mmaped files are not
        visible in pprof; whe you run out of memory, mmap starts behaving
        really suboptimally. And it's hard to see which file takes how much
        memory (again it doesn't show in pprof).
       
          perbu wrote 10 hours 6 min ago:
          random reads are ok. writes through a mmap are a disaster.
       
            vlowther wrote 9 hours 2 min ago:
            Only if you are doing in-place updates.  If append-only datastores
            are your jam, writes via mmap are Just Fine:
            
              $ go test -v
              === RUN   TestChunkOps
                  chunk_test.go:26: Checking basic persistence and Store
            expansion.
                  chunk_test.go:74: Checking close and reopen read-only
                  chunk_test.go:106: Checking that readonly blocks write ops
                  chunk_test.go:116: Checking Clear
                  chunk_test.go:175: Checking interrupted write
              --- PASS: TestChunkOps (0.06s)
              === RUN   TestEncWriteSpeed
                  chunk_test.go:246: Wrote 1443 MB/s
                  chunk_test.go:264: Read 5525.418751 MB/s
              --- PASS: TestEncWriteSpeed (1.42s)
              === RUN   TestPlaintextWriteSpeed
                  chunk_test.go:301: Wrote 1693 MB/s
                  chunk_test.go:319: Read 10528.744206 MB/s
              --- PASS: TestPlaintextWriteSpeed (1.36s)
              PASS
       
        gethly wrote 18 hours 8 min ago:
        I have never used mmap, as I had no need, but I know BoltDB uses it and
        from what I remember, the mmap is good for when you are working with
        whole disk pages, which BoltDB does. Otherwise it seems to be wrong use
        case for it?
       
        philippta wrote 18 hours 51 min ago:
        At computerenhance.com[0] Casey Muratori shows that memory mapped files
        actually perform worse at sequential reads, which is the common case
        for file access.
        
        That’s because the CPU won’t prefetch data as effectively and has
        to rely on page faults to know what to read next. With regular,
        sequential file reads, the CPU can be much smarter and prefetch the
        next page while the program is consuming the previous one.
        
        [0]
        
  HTML  [1]: https://www.computerenhance.com/p/memory-mapped-files
       
          atombender wrote 10 hours 7 min ago:
          Does madvise(..., MADV_SEQUENTIAL) not help here?
       
          vlovich123 wrote 18 hours 14 min ago:
          Io_uring should be outperforming both - you can configure the read
          ahead optimally, there’s no page faults, and there’s no copies as
          there is with buffered I/O:
          
  HTML    [1]: https://archive.is/vkdCo
       
        charlietap wrote 20 hours 1 min ago:
        This article is nonsensical. If you're reading this please don't start
        mmap'ing files just to read from them. It proposes an incredibly
        unrealistic scenario where the program is  making thousands of random
        incredibly small unbuffered reads from a file. In reality 99 percent of
        programs will sequentially reading bytes into a buffer which makes
        orders of magnitude less syscalls.
        
        Mmap is useful in niche scenarios, it's not magic.
       
          icedchai wrote 11 hours 32 min ago:
          At a previous company, we had a custom "database" (I use that term
          very loosely) built on memory mapped files. All startup, all pages
          were read to ensure the data was hot, unlikely to be any page faults.
          It worked well for the application, but obviously because the whole
          thing fit in memory and was preloaded. We also had our own custom
          write-ahead-log. Today, I'd probably use sqlite.
       
          perbu wrote 16 hours 25 min ago:
          This is a niche scenario. The scenario outlined is reading CDB
          databases.
       
          karel-3d wrote 16 hours 38 min ago:
          That is not unrealistic if you are using the file to save binary data
          on given positions and don't need to read all data. For example if
          you have a big matrix of fixed size structs and you need to read only
          some of them.
       
        Animats wrote 20 hours 43 min ago:
        I never knew that Linux memory mapped files were copy-on-write. I'd
        assumed they let you alter the page and wrote out dirty pages later.
       
          pengaru wrote 20 hours 9 min ago:
          MAP_PRIVATE vs. MAP_SHARED
       
        kragen wrote 1 day ago:
        The simple answer to "How do memory maps (mmap) deliver faster file
        access?" is "sometimes", but the blog post does give some more details.
        
        I was suspicious of the 25× speedup claim, but it's a lot more
        plausible than I thought.
        
        On this Ryzen 5 3500U running mostly at 3.667GHz (poorly controlled),
        reading data from an already-memory-mapped page is as fast as memcpy
        (about 10 gigabytes per second when not cached on one core of my
        laptop, which works out to 0.1 nanoseconds per byte, plus about 20
        nanoseconds of overhead) while lseek+read is two system calls (590ns
        each) plus copying bytes into userspace (26–30ps per byte for small
        calls, 120ps per byte for a few megabytes).  Small memcpy (from, as it
        happens, an mmapped page) also costs about 25ps per byte, plus about
        2800ps per loop iteration, probably much of which is incrementing the
        loop counter and passing arguments to the memcpy function (GCC is
        emitting an actual call to memcpy, via the PLT).
        
        So mmap will always be faster than lseek+read on this machine, at least
        if it doesn't have a page fault, but the point at which memcpy from
        mmap would be 25× faster than lseek+read would be where 2×590 + .028n
        = 25×(2.8 + .025n) = 70 + .625n.  Which is to say 1110 = .597n ∴ n =
        1110/.597 = 1859 bytes.  At that point, memcpy from mmap should be 49ns
        and lseek+read should be 1232ns, which is 25× as big.    You can cut
        that size more than in half if you use pread() instead of lseek+read,
        and presumably io_uring would cut it even more.  If we assume that
        we're also taking cache misses to bring in the data from main memory in
        both cases, we have 2×590 + .1n = 25×(2.8 + .1n) = 70 + 2.5n, so 1110
        = 2.4n ∴ n = 1110/2.4 = 462 bytes.
        
        On the other hand, mmap will be slow if it's hitting a page fault,
        which sort of corresponds to the case where you could have cached the
        result of lseek+read in private RAM, which you could do on a
        smaller-than-pagesize granularity, which potentially means you could
        hit the slow path much less often for a given working set.  And
        lseek+read has several possible ways to do make the I/O asynchronous,
        while the only way to make mmap page faults asynchronous is to hit the
        page faults in different threads, which is a pretty heavyweight
        mechanism.
        
        On the other hand, lseek+read with a software cache is sort of using
        twice as much memory (one copy is in the kernel's buffer cache and
        another copy is in the application's software cache) so mmap could
        still win.  And, if there are other processes writing to the data being
        queried, you need some way to invalidate the software cache, which can
        be expensive.
        
        (On the gripping hand, if you're reading from shared memory while other
        processes are updating it, you're probably going to need some kind of
        locking or lock-free synchronization with those other processes.)
        
        So I think a reasonably architected lseek+read (or pread) approach to
        the problem might be a little faster or a little slower than the mmap
        approach, but the gap definitely won't be 25×.  But very simple
        applications or libraries, or libraries where many processes might be
        simultaneously accessing the same data, could indeed get 25× or even
        256× performance improvements by letting the kernel manage the cache
        instead of trying to do it themselves.
        
        Someone at a large user of Varnish told me they've mostly removed mmap
        from their Varnish fork for performance.
       
          kragen wrote 21 hours 3 min ago:
          It's worth reading bcrl's comment at [1] for more depth on some of
          these issues.
          
  HTML    [1]: https://news.ycombinator.com/item?id=45690006
       
          loeg wrote 22 hours 41 min ago:
          > lseek+read is two system calls
          
          You'd never do that, though -- you'd use pread.
       
            kragen wrote 21 hours 7 min ago:
            The article I'm commenting on said its author used seek and read,
            so I don't know if maybe for some reason they did do that instead
            of pread(), which it also mentioned.  I didn't want to
            optimistically assume otherwise.  Is pread() available in the
            Golang standard library? [1] is someone using os.File.ReadAt, which
            is a method name that makes me even more uncertain.  But there's
            also syscall.Pread apparently, so it should be fine?
            
            If you are making only one system call, the 25× crossover point is
            800-some bytes by my measurements.
            
  HTML      [1]: https://github.com/golang/go/issues/19563
       
        liuliu wrote 1 day ago:
        mmap is a good crutch when you 1. don't have busy polling / async IO
        API available and want to do some quick & dirty preloading tricks; 2.
        don't want to manage the complexity of in-memory cache, especially
        cross-processes ones.
        
        Obviously if you have kernel-backed async IO APIs (io_uring) and
        willing to dig into the deeper end (for better managed cache), you can
        get better performance than mmap. But in many cases, mmap is
        "good-enough".
       
        gustavpaul wrote 1 day ago:
        The MmapReader is not copying the requested byte range into the buf
        argument, so if ever the underlying file descriptor is closed (or the
        file truncated out of band) any subsequent slice access will throw
        SIGBUS, which is really unpleasant.
        
        It also means the latency due to pagefaults is shifted from inside
        mmapReader.ReadRecord() (where it would be expected) to wherever in the
        application the bytes are first accessed, leading to spooky
        unpreditactable latency spikes in what are otherwise pure functions.
        That inevitably leads to wild arguments about how bad GC stalls are :-)
        
        An apples to apples comparison should be copying the bytes from the
        mmap buffer and returning the resulting slice.
       
          dahfizz wrote 12 hours 25 min ago:
          Being able to avoid an extra copy is actually a huge performance gain
          when you can safely do it. You shouldn't discount how useful mmap is
          just because its not useful in every scenario.
          
          You shouldn't replace every single file access with mmap. But when it
          makes sense, mmap is a big performance win.
       
          loeg wrote 22 hours 45 min ago:
          > so if ever the underlying file descriptor is closed
          
          Nit: Mmap mapping lifetimes are not attached to the underlying fd. 
          The file truncation and latency concerns are valid, though.
       
          dapperdrake wrote 23 hours 54 min ago:
          It’s not accessible until it is in user space.  (Virtual memory
          addresses mapped to physical RAM holding the data.)
          
          Good point.
       
        Ingon wrote 1 day ago:
        When I adopted mmap in klevdb [1], I saw a dramatic performance
        improvements. So, even as klevdb completes a write segment, it will
        reopen, on demand, the segment for reading with mmap (segments are
        basically part of write only log). With this any random reads are super
        fast (but of course not as fast as sequential ones).
        
  HTML  [1]: https://github.com/klev-dev/klevdb
       
        commandersaki wrote 1 day ago:
        This is a good article but I'm wondering what is the relationship
        between this website/company and varnish-cache.org, since in the
        article they make claims of releasing Varnish Cache, and the article
        wasn't written by Poul-Henning Kamp.
       
          wmf wrote 23 hours 15 min ago:
          Varnish hasn't been a solo project for many years. Also PHK's version
          is now called Vinyl Cache while the corporate fork is called Varnish.
       
            commandersaki wrote 18 hours 8 min ago:
            The article says "when we launched Varnish Cache back in 2006". Who
            is we? My memory was that around that time PHK released it to the
            world and was the sole developer at the time.
       
              kragen wrote 17 hours 17 min ago:
              I was wondering about this too.  He apparently worked at the
              company for a while?  Did he found it?
       
                perbu wrote 16 hours 40 min ago:
                Yes. When Varnish Cache launched, in 2006, I worked in a rather
                small OSS consultancy, which did the Linux port of Varnish
                Cache and provided maintenance and funding for the project.
       
                  kragen wrote 16 hours 35 min ago:
                  You say, "Yes. When Varnish Cache launched, in 2006, I worked
                  in a rather small OSS consultancy, which did the Linux port
                  of Varnish Cache and provided maintenance and funding for the
                  project."
                  
                  But eventually phk left, and you came into conflict with him
                  over the name, which was resolved by him choosing a different
                  name for his version of Varnish?
       
                    perbu wrote 10 hours 8 min ago:
                    Not really.
                    
                    We've been funding phks work on Varnish and Vinyl cache for
                    20 years. Do you think phk can write, maintain and release
                    something on his own? Vinyl Cache cannot be a one-man-show,
                    be real.
       
                      kragen wrote 6 hours 53 min ago:
                      (I do, in fact, think phk can write, maintain, and
                      release something on his own.)
       
                        perbu wrote 5 hours 39 min ago:
                        He knows a lot of things and is amongst the best
                        software developers I've worked with, but on a project
                        like this you need a lot more breath than any single
                        developer can bring.
       
                      kragen wrote 9 hours 38 min ago:
                      I see.    Thank you for explaining!
       
        mholt wrote 1 day ago:
        Just this month, I've learned the hard way that some file systems do
        not play well with mmap: [1] In my case, it seems that Mac's ExFAT
        driver is incompatible with sqlite's WAL mode because the driver
        returned a memory address that is misaligned on ARM64. Most bizarre
        error I've encountered in years.
        
        So, uh, mind your file systems, kids!
        
  HTML  [1]: https://github.com/mattn/go-sqlite3/issues/1355
       
          vlovich123 wrote 1 day ago:
          I would be very careful about that conclusion. Reading that thread it
          sounds like you’re relying on Claude to make this conclusion but
          you haven’t actually verified what the address being returned
          actually is.
          
          The reason I’m skeptical is three fold. The first is that it’s
          generally impossible for a filesystem to mmap return a pointer
          that’s not page boundary aligned. The second is that unaligned
          accesses are still fine on modern ARM is not a SIGBUS. The third is
          that Claude’s reasoning that the pointer must be 8-byte aligned and
          that indicates a misaligned read is flawed - how do you know that
          SQLite isn’t doing a 2-byte read at that address?
          
          If you really think it’s a bad alignment it should be trivial to
          reproduce - mmap the file explicitly and print the address or modify
          the SQLite source to print the mmap location it gets.
       
            mholt wrote 1 day ago:
            I'd love to be wrong, but the address it's referring to is the
            correct address from the error / stack trace.
            
            I honestly don't know anything about this. There's no search
            results for my error. ChatGPT and Claude and Grok all agreed one
            way or another, with various prompts.
            
            Would be happy to have some help verifying any of this. I just know
            that disabling WAL mode, and not using Mac's ExFAT driver, both
            fixed the error reliably.
       
              achierius wrote 23 hours 30 min ago:
              But is that the address being returned by mmap?
              Furthermore, what instruction is this crashing on? You should be
              able to look up the specific alignment requirements of that
              instruction to verify.
              
              > ChatGPT and Claude and Grok all agreed one way or another, with
              various prompts.
              
              This means less than you'd think: they're all trained on a
              similar corpus, and Grok in particular is probably at least
              partially distilled from Claude. So they tend to come to similar
              conclusions given similar data.
       
                mholt wrote 20 hours 49 min ago:
                I believe it's being returned by the FS driver, not mmap()
                necessarily. I think I knew what instruction it was when I was
                debugging it but don't remember right now. (I could probably
                dig through my LLM history and get it though.)
                
                And yeah, I knew AI is useless, I try to avoid it, but when I'm
                way over my head it's better than nothing (it did lead me to
                the workaround that I mentioned in my previous comment).
       
                  vlovich123 wrote 18 hours 53 min ago:
                  If it was in the FS driver (w which runs in kernel /
                  different process?) why would your process be dying?
       
        MayCXC wrote 1 day ago:
        wowie. mmap also dramatically improved perf for LLaMA:
        
  HTML  [1]: https://justine.lol/mmap/
       
          kristjansson wrote 20 hours 51 min ago:
          uh.  there was a bit more to the story than 'yup totally unalloyed
          free lunch'
       
        buybackoff wrote 1 day ago:
        It looks suspicious at 25x. Even 2.5x would be suspicious unless
        reading very small records.
        
        I assume both cases have the file cached in RAM already fully, with a
        tiny size of 100MB. But the file read based version actually copies the
        data into a given buffer, which involves cache misses to get data from
        RAM to L1 for copying. The mmap version just returns the slice and it's
        discarded immediately, the actual data is not touched at all. Each
        record is 2 cache lines and with random indices is not prefetched. For
        the CPU AMD Ryzen 7 9800X3D mentioned in the repo, just reading 100
        bytes from RAM to L1 should take ~100 nanos.
        
        The benchmark compares actually getting data vs getting data location.
        Single digit nanos is the scale of good hash tables lookups with data
        in CPU caches, not actual IO. For fairness, both should use/touch the
        data, eg copy it.
       
          Tuna-Fish wrote 12 hours 30 min ago:
          > For the CPU AMD Ryzen 7 9800X3D mentioned in the repo, just reading
          100 bytes from RAM to L1 should take ~100 nanos.
          
          It's important to note that throughput is not just an inverse of
          latency, because modern OoO cpus with modern memory subsystems can
          have hundreds of requests in flight. If your code doesn't serialize
          accesses, latency numbers are irrelevant to throughput.
       
          checker659 wrote 13 hours 20 min ago:
          Latency Numbers Every Programmer Should Know (originally by Jeff Dean
          / Peter Norvig)
          
  HTML    [1]: https://gist.github.com/jboner/2841832
       
          hyc_symas wrote 17 hours 8 min ago:
          That's such an obvious error in their benchmark code. In my benchmark
          code I make sure to touch the data so at least the 1st page is
          actually paged in from disk.
          
  HTML    [1]: https://github.com/LMDB/dbbench/blob/1281588b7fdf119bcba65ce...
       
          a-dub wrote 23 hours 49 min ago:
          doing these sorts of benchmarks is actually quite tricky. you must
          clear the page cache by allocating >1x physical ram before each
          attempt.
          
          moreover, mmap by default will load lazy, where mmap with
          MAP_POPULATE will prefetch. in the former case, reporting average
          operation times is not valid because the access time distributions
          are not gaussian (they have a one time big hit at first touch). with
          MAP_POPULATE (linux only), there is long loading delay when mmap is
          first called, but then the average access times will be very low.
          when pages are released will be determined by the operating system
          page cache eviction policy.
          
          the data structure on top is best chosen based on desired runtime
          characteristics. if it's all going in ram, go ahead and use a
          standard randomized hash table. if it's too big to fit in ram,
          designing a structure that is aware of lru style page eviction
          semantics may make sense (ie, a hash table or other layout that
          preserves locality for things that are expected to be accessed in a
          temporally local fashion.)
       
            codedokode wrote 18 hours 35 min ago:
            >  you must clear the page cache
            
            In Linux there is a /proc/sys/vm/drop_caches pseudo file that does
            this. Look how great Linux is compared to other OSes.
       
              a-dub wrote 11 hours 51 min ago:
              that's super cool! live and learn. even better would be the
              capability to drop caches from a supplied point in the filesystem
              hierarchy.
       
                ahoka wrote 10 hours 35 min ago:
                People would run it from cron to "free memory", believe it or
                not.
       
                  DoctorOW wrote 10 hours 14 min ago:
                  Hence,
                  
  HTML            [1]: https://www.linuxatemyram.com/
       
          kragen wrote 1 day ago:
          > For the CPU AMD Ryzen 7 9800X3D mentioned in the repo, just reading
          100 bytes from RAM to L1 should take ~100 nanos.
          
          I think this is the wrong order of magnitude.  One core of my Ryzen 5
          3500U seems to be able to run memcpy() at 10 gigabytes per second
          (0.1 nanoseconds per byte) and memset() at 31 gigabytes per second
          (0.03 nanoseconds per byte).  I'd expect a sequential read of 100
          bytes to take about 3 nanoseconds, not 100 nanoseconds.
          
          However, I think random accesses do take close to 100 nanoseconds to
          transmit the starting row and column address and open the row.    I
          haven't measured this on this hardware because I don't have a test
          I'm confident in.
       
            bcrl wrote 23 hours 52 min ago:
            100 nanoseconds from RAM is correct.  Latency != bandwidth.  3
            nanoseconds would be from cache or so on a Ryzen.  You ain't gonna
            get the benefits of prefetching on the first 100 bytes.
       
              kragen wrote 23 hours 47 min ago:
              Yes, my comment clearly specified that I was talking about
              sequential reads, which do get the benefits of prefetching, and
              said, "I think random accesses do take close to 100 nanoseconds".
       
                bcrl wrote 23 hours 37 min ago:
                If you're doing large amounts of sequential reads from a
                filesystem, it's probably not in cache.  You only get latency
                that low if you're doing nothing else that stresses the memory
                subsystem, which is rather unlikely.  Real applications have
                overhead, which is why microbenchmarks like this are useless. 
                Microbenchmarks are not the best first order estimate for
                programmers to think of.
       
                  kragen wrote 23 hours 28 min ago:
                  Yes, I went into more detail on those issues in [1] , but
                  overhead is irrelevant to the issue we were discussing, which
                  is about how long it takes to read 100 bytes from memory. 
                  Microbenchmarks are generally exactly the right way to answer
                  that question.
                  
                  Memory subsystem bottlenecks are real, but even in real
                  applications, it's common for the memory subsystem to not be
                  the bottleneck.  For example, in this case we're discussing
                  system call overhead, which tends to move the system
                  bottleneck inside the CPU (even though a significant part of
                  that effect is due to L1I cache evictions).
                  
                  Moreover, even if the memory subsystem is the bottleneck, on
                  the system I was measuring, it will not push the sequential
                  memory access time anywhere close to 1 nanosecond per byte. 
                  I just don't have enough cores to oversubscribe the memory
                  bus 30×.  (1.5×, I think.)  Having such a large ratio of
                  processor speed to RAM interconnect bandwidth is in fact very
                  unusual, because it tends to perform very poorly in some
                  workloads.
                  
                  If microbenchmarks don't give you a pretty good first-order
                  performance estimate, either you're doing the wrong
                  microbenchmarks or you're completely mistaken about what your
                  application's major bottlenecks are (plural, because in a
                  sequential program you can have multiple "bottlenecks",
                  colloquially, unlike in concurrent systens where you almost
                  always havr exactly one bottleneck.)  Both of these problems
                  do happen often, but the good news is that they're fixable. 
                  But giving up on microbenchmarking will not fix them.
                  
  HTML            [1]: https://news.ycombinator.com/item?id=45689464
       
                    bcrl wrote 22 hours 44 min ago:
                    If you're bottlenecked on a 100 byte read, the app is
                    probably doing something really stupid, like not using
                    syscalls the way they're supposed to.  Buffered I/O has
                    existed from fairly early on in Unix history, and it exists
                    because it is needed to deal with the mismatch between what
                    stupid applications want to do versus the guarantees the
                    kernel has to provide for file I/O.
                    
                    The main benefit from the mmap approach is that the fast
                    path then avoids all the code the kernel has to execute,
                    the data structures the kernel has to touch, and everything
                    needed to ensure the correctness of the system.  In modern
                    systems that means all kinds of synchronization and
                    serialization of the CPU needed to deal with
                    $randomCPUdataleakoftheweek (pipeline flushes ftw!).
                    
                    However, real applications need to deal with correctness. 
                    For example, a real database is not just going to just do
                    100 byte reads of records.  It's going to have to take
                    measures (locks) to ensure the data isn't being written to
                    by another thread.
                    
                    Rarely is it just a sequential read of the next 100 bytes
                    from a file.
                    
                    I'm firmly in the camp that focusing on microbenchmarks
                    like this is frequently a waste of time in the general
                    case.  You have to look at the application as a whole
                    first.    I've implemented optimizations that looked great in
                    a microbenchmark, but showed absolutely no difference
                    whatsoever at the application level.
                    
                    Moreover, my main hatred for mmap() as a file I/O mechanism
                    is that it moves the context switches when the data is not
                    present in RAM from somewhere obvious (doing a read() or
                    pread() system call) to somewhere implicit (reading 100
                    bytes from memory that happens to be mmap()ed and was
                    passed as a pointer to a function written by some other
                    poor unknowing programmer).  Additionally, read ahead
                    performance for mmap()s when bringing data into RAM is
                    quite a bit slower than on read()s in large part because it
                    means that the application is not providing a hint (the
                    size argument to the read() syscall) to the kernel for how
                    much data to bring in (and if everything is sequential as
                    you claim, your code really should know that ahead of
                    time).
                    
                    So, sure, your 100 byte read in the ideal case when
                    everything is cached is faster, but warming up the cache is
                    now significantly slower.  Is shifting costs that way
                    always the right thing to do?  Rarely in my experience.
                    
                    And if you don't think about it (as there's no obvious
                    pread() syscall anymore), those microseconds and sometimes
                    milliseconds to fault in the page for that 100 byte read
                    will hurt you.    It impacts your main event loop, the size
                    of your pool of processes / threads, etc.  The programmer
                    needs to think about these things, and the article
                    mentioned none of this.  This makes me think that the
                    author is actually quite naive and merely proud in thinking
                    that he discovered the magic Go Faster button without
                    having been burned by the downsides that arise in the Real
                    World from possible overuse of mmap().
       
                      kragen wrote 21 hours 4 min ago:
                      Perhaps surprisingly, I agree with your entire comment
                      from beginning to end.
                      
                      Sometimes mmap can be a real win, though.  The poster
                      child for this is probably LMDB.  Varnish also does
                      pretty well with mmap, though see my caveat on that in my
                      linked comment.
       
                        bcrl wrote 16 min ago:
                        Varish was very well done.  It's disappointing that
                        with HTTPS-first nowadays there is very little
                        oppourtunity to make good use of local web caches of
                        web content across browsers / clients.    Caches would
                        have been a godsend back in the 1990s when we had to
                        use shared dialup to connect to the internet while
                        using NetScape in a classroom full of computers.
       
          Scaevolus wrote 1 day ago:
          Yeah, 3.3ns is about 12 CPU cycles. You can indeed create a pointer
          to a memory location that fast!
       
        habibur wrote 1 day ago:
        Is mmap still faster than fread? That might have been true in the 90s
        but I was wondering about current improvements.
        
        If you have enough free memory, the file will be cached in memory
        anyway instead of residing on disk. Therefore both will be reading from
        memory, albeit through different API.
        
        Looking for recent benchmark or view from OS developers.
       
          loeg wrote 22 hours 43 min ago:
          read, or fread?  fread is the buffered version that does an extra
          copy for no reason that would benefit this use case.
       
          do_not_redeem wrote 1 day ago:
          Even if the file is cached, fread has to do a memcpy. mmap doesn't.
       
            gpderetta wrote 1 day ago:
            fread is (usually) buffered io, so it actually does two additional
            mem copies (kernel to FILE buffer then to user buffer)
       
              assbuttbuttass wrote 11 hours 40 min ago:
              Not in Go
       
                gpderetta wrote 9 hours 55 min ago:
                oh, right, this is Go ( [1] ). Do the strings it return share
                memory with the internal buffer?
                
  HTML          [1]: https://pkg.go.dev/github.com/odeke-em/go-utils/fread#...
       
          stingraycharles wrote 1 day ago:
          In our experience building a high performance database server:
          absolutely. If your line of thinking is “if you have enough free
          memory”, then these types of optimizations aren’t for you. one of
          the main benefits is eliminating an extra copy.
          
          additionally, mmap is heavily optimized for random access, so if
          that’s what you’re doing, then you’ll have a much better time
          with it than fread.
          
          (I hope a plug is not frowned upon here: if you like this kind of
          stuff, we’re a fully remote company and hiring C++ devs: [1] )
          
  HTML    [1]: https://apply.workable.com/quasar/j/436B0BEE43/
       
            YouAreWRONGtoo wrote 1 day ago:
            If you can't post a salary, you shouldn't post a job opening.
            
            (Not that you can afford me.)
            
            Also, your company is breaking the law by false advertising. It
            suggests your current leadership is fucking stupid. Why do you work
            for a criminal enterprise?
       
              jasonwatkinspdx wrote 1 day ago:
              I'd be shocked if anyone would hire you after seeing this
              behavior...
       
              vlovich123 wrote 1 day ago:
              What’s the false advertising?
       
                deaddodo wrote 1 day ago:
                Yeah, I took a look at the posting and it’s a bog standard
                job posting.
                
                I assume they’re referring to the no-salary aspect and (based
                on their speech style) are in the US. But, even in that case,
                it would only matter if the posting were targeted to one of the
                states that require salary information and the company operated
                or had a presence in said state. Since it’s an EU company,
                that’s almost definitely not the case.
       
                  vlovich123 wrote 19 hours 9 min ago:
                  > and the company operated or had a presence in said state
                  
                  And the company was big enough. AFAIK the salary transparency
                  stuff only applies when your headcount exceeds some number.
       
        nteon wrote 1 day ago:
        the downside is that the go runtime doesn't expect memory reads to page
        fault, so you may end up with stalls/latency/under-utilization if part
        of your dataset is paged out (like if you have a large cdb file w/
        random access patterns).  Using file IO, the Go runtime could be
        running a different goroutine if there is a disk read, but with mmap
        that thread is descheduled but holding an m & p. I'm also not sure if
        there would be increased stop the world pauses, or if the async
        preemption stuff would "just work".
        
        Section 3.2 of this paper has more details:
        
  HTML  [1]: https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf
       
          perbu wrote 15 hours 48 min ago:
          This is amazingly good feedback. I hadn't thought of that at all. It
          is so much harder to reason about the Go runtime as opposed to a
          threaded application.
       
          vlovich123 wrote 1 day ago:
          To me this indicates a limitation of the API. Cause you do want to
          maintain that the kernel can page out that memory under pressure
          while userspace accesses that memory asynchronously while allowing
          the thread to do other asynchronous things. There’s no good
          programming model/OS api that can accomplish this today.
       
            twic wrote 9 hours 59 min ago:
            There isn't today, but there was in 1991, scheduler activations:
            [1] The rough idea is that if the kernel blocks a thread on
            something like a page cache miss, then it notifies the program
            through something a bit like a signal handler; if the program is
            doing user-level scheduling, it can then take account of that
            thread being blocked. The actual mechanism in the paper is more
            refined than that.
            
  HTML      [1]: https://dl.acm.org/doi/10.1145/121132.121151
       
              scottlamb wrote 9 hours 8 min ago:
              Nice find. That going nowhere seems like classic consequence of
              the cyclical nature of these things: user-managed concurrency was
              cool, then it wasn't, then Go (and others) brought it back.
              
              I think the more recent UMCG [1] (kind of a hybrid approach, with
              threads visible by the kernel but mostly scheduled by userspace)
              handles this well. Assuming it ever actually lands in upstream,
              it seems reasonable to guess Go would adopt it, given that both
              originate within Google.
              
              It's worth pointing out that the slow major page fault problem is
              not unique to programs using mmap(..., fd, ...). The program
              binary is implicitly mmaped, and if swap is enabled, even
              anonymous memory can be paged out. I prefer to lock ~everything
              [2] into RAM to avoid this, but most programs don't do this, and
              default ulimits prevent programs running within login shells from
              locking much if anything. [1] 
              
              [2] particularly on (mostly non-Go) programs with many threads,
              it's good to avoid locking into RAM the guard pages or stack
              beyond what is likely to be used, so better not to just use
              mlockall(MCL_CURRENT | MCL_FUTURE) unfortunately.
              
  HTML        [1]: https://lwn.net/Articles/879398/
       
            wmf wrote 23 hours 24 min ago:
            If C had exceptions a page fault could safely unwind the stack up
            to the main loop which could work on something else until the page
            arrives. This has the advantage that there's no cost for the common
            case of accessing resident pages. Exceptions seem to have fallen
            out of favor so this may trade one problem for another.
       
              gpderetta wrote 16 hours 18 min ago:
              you can longjmp, swapcontext or whatever from a signal handler
              into another lightweight fiber. The problem is that there is no
              "until the page arrive" notification. You would have to poll
              mincore which is awful.
              
              You could of course imagine an ansychronous "mmap complete
              notification" syscal, but at that point why not just use
              io_uring, it will be simpler and it has the benefit of actually
              existing.
       
              vlovich123 wrote 19 hours 6 min ago:
              C++ has exceptions and having seen the vast majority of code and
              the way it’s written and the understanding of people writing
              it, exception safety is a foreign concept. Doing it in C without
              RAII seems particularly masochistic and doomed to fail.
              
              And unwinding the stack isn’t what you want to do because
              you’re basically signaling you want to cancel the operation and
              you’re throwing all the state when you precisely don’t want
              to do that - you just want to pause the current task and do other
              I/O in the meantime.
       
              pjmlp wrote 19 hours 44 min ago:
              Windows C has exceptions, and no one has ever thought about doing
              something like this.
              
              They are only used for the same purpose as UNIX signals, without
              their flaws.
              
              In any case, page faults are OS specific, how to standardise such
              behaviour, with the added performance loss switching between both
              userspace and kernel.
       
            avianlyric wrote 1 day ago:
            There is no sensible OS API that could support this, because
            fundamentally memory access is a hardware API. The OS isn’t
            involved in normal memory reads, because that would be ludicrously
            inefficient, effectively requiring a syscall for every memory
            operation, which effectively means a syscall for any operation
            involving data I.e. all operations.
            
            Memory operations are always synchronous because they’re
            performed directly as a consequence of CPU instructions. Reading
            memory that’s been paged out results in the CPU itself detecting
            that the virtual address isn’t in RAM, and performing a hardware
            level interrupt. Literally abandoning a CPU instruction mid
            execution to start executing an entirely separate set of
            instructions which will hopefully sort out the page fault that just
            occurred, then kindly ask the CPU to go back and repeat the
            operation that caused the page fault.
            
            OS is only involved only because it’s the thing that provided the
            handling instructions for the CPU to execute in the event of a page
            fault. But it’s not in anyway actually capable of changing how
            the CPU initially handles the page fault.
            
            Also the current model does allow other threads to continue
            executing other work while the page fault is handled. The fault is
            completely localised to individual thread that triggered the fault.
            The CPU has no concept of the idea that multiple threads running on
            different physical core are in anyway related to each other. It
            also wouldn’t make sense to allow the interrupted thread to
            someone kick off a separate asynchronous operation, because where
            is it going to execute? The CPU core where the page fault happened
            is needed to handle the actual page fault, and copy in the needed
            memory. So even if you could kick off an async operation, there
            wouldn’t be any available CPU cycles to carry out the operation.
            
            Fundamentally there aren’t any sensible ways to improve on this
            problem, because the problem only exists due to us pretending that
            our machines have vastly more memory than they actually do. Which
            comes with tradeoffs, such as having to pause the CPU and steal CPU
            time to maintain the illusion.
            
            If people don’t like those tradeoffs, there’s a very simple
            solution. Put enough memory in your machine to keep your entire
            working set in memory all the time. Then page faults can never
            happen.
       
              blibble wrote 23 hours 28 min ago:
              > There is no sensible OS API that could support this, because
              fundamentally memory access is a hardware API.
              
              there's nothing magic about demand paging, faulting is one way it
              can be handled
              
              another could be that the OS could expose the present bit on the
              PTE to userland, and it has to check it itself, and linux already
              has asynchronous "please back this virtual address" APIs
              
              > Memory operations are always synchronous because they’re
              performed directly as a consequence of CPU instructions.
              
              although most CPU instructions may look synchronous they really
              aren't, the memory controller is quite sophisticated
              
              > Fundamentally there aren’t any sensible ways to improve on
              this problem, because the problem only exists due to us
              pretending that our machines have vastly more memory than they
              actually do. Which comes with tradeoffs, such as having to pause
              the CPU and steal CPU time to maintain the illusion.
              
              modern demand paging is one possible model that happens to be
              near universal amongst operating system today
              
              there are many, many other architectures that are possible...
       
                avianlyric wrote 1 hour 8 min ago:
                > although most CPU instructions may look synchronous they
                really aren't, the memory controller is quite sophisticated
                
                I was eliding at lot of details. But my broader point is that
                from the perspective of the thread being interpreted, the
                paging process is completely synchronous. Sure advanced x86 CPU
                maybe be tracking data dependencies between instructions and
                actively reordering instructions to reduce the impact of the
                pipeline stalling caused by the page fault. But that’s all
                low level optimisation that are (or should be) completely
                invisible to the executing thread.
                
                > there are many, many other architectures that are possible...
                
                I would be curious to see any examples of those alternatives.
                Demand paging provides a powerful abstraction, and it’s not
                clear to me how you can sensibly move page management into
                applications. At a very minimum that would suggest that every
                programming language would need a memory management runtime
                capable to predicting possible memory reads ahead of time in a
                sensible fashion, and triggering its own paging logic.
       
              kragen wrote 23 hours 50 min ago:
              > There is no sensible OS API that could support this, because
              fundamentally memory access is a hardware API.
              
              Not only is there a sensible OS API that could support this,
              Linux already implements it; it's the SIGSEGV signal.  The
              default way to respond to a SIGSEGV is by exiting the process
              with an error, but Linux provides the signal handler with enough
              information to do something sensible with it.  For example, it
              could map a page into the page frame that was requested, enqueue
              an asynchronous I/O to fill it, put the current green thread to
              sleep until the I/O completes, and context-switch to a different
              green thread.
              
              Invoking a signal handler only has about the same inherent
              overhead as a system call.  But then the signal handler needs
              another couple of system calls.  So on Linux this is over a
              microsecond in all.  That's probably acceptable, but it's slower
              than just calling pread() and having the kernel switch threads.
              
              Some garbage-collected runtimes do use SIGSEGV handlers on Linux,
              but I don't know of anything using this technique for user-level
              virtual memory.  It's not a very popular technique in part
              because, like inotify and epoll, it's nonportable; POSIX doesn't
              specify that the signal handler gets the arguments it would need,
              so running on other operating systems requires extra work.
              
              im3w1l also mentions userfaultfd, which is a different
              nonportable Linux-only interface that can solve the same thing
              but is, I think, more efficient.
       
                maxdamantus wrote 20 hours 40 min ago:
                Just to clarify, I think the parent posts are talking about
                non-failing page faults, ie where the kernel just needs to
                update the mapping in the MMU after finding the existing page
                already in memory (minor page fault), or possibly reading it
                from filesystem/swap (major page fault).
                
                SIGSEGV isn't raised during a typical page fault, only ones
                that are deemed to be due to invalid reads/writes.
                
                When one of the parents talks about "no good programming
                model/OS api", they basically mean an async option that gives
                the power of threads; threading allows concurrency of page
                faults, so the kernel is able to perform concurrent reads
                against the underlying storage media.
                
                Off the top of my head, a model I can think of for supporting
                concurrent mmap reads might involve a function:
                
                  bool hint_read(void *data, size_t length);
                
                When the caller is going to read various parts of an mmapped
                region, it can call `hint_read` multiple times beforehand to
                add regions into a queue. When the next page fault happens,
                instead of only reading the currently accessed page from disk,
                it can drain the `hint_read` queue for other pages
                concurrently. The `bool` return indicates whether the queue was
                full, so the caller stops making useless `hint_read` calls.
                
                I'm not familiar with userfaultfd, so don't know if it relates
                to this functionality. The mechanism I came up with is still a
                bit clunky and probably sub-optimal compared to using io_uring
                or even `readv`, but these are alternatives to mmap.
       
                  gpderetta wrote 16 hours 34 min ago:
                  Are you reinventing madvise?
       
                    maxdamantus wrote 14 hours 34 min ago:
                    I think the model I described is more precise than madvise.
                    I think madvise would usually be called on large sequences
                    of pages, which is why it has `MADV_RANDOM`,
                    `MADV_SEQUENTIAL` etc. You're not specifying which
                    memory/pages are about to be accessed, but the likely
                    access pattern.
                    
                    If you're just using mmap to read a file from start to
                    finish, then the `hint_read` mechanism is indeed pointless,
                    since multiple `hint_read` calls would do the same thing as
                    a single `madvise(..., MADV_SEQUENTIAL)` call.
                    
                    The point of `hint_read`, and indeed io_uring or `readv` is
                    the program knows exactly what parts of the file it wants
                    to read first, so it would be best if those are read
                    concurrently, and preferably using a single system call or
                    page fault (ie, one switch to kernel space).
                    
                    I would expect the `hint_read` function to push to a queue
                    in thread-local storage, so it shouldn't need a switch to
                    kernel space. User/kernel space switches are slow, in the
                    order of a couple of 10s of millions per second. This is
                    why the vDSO exists, and why the libc buffers writes
                    through `fwrite`/`println`/etc, because function calls
                    within userspace can happen at rates of billions per
                    second.
       
                      gpderetta wrote 13 hours 21 min ago:
                      you can do fine grained madvise via io_uring, which
                      indeed uses a queue. But at that point why use mmap at
                      all, just do async reads via io_uring.
       
                        vlovich123 wrote 13 hours 2 min ago:
                        The entire point I was trying to make at the beginning
                        of the thread is that mmap gives you memory pages in
                        the page cache that the OS can drop on memory pressure.
                        Io_uring is close on the performance and fine-grained
                        access patterns front. It’s not so good on the
                        system-wide cooperative behavior with memory front and
                        has a higher cost as either you’re still copying it
                        from the page cache into a user buffer (non trivial
                        performance impact vs the read itself) + trashing your
                        CPU caches or you’re doing direct I/O and having to
                        implement a page cache manually (and risks duplicating
                        page data inefficiently in userspace if the same file
                        is accessed by multiple processes.
       
                          gpderetta wrote 9 hours 42 min ago:
                          Right, so zero copy IO but still having the ability
                          to share the pagecache across process and allow the
                          kernel to drop caches on high mempressure. One issue
                          is that when under pressure, a process might not
                          really be able to successfully read a page and keep
                          retyring and failing (with an LRU replacement policy
                          it is unlikely and probably self-limiting, but
                          still...).
       
                            kragen wrote 8 hours 2 min ago:
                            To take advantage of zero-copy I/O, which I believe
                            has become much more important since the shift from
                            spinning rust to Flash, I think applications often
                            need to adopt a file format that's amenable to
                            zero-copy access.  Examples include Arrow (but not
                            compressed Feather), HDF5, FlatBuffers, Avro, and
                            SBE.  A lot of file formats developed during the
                            spinning-rust eon require full parsing before the
                            data in them can be used, which is fine for a 1KB
                            file but suboptimal for a 1GB file.
       
                  vlovich123 wrote 18 hours 59 min ago:
                  You’ve actually understood my suggestion - thank you.
                  Unfortunately I think hint_read inherently can’t work
                  because it’s a race condition between the read and how long
                  you access the page. And this race is inherent in any
                  attempted solution that needs to be solved. Signals are also
                  the wrong abstraction mechanism (and are slow and have all
                  sorts of other problems).
                  
                  You need something more complicated I think, like rseq and
                  futex you have some shared data structure that both
                  understand how to mutate atomically. You could literally use
                  rseq to abort if the page isn’t in memory and then submit
                  an io_uring task to get signaled when it gets paged in again
                  but rseq is a bit too coarse (it’ll trigger on any
                  preemption).
                  
                  There’s a race condition starvation danger here (it gets
                  evicted between when you get the signal and the sequence
                  completes) but something like this conceptually could maybe
                  be closer to working.
                  
                  But yes it’s inherently difficult which is why it doesn’t
                  exist but it is higher performance. And yes, this only makes
                  sense for mmap not all allocations so SIGSEGV is irrelevant
                  if looking at today’s kernels.
       
                  kragen wrote 20 hours 25 min ago:
                  If you want accessing a particular page to cause a SIGSEGV so
                  your custom fault handler gets invoked, you can just munmap
                  it, converting that access from a "non-failing page fault"
                  into one "deemed to be invalid".  Then the mechanism I
                  described would "allow[] concurrency of page faults, so the
                  [userspace threading library] is able to perform concurrent
                  reads against the underlying storage media".  As long as you
                  were aggressive enough about unmapping pages that none of
                  your still-mapped pages got swapped out by the kernel.    (Or
                  you could use mlock(), maybe.)
                  
                  I tried implementing your "hint_read" years ago in userspace
                  in a search engine I wrote, by having a "readahead thread"
                  read from pages before the main thread got to them.  It made
                  it slower, and I didn't know enough about the kernel to
                  figure out why.  I think I could probably make it work now,
                  and Linux's mmap implementation has improved enormously since
                  then, so maybe it would just work right away.
       
                    maxdamantus wrote 13 hours 45 min ago:
                    The point about inducing segmentation faults is interesting
                    and sounds like it could work to implement the `hint_read`
                    mechanism. I guess it would mostly be a question of how
                    performant userfaultfd or SIGSEGV handling is. In any case
                    it will be sub-optimal to having it in the kernel's own
                    fault handler, since each userfaultfd read or SIGSEGV
                    callback is already a user-kernel-user switch, and it still
                    needs to perform another system call to do the actual
                    reads, and even more system calls to mmap the bits of
                    memory again.
                    
                    Presumably having fine-grained mmaps will be another source
                    of overhead. Not to mention that each mmap requires another
                    system call. Instead of a single fault or a single call to
                    `readv`, you're doing many `mmap` calls.
                    
                    > I tried implementing your "hint_read" years ago in
                    userspace in a search engine I wrote, by having a
                    "readahead thread" read from pages before the main thread
                    got to them.
                    
                    Yeah, doing it in another thread will also have quite a bit
                    of overhead. You need some sort of synchronisation with the
                    other thread, and ultimately the "readahead" thread will
                    need to induce the disk reads through something other than
                    a page fault to achieve concurrent reads, since within the
                    readahead thread, the page faults are still synchronous,
                    and they don't know what the future page faults will be.
                    
                    It might help to do `readv` into dummy buffers to force the
                    kernel to load the pages from disk to memory, so the
                    subsequent page faults are minor instead of major. You're
                    still not reducing the number of page faults though, and
                    the total number of mode switches is increased.
                    
                    Anyway, all of these workarounds are very complicated and
                    will certainly be a lot more overhead than vectored IO, so
                    I would recommend just doing that. The overall point is
                    that using mmap isn't friendly to concurrent reads from
                    disk like io_uring or `readv` is.
                    
                    Major page faults are basically the same as synchronous
                    read calls, but Golang read calls are asynchronous, so the
                    OS thread can continue doing computation from other
                    Goroutines.
                    
                    Fundamentally, the benchmarks in this repository are broken
                    because in the mmap case they never read any of the data
                    [0], so there are basically no page faults anyway. With a
                    well-written program, there shouldn't be a reason that mmap
                    would be faster than IO, and vectored IO can obviously be
                    faster in various cases.
                    
                    [0] Eg, see here where the byte slice is assigned to `_`
                    instead of being used:
                    
  HTML              [1]: https://github.com/perbu/mmaps-in-go/blob/7e24f154...
       
                      immibis wrote 9 hours 5 min ago:
                      Inducing segmentation faults is literally how the kernel
                      implements memory mapping, and virtual memory in general,
                      by the way. From the CPU's perspective, that page is
                      unmapped. The kernel gets its equivalent of a SIGSEGV
                      signal (which is a "page fault"=SIGSEGV
                      "interrupt"=signal), checks its own private tables,
                      decides the page is currently on disk, schedules it to be
                      read from disk, does other stuff in the meantime, and
                      when the page has finished being read from disk, it
                      returns from the interrupt.
                      
                      (It does get even deeper than that: from the CPU's
                      perspective, the interrupt is very brief, just long
                      enough to take note that it happened and avoid switching
                      back to the thread that page-faulted. The rest of the
                      stuff I mentioned, although logically an "interrupt" from
                      the application's perspective, happens with the CPU's "am
                      I handling an interrupt?" flag set to false. This is
                      equivalent to writing a signal handler that sets a flag
                      saying the thread is blocked, edits its own return
                      address so it will return to the scheduler instead of the
                      interrupted code, then calls sigreturn to exit the signal
                      handler.)
       
                      vlovich123 wrote 13 hours 6 min ago:
                      munmap + signal handling is terrible not least of which
                      that you don’t want to be fucking with the page table
                      in that way as an unmap involves a cross cpu TLB shoot
                      down which is slooow in a “make the entire machine
                      slow” kind of way.
       
              im3w1l wrote 1 day ago:
              I think you have a misunderstanding of how disk IO happens. The
              CPU core sends a command to the disk "I want some this and that
              data", then the CPU core can go do something else while the disk
              services that request. From what I read the disk actually puts
              the data directly into memory by using DMA, without needing to
              involve the CPU.
              
              So far so good, but then the question is to ensure that the CPU
              core has something more productive to do then just check "did the
              data arrive yet?" over and over and coordinating that is where
              good apis come in.
       
                lmz wrote 23 hours 42 min ago:
                It's hard to say on one hand "I use mmap because I don't want
                fancy APis for every read" and on the other "I want to do
                something useful on page fault" because you don't want to make
                every memory read a possible interruption point.
       
                ori_b wrote 1 day ago:
                I think you have a misunderstanding of how the OS is signaled
                about disk I/O being necessary. Most of the post above was
                discussing that aspect of it, before the OS even sends the
                command to the disk.
       
                dapperdrake wrote 1 day ago:
                (Not the person you are replying to.)
                
                There is nothing in the sense of Python async or JS async that
                the OS thread or OS process in question could usefully do on
                the CPU until the memory is paged into physical RAM. DMA or no
                DMA.
                
                The OS process scheduler can run another process or thread. 
                But your program instance will have to wait.  That’s the
                point.    It doesn’t matter whether waiting is handled by a
                busy loop a.k.a. polling or by a second interrupt that wakes
                the OS thread up again.
                
                That is why Linux calls it uninterruptible sleep.
                
                EDIT: io_uring would of course change your thread from blocking
                syscalls to non-blocking syscalls.  Page faults are not a
                syscall, as GP pointed out.  They are, however, a
                context-switch to an OS interrupt handler.  That is why you
                have an OS.  It provides the software drivers for your CPU,
                MMU, and disks/storage.  Here this is the interrupt handler for
                a page fault.
       
                  hyghjiyhu wrote 12 hours 33 min ago:
                  (I am the person you are replying to)
                  
                  It could work like this. "Hey OS I would like to process
                  these pages* are they good to go? If not could you fetch and
                  lock them for me" and then if they are ready you process them
                  knowing it won't fault, and if they are not you do something
                  else and try again later.
                  
                  It's a sort of hybrid of the mmap and fread  paradigms in
                  that there are both explicit read requests but the kernel can
                  also get you data on its own initiative if there are spare
                  resources for it.
                  
                  * to amortize syscall overhead.
       
                    avianlyric wrote 1 hour 16 min ago:
                    What advantages does that provide over using more OS
                    threads. Ultimately this model is based on the idea that we
                    want our programming runtimes to become increasingly
                    responsible for low level scheduling concerns that have
                    traditionally been handled by the OS scheduler.
                    
                    I can broadly understand why there may be a desire to go
                    down that path. But I’m not convinced that it would
                    produce meaningful better performance than the current
                    abstractions. Especially if you take a step back as ask the
                    question: is mmap is the right tool to be using in these
                    situations, rather using other tools like io_uring?
                    
                    To be clear I don’t know the answer to this question. But
                    the complexity of the solutions being suggested to
                    potentially improve the mmap API really make me question if
                    they’re capable of producing meaningful improvements.
       
                  bcrl wrote 23 hours 48 min ago:
                  What everyone forgets is just how expensive context switches
                  are on modern x86 CPUs.  Those 512 bit vector registers fill
                  up a lot of cache lines.  That's why async tends to win over
                  processes / threads for many workloads.
       
            im3w1l wrote 1 day ago:
            There are apis that sort of let you do it: mincore, madvise,
            userfaultfd.
       
              bcrl wrote 23 hours 47 min ago:
              None of those APIs are cheap enough to call in a fast path.
       
                gpderetta wrote 16 hours 16 min ago:
                no syscall will be cheap to call in a fast path. You would need
                an hardware instruction that tells you if a load or store would
                fault.
       
                  vlovich123 wrote 12 hours 52 min ago:
                  Rather than a direct syscall, you could imagine something
                  like rseq where you have a shared userspace / kernel data
                  structure where the userspace code gets aborted and restarted
                  if the page was evicted while being processed. But making
                  this work correctly and actually not have a perf overhead and
                  also be an ergonomic API is super hard. In practice people
                  who care probably are satisfied by direct I/O within io_uring
                  with a custom page cache and a truly optimal implementation
                  where the OS can still manage file pages and evict them but
                  the application still new when it happened isn’t worth it.
       
                    bcrl wrote 30 min ago:
                    Unfortunately, a lot of the shared state with userland
                    became much more difficult to implement securely when the
                    Meltdown and Spectre (and others) exploits became concerns
                    that had to be mitigated.  They makes the OS's job a heck
                    of a lot harder.
                    
                    Sometimes I feel modern technology is basically a
                    delicately balanced house of cards that falls over when
                    breathed upon or looked at incorrectly.
       
                  zozbot234 wrote 15 hours 54 min ago:
                  > You would need an hardware instruction that tells you if a
                  load or store would fault.
                  
                  You have MADV_FREE pages/ranges. They get cleared when
                  purged, so reading zeros tells you that the load would have
                  faulted and needs to be populated from storage.
       
                    vlovich123 wrote 12 hours 49 min ago:
                    MADV_FREE is insufficient - userspace doesn’t get a
                    signal from the OS to know when there’s system wide
                    memory pressure and having userspace try to respond to such
                    a signal would be counter productive and slow in a kernel
                    operation that needs to be a fast path. It’s more that
                    you want MADV (page cache) a memory range and then have
                    some way to have a shared data structure where you are told
                    if it’s still resident and can lock it from being paged
                    out.
       
                      bcrl wrote 19 min ago:
                      MADV_FREE is also extremely expensive.    CPU vendors have
                      finally simplified TLB shootdown in recent CPUs with both
                      AMD and Intel now having instructions to broadcast TLB
                      flushes in hardware, which gets rid of one of the worst
                      sources of performance degradation in threaded multicore
                      applications (oh the pain of IPIs mixed with TLB
                      flushing!).  However, it's still very expensive to walk
                      page tables and free pages.
                      
                      Hardware reference counting of memory allocations would
                      be very interesting.  It would be shockingly simple to
                      implement compared to many other features hardware
                      already has to tackle.
       
                      zozbot234 wrote 11 hours 3 min ago:
                      > userspace doesn’t get a signal from the OS to know
                      when there’s system wide memory pressure
                      
                      Memory pressure indicators exist, [1] > have some way to
                      have a shared data structure where you are told if it’s
                      still resident and can lock it from being paged out.
                      
                      What's more efficient than fetching data and comparing it
                      with zero? Any write within the range will then cancel
                      the MADV_FREE property on the written-to page thus
                      "locking" it again, and this is also very efficient.
                      
  HTML                [1]: https://docs.kernel.org/accounting/psi.html
       
        nawgz wrote 1 day ago:
        Sounds interesting. Why wouldn’t the OS itself default to this
        behavior? Could it fall apart under load, or is it just not important
        enough to replace the legacy code relying on it?
       
          perbu wrote 15 hours 51 min ago:
          The point is that invoking the OS has a cost. Using mmap, for those
          situations where it makes sense, lets you avoid that cost.
       
          kragen wrote 17 hours 5 min ago:
          Multics did default to this behavior, but Unix was written on the
          PDP-7 and later the PDP-11, neither of which supported virtual memory
          or paging, so the Unix system call interface necessarily used read()
          and write() calls instead.
          
          This permitted the use of the same system calls on files, on the
          teletype, on the paper tape reader and punch, on the magtape, on the
          line printer, and eventually on pipes.    Even before pipes, the
          ability to "record" a program's terminal output in a file or "play
          back" simulated user input from a file made Unix especially
          convenient.
          
          But pipes, in turn, permitted entire programs on Unix to be used as
          the building blocks of a new, much more powerful programming
          language, one where you manipulated not just numbers or strings but
          potentially endless flows of data, and which could easily orchestrate
          computations that no single program in the PDP-11's 16-bit address
          space could manage.
          
          And that was how Unix users in the 01970s had an operating system
          with the entire printed manual available in the comprehensive online
          help system, a way to produce publication-quality documents on the
          phototypesetter, incremental recompilation, software version control,
          full-text search, WYSIWYG screen editors that could immediately jump
          to the definition of a function, networked email, interactive
          source-level debugging, a relational database, etc.—all on a 16-bit
          computer that struggled to run half a million instructions per
          second, which at most companies might have been relegated to
          controlling some motors and heaters in a chemical plant or something.
          
          It turns out that often what you can do matters even more than how
          fast you can do it.
       
          toast0 wrote 21 hours 34 min ago:
          > Why wouldn’t the OS itself default to this behavior? Could it
          fall apart under load, or is it just not important enough to replace
          the legacy code relying on it?
          
          Mmap and read/write syscalls are both ways to interact with files,
          but they have different behaviors. You can't exactly swap one for the
          other without knowledge of the caller. What you likely do see is that
          OS utilities likely use mmap when it makes sense and a difference.
          
          You also have a lot of things that can work on files or pipes/etc and
          having a common interface is more useful than having more potential
          performance (sometimes the performance is enough to warrant writing
          it twice).
       
          trenchpilgrim wrote 1 day ago:
          1. mmap was added to Unix later by Sun, it wasn't in the original
          Unix
          
          2. As the article points out mmap is very fast for reading huge
          amounts of data but is a lot slower at other file operations. For
          reading smallish files, which is the majority of calls most software
          will make to the filesystem, the regular file syscalls are better.
          
          3. If you're on a modern Linux you might be better off with io_uring
          than mmap.
       
            scottlamb wrote 1 day ago:
            All true, and it's not just performance either. The API is just
            different. mmap data can change at any time. In fact, if the file
            shrinks, access to a formerly valid region of memory has behavior
            that is unspecified by the Single Unix Specification. (On Linux, it
            causes a SIGBUS if you access a page that is entirely invalid;
            bytes within the last page after the last valid byte probably are
            zeros or something? unsure.)
            
            In theory I suppose you could have a libc that mostly emulates
            read() and write() calls on files [1] with memcpy() on mmap()ed
            regions. But I don't think it'd be quite right. For one thing, that
            read() behavior after shrink would be a source of error.
            
            Higher-level APIs might be more free to do things with either mmap
            or read/write.
            
            [1] just on files; so it'd have to track which file descriptors are
            files as opposed to sockets/pipes/etc, maintaining the cached
            lengths and mmap()ed regions and such. libc doesn't normally do
            that, and it'd go badly if you bypass it with direct system calls.
       
              nawgz wrote 20 hours 28 min ago:
              Interesting callouts! Thanks
       
       
   DIR <- back to front page