_______ __ _______
| | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----.
| || _ || __|| < | -__|| _| | || -__|| | | ||__ --|
|___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____|
on Gopher (inofficial)
HTML Visit Hacker News on the Web
COMMENT PAGE FOR:
HTML How memory maps (mmap) deliver faster file access in Go
lzaf wrote 8 hours 52 min ago:
On a similar vein some time ago I had written a small toy lib that
emulates the os.File interface for nmap backed files: [1] ,
It even handled bus errors without panicking: [1]
/blob/186f714343906bb9304ad5f30...
Read and write performance was usually better especially with larger
write sizes.
Compared to os.File:
~/src/yammap$ go test -benchtime=4s -bench .
goos: linux
goarch: amd64
pkg: github.com/zaf/yammap
cpu: AMD Ryzen 9 5900X 12-Core Processor
BenchmarkWrite-24 29085 164744 ns/op 25459.52 MB/s
BenchmarkOSWrite-24 22204 215131 ns/op 19496.54 MB/s
BenchmarkRead-24 29113 166820 ns/op 25142.72 MB/s
BenchmarkOSRead-24 27451 172685 ns/op 24288.69 MB/s
HTML [1]: https://github.com/zaf/yammap
HTML [2]: https://github.com/zaf/yammap/blob/186f714343906bb9304ad5f30b1...
benjiro wrote 15 hours 42 min ago:
People are so focused on the mmap part, and the latency, that the usage
is overlooked.
> The last couple of weeks I've been working on an HTTP-backed
filesystem.
It feels like this is micro optimizations, that are going to get
blocked anyway by the whole HTTP cycle anyway.
There is also the benchmark issue:
The enhanced CDB format seems to be focused on a read only benefits, as
writes introduced a lot of latency, and issue with mmap. In other
words, there is a need to freeze for the mmap, then unfreeze, write for
updates, freeze for mmap ...
This cycle introduces overhead, does it not? Has this been benchmarked?
Because from what i am seeing, the benefits are mostly in the frozen
state (aka read only).
If the data is changed infrequently, why not just use json? No matter
how slow it is, if your just going to do http requests for the
directory listing, your overhead is not the actual file format.
If this enhanced file format was used as file storage, and you want to
be able to fast read files, that is a different matter. Then there are
ways around it with keeping "part" files where files 1 ... 1000 are in
file.01, 2 ... 2000 in file.02 (thus reducing overhead from the file
system). And those are memory mapped for fast reading. And where
updates are invalidated files/rewrites (as i do not see any
delete/vacume ability in the file format).
So, the actual benefits just for a file directory listing db escapes
me.
dahfizz wrote 12 hours 29 min ago:
This reads like complete nonsense. If HTTP is involved, lets just
give up and make the system as slow as possible?
The HTTP request needs to actually be actioned by the server before
it can respond. Reducing the time it takes for the server to do the
thing (accessing files) will meaningfully improve overall
performance.
Switching out to JSON will meaningfully degrade performance. For no
benefit.
benjiro wrote 11 hours 28 min ago:
> If HTTP is involved, lets just give up and make the system as
slow as possible?
Did i write that? Please leave flamebait out of these discussions.
The original author (today) answered why they wanted to use this
approach and the benefits from it. This has been missing in this
entire discussion. So i really do not understand where you get this
confidence.
> Switching out to JSON will meaningfully degrade performance. For
no benefit.
Without knowing why or how the system was used, and now we know it
is used as a transport medium between the db/nodes, its more clear
as to why json is a issue for them. Does not explain how you
conclude it will "meaningfully degrade performance" when this
information was not available to any of us.
perbu wrote 13 hours 38 min ago:
We need to support over 10M files in each folder. JSON wouldn't fare
well as the lack of indices makes random access problematic.
Composing a JSON file with many objects is, at least with the current
JSON implementation, not feasible.
CDB is only a transport medium. The data originates in PostgreSQL and
upon request, stored in CDB and transferred. Writing/freezing to CDB
is faster than encoding JSON.
CDB also makes it possible to access it directly, with ranged HTTP
requests. It isn't something I've implemented, but having the option
to do so is nice.
benjiro wrote 11 hours 31 min ago:
> CDB is only a transport medium. The data originates in PostgreSQL
and upon request, stored in CDB and transferred. Writing/freezing
to CDB is faster than encoding JSON.
Might have been interesting to actually include this in the
article, do you not think so? ;-)
The way the article is written, made it seen that you used cdb on
edge nodes to store metadata. With no information as to what your
storing / access, how, why ... This is part of the reason we have
these discussions here.
perbu wrote 10 hours 2 min ago:
The post is about mmap and my somewhat successful use of it. If
I've described my whole stack it would have been a small thesis
and not really interesting.
karel-3d wrote 16 hours 41 min ago:
mmap is fine when you know the file fits in memory, and you need random
file reads/writes of only some parts of the file. It's not magic.
It's also quite hard to debug in go, because mmaped files are not
visible in pprof; whe you run out of memory, mmap starts behaving
really suboptimally. And it's hard to see which file takes how much
memory (again it doesn't show in pprof).
perbu wrote 10 hours 6 min ago:
random reads are ok. writes through a mmap are a disaster.
vlowther wrote 9 hours 2 min ago:
Only if you are doing in-place updates. If append-only datastores
are your jam, writes via mmap are Just Fine:
$ go test -v
=== RUN TestChunkOps
chunk_test.go:26: Checking basic persistence and Store
expansion.
chunk_test.go:74: Checking close and reopen read-only
chunk_test.go:106: Checking that readonly blocks write ops
chunk_test.go:116: Checking Clear
chunk_test.go:175: Checking interrupted write
--- PASS: TestChunkOps (0.06s)
=== RUN TestEncWriteSpeed
chunk_test.go:246: Wrote 1443 MB/s
chunk_test.go:264: Read 5525.418751 MB/s
--- PASS: TestEncWriteSpeed (1.42s)
=== RUN TestPlaintextWriteSpeed
chunk_test.go:301: Wrote 1693 MB/s
chunk_test.go:319: Read 10528.744206 MB/s
--- PASS: TestPlaintextWriteSpeed (1.36s)
PASS
gethly wrote 18 hours 8 min ago:
I have never used mmap, as I had no need, but I know BoltDB uses it and
from what I remember, the mmap is good for when you are working with
whole disk pages, which BoltDB does. Otherwise it seems to be wrong use
case for it?
philippta wrote 18 hours 51 min ago:
At computerenhance.com[0] Casey Muratori shows that memory mapped files
actually perform worse at sequential reads, which is the common case
for file access.
Thatâs because the CPU wonât prefetch data as effectively and has
to rely on page faults to know what to read next. With regular,
sequential file reads, the CPU can be much smarter and prefetch the
next page while the program is consuming the previous one.
[0]
HTML [1]: https://www.computerenhance.com/p/memory-mapped-files
atombender wrote 10 hours 7 min ago:
Does madvise(..., MADV_SEQUENTIAL) not help here?
vlovich123 wrote 18 hours 14 min ago:
Io_uring should be outperforming both - you can configure the read
ahead optimally, thereâs no page faults, and thereâs no copies as
there is with buffered I/O:
HTML [1]: https://archive.is/vkdCo
charlietap wrote 20 hours 1 min ago:
This article is nonsensical. If you're reading this please don't start
mmap'ing files just to read from them. It proposes an incredibly
unrealistic scenario where the program is making thousands of random
incredibly small unbuffered reads from a file. In reality 99 percent of
programs will sequentially reading bytes into a buffer which makes
orders of magnitude less syscalls.
Mmap is useful in niche scenarios, it's not magic.
icedchai wrote 11 hours 32 min ago:
At a previous company, we had a custom "database" (I use that term
very loosely) built on memory mapped files. All startup, all pages
were read to ensure the data was hot, unlikely to be any page faults.
It worked well for the application, but obviously because the whole
thing fit in memory and was preloaded. We also had our own custom
write-ahead-log. Today, I'd probably use sqlite.
perbu wrote 16 hours 25 min ago:
This is a niche scenario. The scenario outlined is reading CDB
databases.
karel-3d wrote 16 hours 38 min ago:
That is not unrealistic if you are using the file to save binary data
on given positions and don't need to read all data. For example if
you have a big matrix of fixed size structs and you need to read only
some of them.
Animats wrote 20 hours 43 min ago:
I never knew that Linux memory mapped files were copy-on-write. I'd
assumed they let you alter the page and wrote out dirty pages later.
pengaru wrote 20 hours 9 min ago:
MAP_PRIVATE vs. MAP_SHARED
kragen wrote 1 day ago:
The simple answer to "How do memory maps (mmap) deliver faster file
access?" is "sometimes", but the blog post does give some more details.
I was suspicious of the 25Ã speedup claim, but it's a lot more
plausible than I thought.
On this Ryzen 5 3500U running mostly at 3.667GHz (poorly controlled),
reading data from an already-memory-mapped page is as fast as memcpy
(about 10 gigabytes per second when not cached on one core of my
laptop, which works out to 0.1 nanoseconds per byte, plus about 20
nanoseconds of overhead) while lseek+read is two system calls (590ns
each) plus copying bytes into userspace (26â30ps per byte for small
calls, 120ps per byte for a few megabytes). Small memcpy (from, as it
happens, an mmapped page) also costs about 25ps per byte, plus about
2800ps per loop iteration, probably much of which is incrementing the
loop counter and passing arguments to the memcpy function (GCC is
emitting an actual call to memcpy, via the PLT).
So mmap will always be faster than lseek+read on this machine, at least
if it doesn't have a page fault, but the point at which memcpy from
mmap would be 25Ã faster than lseek+read would be where 2Ã590 + .028n
= 25Ã(2.8 + .025n) = 70 + .625n. Which is to say 1110 = .597n â´ n =
1110/.597 = 1859 bytes. At that point, memcpy from mmap should be 49ns
and lseek+read should be 1232ns, which is 25Ã as big. You can cut
that size more than in half if you use pread() instead of lseek+read,
and presumably io_uring would cut it even more. If we assume that
we're also taking cache misses to bring in the data from main memory in
both cases, we have 2Ã590 + .1n = 25Ã(2.8 + .1n) = 70 + 2.5n, so 1110
= 2.4n â´ n = 1110/2.4 = 462 bytes.
On the other hand, mmap will be slow if it's hitting a page fault,
which sort of corresponds to the case where you could have cached the
result of lseek+read in private RAM, which you could do on a
smaller-than-pagesize granularity, which potentially means you could
hit the slow path much less often for a given working set. And
lseek+read has several possible ways to do make the I/O asynchronous,
while the only way to make mmap page faults asynchronous is to hit the
page faults in different threads, which is a pretty heavyweight
mechanism.
On the other hand, lseek+read with a software cache is sort of using
twice as much memory (one copy is in the kernel's buffer cache and
another copy is in the application's software cache) so mmap could
still win. And, if there are other processes writing to the data being
queried, you need some way to invalidate the software cache, which can
be expensive.
(On the gripping hand, if you're reading from shared memory while other
processes are updating it, you're probably going to need some kind of
locking or lock-free synchronization with those other processes.)
So I think a reasonably architected lseek+read (or pread) approach to
the problem might be a little faster or a little slower than the mmap
approach, but the gap definitely won't be 25Ã. But very simple
applications or libraries, or libraries where many processes might be
simultaneously accessing the same data, could indeed get 25Ã or even
256Ã performance improvements by letting the kernel manage the cache
instead of trying to do it themselves.
Someone at a large user of Varnish told me they've mostly removed mmap
from their Varnish fork for performance.
kragen wrote 21 hours 3 min ago:
It's worth reading bcrl's comment at [1] for more depth on some of
these issues.
HTML [1]: https://news.ycombinator.com/item?id=45690006
loeg wrote 22 hours 41 min ago:
> lseek+read is two system calls
You'd never do that, though -- you'd use pread.
kragen wrote 21 hours 7 min ago:
The article I'm commenting on said its author used seek and read,
so I don't know if maybe for some reason they did do that instead
of pread(), which it also mentioned. I didn't want to
optimistically assume otherwise. Is pread() available in the
Golang standard library? [1] is someone using os.File.ReadAt, which
is a method name that makes me even more uncertain. But there's
also syscall.Pread apparently, so it should be fine?
If you are making only one system call, the 25Ã crossover point is
800-some bytes by my measurements.
HTML [1]: https://github.com/golang/go/issues/19563
liuliu wrote 1 day ago:
mmap is a good crutch when you 1. don't have busy polling / async IO
API available and want to do some quick & dirty preloading tricks; 2.
don't want to manage the complexity of in-memory cache, especially
cross-processes ones.
Obviously if you have kernel-backed async IO APIs (io_uring) and
willing to dig into the deeper end (for better managed cache), you can
get better performance than mmap. But in many cases, mmap is
"good-enough".
gustavpaul wrote 1 day ago:
The MmapReader is not copying the requested byte range into the buf
argument, so if ever the underlying file descriptor is closed (or the
file truncated out of band) any subsequent slice access will throw
SIGBUS, which is really unpleasant.
It also means the latency due to pagefaults is shifted from inside
mmapReader.ReadRecord() (where it would be expected) to wherever in the
application the bytes are first accessed, leading to spooky
unpreditactable latency spikes in what are otherwise pure functions.
That inevitably leads to wild arguments about how bad GC stalls are :-)
An apples to apples comparison should be copying the bytes from the
mmap buffer and returning the resulting slice.
dahfizz wrote 12 hours 25 min ago:
Being able to avoid an extra copy is actually a huge performance gain
when you can safely do it. You shouldn't discount how useful mmap is
just because its not useful in every scenario.
You shouldn't replace every single file access with mmap. But when it
makes sense, mmap is a big performance win.
loeg wrote 22 hours 45 min ago:
> so if ever the underlying file descriptor is closed
Nit: Mmap mapping lifetimes are not attached to the underlying fd.
The file truncation and latency concerns are valid, though.
dapperdrake wrote 23 hours 54 min ago:
Itâs not accessible until it is in user space. (Virtual memory
addresses mapped to physical RAM holding the data.)
Good point.
Ingon wrote 1 day ago:
When I adopted mmap in klevdb [1], I saw a dramatic performance
improvements. So, even as klevdb completes a write segment, it will
reopen, on demand, the segment for reading with mmap (segments are
basically part of write only log). With this any random reads are super
fast (but of course not as fast as sequential ones).
HTML [1]: https://github.com/klev-dev/klevdb
commandersaki wrote 1 day ago:
This is a good article but I'm wondering what is the relationship
between this website/company and varnish-cache.org, since in the
article they make claims of releasing Varnish Cache, and the article
wasn't written by Poul-Henning Kamp.
wmf wrote 23 hours 15 min ago:
Varnish hasn't been a solo project for many years. Also PHK's version
is now called Vinyl Cache while the corporate fork is called Varnish.
commandersaki wrote 18 hours 8 min ago:
The article says "when we launched Varnish Cache back in 2006". Who
is we? My memory was that around that time PHK released it to the
world and was the sole developer at the time.
kragen wrote 17 hours 17 min ago:
I was wondering about this too. He apparently worked at the
company for a while? Did he found it?
perbu wrote 16 hours 40 min ago:
Yes. When Varnish Cache launched, in 2006, I worked in a rather
small OSS consultancy, which did the Linux port of Varnish
Cache and provided maintenance and funding for the project.
kragen wrote 16 hours 35 min ago:
You say, "Yes. When Varnish Cache launched, in 2006, I worked
in a rather small OSS consultancy, which did the Linux port
of Varnish Cache and provided maintenance and funding for the
project."
But eventually phk left, and you came into conflict with him
over the name, which was resolved by him choosing a different
name for his version of Varnish?
perbu wrote 10 hours 8 min ago:
Not really.
We've been funding phks work on Varnish and Vinyl cache for
20 years. Do you think phk can write, maintain and release
something on his own? Vinyl Cache cannot be a one-man-show,
be real.
kragen wrote 6 hours 53 min ago:
(I do, in fact, think phk can write, maintain, and
release something on his own.)
perbu wrote 5 hours 39 min ago:
He knows a lot of things and is amongst the best
software developers I've worked with, but on a project
like this you need a lot more breath than any single
developer can bring.
kragen wrote 9 hours 38 min ago:
I see. Thank you for explaining!
mholt wrote 1 day ago:
Just this month, I've learned the hard way that some file systems do
not play well with mmap: [1] In my case, it seems that Mac's ExFAT
driver is incompatible with sqlite's WAL mode because the driver
returned a memory address that is misaligned on ARM64. Most bizarre
error I've encountered in years.
So, uh, mind your file systems, kids!
HTML [1]: https://github.com/mattn/go-sqlite3/issues/1355
vlovich123 wrote 1 day ago:
I would be very careful about that conclusion. Reading that thread it
sounds like youâre relying on Claude to make this conclusion but
you havenât actually verified what the address being returned
actually is.
The reason Iâm skeptical is three fold. The first is that itâs
generally impossible for a filesystem to mmap return a pointer
thatâs not page boundary aligned. The second is that unaligned
accesses are still fine on modern ARM is not a SIGBUS. The third is
that Claudeâs reasoning that the pointer must be 8-byte aligned and
that indicates a misaligned read is flawed - how do you know that
SQLite isnât doing a 2-byte read at that address?
If you really think itâs a bad alignment it should be trivial to
reproduce - mmap the file explicitly and print the address or modify
the SQLite source to print the mmap location it gets.
mholt wrote 1 day ago:
I'd love to be wrong, but the address it's referring to is the
correct address from the error / stack trace.
I honestly don't know anything about this. There's no search
results for my error. ChatGPT and Claude and Grok all agreed one
way or another, with various prompts.
Would be happy to have some help verifying any of this. I just know
that disabling WAL mode, and not using Mac's ExFAT driver, both
fixed the error reliably.
achierius wrote 23 hours 30 min ago:
But is that the address being returned by mmap?
Furthermore, what instruction is this crashing on? You should be
able to look up the specific alignment requirements of that
instruction to verify.
> ChatGPT and Claude and Grok all agreed one way or another, with
various prompts.
This means less than you'd think: they're all trained on a
similar corpus, and Grok in particular is probably at least
partially distilled from Claude. So they tend to come to similar
conclusions given similar data.
mholt wrote 20 hours 49 min ago:
I believe it's being returned by the FS driver, not mmap()
necessarily. I think I knew what instruction it was when I was
debugging it but don't remember right now. (I could probably
dig through my LLM history and get it though.)
And yeah, I knew AI is useless, I try to avoid it, but when I'm
way over my head it's better than nothing (it did lead me to
the workaround that I mentioned in my previous comment).
vlovich123 wrote 18 hours 53 min ago:
If it was in the FS driver (w which runs in kernel /
different process?) why would your process be dying?
MayCXC wrote 1 day ago:
wowie. mmap also dramatically improved perf for LLaMA:
HTML [1]: https://justine.lol/mmap/
kristjansson wrote 20 hours 51 min ago:
uh. there was a bit more to the story than 'yup totally unalloyed
free lunch'
buybackoff wrote 1 day ago:
It looks suspicious at 25x. Even 2.5x would be suspicious unless
reading very small records.
I assume both cases have the file cached in RAM already fully, with a
tiny size of 100MB. But the file read based version actually copies the
data into a given buffer, which involves cache misses to get data from
RAM to L1 for copying. The mmap version just returns the slice and it's
discarded immediately, the actual data is not touched at all. Each
record is 2 cache lines and with random indices is not prefetched. For
the CPU AMD Ryzen 7 9800X3D mentioned in the repo, just reading 100
bytes from RAM to L1 should take ~100 nanos.
The benchmark compares actually getting data vs getting data location.
Single digit nanos is the scale of good hash tables lookups with data
in CPU caches, not actual IO. For fairness, both should use/touch the
data, eg copy it.
Tuna-Fish wrote 12 hours 30 min ago:
> For the CPU AMD Ryzen 7 9800X3D mentioned in the repo, just reading
100 bytes from RAM to L1 should take ~100 nanos.
It's important to note that throughput is not just an inverse of
latency, because modern OoO cpus with modern memory subsystems can
have hundreds of requests in flight. If your code doesn't serialize
accesses, latency numbers are irrelevant to throughput.
checker659 wrote 13 hours 20 min ago:
Latency Numbers Every Programmer Should Know (originally by Jeff Dean
/ Peter Norvig)
HTML [1]: https://gist.github.com/jboner/2841832
hyc_symas wrote 17 hours 8 min ago:
That's such an obvious error in their benchmark code. In my benchmark
code I make sure to touch the data so at least the 1st page is
actually paged in from disk.
HTML [1]: https://github.com/LMDB/dbbench/blob/1281588b7fdf119bcba65ce...
a-dub wrote 23 hours 49 min ago:
doing these sorts of benchmarks is actually quite tricky. you must
clear the page cache by allocating >1x physical ram before each
attempt.
moreover, mmap by default will load lazy, where mmap with
MAP_POPULATE will prefetch. in the former case, reporting average
operation times is not valid because the access time distributions
are not gaussian (they have a one time big hit at first touch). with
MAP_POPULATE (linux only), there is long loading delay when mmap is
first called, but then the average access times will be very low.
when pages are released will be determined by the operating system
page cache eviction policy.
the data structure on top is best chosen based on desired runtime
characteristics. if it's all going in ram, go ahead and use a
standard randomized hash table. if it's too big to fit in ram,
designing a structure that is aware of lru style page eviction
semantics may make sense (ie, a hash table or other layout that
preserves locality for things that are expected to be accessed in a
temporally local fashion.)
codedokode wrote 18 hours 35 min ago:
> you must clear the page cache
In Linux there is a /proc/sys/vm/drop_caches pseudo file that does
this. Look how great Linux is compared to other OSes.
a-dub wrote 11 hours 51 min ago:
that's super cool! live and learn. even better would be the
capability to drop caches from a supplied point in the filesystem
hierarchy.
ahoka wrote 10 hours 35 min ago:
People would run it from cron to "free memory", believe it or
not.
DoctorOW wrote 10 hours 14 min ago:
Hence,
HTML [1]: https://www.linuxatemyram.com/
kragen wrote 1 day ago:
> For the CPU AMD Ryzen 7 9800X3D mentioned in the repo, just reading
100 bytes from RAM to L1 should take ~100 nanos.
I think this is the wrong order of magnitude. One core of my Ryzen 5
3500U seems to be able to run memcpy() at 10 gigabytes per second
(0.1 nanoseconds per byte) and memset() at 31 gigabytes per second
(0.03 nanoseconds per byte). I'd expect a sequential read of 100
bytes to take about 3 nanoseconds, not 100 nanoseconds.
However, I think random accesses do take close to 100 nanoseconds to
transmit the starting row and column address and open the row. I
haven't measured this on this hardware because I don't have a test
I'm confident in.
bcrl wrote 23 hours 52 min ago:
100 nanoseconds from RAM is correct. Latency != bandwidth. 3
nanoseconds would be from cache or so on a Ryzen. You ain't gonna
get the benefits of prefetching on the first 100 bytes.
kragen wrote 23 hours 47 min ago:
Yes, my comment clearly specified that I was talking about
sequential reads, which do get the benefits of prefetching, and
said, "I think random accesses do take close to 100 nanoseconds".
bcrl wrote 23 hours 37 min ago:
If you're doing large amounts of sequential reads from a
filesystem, it's probably not in cache. You only get latency
that low if you're doing nothing else that stresses the memory
subsystem, which is rather unlikely. Real applications have
overhead, which is why microbenchmarks like this are useless.
Microbenchmarks are not the best first order estimate for
programmers to think of.
kragen wrote 23 hours 28 min ago:
Yes, I went into more detail on those issues in [1] , but
overhead is irrelevant to the issue we were discussing, which
is about how long it takes to read 100 bytes from memory.
Microbenchmarks are generally exactly the right way to answer
that question.
Memory subsystem bottlenecks are real, but even in real
applications, it's common for the memory subsystem to not be
the bottleneck. For example, in this case we're discussing
system call overhead, which tends to move the system
bottleneck inside the CPU (even though a significant part of
that effect is due to L1I cache evictions).
Moreover, even if the memory subsystem is the bottleneck, on
the system I was measuring, it will not push the sequential
memory access time anywhere close to 1 nanosecond per byte.
I just don't have enough cores to oversubscribe the memory
bus 30Ã. (1.5Ã, I think.) Having such a large ratio of
processor speed to RAM interconnect bandwidth is in fact very
unusual, because it tends to perform very poorly in some
workloads.
If microbenchmarks don't give you a pretty good first-order
performance estimate, either you're doing the wrong
microbenchmarks or you're completely mistaken about what your
application's major bottlenecks are (plural, because in a
sequential program you can have multiple "bottlenecks",
colloquially, unlike in concurrent systens where you almost
always havr exactly one bottleneck.) Both of these problems
do happen often, but the good news is that they're fixable.
But giving up on microbenchmarking will not fix them.
HTML [1]: https://news.ycombinator.com/item?id=45689464
bcrl wrote 22 hours 44 min ago:
If you're bottlenecked on a 100 byte read, the app is
probably doing something really stupid, like not using
syscalls the way they're supposed to. Buffered I/O has
existed from fairly early on in Unix history, and it exists
because it is needed to deal with the mismatch between what
stupid applications want to do versus the guarantees the
kernel has to provide for file I/O.
The main benefit from the mmap approach is that the fast
path then avoids all the code the kernel has to execute,
the data structures the kernel has to touch, and everything
needed to ensure the correctness of the system. In modern
systems that means all kinds of synchronization and
serialization of the CPU needed to deal with
$randomCPUdataleakoftheweek (pipeline flushes ftw!).
However, real applications need to deal with correctness.
For example, a real database is not just going to just do
100 byte reads of records. It's going to have to take
measures (locks) to ensure the data isn't being written to
by another thread.
Rarely is it just a sequential read of the next 100 bytes
from a file.
I'm firmly in the camp that focusing on microbenchmarks
like this is frequently a waste of time in the general
case. You have to look at the application as a whole
first. I've implemented optimizations that looked great in
a microbenchmark, but showed absolutely no difference
whatsoever at the application level.
Moreover, my main hatred for mmap() as a file I/O mechanism
is that it moves the context switches when the data is not
present in RAM from somewhere obvious (doing a read() or
pread() system call) to somewhere implicit (reading 100
bytes from memory that happens to be mmap()ed and was
passed as a pointer to a function written by some other
poor unknowing programmer). Additionally, read ahead
performance for mmap()s when bringing data into RAM is
quite a bit slower than on read()s in large part because it
means that the application is not providing a hint (the
size argument to the read() syscall) to the kernel for how
much data to bring in (and if everything is sequential as
you claim, your code really should know that ahead of
time).
So, sure, your 100 byte read in the ideal case when
everything is cached is faster, but warming up the cache is
now significantly slower. Is shifting costs that way
always the right thing to do? Rarely in my experience.
And if you don't think about it (as there's no obvious
pread() syscall anymore), those microseconds and sometimes
milliseconds to fault in the page for that 100 byte read
will hurt you. It impacts your main event loop, the size
of your pool of processes / threads, etc. The programmer
needs to think about these things, and the article
mentioned none of this. This makes me think that the
author is actually quite naive and merely proud in thinking
that he discovered the magic Go Faster button without
having been burned by the downsides that arise in the Real
World from possible overuse of mmap().
kragen wrote 21 hours 4 min ago:
Perhaps surprisingly, I agree with your entire comment
from beginning to end.
Sometimes mmap can be a real win, though. The poster
child for this is probably LMDB. Varnish also does
pretty well with mmap, though see my caveat on that in my
linked comment.
bcrl wrote 16 min ago:
Varish was very well done. It's disappointing that
with HTTPS-first nowadays there is very little
oppourtunity to make good use of local web caches of
web content across browsers / clients. Caches would
have been a godsend back in the 1990s when we had to
use shared dialup to connect to the internet while
using NetScape in a classroom full of computers.
Scaevolus wrote 1 day ago:
Yeah, 3.3ns is about 12 CPU cycles. You can indeed create a pointer
to a memory location that fast!
habibur wrote 1 day ago:
Is mmap still faster than fread? That might have been true in the 90s
but I was wondering about current improvements.
If you have enough free memory, the file will be cached in memory
anyway instead of residing on disk. Therefore both will be reading from
memory, albeit through different API.
Looking for recent benchmark or view from OS developers.
loeg wrote 22 hours 43 min ago:
read, or fread? fread is the buffered version that does an extra
copy for no reason that would benefit this use case.
do_not_redeem wrote 1 day ago:
Even if the file is cached, fread has to do a memcpy. mmap doesn't.
gpderetta wrote 1 day ago:
fread is (usually) buffered io, so it actually does two additional
mem copies (kernel to FILE buffer then to user buffer)
assbuttbuttass wrote 11 hours 40 min ago:
Not in Go
gpderetta wrote 9 hours 55 min ago:
oh, right, this is Go ( [1] ). Do the strings it return share
memory with the internal buffer?
HTML [1]: https://pkg.go.dev/github.com/odeke-em/go-utils/fread#...
stingraycharles wrote 1 day ago:
In our experience building a high performance database server:
absolutely. If your line of thinking is âif you have enough free
memoryâ, then these types of optimizations arenât for you. one of
the main benefits is eliminating an extra copy.
additionally, mmap is heavily optimized for random access, so if
thatâs what youâre doing, then youâll have a much better time
with it than fread.
(I hope a plug is not frowned upon here: if you like this kind of
stuff, weâre a fully remote company and hiring C++ devs: [1] )
HTML [1]: https://apply.workable.com/quasar/j/436B0BEE43/
YouAreWRONGtoo wrote 1 day ago:
If you can't post a salary, you shouldn't post a job opening.
(Not that you can afford me.)
Also, your company is breaking the law by false advertising. It
suggests your current leadership is fucking stupid. Why do you work
for a criminal enterprise?
jasonwatkinspdx wrote 1 day ago:
I'd be shocked if anyone would hire you after seeing this
behavior...
vlovich123 wrote 1 day ago:
Whatâs the false advertising?
deaddodo wrote 1 day ago:
Yeah, I took a look at the posting and itâs a bog standard
job posting.
I assume theyâre referring to the no-salary aspect and (based
on their speech style) are in the US. But, even in that case,
it would only matter if the posting were targeted to one of the
states that require salary information and the company operated
or had a presence in said state. Since itâs an EU company,
thatâs almost definitely not the case.
vlovich123 wrote 19 hours 9 min ago:
> and the company operated or had a presence in said state
And the company was big enough. AFAIK the salary transparency
stuff only applies when your headcount exceeds some number.
nteon wrote 1 day ago:
the downside is that the go runtime doesn't expect memory reads to page
fault, so you may end up with stalls/latency/under-utilization if part
of your dataset is paged out (like if you have a large cdb file w/
random access patterns). Using file IO, the Go runtime could be
running a different goroutine if there is a disk read, but with mmap
that thread is descheduled but holding an m & p. I'm also not sure if
there would be increased stop the world pauses, or if the async
preemption stuff would "just work".
Section 3.2 of this paper has more details:
HTML [1]: https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf
perbu wrote 15 hours 48 min ago:
This is amazingly good feedback. I hadn't thought of that at all. It
is so much harder to reason about the Go runtime as opposed to a
threaded application.
vlovich123 wrote 1 day ago:
To me this indicates a limitation of the API. Cause you do want to
maintain that the kernel can page out that memory under pressure
while userspace accesses that memory asynchronously while allowing
the thread to do other asynchronous things. Thereâs no good
programming model/OS api that can accomplish this today.
twic wrote 9 hours 59 min ago:
There isn't today, but there was in 1991, scheduler activations:
[1] The rough idea is that if the kernel blocks a thread on
something like a page cache miss, then it notifies the program
through something a bit like a signal handler; if the program is
doing user-level scheduling, it can then take account of that
thread being blocked. The actual mechanism in the paper is more
refined than that.
HTML [1]: https://dl.acm.org/doi/10.1145/121132.121151
scottlamb wrote 9 hours 8 min ago:
Nice find. That going nowhere seems like classic consequence of
the cyclical nature of these things: user-managed concurrency was
cool, then it wasn't, then Go (and others) brought it back.
I think the more recent UMCG [1] (kind of a hybrid approach, with
threads visible by the kernel but mostly scheduled by userspace)
handles this well. Assuming it ever actually lands in upstream,
it seems reasonable to guess Go would adopt it, given that both
originate within Google.
It's worth pointing out that the slow major page fault problem is
not unique to programs using mmap(..., fd, ...). The program
binary is implicitly mmaped, and if swap is enabled, even
anonymous memory can be paged out. I prefer to lock ~everything
[2] into RAM to avoid this, but most programs don't do this, and
default ulimits prevent programs running within login shells from
locking much if anything. [1]
[2] particularly on (mostly non-Go) programs with many threads,
it's good to avoid locking into RAM the guard pages or stack
beyond what is likely to be used, so better not to just use
mlockall(MCL_CURRENT | MCL_FUTURE) unfortunately.
HTML [1]: https://lwn.net/Articles/879398/
wmf wrote 23 hours 24 min ago:
If C had exceptions a page fault could safely unwind the stack up
to the main loop which could work on something else until the page
arrives. This has the advantage that there's no cost for the common
case of accessing resident pages. Exceptions seem to have fallen
out of favor so this may trade one problem for another.
gpderetta wrote 16 hours 18 min ago:
you can longjmp, swapcontext or whatever from a signal handler
into another lightweight fiber. The problem is that there is no
"until the page arrive" notification. You would have to poll
mincore which is awful.
You could of course imagine an ansychronous "mmap complete
notification" syscal, but at that point why not just use
io_uring, it will be simpler and it has the benefit of actually
existing.
vlovich123 wrote 19 hours 6 min ago:
C++ has exceptions and having seen the vast majority of code and
the way itâs written and the understanding of people writing
it, exception safety is a foreign concept. Doing it in C without
RAII seems particularly masochistic and doomed to fail.
And unwinding the stack isnât what you want to do because
youâre basically signaling you want to cancel the operation and
youâre throwing all the state when you precisely donât want
to do that - you just want to pause the current task and do other
I/O in the meantime.
pjmlp wrote 19 hours 44 min ago:
Windows C has exceptions, and no one has ever thought about doing
something like this.
They are only used for the same purpose as UNIX signals, without
their flaws.
In any case, page faults are OS specific, how to standardise such
behaviour, with the added performance loss switching between both
userspace and kernel.
avianlyric wrote 1 day ago:
There is no sensible OS API that could support this, because
fundamentally memory access is a hardware API. The OS isnât
involved in normal memory reads, because that would be ludicrously
inefficient, effectively requiring a syscall for every memory
operation, which effectively means a syscall for any operation
involving data I.e. all operations.
Memory operations are always synchronous because theyâre
performed directly as a consequence of CPU instructions. Reading
memory thatâs been paged out results in the CPU itself detecting
that the virtual address isnât in RAM, and performing a hardware
level interrupt. Literally abandoning a CPU instruction mid
execution to start executing an entirely separate set of
instructions which will hopefully sort out the page fault that just
occurred, then kindly ask the CPU to go back and repeat the
operation that caused the page fault.
OS is only involved only because itâs the thing that provided the
handling instructions for the CPU to execute in the event of a page
fault. But itâs not in anyway actually capable of changing how
the CPU initially handles the page fault.
Also the current model does allow other threads to continue
executing other work while the page fault is handled. The fault is
completely localised to individual thread that triggered the fault.
The CPU has no concept of the idea that multiple threads running on
different physical core are in anyway related to each other. It
also wouldnât make sense to allow the interrupted thread to
someone kick off a separate asynchronous operation, because where
is it going to execute? The CPU core where the page fault happened
is needed to handle the actual page fault, and copy in the needed
memory. So even if you could kick off an async operation, there
wouldnât be any available CPU cycles to carry out the operation.
Fundamentally there arenât any sensible ways to improve on this
problem, because the problem only exists due to us pretending that
our machines have vastly more memory than they actually do. Which
comes with tradeoffs, such as having to pause the CPU and steal CPU
time to maintain the illusion.
If people donât like those tradeoffs, thereâs a very simple
solution. Put enough memory in your machine to keep your entire
working set in memory all the time. Then page faults can never
happen.
blibble wrote 23 hours 28 min ago:
> There is no sensible OS API that could support this, because
fundamentally memory access is a hardware API.
there's nothing magic about demand paging, faulting is one way it
can be handled
another could be that the OS could expose the present bit on the
PTE to userland, and it has to check it itself, and linux already
has asynchronous "please back this virtual address" APIs
> Memory operations are always synchronous because theyâre
performed directly as a consequence of CPU instructions.
although most CPU instructions may look synchronous they really
aren't, the memory controller is quite sophisticated
> Fundamentally there arenât any sensible ways to improve on
this problem, because the problem only exists due to us
pretending that our machines have vastly more memory than they
actually do. Which comes with tradeoffs, such as having to pause
the CPU and steal CPU time to maintain the illusion.
modern demand paging is one possible model that happens to be
near universal amongst operating system today
there are many, many other architectures that are possible...
avianlyric wrote 1 hour 8 min ago:
> although most CPU instructions may look synchronous they
really aren't, the memory controller is quite sophisticated
I was eliding at lot of details. But my broader point is that
from the perspective of the thread being interpreted, the
paging process is completely synchronous. Sure advanced x86 CPU
maybe be tracking data dependencies between instructions and
actively reordering instructions to reduce the impact of the
pipeline stalling caused by the page fault. But thatâs all
low level optimisation that are (or should be) completely
invisible to the executing thread.
> there are many, many other architectures that are possible...
I would be curious to see any examples of those alternatives.
Demand paging provides a powerful abstraction, and itâs not
clear to me how you can sensibly move page management into
applications. At a very minimum that would suggest that every
programming language would need a memory management runtime
capable to predicting possible memory reads ahead of time in a
sensible fashion, and triggering its own paging logic.
kragen wrote 23 hours 50 min ago:
> There is no sensible OS API that could support this, because
fundamentally memory access is a hardware API.
Not only is there a sensible OS API that could support this,
Linux already implements it; it's the SIGSEGV signal. The
default way to respond to a SIGSEGV is by exiting the process
with an error, but Linux provides the signal handler with enough
information to do something sensible with it. For example, it
could map a page into the page frame that was requested, enqueue
an asynchronous I/O to fill it, put the current green thread to
sleep until the I/O completes, and context-switch to a different
green thread.
Invoking a signal handler only has about the same inherent
overhead as a system call. But then the signal handler needs
another couple of system calls. So on Linux this is over a
microsecond in all. That's probably acceptable, but it's slower
than just calling pread() and having the kernel switch threads.
Some garbage-collected runtimes do use SIGSEGV handlers on Linux,
but I don't know of anything using this technique for user-level
virtual memory. It's not a very popular technique in part
because, like inotify and epoll, it's nonportable; POSIX doesn't
specify that the signal handler gets the arguments it would need,
so running on other operating systems requires extra work.
im3w1l also mentions userfaultfd, which is a different
nonportable Linux-only interface that can solve the same thing
but is, I think, more efficient.
maxdamantus wrote 20 hours 40 min ago:
Just to clarify, I think the parent posts are talking about
non-failing page faults, ie where the kernel just needs to
update the mapping in the MMU after finding the existing page
already in memory (minor page fault), or possibly reading it
from filesystem/swap (major page fault).
SIGSEGV isn't raised during a typical page fault, only ones
that are deemed to be due to invalid reads/writes.
When one of the parents talks about "no good programming
model/OS api", they basically mean an async option that gives
the power of threads; threading allows concurrency of page
faults, so the kernel is able to perform concurrent reads
against the underlying storage media.
Off the top of my head, a model I can think of for supporting
concurrent mmap reads might involve a function:
bool hint_read(void *data, size_t length);
When the caller is going to read various parts of an mmapped
region, it can call `hint_read` multiple times beforehand to
add regions into a queue. When the next page fault happens,
instead of only reading the currently accessed page from disk,
it can drain the `hint_read` queue for other pages
concurrently. The `bool` return indicates whether the queue was
full, so the caller stops making useless `hint_read` calls.
I'm not familiar with userfaultfd, so don't know if it relates
to this functionality. The mechanism I came up with is still a
bit clunky and probably sub-optimal compared to using io_uring
or even `readv`, but these are alternatives to mmap.
gpderetta wrote 16 hours 34 min ago:
Are you reinventing madvise?
maxdamantus wrote 14 hours 34 min ago:
I think the model I described is more precise than madvise.
I think madvise would usually be called on large sequences
of pages, which is why it has `MADV_RANDOM`,
`MADV_SEQUENTIAL` etc. You're not specifying which
memory/pages are about to be accessed, but the likely
access pattern.
If you're just using mmap to read a file from start to
finish, then the `hint_read` mechanism is indeed pointless,
since multiple `hint_read` calls would do the same thing as
a single `madvise(..., MADV_SEQUENTIAL)` call.
The point of `hint_read`, and indeed io_uring or `readv` is
the program knows exactly what parts of the file it wants
to read first, so it would be best if those are read
concurrently, and preferably using a single system call or
page fault (ie, one switch to kernel space).
I would expect the `hint_read` function to push to a queue
in thread-local storage, so it shouldn't need a switch to
kernel space. User/kernel space switches are slow, in the
order of a couple of 10s of millions per second. This is
why the vDSO exists, and why the libc buffers writes
through `fwrite`/`println`/etc, because function calls
within userspace can happen at rates of billions per
second.
gpderetta wrote 13 hours 21 min ago:
you can do fine grained madvise via io_uring, which
indeed uses a queue. But at that point why use mmap at
all, just do async reads via io_uring.
vlovich123 wrote 13 hours 2 min ago:
The entire point I was trying to make at the beginning
of the thread is that mmap gives you memory pages in
the page cache that the OS can drop on memory pressure.
Io_uring is close on the performance and fine-grained
access patterns front. Itâs not so good on the
system-wide cooperative behavior with memory front and
has a higher cost as either youâre still copying it
from the page cache into a user buffer (non trivial
performance impact vs the read itself) + trashing your
CPU caches or youâre doing direct I/O and having to
implement a page cache manually (and risks duplicating
page data inefficiently in userspace if the same file
is accessed by multiple processes.
gpderetta wrote 9 hours 42 min ago:
Right, so zero copy IO but still having the ability
to share the pagecache across process and allow the
kernel to drop caches on high mempressure. One issue
is that when under pressure, a process might not
really be able to successfully read a page and keep
retyring and failing (with an LRU replacement policy
it is unlikely and probably self-limiting, but
still...).
kragen wrote 8 hours 2 min ago:
To take advantage of zero-copy I/O, which I believe
has become much more important since the shift from
spinning rust to Flash, I think applications often
need to adopt a file format that's amenable to
zero-copy access. Examples include Arrow (but not
compressed Feather), HDF5, FlatBuffers, Avro, and
SBE. A lot of file formats developed during the
spinning-rust eon require full parsing before the
data in them can be used, which is fine for a 1KB
file but suboptimal for a 1GB file.
vlovich123 wrote 18 hours 59 min ago:
Youâve actually understood my suggestion - thank you.
Unfortunately I think hint_read inherently canât work
because itâs a race condition between the read and how long
you access the page. And this race is inherent in any
attempted solution that needs to be solved. Signals are also
the wrong abstraction mechanism (and are slow and have all
sorts of other problems).
You need something more complicated I think, like rseq and
futex you have some shared data structure that both
understand how to mutate atomically. You could literally use
rseq to abort if the page isnât in memory and then submit
an io_uring task to get signaled when it gets paged in again
but rseq is a bit too coarse (itâll trigger on any
preemption).
Thereâs a race condition starvation danger here (it gets
evicted between when you get the signal and the sequence
completes) but something like this conceptually could maybe
be closer to working.
But yes itâs inherently difficult which is why it doesnât
exist but it is higher performance. And yes, this only makes
sense for mmap not all allocations so SIGSEGV is irrelevant
if looking at todayâs kernels.
kragen wrote 20 hours 25 min ago:
If you want accessing a particular page to cause a SIGSEGV so
your custom fault handler gets invoked, you can just munmap
it, converting that access from a "non-failing page fault"
into one "deemed to be invalid". Then the mechanism I
described would "allow[] concurrency of page faults, so the
[userspace threading library] is able to perform concurrent
reads against the underlying storage media". As long as you
were aggressive enough about unmapping pages that none of
your still-mapped pages got swapped out by the kernel. (Or
you could use mlock(), maybe.)
I tried implementing your "hint_read" years ago in userspace
in a search engine I wrote, by having a "readahead thread"
read from pages before the main thread got to them. It made
it slower, and I didn't know enough about the kernel to
figure out why. I think I could probably make it work now,
and Linux's mmap implementation has improved enormously since
then, so maybe it would just work right away.
maxdamantus wrote 13 hours 45 min ago:
The point about inducing segmentation faults is interesting
and sounds like it could work to implement the `hint_read`
mechanism. I guess it would mostly be a question of how
performant userfaultfd or SIGSEGV handling is. In any case
it will be sub-optimal to having it in the kernel's own
fault handler, since each userfaultfd read or SIGSEGV
callback is already a user-kernel-user switch, and it still
needs to perform another system call to do the actual
reads, and even more system calls to mmap the bits of
memory again.
Presumably having fine-grained mmaps will be another source
of overhead. Not to mention that each mmap requires another
system call. Instead of a single fault or a single call to
`readv`, you're doing many `mmap` calls.
> I tried implementing your "hint_read" years ago in
userspace in a search engine I wrote, by having a
"readahead thread" read from pages before the main thread
got to them.
Yeah, doing it in another thread will also have quite a bit
of overhead. You need some sort of synchronisation with the
other thread, and ultimately the "readahead" thread will
need to induce the disk reads through something other than
a page fault to achieve concurrent reads, since within the
readahead thread, the page faults are still synchronous,
and they don't know what the future page faults will be.
It might help to do `readv` into dummy buffers to force the
kernel to load the pages from disk to memory, so the
subsequent page faults are minor instead of major. You're
still not reducing the number of page faults though, and
the total number of mode switches is increased.
Anyway, all of these workarounds are very complicated and
will certainly be a lot more overhead than vectored IO, so
I would recommend just doing that. The overall point is
that using mmap isn't friendly to concurrent reads from
disk like io_uring or `readv` is.
Major page faults are basically the same as synchronous
read calls, but Golang read calls are asynchronous, so the
OS thread can continue doing computation from other
Goroutines.
Fundamentally, the benchmarks in this repository are broken
because in the mmap case they never read any of the data
[0], so there are basically no page faults anyway. With a
well-written program, there shouldn't be a reason that mmap
would be faster than IO, and vectored IO can obviously be
faster in various cases.
[0] Eg, see here where the byte slice is assigned to `_`
instead of being used:
HTML [1]: https://github.com/perbu/mmaps-in-go/blob/7e24f154...
immibis wrote 9 hours 5 min ago:
Inducing segmentation faults is literally how the kernel
implements memory mapping, and virtual memory in general,
by the way. From the CPU's perspective, that page is
unmapped. The kernel gets its equivalent of a SIGSEGV
signal (which is a "page fault"=SIGSEGV
"interrupt"=signal), checks its own private tables,
decides the page is currently on disk, schedules it to be
read from disk, does other stuff in the meantime, and
when the page has finished being read from disk, it
returns from the interrupt.
(It does get even deeper than that: from the CPU's
perspective, the interrupt is very brief, just long
enough to take note that it happened and avoid switching
back to the thread that page-faulted. The rest of the
stuff I mentioned, although logically an "interrupt" from
the application's perspective, happens with the CPU's "am
I handling an interrupt?" flag set to false. This is
equivalent to writing a signal handler that sets a flag
saying the thread is blocked, edits its own return
address so it will return to the scheduler instead of the
interrupted code, then calls sigreturn to exit the signal
handler.)
vlovich123 wrote 13 hours 6 min ago:
munmap + signal handling is terrible not least of which
that you donât want to be fucking with the page table
in that way as an unmap involves a cross cpu TLB shoot
down which is slooow in a âmake the entire machine
slowâ kind of way.
im3w1l wrote 1 day ago:
I think you have a misunderstanding of how disk IO happens. The
CPU core sends a command to the disk "I want some this and that
data", then the CPU core can go do something else while the disk
services that request. From what I read the disk actually puts
the data directly into memory by using DMA, without needing to
involve the CPU.
So far so good, but then the question is to ensure that the CPU
core has something more productive to do then just check "did the
data arrive yet?" over and over and coordinating that is where
good apis come in.
lmz wrote 23 hours 42 min ago:
It's hard to say on one hand "I use mmap because I don't want
fancy APis for every read" and on the other "I want to do
something useful on page fault" because you don't want to make
every memory read a possible interruption point.
ori_b wrote 1 day ago:
I think you have a misunderstanding of how the OS is signaled
about disk I/O being necessary. Most of the post above was
discussing that aspect of it, before the OS even sends the
command to the disk.
dapperdrake wrote 1 day ago:
(Not the person you are replying to.)
There is nothing in the sense of Python async or JS async that
the OS thread or OS process in question could usefully do on
the CPU until the memory is paged into physical RAM. DMA or no
DMA.
The OS process scheduler can run another process or thread.
But your program instance will have to wait. Thatâs the
point. It doesnât matter whether waiting is handled by a
busy loop a.k.a. polling or by a second interrupt that wakes
the OS thread up again.
That is why Linux calls it uninterruptible sleep.
EDIT: io_uring would of course change your thread from blocking
syscalls to non-blocking syscalls. Page faults are not a
syscall, as GP pointed out. They are, however, a
context-switch to an OS interrupt handler. That is why you
have an OS. It provides the software drivers for your CPU,
MMU, and disks/storage. Here this is the interrupt handler for
a page fault.
hyghjiyhu wrote 12 hours 33 min ago:
(I am the person you are replying to)
It could work like this. "Hey OS I would like to process
these pages* are they good to go? If not could you fetch and
lock them for me" and then if they are ready you process them
knowing it won't fault, and if they are not you do something
else and try again later.
It's a sort of hybrid of the mmap and fread paradigms in
that there are both explicit read requests but the kernel can
also get you data on its own initiative if there are spare
resources for it.
* to amortize syscall overhead.
avianlyric wrote 1 hour 16 min ago:
What advantages does that provide over using more OS
threads. Ultimately this model is based on the idea that we
want our programming runtimes to become increasingly
responsible for low level scheduling concerns that have
traditionally been handled by the OS scheduler.
I can broadly understand why there may be a desire to go
down that path. But Iâm not convinced that it would
produce meaningful better performance than the current
abstractions. Especially if you take a step back as ask the
question: is mmap is the right tool to be using in these
situations, rather using other tools like io_uring?
To be clear I donât know the answer to this question. But
the complexity of the solutions being suggested to
potentially improve the mmap API really make me question if
theyâre capable of producing meaningful improvements.
bcrl wrote 23 hours 48 min ago:
What everyone forgets is just how expensive context switches
are on modern x86 CPUs. Those 512 bit vector registers fill
up a lot of cache lines. That's why async tends to win over
processes / threads for many workloads.
im3w1l wrote 1 day ago:
There are apis that sort of let you do it: mincore, madvise,
userfaultfd.
bcrl wrote 23 hours 47 min ago:
None of those APIs are cheap enough to call in a fast path.
gpderetta wrote 16 hours 16 min ago:
no syscall will be cheap to call in a fast path. You would need
an hardware instruction that tells you if a load or store would
fault.
vlovich123 wrote 12 hours 52 min ago:
Rather than a direct syscall, you could imagine something
like rseq where you have a shared userspace / kernel data
structure where the userspace code gets aborted and restarted
if the page was evicted while being processed. But making
this work correctly and actually not have a perf overhead and
also be an ergonomic API is super hard. In practice people
who care probably are satisfied by direct I/O within io_uring
with a custom page cache and a truly optimal implementation
where the OS can still manage file pages and evict them but
the application still new when it happened isnât worth it.
bcrl wrote 30 min ago:
Unfortunately, a lot of the shared state with userland
became much more difficult to implement securely when the
Meltdown and Spectre (and others) exploits became concerns
that had to be mitigated. They makes the OS's job a heck
of a lot harder.
Sometimes I feel modern technology is basically a
delicately balanced house of cards that falls over when
breathed upon or looked at incorrectly.
zozbot234 wrote 15 hours 54 min ago:
> You would need an hardware instruction that tells you if a
load or store would fault.
You have MADV_FREE pages/ranges. They get cleared when
purged, so reading zeros tells you that the load would have
faulted and needs to be populated from storage.
vlovich123 wrote 12 hours 49 min ago:
MADV_FREE is insufficient - userspace doesnât get a
signal from the OS to know when thereâs system wide
memory pressure and having userspace try to respond to such
a signal would be counter productive and slow in a kernel
operation that needs to be a fast path. Itâs more that
you want MADV (page cache) a memory range and then have
some way to have a shared data structure where you are told
if itâs still resident and can lock it from being paged
out.
bcrl wrote 19 min ago:
MADV_FREE is also extremely expensive. CPU vendors have
finally simplified TLB shootdown in recent CPUs with both
AMD and Intel now having instructions to broadcast TLB
flushes in hardware, which gets rid of one of the worst
sources of performance degradation in threaded multicore
applications (oh the pain of IPIs mixed with TLB
flushing!). However, it's still very expensive to walk
page tables and free pages.
Hardware reference counting of memory allocations would
be very interesting. It would be shockingly simple to
implement compared to many other features hardware
already has to tackle.
zozbot234 wrote 11 hours 3 min ago:
> userspace doesnât get a signal from the OS to know
when thereâs system wide memory pressure
Memory pressure indicators exist, [1] > have some way to
have a shared data structure where you are told if itâs
still resident and can lock it from being paged out.
What's more efficient than fetching data and comparing it
with zero? Any write within the range will then cancel
the MADV_FREE property on the written-to page thus
"locking" it again, and this is also very efficient.
HTML [1]: https://docs.kernel.org/accounting/psi.html
nawgz wrote 1 day ago:
Sounds interesting. Why wouldnât the OS itself default to this
behavior? Could it fall apart under load, or is it just not important
enough to replace the legacy code relying on it?
perbu wrote 15 hours 51 min ago:
The point is that invoking the OS has a cost. Using mmap, for those
situations where it makes sense, lets you avoid that cost.
kragen wrote 17 hours 5 min ago:
Multics did default to this behavior, but Unix was written on the
PDP-7 and later the PDP-11, neither of which supported virtual memory
or paging, so the Unix system call interface necessarily used read()
and write() calls instead.
This permitted the use of the same system calls on files, on the
teletype, on the paper tape reader and punch, on the magtape, on the
line printer, and eventually on pipes. Even before pipes, the
ability to "record" a program's terminal output in a file or "play
back" simulated user input from a file made Unix especially
convenient.
But pipes, in turn, permitted entire programs on Unix to be used as
the building blocks of a new, much more powerful programming
language, one where you manipulated not just numbers or strings but
potentially endless flows of data, and which could easily orchestrate
computations that no single program in the PDP-11's 16-bit address
space could manage.
And that was how Unix users in the 01970s had an operating system
with the entire printed manual available in the comprehensive online
help system, a way to produce publication-quality documents on the
phototypesetter, incremental recompilation, software version control,
full-text search, WYSIWYG screen editors that could immediately jump
to the definition of a function, networked email, interactive
source-level debugging, a relational database, etc.âall on a 16-bit
computer that struggled to run half a million instructions per
second, which at most companies might have been relegated to
controlling some motors and heaters in a chemical plant or something.
It turns out that often what you can do matters even more than how
fast you can do it.
toast0 wrote 21 hours 34 min ago:
> Why wouldnât the OS itself default to this behavior? Could it
fall apart under load, or is it just not important enough to replace
the legacy code relying on it?
Mmap and read/write syscalls are both ways to interact with files,
but they have different behaviors. You can't exactly swap one for the
other without knowledge of the caller. What you likely do see is that
OS utilities likely use mmap when it makes sense and a difference.
You also have a lot of things that can work on files or pipes/etc and
having a common interface is more useful than having more potential
performance (sometimes the performance is enough to warrant writing
it twice).
trenchpilgrim wrote 1 day ago:
1. mmap was added to Unix later by Sun, it wasn't in the original
Unix
2. As the article points out mmap is very fast for reading huge
amounts of data but is a lot slower at other file operations. For
reading smallish files, which is the majority of calls most software
will make to the filesystem, the regular file syscalls are better.
3. If you're on a modern Linux you might be better off with io_uring
than mmap.
scottlamb wrote 1 day ago:
All true, and it's not just performance either. The API is just
different. mmap data can change at any time. In fact, if the file
shrinks, access to a formerly valid region of memory has behavior
that is unspecified by the Single Unix Specification. (On Linux, it
causes a SIGBUS if you access a page that is entirely invalid;
bytes within the last page after the last valid byte probably are
zeros or something? unsure.)
In theory I suppose you could have a libc that mostly emulates
read() and write() calls on files [1] with memcpy() on mmap()ed
regions. But I don't think it'd be quite right. For one thing, that
read() behavior after shrink would be a source of error.
Higher-level APIs might be more free to do things with either mmap
or read/write.
[1] just on files; so it'd have to track which file descriptors are
files as opposed to sockets/pipes/etc, maintaining the cached
lengths and mmap()ed regions and such. libc doesn't normally do
that, and it'd go badly if you bypass it with direct system calls.
nawgz wrote 20 hours 28 min ago:
Interesting callouts! Thanks
DIR <- back to front page