_______ __ _______
| | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----.
| || _ || __|| < | -__|| _| | || -__|| | | ||__ --|
|___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____|
on Gopher (inofficial)
HTML Visit Hacker News on the Web
COMMENT PAGE FOR:
HTML How We Found 7 TiB of Memory Just Sitting Around
Aeolun wrote 5 hours 5 min ago:
I read this and I have to wonder, did anyone ever think it was
reasonable that a cluster that apparently needed only 120gb of memory
was consuming 1.2TB just for logging (or whatever vector does)
fock wrote 1 hour 6 min ago:
we have on-prem with heavy spikes (our batch workload can utilize the
20TB of memory in the cluster easily) and we just don't care much and
add 10% every year to the hardware requested. Compared to employing
people or paying other vendors (relational databases with many
TB-sized tables...) this is just irrelevant.
Sadly devs are incentivized by that and going towards the cloud might
be a fun story. Given the environment I hope they scrap the effort
sooner rather than later, buy some Oxide systems for the people who
need to iterate faster than the usual process of getting a VM and
replace/reuse the 10% of the company occupied with the cloud (mind
you: no real workload runs there yet...) to actually improve local
processes...
devjab wrote 1 hour 38 min ago:
We're a much smaller scale company and the cost we lose on these
things is insignificant compared to what's in this story. Yesterday I
was improving the process for creating databases in our azure and I
stumbled upon a subscription which was running 7 mssql servers for 12
databases. These weren't elastic and they were each paying a license
that we don't have to pay because we qualify for the base cost
through our contract with our microsoft partner. This company has
some of the thightest control over their cloud infrastructure out of
any organisation I've worked with.
This is anecdotal, but if my experiences aren't unique then there is
a lot of lack of reasonable in DevOps.
bstack wrote 2 hours 51 min ago:
Author here: Youâd be surprised what you donât notice given
enough nodes and slow enough resource growth over time! Out of the
total resource usage in these clusters even at the high water mark
for this daemonset it was still a small overall portion of the total.
Aeolun wrote 20 min ago:
Iâm not sure if that makes it better or worse.
fock wrote 1 hour 1 min ago:
how large are the clusters then?
hinkley wrote 10 hours 29 min ago:
Keys require O(logn) space per key or nlogn for the entire data set,
simply to avoid key collisions. But human friendly key spaces grow
much, much faster and I donât think many people have looked too hard
at that.
There were recent changes to the NodeJS Prometheus client that
eliminates tag names from the keys used for storing the tag cardinality
for metrics. The memory savings wasnât reported but the cpu savings
for recording data points was over 1/3. And about twice that when
applied to the aggregation logic.
Lookups are rarely O(1), even in hash tables.
I wonder if thereâs a general solution for keeping names concise
without triggering transposition or reading comprehension errors. And
what the space complexity is of such an algorithm.
vlovich123 wrote 3 hours 14 min ago:
Why arenât letâs just 128bit UUIDs? Those are guaranteed to be
globally unique and donât require so much spacex
hinkley wrote 37 min ago:
Why arenât what 128bit UUIDs?
> keeping names concise without triggering transposition or reading
comprehension errors.
Code that doesnât work for developers first will soon cease to
work for anyone. Plus how do you look up a uuid for a set of tags?
Whatâs your perfect hash plan to make sure you donât
misattribute stats to the wrong place?
UUIDs are entirely opaque and difficult to tell apart consistently.
nitinreddy88 wrote 1 day ago:
The other way to look is why adding NS label is causing so much memory
footprint in Kubernetes. Shouldn't be fixing that (could be much bigger
design change), will benefit whole Kube community?
bstack wrote 15 hours 54 min ago:
Author here: yeah that's a good point. tbh I was mostly unfamiliar
with Vector so I took the shortest path to the goal but that could be
interesting followup. It does seem like there's a lot of bytes per
namespace!
stackskipton wrote 4 hours 31 min ago:
You mentioned in the blog article that it's doing listwatch. List
Watch registers with Kubernetes API that get a list of all objects
AND get a notification when anything in object you have registered
with changes. A bunch of Vector Pods saying "Hey, send me a
notification when anything with namespaces changes" and poof goes
your Memory keeping track of who needs to know what.
At this point, I wonder if instead of relying on daemonsets, you
just gave every namespace a vector instance that was responsible
for that namespace and pods within. ElasticSearch or whatever you
pipe logging data to might not be happy with all those TCP
connections.
Just my SRE brain thoughts.
fells wrote 3 hours 45 min ago:
>you just gave every namespace a vector instance that was
responsible for that namespace and pods within.
Vector is a daemonset, because it needs to tail the log files on
each node. A single vector per namespace might not reside on the
nodes that each pod is on.
shanemhansen wrote 1 day ago:
The unreasonable effectiveness of profiling and digging deep strikes
again.
hinkley wrote 11 hours 1 min ago:
The biggest tool in the performance toolbox is stubbornness. Without
it all the mechanical sympathy in the world will go unexploited.
Thereâs about a factor of 3 improvement that can be made to most
code after the profiler has given up. That probably means there are
better profilers than could be written, but in 20 years of having
them Iâve only seen 2 that tried. Sadly I think flame graphs made
profiling more accessible to the unmotivated but didnât actually
improve overall results.
jesse__ wrote 8 hours 54 min ago:
Broadly agree.
I'm curious, what're the profilers you know of that tried to be
better? I have a little homebrew game engine with an integrated
profiler that I'm always looking for ideas to make more effective.
hinkley wrote 8 hours 21 min ago:
Clinic.js tried and lost steam. I have a recollection of a
profiler called JProfiler that represented space and time as a
graph, but also a recollection they went under. And there is a
company selling a product of that name that has been around since
that time, but doesnât quite look how I recalled and so I
donât know if I was mistaken about their demise or Iâve
swapped product names in my brain. It was 20 years ago which is a
long time for mush to happen.
The common element between attempts is new visualizations. And
like drawing a projection of an object in a mechanical
engineering drawing, there is no one projection that contains the
entire description of the problem. You need to present several
and let brain synthesize the data missing in each individual
projection into an accurate model.
Negitivefrags wrote 9 hours 48 min ago:
I think the biggest tool is higher expectations. Most programmers
really haven't come to grips with the idea that computers are fast.
If you see a database query that takes 1 hour to run, and only
touches a few gb of data, you should be thinking "Well nvme
bandwidth is multiple gigabytes per second, why can't it run in 1
second or less?"
The idea that anyone would accept a request to a website taking
longer than 30ms, (the time it takes for a game to render it's
entire world including both the CPU and GPU parts at 60fps) is
insane, and nobody should really accept it, but we commonly do.
mjevans wrote 1 hour 45 min ago:
30mS for a website is a tough bar to clear considering Speed of
Light (or rather electrons in copper / light in fiber) [1] Just
as an example, round trip delay from where I rent to the local
backbone is about 14mS alone, and the average for a webserver is
53mS. Just as a simple echo reply. (I picked it because I'd
hoped that was in Redmond or some nearby datacenter, but it looks
more likely to be in a cheaper labor area.)
However it's only the bloated ECMAScript (javascript) trash web
of today that makes a website take longer than ~1 second to load
on a modern PC. Plain old HTML, images on a reasonable diet, and
some script elements only for interactive things can scream.
mtr -bzw microsoft.com
6. AS7922 be-36131-cs03.seattle.wa.ibone.comcast.net
(2001:558:3:942::1) 0.0% 10 12.9 13.9 11.5 18.7
2.6
7. AS7922 be-2311-pe11.seattle.wa.ibone.comcast.net
(2001:558:3:3a::2) 0.0% 10 11.8 13.3 10.6 17.2
2.4
8. AS7922 2001:559:0:80::101e
0.0% 10 15.2 20.7 10.7 60.0
17.3
9. AS8075 ae25-0.icr02.mwh01.ntwk.msn.net
(2a01:111:2000:2:8000::b9a) 0.0% 10 41.1 23.7
14.8 41.9 10.4
10. AS8075 be140.ibr03.mwh01.ntwk.msn.net
(2603:1060:0:12::f18e) 0.0% 10 53.1 53.1
50.2 57.4 2.1
11. AS8075 2603:1060:0:10::f536
0.0% 10 82.1 55.7 50.5 82.1
9.7
12. AS8075 2603:1060:0:10::f3b1
0.0% 10 54.4 96.6 50.4 147.4
32.5
13. AS8075 2603:1060:0:10::f51a
0.0% 10 49.7 55.3 49.7 78.4
8.3
14. AS8075 2a01:111:201:f200::d9d
0.0% 10 52.7 53.2 50.2 58.1
2.7
15. AS8075 2a01:111:2000:6::4a51
0.0% 10 49.4 51.6 49.4 54.1
1.7
20. AS8075 2603:1030:b:3::152
0.0% 10 50.7 53.4 49.2 60.7
4.2
HTML [1]: https://en.wikipedia.org/wiki/Speed_of_light
hinkley wrote 8 hours 25 min ago:
Lowered expectations are come in part from people giving up on
theirs. Accepting versus pushing back.
antonymoose wrote 8 hours 21 min ago:
I have high hopes and expectations, unfortunately my chain of
command does not, and is often an immovable force.
hinkley wrote 7 hours 24 min ago:
This is a terrible time to tell someone to find a movable
object in another part of the org or elsewhere. :/
I always liked Shawâs âThe reasonable man adapts himself
to the world: the unreasonable one persists in trying to
adapt the world to himself. Therefore all progress depends on
the unreasonable man.â
azornathogron wrote 8 hours 59 min ago:
Pedantic nit: At 60 fps the per frame time is 16.66... ms, not 30
ms. Having said that a lot of games run at 30 fps, or run
different parts of their logic at different frequencies, or do
other tricks that mean there isn't exactly one FPS rate that the
thing is running at.
Negitivefrags wrote 8 hours 39 min ago:
The CPU part happens on one frame, the GPU part happens on the
next frame. If you want to talk about the total time for a game
to render a frame, it needs to count two frames.
wizzwizz4 wrote 7 hours 25 min ago:
Computers are fast. Why do you accept a frame of lag? The
average game for a PC from the 1980s ran with less lag than
that. Super Mario Bros had less than a frame between
controller input and character movement on the screen.
(Technically, it could be more than a frame, but only if
there were enough objects in play that the processor couldn't
handle all the physics updates in time and missed the v-blank
interval.)
Negitivefrags wrote 7 hours 1 min ago:
If Vsync is on which was my assumption from my previous
comment, then if your computer is fast enough, you might be
able to run CPU and GPU work entirely in a single frame if
you use Reflex to delay when simulation starts to lower
latency, but regardless, you still have a total time budget
of 1/30th of a second to do all your combined CPU and GPU
work to get to 60fps.
javier2 wrote 9 hours 9 min ago:
its also about cost. My game computer has 8 cores + 1 expensive
gpu + 32GB ram for me alone. We dont have that per customer.
Aeolun wrote 5 hours 4 min ago:
If your websites take less than 16ms to serve, you can serve 60
customers per second with that. So you sorta do have it per
customer?
vlovich123 wrote 3 hours 43 min ago:
Thatâs per core assuming the 16ms is CPU bound activity (so
100 cores would serve 100 customers). If itâs I/O you can
overlap a lot of customers since a single core could easily
keep track of thousands of in flight requests.
oivey wrote 8 hours 44 min ago:
This is again a problem understanding that computers are fast.
A toaster can run an old 3D game like Quake at hundreds of FPS.
A website primarily displaying text should be way faster. The
reasons websites often arenât have nothing to do with the
userâs computer.
paulryanrogers wrote 8 hours 2 min ago:
That's a dedicated toaster serving only one client. Websites
usually aren't backed by bare metal per visitor.
oivey wrote 7 hours 31 min ago:
Right. Iâm replying to someone talking about their
personal computer.
avidiax wrote 8 hours 57 min ago:
It's also about revenue.
Uber could run the complete global rider/driver flow from a
single server.
It doesn't, in part because all of those individual trips earn
$1 or more each, so it's perfectly acceptable to the business
to be more more inefficient and use hundreds of servers for
this task.
Similarly, a small website taking 150ms to render the page only
matters if the lost productivity costs less than the
engineering time to fix it, and even then, only makes sense if
that engineering time isn't more productively used to add
features or reliability.
onethumb wrote 50 min ago:
Uber could not run the complete global rider/driver flow from
a single server.
zahlman wrote 9 hours 58 min ago:
> The biggest tool in the performance toolbox is stubbornness.
Without it all the mechanical sympathy in the world will go
unexploited.
The sympathy is also needed. Problems aren't found when people
don't care, or consider the current performance acceptable.
> Thereâs about a factor of 3 improvement that can be made to
most code after the profiler has given up. That probably means
there are better profilers than could be written, but in 20 years
of having them Iâve only seen 2 that tried.
It's hard for profilers to identify slowdowns that are due to the
architecture. Making the function do less work to get its result
feels different from determining that the function's result is
unnecessary.
hinkley wrote 8 hours 23 min ago:
Architecture, cache eviction, memory bandwidth, thermal
throttling.
All of which have gotten perhaps an order of magnitude worse in
the time since I started on this theory.
DIR <- back to front page