_______ __ _______
| | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----.
| || _ || __|| < | -__|| _| | || -__|| | | ||__ --|
|___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____|
on Gopher (inofficial)
HTML Visit Hacker News on the Web
COMMENT PAGE FOR:
HTML PyTorch Monarch
bjourne wrote 20 hours 34 min ago:
> Monarch lets you program distributed systems the way youâd program
a single machine, hiding the complexity of distributed computing:
There are some infamous tech based on the "hiding" paradigm. PHP comes
to mind. By hiding how the http request/response cycle actually works
it fostered a generation of web developers who didn't know what a
session cookie was, resulting in login systems that leaked like a
sieve. Distributed computing is complicated. There are many parameters
you need to tweak and many design decisions you need to take to make
distributed model training run smoothly. I think explicit and
transparent architectures are way better. Distributed model training
shouldn't "feel" like running on a single device because it isn't.
semessier wrote 23 hours 28 min ago:
this could become a major thing in coarray world, but the issues start
already:
> ...Note that this does not support tensor engine, which is tied to
CUDA and RDMA (via ibverbs).
I.e. yet another CUDA married approach: the issue is not ibverbs but
the code shows they use GPUDirect RDMA, going from there this can only
get worse - more CUDA dependencies. There would have been OpenUCX.
fadedsignal wrote 1 day ago:
It is a nice project. I have questions.
- Is this similar to openMPI?
- How is a mesh established? Do they need to be on the same host?
SomaticPirate wrote 1 day ago:
"Our Rust-based backend facilitates our performance, scale, and
robustness â we amply use Rustâs fearless concurrency in
Monarchâs implementation"
Found a few typo's. The em dash makes me suspect an LLM was involved in
proofreading
hellohello2 wrote 1 day ago:
I would argue that typos suggest an LLM did not proofread.
whimsicalism wrote 1 day ago:
that it is surrounded by spaces makes this less likely
ComputerGuru wrote 22 hours 56 min ago:
Most style guides would call that an error, em dash should be used
without surrounding spaces (while an en dash requires them). The
only publication I know that has (recently?) eschewed that advice
is WaPo. If the idea was to make it more visible, I believe the
correct solution would have been for WaPo to use an en dash but
render it longer in their typeface.
whimsicalism wrote 21 hours 46 min ago:
yes, i agree with you and this is how i used to use emdashes.
chatgpt also agrees with you, which is why spaces are a pretty
good indicator that it's not an LLM
alt187 wrote 1 day ago:
HTML [1]: https://www.scottsmitelli.com/articles/em-dash-tool/
geedzmo wrote 1 day ago:
That was a really good read. Glad I clicked
alt187 wrote 1 day ago:
It's not even one of the funniest pieces of the author, and that
says a lot.
chandureddyvari wrote 1 day ago:
Interesting - this seems to target a different layer than services like
Tinker ( [1] ). Monarch provides the infrastructure primitives while
Tinker is a managed finetuning service. Could someone build something
like Tinker on top of Monarch?
HTML [1]: https://thinkingmachines.ai/blog/announcing-tinker/
gaogao wrote 1 day ago:
Yup, there's stuff like [1] on top of it now
HTML [1]: https://pytorch.org/blog/introducing-torchforge/
pstoll wrote 17 hours 12 min ago:
âService Adverbs - like ârouteâ and âfanoutââ
Grammarians are going to be big angry here. Ainât an adverb in
sight.
chandureddyvari wrote 1 day ago:
Nice, so the open source equivalent now exists. Meta basically
commoditized Tinker's($12B valuation) value prop by giving away the
infra (Monarch) and the RL framework (TorchForge). Will be
interesting to see how a managed service competes with free + open
source at this layer.
porridgeraisin wrote 1 day ago:
> This lets us avoid single-host bottlenecks, effectively using the
whole mesh as a distributed cluster for message forwarding. (Cite
scalability numbers here.)
In case someone that can fix this is reading here
nothrowaways wrote 1 day ago:
FB should create a pytorch foundation and set it free before they fuck
it up.
gooodvibes wrote 1 day ago:
HTML [1]: https://pytorch.org/foundation/
dkdcio wrote 1 day ago:
damn that was fast!
logicchains wrote 1 day ago:
This seems strictly less powerful than Jax, which comes with a powerful
compiler that optimises how cross-node communication is conducted.
gaogao wrote 1 day ago:
Nah, focusing on a different controller paradigm. Jax is focused on
multi-controller SPMD, while this is focused on a single-controller
setup. Both have their place, with single-controller being generally
easier to reason about, and multi-controller more optimal for certain
dataflows. There's also some interesting mixes of the two control
paradigms.
alyxya wrote 1 day ago:
I made my own single controller PyTorch extension [1], though mines
doesn't yet support cross node communication. I found it interesting to
compare how Monarch makes things performant. I believe Monarch also
uses cloudpickle for code to be shared among all nodes, which is
probably the only way to performantly have various nodes execute work
as that ends up being a one time setup cost. I found the fanning out of
sending messages from the single controller to be really interesting,
so the controller is unlikely to be the bottleneck besides any
synchronous operations.
As far as things that might be a performance loss here, one thing I'm
wondering is if custom kernels are supported. I'm also wondering how
much granularity of control there is with communication between
different actors calling a function. Overall, I really like this
project and hope to see it used over multi-controller setups.
HTML [1]: https://github.com/alyxya/mycelya-torch
gaogao wrote 1 day ago:
> As far as things that might be a performance loss here, one thing
I'm wondering is if custom kernels are supported
Yeah, you might end up needing some changes to remote worker
initialization, but you can generally bake in whatever kernels and
other system code you need.
milancurcic wrote 1 day ago:
Cool! Essentially Fortran coarrays from 2008.
philipallstar wrote 1 day ago:
Or Hadoop from 2006? But you don't need to write MapReduce or
Fortran, so it's probably far nicer.
pjmlp wrote 9 hours 51 min ago:
Fortran 2023 is already quite nice, and doesn't need to rewrite
stuff in C for performance.
valzam wrote 1 day ago:
I assume this is similar to Ray?
cwp wrote 13 hours 23 min ago:
The code example is very similar to Ray.
Monarch:
class Example(Actor):
@endpoint
def say_hello(self, txt):
return f"hello {txt}"
procs = this_host().spawn_procs({"gpus": 8})
actors = procs.spawn("actors", Example)
hello_future = actors.say_hello.call("world")
hello_future.get()
Ray:
@ray.remote(num_gpus=1)
class Example:
def say_hello(self, txt):
return f"hello {txt}"
actors = [Example.remote() for _ in range(8)]
hello_object_refs = [a.say_hello.remote("world") for a in actors]
ray.get(hello_object_refs)
unnah wrote 1 day ago:
There's also Dask, which can do distributed pandas and numpy
operations etc. However it was originally developed for traditional
HPC systems and has only limited support for GPU computing.
HTML [1]: https://www.dask.org/
disattention wrote 1 day ago:
I had the same thought, especially because of their recent
collaboration.
HTML [1]: https://pytorch.org/blog/pytorch-foundation-welcomes-ray-to-...
lairv wrote 1 day ago:
I'm also curious what's the use case of this over Ray. Tighter
integration with PyTorch/tensors abstractions?
porridgeraisin wrote 1 day ago:
That.
Also, it has RDMA. Last I checked, Ray did not support RDMA.
There are probably other differences as well, but the lack of RDMA
immediately splits the world into things you can do with ray and
things you cannot do with ray
zacmps wrote 1 day ago:
Not currently, but it is being worked on [1] .
HTML [1]: https://github.com/ray-project/ray/issues/53976
jonapro wrote 1 day ago:
Beowulf then.
pjmlp wrote 1 day ago:
Apparently PyTorch oxidation has started.
> Monarch is split into a Python-based frontend, and a backend
implemented in Rust.
Other than that, looks like a quite interesting project.
dhrt12327 wrote 1 day ago:
Multiple sources say that it is an experimental framework around
PyTorch, not a replacement. People will still get to enjoy a circular
graph using std::shared_ptr with memory leaks.
It's a pity they don't do a complete rewrite with a functional
language as the driver.
bullfightonmars wrote 16 hours 28 min ago:
You might be looking for elixir/nx and axon
HTML [1]: https://github.com/elixir-nx/axon
hansvm wrote 1 day ago:
Arc has entered the chat.
pjmlp wrote 1 day ago:
Interesting, by the way, you can replicate the experience in Rust.
gaogao wrote 1 day ago:
> It's a pity they don't do a complete rewrite with a functional
language as the driver.
It's open source, so seeing such an extension would be quite cool.
There's much that could be done with native Rust actors and code
that get maybe at what you want, but nothing precludes mixing
PyTorch and other backends.
For example, you could wrap a C++ inference engine as part of one
of the actors generating data for other actors doing distributed
training.
galangalalgol wrote 1 day ago:
This is a new project right? Not the oxidation of an existing one.
gaogao wrote 1 day ago:
Yup, hyperreactor, one of the new crates that's part of it, does
some particularly interesting things for efficient parallel
distributed channels.
DIR <- back to front page