URI: 
        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   PyTorch Monarch
       
       
        bjourne wrote 20 hours 34 min ago:
        > Monarch lets you program distributed systems the way you’d program
        a single machine, hiding the complexity of distributed computing:
        
        There are some infamous tech based on the "hiding" paradigm. PHP comes
        to mind. By hiding how the http request/response cycle actually works
        it fostered a generation of web developers who didn't know what a
        session cookie was, resulting in login systems that leaked like a
        sieve. Distributed computing is complicated. There are many parameters
        you need to tweak and many design decisions you need to take to make
        distributed model training run smoothly. I think explicit and
        transparent architectures are way better. Distributed model training
        shouldn't "feel" like running on a single device because it isn't.
       
        semessier wrote 23 hours 28 min ago:
        this could become a major thing in coarray world, but the issues start
        already:
        
        > ...Note that this does not support tensor engine, which is tied to
        CUDA and RDMA (via ibverbs).
        
        I.e. yet another CUDA married approach: the issue is not ibverbs but
        the code shows they use GPUDirect RDMA, going from there this can only
        get worse - more CUDA dependencies. There would have been OpenUCX.
       
        fadedsignal wrote 1 day ago:
        It is a nice project. I have questions.
        
        - Is this similar to openMPI?
        
        - How is a mesh established? Do they need to be on the same host?
       
        SomaticPirate wrote 1 day ago:
        "Our Rust-based backend facilitates our performance, scale, and
        robustness  — we amply use Rust’s fearless concurrency in
        Monarch’s implementation"
        
        Found a few typo's. The em dash makes me suspect an LLM was involved in
        proofreading
       
          hellohello2 wrote 1 day ago:
          I would argue that typos suggest an LLM did not proofread.
       
          whimsicalism wrote 1 day ago:
          that it is surrounded by spaces makes this less likely
       
            ComputerGuru wrote 22 hours 56 min ago:
            Most style guides would call that an error, em dash should be used
            without surrounding spaces (while an en dash requires them). The
            only publication I know that has (recently?) eschewed that advice
            is WaPo. If the idea was to make it more visible, I believe the
            correct solution would have been for WaPo to use an en dash but
            render it longer in their typeface.
       
              whimsicalism wrote 21 hours 46 min ago:
              yes, i agree with you and this is how i used to use emdashes.
              chatgpt also agrees with you, which is why spaces are a pretty
              good indicator that it's not an LLM
       
          alt187 wrote 1 day ago:
          
          
  HTML    [1]: https://www.scottsmitelli.com/articles/em-dash-tool/
       
            geedzmo wrote 1 day ago:
            That was a really good read. Glad I clicked
       
              alt187 wrote 1 day ago:
              It's not even one of the funniest pieces of the author, and that
              says a lot.
       
        chandureddyvari wrote 1 day ago:
        Interesting - this seems to target a different layer than services like
        Tinker ( [1] ). Monarch provides the infrastructure primitives while
        Tinker is a managed finetuning service. Could someone build something
        like Tinker on top of Monarch?
        
  HTML  [1]: https://thinkingmachines.ai/blog/announcing-tinker/
       
          gaogao wrote 1 day ago:
          Yup, there's stuff like [1] on top of it now
          
  HTML    [1]: https://pytorch.org/blog/introducing-torchforge/
       
            pstoll wrote 17 hours 12 min ago:
            “Service Adverbs - like ‘route’ and ‘fanout’”
            
            Grammarians are going to be big angry here. Ain’t an adverb in
            sight.
       
            chandureddyvari wrote 1 day ago:
            Nice, so the open source equivalent now exists. Meta basically
            commoditized Tinker's($12B valuation) value prop by giving away the
            infra (Monarch) and the RL framework (TorchForge). Will be
            interesting to see how a managed service competes with free + open
            source at this layer.
       
        porridgeraisin wrote 1 day ago:
        > This lets us avoid single-host bottlenecks, effectively using the
        whole mesh as a distributed cluster for message forwarding. (Cite
        scalability numbers here.)
        
        In case someone that can fix this is reading here
       
        nothrowaways wrote 1 day ago:
        FB should create a pytorch foundation and set it free before they fuck
        it up.
       
          gooodvibes wrote 1 day ago:
          
          
  HTML    [1]: https://pytorch.org/foundation/
       
            dkdcio wrote 1 day ago:
            damn that was fast!
       
        logicchains wrote 1 day ago:
        This seems strictly less powerful than Jax, which comes with a powerful
        compiler that optimises how cross-node communication is conducted.
       
          gaogao wrote 1 day ago:
          Nah, focusing on a different controller paradigm. Jax is focused on
          multi-controller SPMD, while this is focused on a single-controller
          setup. Both have their place, with single-controller being generally
          easier to reason about, and multi-controller more optimal for certain
          dataflows. There's also some interesting mixes of the two control
          paradigms.
       
        alyxya wrote 1 day ago:
        I made my own single controller PyTorch extension [1], though mines
        doesn't yet support cross node communication. I found it interesting to
        compare how Monarch makes things performant. I believe Monarch also
        uses cloudpickle for code to be shared among all nodes, which is
        probably the only way to performantly have various nodes execute work
        as that ends up being a one time setup cost. I found the fanning out of
        sending messages from the single controller to be really interesting,
        so the controller is unlikely to be the bottleneck besides any
        synchronous operations.
        
        As far as things that might be a performance loss here, one thing I'm
        wondering is if custom kernels are supported. I'm also wondering how
        much granularity of control there is with communication between
        different actors calling a function. Overall, I really like this
        project and hope to see it used over multi-controller setups.
        
  HTML  [1]: https://github.com/alyxya/mycelya-torch
       
          gaogao wrote 1 day ago:
          > As far as things that might be a performance loss here, one thing
          I'm wondering is if custom kernels are supported
          
          Yeah, you might end up needing some changes to remote worker
          initialization, but you can generally bake in whatever kernels and
          other system code you need.
       
        milancurcic wrote 1 day ago:
        Cool! Essentially Fortran coarrays from 2008.
       
          philipallstar wrote 1 day ago:
          Or Hadoop from 2006? But you don't need to write MapReduce or
          Fortran, so it's probably far nicer.
       
            pjmlp wrote 9 hours 51 min ago:
            Fortran 2023 is already quite nice, and doesn't need to rewrite
            stuff in C for performance.
       
        valzam wrote 1 day ago:
        I assume this is similar to Ray?
       
          cwp wrote 13 hours 23 min ago:
          The code example is very similar to Ray.
          
          Monarch:
          
            class Example(Actor):
               @endpoint
               def say_hello(self, txt):
               return f"hello {txt}"
          
            procs = this_host().spawn_procs({"gpus": 8})
            actors = procs.spawn("actors", Example)
            hello_future = actors.say_hello.call("world")
            hello_future.get()
          
          Ray:
          
            @ray.remote(num_gpus=1)
            class Example:
                def say_hello(self, txt):
                return f"hello {txt}"
          
            actors = [Example.remote() for _ in range(8)]
            hello_object_refs = [a.say_hello.remote("world") for a in actors]
            ray.get(hello_object_refs)
       
          unnah wrote 1 day ago:
          There's also Dask, which can do distributed pandas and numpy
          operations etc. However it was originally developed for traditional
          HPC systems and has only limited support for GPU computing.
          
  HTML    [1]: https://www.dask.org/
       
          disattention wrote 1 day ago:
          I had the same thought, especially because of their recent
          collaboration.
          
  HTML    [1]: https://pytorch.org/blog/pytorch-foundation-welcomes-ray-to-...
       
          lairv wrote 1 day ago:
          I'm also curious what's the use case of this over Ray. Tighter
          integration with PyTorch/tensors abstractions?
       
            porridgeraisin wrote 1 day ago:
            That.
            
            Also, it has RDMA. Last I checked, Ray did not support RDMA.
            
            There are probably other differences as well, but the lack of RDMA
            immediately splits the world into things you can do with ray and
            things you cannot do with ray
       
              zacmps wrote 1 day ago:
              Not currently, but it is being worked on [1] .
              
  HTML        [1]: https://github.com/ray-project/ray/issues/53976
       
        jonapro wrote 1 day ago:
        Beowulf then.
       
        pjmlp wrote 1 day ago:
        Apparently PyTorch oxidation has started.
        
        > Monarch is split into a Python-based frontend, and a backend
        implemented in Rust.
        
        Other than that, looks like a quite interesting project.
       
          dhrt12327 wrote 1 day ago:
          Multiple sources say that it is an experimental framework around
          PyTorch, not a replacement. People will still get to enjoy a circular
          graph using std::shared_ptr with memory leaks.
          
          It's a pity they don't do a complete rewrite with a functional
          language as the driver.
       
            bullfightonmars wrote 16 hours 28 min ago:
            You might be looking for elixir/nx and axon
            
  HTML      [1]: https://github.com/elixir-nx/axon
       
            hansvm wrote 1 day ago:
            Arc has entered the chat.
       
            pjmlp wrote 1 day ago:
            Interesting, by the way, you can replicate the experience in Rust.
       
            gaogao wrote 1 day ago:
            > It's a pity they don't do a complete rewrite with a functional
            language as the driver.
            
            It's open source, so seeing such an extension would be quite cool.
            There's much that could be done with native Rust actors and code
            that get maybe at what you want, but nothing precludes mixing
            PyTorch and other backends.
            
            For example, you could wrap a C++ inference engine as part of one
            of the actors generating data for other actors doing distributed
            training.
       
          galangalalgol wrote 1 day ago:
          This is a new project right? Not the oxidation of an existing one.
       
            gaogao wrote 1 day ago:
            Yup, hyperreactor, one of the new crates that's part of it, does
            some particularly interesting things for efficient parallel
            distributed channels.
       
       
   DIR <- back to front page