We organized a tutorial at ACM SIGCOMM 2021. The tutorial is based on our work on optimizing distributed deep learning systems (SwitchML, OmniReduce, GRACE).
A recording of the tutorial is available below. The slides and other hands on materials are available at this Github repository.