Model and Data Parallelism

Diffulex exposes tensor, data, and expert parallel dimensions. The effective world size is computed from these values, and the number of visible CUDA devices must be large enough for the requested topology.

Tensor Parallelism

tensor_parallel_size partitions model compute across devices.

Set it to an integer from 1 to 8. The core Config default is 2, while the server and benchmark CLIs default to 1 so a fresh run can start on a single GPU.

Increase tensor parallelism when one model replica does not fit or when a model family expects tensor-parallel execution. Use 1 for initial debugging.

Data Parallelism

data_parallel_size runs independent request-processing groups.

Set it to an integer from 1 to 1024. The default is 1, which means a single request-processing group.

Data parallelism is useful for serving throughput when each group can own a model-parallel worker set. It increases the required CUDA device count.

Device Selection

Use device_ids or --device-ids to select logical CUDA device IDs. When CUDA_VISIBLE_DEVICES is set, PyTorch remaps visible physical GPUs to logical IDs starting at 0.