# Model and Data Parallelism

Diffulex exposes tensor, data, and expert parallel dimensions. The effective
world size is computed from these values, and the number of visible CUDA devices
must be large enough for the requested topology.

## Tensor Parallelism

`tensor_parallel_size` partitions model compute across devices.

Set it to an integer from `1` to `8`. The core `Config` default is `2`, while
the server and benchmark CLIs default to `1` so a fresh run can start on a
single GPU.

Increase tensor parallelism when one model replica does not fit or when a model
family expects tensor-parallel execution. Use `1` for initial debugging.

## Data Parallelism

`data_parallel_size` runs independent request-processing groups.

Set it to an integer from `1` to `1024`. The default is `1`, which means a
single request-processing group.

Data parallelism is useful for serving throughput when each group can own a
model-parallel worker set. It increases the required CUDA device count.

## Device Selection

Use `device_ids` or `--device-ids` to select logical CUDA device IDs. When
`CUDA_VISIBLE_DEVICES` is set, PyTorch remaps visible physical GPUs to logical
IDs starting at `0`.

## Related Arguments

| Surface | Names | Notes |
| --- | --- | --- |
| Python/config | `tensor_parallel_size`, `data_parallel_size`, `device_ids` | Use these when constructing `Config` or editing YAML. |
| CLI | `--tensor-parallel-size`, `--data-parallel-size`, `--device-ids` | Use these when launching server or benchmark commands. |