# Troubleshooting

Start with the narrowest failing command and verify the runtime layer before
debugging Diffulex-specific behavior. Most failures fall into environment,
configuration, model loading, scheduler capacity, or serving lifecycle issues.

## Import Errors

Confirm that Diffulex is installed in the active Python environment:

```bash
python -c "from diffulex import Diffulex, SamplingParams; print('ok')"
```

If this fails, check that the shell is using the expected virtual environment
and that the package was installed from the repository root.

Also check lightweight kernel imports:

```bash
python -c "import diffulex_kernel; print('ok')"
```

This import should not eagerly load all optional kernels.

## CUDA Availability

Confirm that PyTorch can see the expected GPUs:

```bash
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"
```

If CUDA is unavailable, fix the PyTorch/CUDA installation before changing
Diffulex settings. If the device count is lower than expected, inspect
`CUDA_VISIBLE_DEVICES` and cluster scheduler allocation.

## Model Loading

Check that model and tokenizer paths are local directories. Diffulex validates
the model directory and then loads Hugging Face config and tokenizer metadata.

For LoRA runs:

| Setting | What to check |
| --- | --- |
| `use_lora` / `--use-lora` | Enable adapter loading only when a LoRA checkpoint should be used. |
| `lora_path` / `--lora-path` | Point to the adapter checkpoint directory when LoRA is enabled. |
| Adapter and base model | Confirm the adapter was trained for the selected base model family. |

If startup fails after model loading begins, retry with `tensor_parallel_size=1`
and `data_parallel_size=1` to separate model compatibility from distributed
topology problems.

## Configuration Errors

Validation errors usually name the invalid field. Common examples:

| Field or condition | Constraint |
| --- | --- |
| `block_size`, `page_size` | `block_size` must be less than or equal to `page_size`, and both values must be supported for the selected model family. |
| `decoding_strategy="dmax"` | Requires `sampling_mode="edit"` and a compatible model name. |
| Parallel world size | Must not exceed the number of visible CUDA devices. |
| `max_num_batched_tokens`, `max_model_len` | `max_num_batched_tokens` must be at least the effective `max_model_len`. |

Fix the first validation error before looking at later symptoms.

## Serving

Use a small `max_num_reqs`, `max_num_batched_tokens`, and `max_model_len` while
validating a new serving command.

If the HTTP server starts but the client cannot connect, check the host, port,
and client base URL. If the server exits during startup, inspect backend worker
logs and reduce engine limits.

## Benchmarking

Use `--dataset-limit` when testing a new config. If lm-eval cannot find a task,
check `--dataset` and `--include-path`. If code tasks fail before scoring, check
whether `--confirm-run-unsafe-code` is required.

## Performance Problems

First confirm correctness with eager mode and small batches. Then measure one
change at a time:

| Change | What it measures |
| --- | --- |
| Remove `--enforce-eager` | Measures optimized execution after eager-mode correctness is established. |
| Enable CUDA graph paths | Measures launch-overhead reduction. |
| Increase request and token limits | Measures scheduler and memory behavior under larger serving load. |
| Increase tensor or data parallelism | Measures multi-GPU scaling after single-device behavior is stable. |

Avoid changing model family, strategy, thresholds, and optimization flags in the
same experiment.