Troubleshooting¶

Start with the narrowest failing command and verify the runtime layer before debugging Diffulex-specific behavior. Most failures fall into environment, configuration, model loading, scheduler capacity, or serving lifecycle issues.

Import Errors¶

Confirm that Diffulex is installed in the active Python environment:

python -c "from diffulex import Diffulex, SamplingParams; print('ok')"

If this fails, check that the shell is using the expected virtual environment and that the package was installed from the repository root.

Also check lightweight kernel imports:

python -c "import diffulex_kernel; print('ok')"

This import should not eagerly load all optional kernels.

CUDA Availability¶

Confirm that PyTorch can see the expected GPUs:

python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"

If CUDA is unavailable, fix the PyTorch/CUDA installation before changing Diffulex settings. If the device count is lower than expected, inspect CUDA_VISIBLE_DEVICES and cluster scheduler allocation.

Model Loading¶

Check that model and tokenizer paths are local directories. Diffulex validates the model directory and then loads Hugging Face config and tokenizer metadata.

For LoRA runs:

Setting	What to check
`use_lora` / `--use-lora`	Enable adapter loading only when a LoRA checkpoint should be used.
`lora_path` / `--lora-path`	Point to the adapter checkpoint directory when LoRA is enabled.
Adapter and base model	Confirm the adapter was trained for the selected base model family.

If startup fails after model loading begins, retry with tensor_parallel_size=1 and data_parallel_size=1 to separate model compatibility from distributed topology problems.

Configuration Errors¶

Validation errors usually name the invalid field. Common examples:

Field or condition	Constraint
`block_size`, `page_size`	`block_size` must be less than or equal to `page_size`, and both values must be supported for the selected model family.
`decoding_strategy="dmax"`	Requires `sampling_mode="edit"` and a compatible model name.
Parallel world size	Must not exceed the number of visible CUDA devices.
`max_num_batched_tokens`, `max_model_len`	`max_num_batched_tokens` must be at least the effective `max_model_len`.

Fix the first validation error before looking at later symptoms.

Serving¶

Use a small max_num_reqs, max_num_batched_tokens, and max_model_len while validating a new serving command.

If the HTTP server starts but the client cannot connect, check the host, port, and client base URL. If the server exits during startup, inspect backend worker logs and reduce engine limits.

Benchmarking¶

Use --dataset-limit when testing a new config. If lm-eval cannot find a task, check --dataset and --include-path. If code tasks fail before scoring, check whether --confirm-run-unsafe-code is required.

Performance Problems¶

First confirm correctness with eager mode and small batches. Then measure one change at a time:

Change	What it measures
Remove `--enforce-eager`	Measures optimized execution after eager-mode correctness is established.
Enable CUDA graph paths	Measures launch-overhead reduction.
Increase request and token limits	Measures scheduler and memory behavior under larger serving load.
Increase tensor or data parallelism	Measures multi-GPU scaling after single-device behavior is stable.

Avoid changing model family, strategy, thresholds, and optimization flags in the same experiment.