Troubleshooting¶
Start with the narrowest failing command and verify the runtime layer before debugging Diffulex-specific behavior. Most failures fall into environment, configuration, model loading, scheduler capacity, or serving lifecycle issues.
Import Errors¶
Confirm that Diffulex is installed in the active Python environment:
python -c "from diffulex import Diffulex, SamplingParams; print('ok')"
If this fails, check that the shell is using the expected virtual environment and that the package was installed from the repository root.
Also check lightweight kernel imports:
python -c "import diffulex_kernel; print('ok')"
This import should not eagerly load all optional kernels.
CUDA Availability¶
Confirm that PyTorch can see the expected GPUs:
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"
If CUDA is unavailable, fix the PyTorch/CUDA installation before changing
Diffulex settings. If the device count is lower than expected, inspect
CUDA_VISIBLE_DEVICES and cluster scheduler allocation.
Model Loading¶
Check that model and tokenizer paths are local directories. Diffulex validates the model directory and then loads Hugging Face config and tokenizer metadata.
For LoRA runs:
Setting |
What to check |
|---|---|
|
Enable adapter loading only when a LoRA checkpoint should be used. |
|
Point to the adapter checkpoint directory when LoRA is enabled. |
Adapter and base model |
Confirm the adapter was trained for the selected base model family. |
If startup fails after model loading begins, retry with tensor_parallel_size=1
and data_parallel_size=1 to separate model compatibility from distributed
topology problems.
Configuration Errors¶
Validation errors usually name the invalid field. Common examples:
Field or condition |
Constraint |
|---|---|
|
|
|
Requires |
Parallel world size |
Must not exceed the number of visible CUDA devices. |
|
|
Fix the first validation error before looking at later symptoms.
Serving¶
Use a small max_num_reqs, max_num_batched_tokens, and max_model_len while
validating a new serving command.
If the HTTP server starts but the client cannot connect, check the host, port, and client base URL. If the server exits during startup, inspect backend worker logs and reduce engine limits.
Benchmarking¶
Use --dataset-limit when testing a new config. If lm-eval cannot find a task,
check --dataset and --include-path. If code tasks fail before scoring, check
whether --confirm-run-unsafe-code is required.
Performance Problems¶
First confirm correctness with eager mode and small batches. Then measure one change at a time:
Change |
What it measures |
|---|---|
Remove |
Measures optimized execution after eager-mode correctness is established. |
Enable CUDA graph paths |
Measures launch-overhead reduction. |
Increase request and token limits |
Measures scheduler and memory behavior under larger serving load. |
Increase tensor or data parallelism |
Measures multi-GPU scaling after single-device behavior is stable. |
Avoid changing model family, strategy, thresholds, and optimization flags in the same experiment.