Known Issues and Updates Coming Soon

As of this writing, RapidFire AI has the following known issues. We have listed some recommended recourses for some. We are actively working on resolving these and welcome feedback on their utility for your use cases to help with prioritization.

Multi-GPU Model Support

Any given run’s model(s) for its batch size must fit on a single GPU’s memory. For DPO and GRPO, both the policy and reference models must fit on the GPU together. We will soon release an update with more sophisticated native support for multi-GPU model execution with parallelization based on DDP, FSDP, and/or DeepSpeed.

ImportError in between Experiments

If you run multiple experiments back to back from the same notebook/IDE session, you might see the following error appear occasionally:

ImportError: cannot import name 'GenerationMixin' from 'transformers.generation'

This is caused by stray Python processes from the previous experiment not ending properly. If you see this error, we recommend the following steps:

Run the command ps - ef | grep python, look for “multiprocessing.spawn”/”defunct” processes, and kill if there are any with command kill -9 [PID].
Wait for about 2 minutes regardless of whether there are processes to kill as above.
Restart the kernel and then proceed with your new experiment.

Recovering Storage Space

If you run out of storage space on your machine due to experimenting with lots of LLMs, we recommend clearing out the “.cache” folder on your home directory that is created by Hugging Face to import the base models. One experiment’s imported models are not needed for another; so, it is safe to delete them.

If you want to reclaim even more space, look at the artifacts from your experiments and either delete some of the files or move them to other/remote storage. Note that when you use LoRA adapters, RapidFire AI saves only the trained adapters in the checkpoints of the runs, not the base models.

Semi-Automated IC Ops

Triggering IC Ops manually from the dashboard is feasible only if there is a human in the loop. But IC Ops are useful even in offline scripted settings based on application logic, e.g., stop 90% of runs with poor eval metrics and clone-modify the top 10% to drill down into more fine-grained values for their knobs.

In the near future, we plan to update the Experiment API to let you specify such custom semi-automation logic for IC Ops in code using the runs’ metrics and progress so far.