Skip to content

Dataset Augmentation

Dataset augmentation produces a new HuggingFace dataset by applying transforms to an existing one. This is a useful step between data collection and fine-tuning when you want to improve the robustness of your model without collecting additional demonstrations.

Augmentation helps in two common scenarios:

  • Robustness to lighting and visual noise — applying brightness, contrast, color temperature, blur, and occlusion variations teaches the model to generalize beyond the exact conditions under which the data was recorded.
  • Small dataset expansion — generating multiple augmented copies per episode increases the effective size of your dataset, which can stabilize training on small recordings.

Two augmentation types are available:

TypeDescription
deterministicFast CPU transforms — lighting, noise, blur, occlusion
generativeCosmos Transfer2.5 generative re-rendering (currently disabled — coming soon)

Deterministic augmentation runs on a single GPU instance and completes in minutes for typical datasets. Generative augmentation is being integrated and will be available in a future release.

Deterministic augmentation supports four transforms, applied per frame:

TransformEffect
lightingAdjusts brightness, contrast, and color temperature within configurable ranges
noiseAdds gaussian sensor noise to simulate lower-quality cameras
blurApplies motion or focus blur
occlusionRandomly masks regions of the frame to simulate partial occlusions

You can combine transforms in two pipeline modes:

  • stacked — applies all selected transforms to each augmented copy. Use this when you want every copy to contain a mix of variations.
  • independent — creates one copy per transform. Use this when you want to control which transform produced which copy.

The dataset_mode controls how augmented frames are added to the output dataset:

  • copy — augmented copies are added alongside the original episodes. The original data is preserved and the dataset grows by copies × the original episode count.
  • inplace — the original episodes are replaced with augmented versions. Use this when you only need the augmented data and want to keep the dataset size constant.
  1. Click New Job in the top-right of your project page.
  2. Select Dataset Augmentation.
  3. Choose the dataset you want to augment.
  4. Pick the transforms and tune the ranges if needed.
  5. Click Start Job.

The job will progress through queuing → instance_booting → instance_setup → augmentation_running → dataset_uploading → completed. Once finished, the new dataset is published to your linked HuggingFace account.

See the SDK reference for the full Python API. A minimal example:

from qualia import Qualia
client = Qualia()
job = client.augmentation.create(
project_id="...",
dataset_id="qualiaadmin/spoon10",
transforms="lighting,noise",
copies=3,
)
# Poll until completed
while True:
status = client.augmentation.get(job.job_id)
if status.status in ("completed", "failed", "cancelled"):
break
if status.status == "completed":
print(f"Augmented dataset: {status.augmented_dataset_id}")

The lighting transform takes three parameter ranges. Defaults are tuned for mild variation and work well in most cases, but you can widen them for more aggressive augmentation:

ParameterDefault rangeAllowed bounds
brightness_range(-15, 15) %min ∈ [-50, 0], max ∈ [0, 50]
contrast_range(-10, 10) %min ∈ [-50, 0], max ∈ [0, 50]
color_temp_range(4000, 6500) Kmin ∈ [2000, 6500], max ∈ [4000, 10000]

For each augmented frame the platform samples a value uniformly from the range. Wider ranges produce more visual diversity but can also push the data further from your robot’s real operating conditions — keep the ranges realistic for your deployment environment.