Dataset Augmentation

Dataset augmentation produces a new HuggingFace dataset by applying transforms to an existing one. This is a useful step between data collection and fine-tuning when you want to improve the robustness of your model without collecting additional demonstrations.

When to Use Augmentation

Augmentation helps in two common scenarios:

Robustness to lighting and visual noise — applying brightness, contrast, color temperature, blur, and occlusion variations teaches the model to generalize beyond the exact conditions under which the data was recorded.
Small dataset expansion — generating multiple augmented copies per episode increases the effective size of your dataset, which can stabilize training on small recordings.

Augmentation Modes

Two augmentation types are available:

Type	Description
`deterministic`	Fast CPU transforms — lighting, noise, blur, occlusion
`generative`	Cosmos Transfer2.5 generative re-rendering (currently disabled — coming soon)

Deterministic augmentation runs on a single GPU instance and completes in minutes for typical datasets. Generative augmentation is being integrated and will be available in a future release.

Available Transforms

Deterministic augmentation supports four transforms, applied per frame:

Transform	Effect
`lighting`	Adjusts brightness, contrast, and color temperature within configurable ranges
`noise`	Adds gaussian sensor noise to simulate lower-quality cameras
`blur`	Applies motion or focus blur
`occlusion`	Randomly masks regions of the frame to simulate partial occlusions

You can combine transforms in two pipeline modes:

stacked — applies all selected transforms to each augmented copy. Use this when you want every copy to contain a mix of variations.
independent — creates one copy per transform. Use this when you want to control which transform produced which copy.

Dataset Modes

The dataset_mode controls how augmented frames are added to the output dataset:

copy — augmented copies are added alongside the original episodes. The original data is preserved and the dataset grows by copies × the original episode count.
inplace — the original episodes are replaced with augmented versions. Use this when you only need the augmented data and want to keep the dataset size constant.

Creating an Augmentation Job

From the Dashboard

Click New Job in the top-right of your project page.
Select Dataset Augmentation.
Choose the dataset you want to augment.
Pick the transforms and tune the ranges if needed.
Click Start Job.

The job will progress through queuing → instance_booting → instance_setup → augmentation_running → dataset_uploading → completed. Once finished, the new dataset is published to your linked HuggingFace account.

From the SDK

See the SDK reference for the full Python API. A minimal example:

from qualia import Qualia

client = Qualia()

job = client.augmentation.create(
    project_id="...",
    dataset_id="qualiaadmin/spoon10",
    transforms="lighting,noise",
    copies=3,
)

# Poll until completed
while True:
    status = client.augmentation.get(job.job_id)
    if status.status in ("completed", "failed", "cancelled"):
        break

if status.status == "completed":
    print(f"Augmented dataset: {status.augmented_dataset_id}")

Tuning the Lighting Ranges

The lighting transform takes three parameter ranges. Defaults are tuned for mild variation and work well in most cases, but you can widen them for more aggressive augmentation:

Parameter	Default range	Allowed bounds
`brightness_range`	`(-15, 15)` %	min ∈ [-50, 0], max ∈ [0, 50]
`contrast_range`	`(-10, 10)` %	min ∈ [-50, 0], max ∈ [0, 50]
`color_temp_range`	`(4000, 6500)` K	min ∈ [2000, 6500], max ∈ [4000, 10000]

For each augmented frame the platform samples a value uniformly from the range. Wider ranges produce more visual diversity but can also push the data further from your robot’s real operating conditions — keep the ranges realistic for your deployment environment.

Next Steps

Fine-tune a model using your augmented dataset
SDK reference for augmentation