Datasets & Training Data

Available Datasets

Browse

Datasets are available for research and evaluation. Commercial licensing available on request.

Available

RTL Corpus

Curated collection of open-source Verilog and SystemVerilog designs. Cleaned, categorized, and annotated for LLM training and code generation research.

Request access →

Available

Verification Benchmarks

Testbench-design pairs with coverage data, assertion libraries, and pass/fail labels. For training verification-aware models.

Request access →

Coming Soon

EDA Log Dataset

Structured logs from synthesis, simulation, and timing analysis runs. Annotated with error categories, root causes, and resolutions.

Register interest →

Coming Soon

Spec-to-RTL Pairs

Natural language specification excerpts mapped to corresponding RTL implementations. For spec comprehension and code generation.

Register interest →

Data Quality

How we build datasets

Engineer-Annotated

All labels and annotations are created or reviewed by semiconductor engineers — not crowd-sourced. Domain accuracy matters.

Cleaned & Deduplicated

Raw data goes through automated and manual cleaning pipelines. Duplicates, license-incompatible code, and low-quality samples are removed.

Versioned & Documented

Each dataset release includes version tags, changelogs, data cards, and usage guidelines. Reproducibility is non-negotiable.

Need custom data?

We build domain-specific datasets for research teams and companies. Tell us what you need.

Get in Touch