Neural Architecture Search (NAS) Implementation

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Neural Architecture Search (NAS) Implementation
Complex
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Neural Architecture Search (NAS)

Hand-designed architectures are the result of experience and intuition. NAS is an algorithmic enumeration of the architectural space, optimized for a specific task, dataset, and hardware target. It's not a replacement for architectural thinking, but a way to find configurations that a human simply wouldn't be able to achieve in a reasonable amount of time.

Why a Naive NAS Is Killing Your GPU Budget

The classic NAS implementation, NASNet (Google, 2017), requires 500 GPU days on an A100 equivalent. The problem is that each candidate architecture was trained from scratch until convergence. With a search space of 10^10 configurations, exhaustive search is fundamentally impossible.

Modern approaches address this through three fundamentally different ideas:

One-shot NAS / Weight Sharing. The supernet includes all possible subgraphs. Each candidate is a "path" through this supernet that uses pre-trained weights. DARTS, SNAS, and Single-Path NAS are all built on this idea. Search time drops from hundreds of GPU days to 1–4 days.

Predictor-based NAS. A surrogate model is trained that predicts the accuracy of an architecture without fully training it. BANANAS, NASBOWL, and NAO use this approach. A sample from the search space + 100–200 real-world estimates → an accuracy predictor for the next million candidates.

Hardware-aware NAS. Optimization not only for accuracy but also for latency on a specific device. MNasNet, FBNet, and Once-for-All search for a Pareto-frontier solution (accuracy, latency, and MACs). Critical for edge deployments.

Deep Dive: DARTS and Its Production Challenges

DARTS (Differentiable Architecture Search) is the most widely used one-shot method. The idea is to use continuous α weights for each operation, optimized through gradient descent, instead of discretely choosing the operation (3×3 conv vs. 5×5 conv vs. skip).

import torch
import torch.nn as nn
from torch.nn import functional as F

class MixedOp(nn.Module):
    """
    DARTS mixed operation: взвешенная сумма всех кандидатных операций.
    Веса alpha оптимизируются через архитектурный градиент.
    """
    def __init__(self, C: int, stride: int):
        super().__init__()
        self._ops = nn.ModuleList()
        for primitive in PRIMITIVES:  # ['none', 'skip_connect', 'sep_conv_3x3', ...]
            op = OPS[primitive](C, stride, affine=False)
            self._ops.append(op)

    def forward(self, x: torch.Tensor, weights: torch.Tensor) -> torch.Tensor:
        # weights = softmax(alpha) — архитектурные веса
        return sum(w * op(x) for w, op in zip(weights, self._ops))


class DARTSCell(nn.Module):
    def __init__(self, steps: int, multiplier: int, C_prev_prev: int,
                 C_prev: int, C: int, reduction: bool, reduction_prev: bool):
        super().__init__()
        self._steps = steps       # число промежуточных узлов (обычно 4)
        self._multiplier = multiplier  # сколько узлов конкатенируется на выходе
        # ... инициализация preprocess и mixed ops

    def forward(self, s0: torch.Tensor, s1: torch.Tensor,
                weights: torch.Tensor) -> torch.Tensor:
        states = [s0, s1]
        offset = 0
        for i in range(self._steps):
            s = sum(
                self._ops[offset + j](h, weights[offset + j])
                for j, h in enumerate(states)
            )
            offset += len(states)
            states.append(s)
        return torch.cat(states[-self._multiplier:], dim=1)

The two-level optimization of DARTS is the main engineering challenge. Network weights w and architectural weights α are optimized alternately:

def train_darts_step(model, architect, optimizer_w, optimizer_alpha,
                     train_queue, valid_queue, lr_w: float):
    """
    DARTS: чередование шагов оптимизации весов сети и архитектурных весов.
    """
    for step, (input_train, target_train) in enumerate(train_queue):
        # 1. Архитектурный шаг: обновляем alpha по валидационной потере
        input_valid, target_valid = next(iter(valid_queue))
        architect.step(
            input_train, target_train,
            input_valid, target_valid,
            lr=lr_w, optimizer=optimizer_w,
            unrolled=False  # True = second-order DARTS, в 2x дороже
        )

        # 2. Шаг весов: обновляем w по тренировочной потере
        optimizer_w.zero_grad()
        logits = model(input_train)
        loss = F.cross_entropy(logits, target_train)
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)
        optimizer_w.step()

The Operation Collapse Problem. In pure DARTS, skip-connection operations almost always "win"—they have zero parameters, learn well early on, and the architectural weights α[skip] steadily increase. Result: the resulting architecture degenerates into a nearly skip-only network with poor generalization. Solutions:

  • DARTS+: pruning skip connections with the largest α in the final stage
  • P-DARTS: progressive increase in network depth during search
  • GDAS: Gumbel-softmax instead of softmax for α - sparse selection of operations

Hardware-aware NAS in practice

For mobile deployments (Android, CoreML), accuracy isn't the only metric. Latency on the target hardware is more important than FLOPs, because different operations are executed differently on real hardware.

Once-for-All (MIT) — one supernet is trained, from which subnets are extracted without further training for any hardware constraint:

from ofa.model_zoo import ofa_net

# Загружаем предобученную OFA суперсеть
ofa_network = ofa_net('ofa_mbv3_d234_e346_k357_w1.0', pretrained=True)

# Специализируем под конкретный device с latency constraint
from ofa.nas.efficiency_predictor import Latency_MBV3_MeasuredNet
efficiency_predictor = Latency_MBV3_MeasuredNet(
    'note10',   # Samsung Note10 — реальные замеры латентности
    ofa_network
)

# Evolutionary search: ищем подсеть с latency < 25ms и max accuracy
from ofa.nas.search_algorithm.evolution_finder import EvolutionFinder
finder = EvolutionFinder(
    efficiency_constraint=25,           # ms
    efficiency_predictor=efficiency_predictor,
    accuracy_predictor=accuracy_predictor,
    population_size=100,
    max_time_budget=500                  # эволюционных шагов
)
best_valids, best_info = finder.run_evolution_search()

In a real project: NAS running MobileNetV3-space for the task of classifying manufacturing defects (640x480, 12 classes). Target platform: NVIDIA Jetson Nano (4GB RAM, 128 CUDA cores). Limitation: latency < 30ms with batch=1. The manual MobileNetV3-Large architecture yielded 28.4ms and 91.3% accuracy. OFA search found the subnet in 6 hours: 22.1ms, 92.7% accuracy. Without a single manual architecture change.

The Practical Stack and When NAS Is Justified

Scenario Approach Search time Tool
Image classification, standard DARTS / PC-DARTS 1–2 days (4× A100) nni (Microsoft) or automl (torchvision)
Edge deployment (mobile, MCU) OFA / MNasNet-style 6–24 hours Once-for-All, TuNAS
NLP/Transformer architecture NAS-BERT, AutoFormer 2–5 days Hugging Face NAS toolkit
Custom operations, custom hardware Predictor-based NAS 1–3 days + 100 eval BANANAS, NASBOWL

Microsoft NNI is the most mature open-source framework for NAS. It supports DARTS, ENAS, Random NAS, and SPOS out of the box. Integration with PyTorch and TensorFlow.

import nni
from nni.nas.pytorch.darts import DartsTrainer
from nni.nas.pytorch.callbacks import LRSchedulerCallback

trainer = DartsTrainer(
    model=model,
    loss=nn.CrossEntropyLoss(),
    metrics=lambda output, target: accuracy(output, target, topk=(1,)),
    optimizer=optimizer,
    num_epochs=50,
    dataset_train=dataset_train,
    dataset_valid=dataset_valid,
    batch_size=64,
    log_frequency=10,
    callbacks=[LRSchedulerCallback(scheduler)]
)
trainer.fit()
# Получаем финальную архитектуру
export_result = trainer.export()

When NAS isn't needed. If the task is standard and the dataset contains <50k examples, use a pretrained ResNet-50 or EfficientNet-B0 and fine-tune it. NAS is justified in the following cases: custom hardware limitations, atypical input data (hyperspectral images, specific modalities), or the need to significantly reduce the model size without losing quality.

Work process

  1. Defining the search space is a critical step: we define blocks, operations, channel ranges, and maximum depth. An incorrect search space = poor results, regardless of the algorithm.
  2. Selecting a search strategy - DARTS for GPU-rich environments, evolutionary for hardware-aware, predictor-based for a limited evaluation budget
  3. Target hardware profiling — real latency/throughput measurements for search space operations on production hardware
  4. Finding and evaluating candidates — Weights & Biases for tracking, MLflow for storing found architectures
  5. Full training of the found architecture from scratch - weights from the search phase are not used
  6. Validation on the holdout set, profiling on production hardware

Timeline: Search space definition and setup — 1 week. Search itself — 1–5 days of computing. Full candidate training and validation — 1–2 weeks. Total: 3–6 weeks for a full NAS cycle.