Technology

How did foundational models learn to navigate the busy streets of Bengaluru?

On May 20, 2025, we showcased an end-to-end foundational model learning to navigate busy streets in Bengaluru trained in self supervised manner on raw unlabeled driving data without using any hard-coded rules or predefined maps. This was first time end-to-end autonomous driving was demonstrated on Indian roads, and we wanted to share what went on behind the scenes — from the technology stack to the engineering challenges & key insights — with the community.

Author

Minus Zero

15 Jun 2025



Why the shift to end-2-end autonomous driving?

Traditional AV systems (AV 1.0) rely on modular, rule-based architectures and extensive human-labeled data with distinct components dedicated to tasks like object detection, tracking, trajectory prediction, path planning, and vehicle control. This approach limits scalability and struggles to handle long-tail or edge cases effectively.

The rise of foundational models transforms this landscape by enabling robust cross-modal understanding, particularly in vision and language, through training on large volumes of unlabeled, task-agnostic data and exhibiting strong generalization capabilities.

Given the inherently dynamic and unpredictable nature of driving, a data-driven paradigm is not just preferable but necessary. In our approach, we replace the traditional AV stack with a unified end-to-end foundational model that learns to predict safe driving trajectories directly from camera inputs, trained on large-scale real-world fleet data.

Since, the end-to-end model observes the complete image as input, instead of breaking it down to hard-defined representations like bounding boxes, semantic maps, etc., it is able to use the complete and more generalized visual context for decision making without human labelling. 

For example in the images above, if we use classical object detection with pre-defined labels, a model may get confused to understand whether this is a vehicle or a human. End-to-end models are able to actively perceive the complete context and take appropriate actions for such edge cases. 

While the end-to-end approach to autonomous driving has been a key topic of discussion in the AV industry & research community for the last 3-4 years, we specifically wanted to test & demonstrate its capabilities in some of the most challenging traffic conditions, such as those encountered in India.

How are we attempting this problem?

We will talk about this in two parts -

• Training the Foundational Model

• The Learning Loop - from data to deployment

Training the Foundational Model

This workflow broadly has two major parts:

• Pre-training a foundational backbone on large-scale diverse data to learn general feature representations that can be fine-tuned for driving-specific tasks.

• Fine-tuning the model for downstream tasks of planner by learning to predict trajectories using conditional imitation learning.

Pre-training the foundational model: 

We pre-train our foundational backbone in the manner described above on millions of unlabeled video frames collected from diverse cities, environmental conditions, and sensor setups. Using self-supervised tasks like next-frame prediction, this process encodes broad real-world knowledge into transferable representations that power our end-to-end driving models.

To build models that generalize, exposure to diverse geographies, weather, and rare events is critical. Human drivers rely on both experience and prior knowledge to handle rare events - something imitation learning models struggle with, as they’re limited by the data they’re trained on. Edge cases, like an elephant crossing at night, may never appear in the collected data-set. However, in internet scale data, such cases could be present. So can we learn from this data? But, we do not have the ground-truth for this data. This is where foundational models come in as they distill this world knowledge using self-supervision objectives (e.g. next frame prediction, masked autoencoder, reconstruction etc.) instead of supervision from ground-truth like trajectories or bounding boxes, etc. Since, the foundation models are exposed to millions of images, they are able to create a richer and more abstract representation of the world which can be used to bootstrap our end-to-end driving models. 

This allows us to combine best-of-both-worlds - world knowledge embedded in massive internet data and driving-specific knowledge present in our specially collected datasets.

Fine-tuning using Imitation Learning: 

We make use of imitation learning to finetune the foundational backbone for the downstream task of trajectory prediction. In imitation learning, machine learning models are trained to copy an agent's behaviour by showing them a large amount of demonstration data. In our case, we are using the actual driving data of an expert driver as demonstrations. We utilise high precision localization methodologies during data collection that computes the pose and orientation of the vehicle to sub -10 cm accuracy and a sequence of such waypoints are post-processed to create proposed trajectories for fine-tuning.

In the fine-tuning stage, we use the checkpoints trained in the previous pretraining step as the starting point and fine-tune our end-to-end models using images as input and trajectory data as supervision signal for imitation learning. 

This helps us to build models which learn to exhibit human-like driving behaviours by automatically reasoning about the underlying decision making and preferences that humans use instead of hand-coded rules.

For example, when a lead vehicle stops abruptly, a skilled driver intuitively balances braking based on road friction, safe headway, and passenger comfort—factors hard to encode with rules. Imitation learning captures such nuanced decisions by learning directly from expert behavior.

Generalisation:

The primary benefit of using end-to-end foundational models is their capability to generalise across geographies, obstacles, environments, sensor setups & vehicle form factors. It’s usually not feasible to capture all possible scenarios that can occur on roads in the training datasets, so such generalization capabilities of foundational models enable learning of new unseen scenarios using limited data. 

We observed some interesting examples of generalisation of our models in the following unseen cases:

  1. Night time: Our data collection till date has been majorly focused on day-time driving, and we took a variant of our model not exposed to night / low light data from the dataset and tested it at night for emergency braking. Our models gave decent results in those scenarios as well, showcasing generalisation capabilities across times of day. 

  1. Animals & Pushcarts: Our training dataset has exposure to an extremely low percentage of animals and no exposure to push-carts at all, yet our models were able to negotiate these unseen obstacles on the road as shown in the following video. 


The Learning Loop

In a data-driven approach, our autopilot system learns progressively, adding new skills over time. Each interaction reveals new insights, helping us avoid design pitfalls, generalization debt, and short-term thinking. Hence, it is important to build a Continuous Learning Loop that can automate the entire process to enable learning & deployment at scale. This requires rigorous engineering efforts to build a system that can scale effectively. 

1. Create Curriculum

We begin by defining learning objectives - what capabilities the model needs at this stage. This spans a spectrum from basic lane-following to lane maintenance in absence of lane markings, taking narrow turns, handling aggressive cut-ins by other vehicles, different weather conditions, etc.

Each objective defines the tasks, corner cases, and behaviors we want the model to learn. This helps scope data needs and evaluation strategies.

2. Data Collection & Curation

Our data collection vehicle continuously captures driving data across city and highway environments using proprietary onboard software having:

  • High-res camera data with efficient compression and serialization

  • Sub-10 cm accurate localization (GNSS and Inertial)

  • Low-latency runtime with multi-sensor sync, monitoring, and visualization

  • Onboard/offboard data verification with real-time alerts

  • Scalable & structured logging formats

The data is uploaded to our cloud platform for:

  • Data cleaning

  • Scene understanding using vision-language models (VLMs) and other relevant metadata from pseudo-labels and rule-based filters.

  • Curriculum-aligned dataset curation

Rich scene descriptions, pseudo-labeling, and rigorous QA helps reduce bias and ensures high-quality & diverse training data.

3. Model Training

Our team of researchers train and release a new model candidate.

4. Offline Evaluation

Before testing on field, models go through regression tests in two stages:

• Open Loop Evaluation: Benchmarking on pre-recorded test datasets using metrics like similarity scores, collision rate, etc.

• Closed Loop Evaluation in Simulation: Closed-loop scenario wise testing in simulated environments and replays on real-world logs.

This enables us to efficiently identify the best-performing models on identical test scenarios, ensuring safety before deploying them on real roads.

5. Online Evaluation  - Rigorous Real-World Testing

The models are optimized on edge inference and deployed using our in-house onboard software. The model takes camera feed and predicts the trajectory for the car to follow. This is ingested by the controller module to convert it into a series of steer, throttle, brake commands - transmitted to our drive-by-wire test vehicle.

The entire system from sensors to control is tested in both controlled environments and public roads under supervision of a safety driver. This covers both scenario-specific testing and continuous driving at varying speeds—currently up to 25 km/h—across different weather conditions.

Our engineers visualise the results, log performance results and identify failure modes. The data is further analysed by the team offline around our series of test metrics - interventions, safety envelope violations, consistency across multiple trials, etc. We will discuss our test metrics in our upcoming blogs.

6. Triage Failures and Loop Back

Every test run becomes a diagnostic probe. Failures are triaged into categories and mapped to data, model or software engineering gaps.

This goes as a feedback into:

• Curriculum updates (new objectives or re-weighted priorities)

• Dataset augmentation (targeted collection and hard-negative mining)

• Model architecture or loss function changes

• Feature additions and improvements in onboard software

Each cycle tightens the model’s capabilities and improves its generalization, safety, and performance in open-world driving.

What are the key insights and the new challenges we are working on?

From our recent experiments, we learnt few important insights & identified new challenges that we are working on:

Data Quality and Distribution is Crucial:

Since end-to-end models learn without intermediate modules, high-quality data is crucial — errors, noise, and distribution biases can easily creep into accuracy of decision making. To ensure data quality, we implement multiple checks: well-trained drivers, a dedicated QA team, automated alerts for policy violations, etc. These measures help us curate high-quality training data.

Beyond quality, datasets must capture diverse driving behaviors (e.g., turns, merges), times of day, weather, and road types (single/double lane, urban/rural, etc.).

We use Vision-Language Models (VLMs) to assist in data curation at scale by automatically tagging and organizing large volumes of driving footage using natural language descriptions. They enable efficient filtering of rare or critical scenarios (e.g., jaywalking, unusual obstacles) without manual annotation and just by using natural language. 

Open loop metrics can be misleading:

Many state-of-the-art models report open-loop regression metrics like L2 score or simulation-based scores to quantify the performance of the models but these metrics fail to reflect the real-world performance.

Open-loop testing uses pre-recorded scenarios and cannot capture how other agents react to ego vehicle actions. Metrics like L2 score can give wrong picture because they:

• Penalize valid, safe alternatives

• Encourage behavior averaging, leading to unsafe, indecisive outputs

• Ignore safety, feasibility, and rule compliance

• Miss closed-loop effects where small errors escalate

• Don’t evaluate the full stack, overlooking sensor noise, latencies, and system-level failures

Narrow accuracy metrics alone can lead to a false sense of progress. A robust evaluation framework must capture reliability, generalization, and real-world readiness.

We’re building such a cohesive framework—spanning multi-modal prediction accuracy, collision rates, safety violations, closed-loop performance, intervention rates, scenario diversity, and user comfort—to truly assess system performance at scale. We will discuss this in more detail in our upcoming blogs.

Higher Speeds:

Scaling autonomy to highway speeds also reveals fundamentally harder engineering problems — not just a data and model problem. High speeds demand longer prediction horizons, kinodynamically feasible planning, more stringent QA standards and near-zero tolerance for software latency, bugs, or control instability. We are testing and optimizing for higher speeds in the coming months.

Simulation:

Due to the gap between real world and open loop metrics, closed loop testing in simulation plays a core role in the model release pipeline. And it suffers from sim-to-real gap, distribution bias and lack of natural stochasticity in other agents. We aim to improve our simulation workflow with photo-realistic Operational Design Domains (ODDs), ability to generate scenarios autonomously and seamless integration into the evaluation loop.

Interpretability:

As end-to-end models operate in latent space, it is hard to interpret their decision making. We are exploring different methods like supervised probes, activation maps, attention maps etc. to improve the explainability and interpretability of our models.

Causal confusion:

End-to-end models can sometimes create a spurious connection between “what they see” and “what they do”. While the predicted action of the model might look reasonable (say stopping at a traffic light), the reason could be spurious. For example, the model might learn to stop by detecting the traffic pole instead of the color of the light. Due to this spurious correlation, the model might stop even for a green traffic light or random poles. We will be actively working on this in the coming weeks. 

Catastrophic forgetting:

Catastrophic forgetting is a key challenge in foundational models—where learning from new data leads to a decline in performance on previously seen scenarios. This often stems from limited model capacity, data imbalance, or suboptimal training procedures. We’ve observed such regressions in our experiments and are actively developing training strategies and architectural tweaks to mitigate forgetting and ensure consistent, incremental learning.

We believe we have just touched the tip of the iceberg, and we would continue to share more updates on our progress as we improve our AI driver to deliver ADAS/AD to build the most human like AI driver for some of the toughest road conditions.



Why the shift to end-2-end autonomous driving?

Traditional AV systems (AV 1.0) rely on modular, rule-based architectures and extensive human-labeled data with distinct components dedicated to tasks like object detection, tracking, trajectory prediction, path planning, and vehicle control. This approach limits scalability and struggles to handle long-tail or edge cases effectively.

The rise of foundational models transforms this landscape by enabling robust cross-modal understanding, particularly in vision and language, through training on large volumes of unlabeled, task-agnostic data and exhibiting strong generalization capabilities.

Given the inherently dynamic and unpredictable nature of driving, a data-driven paradigm is not just preferable but necessary. In our approach, we replace the traditional AV stack with a unified end-to-end foundational model that learns to predict safe driving trajectories directly from camera inputs, trained on large-scale real-world fleet data.

Since, the end-to-end model observes the complete image as input, instead of breaking it down to hard-defined representations like bounding boxes, semantic maps, etc., it is able to use the complete and more generalized visual context for decision making without human labelling. 

For example in the images above, if we use classical object detection with pre-defined labels, a model may get confused to understand whether this is a vehicle or a human. End-to-end models are able to actively perceive the complete context and take appropriate actions for such edge cases. 

While the end-to-end approach to autonomous driving has been a key topic of discussion in the AV industry & research community for the last 3-4 years, we specifically wanted to test & demonstrate its capabilities in some of the most challenging traffic conditions, such as those encountered in India.

How are we attempting this problem?

We will talk about this in two parts -

• Training the Foundational Model

• The Learning Loop - from data to deployment

Training the Foundational Model

This workflow broadly has two major parts:

• Pre-training a foundational backbone on large-scale diverse data to learn general feature representations that can be fine-tuned for driving-specific tasks.

• Fine-tuning the model for downstream tasks of planner by learning to predict trajectories using conditional imitation learning.

Pre-training the foundational model: 

We pre-train our foundational backbone in the manner described above on millions of unlabeled video frames collected from diverse cities, environmental conditions, and sensor setups. Using self-supervised tasks like next-frame prediction, this process encodes broad real-world knowledge into transferable representations that power our end-to-end driving models.

To build models that generalize, exposure to diverse geographies, weather, and rare events is critical. Human drivers rely on both experience and prior knowledge to handle rare events - something imitation learning models struggle with, as they’re limited by the data they’re trained on. Edge cases, like an elephant crossing at night, may never appear in the collected data-set. However, in internet scale data, such cases could be present. So can we learn from this data? But, we do not have the ground-truth for this data. This is where foundational models come in as they distill this world knowledge using self-supervision objectives (e.g. next frame prediction, masked autoencoder, reconstruction etc.) instead of supervision from ground-truth like trajectories or bounding boxes, etc. Since, the foundation models are exposed to millions of images, they are able to create a richer and more abstract representation of the world which can be used to bootstrap our end-to-end driving models. 

This allows us to combine best-of-both-worlds - world knowledge embedded in massive internet data and driving-specific knowledge present in our specially collected datasets.

Fine-tuning using Imitation Learning: 

We make use of imitation learning to finetune the foundational backbone for the downstream task of trajectory prediction. In imitation learning, machine learning models are trained to copy an agent's behaviour by showing them a large amount of demonstration data. In our case, we are using the actual driving data of an expert driver as demonstrations. We utilise high precision localization methodologies during data collection that computes the pose and orientation of the vehicle to sub -10 cm accuracy and a sequence of such waypoints are post-processed to create proposed trajectories for fine-tuning.

In the fine-tuning stage, we use the checkpoints trained in the previous pretraining step as the starting point and fine-tune our end-to-end models using images as input and trajectory data as supervision signal for imitation learning. 

This helps us to build models which learn to exhibit human-like driving behaviours by automatically reasoning about the underlying decision making and preferences that humans use instead of hand-coded rules.

For example, when a lead vehicle stops abruptly, a skilled driver intuitively balances braking based on road friction, safe headway, and passenger comfort—factors hard to encode with rules. Imitation learning captures such nuanced decisions by learning directly from expert behavior.

Generalisation:

The primary benefit of using end-to-end foundational models is their capability to generalise across geographies, obstacles, environments, sensor setups & vehicle form factors. It’s usually not feasible to capture all possible scenarios that can occur on roads in the training datasets, so such generalization capabilities of foundational models enable learning of new unseen scenarios using limited data. 

We observed some interesting examples of generalisation of our models in the following unseen cases:

  1. Night time: Our data collection till date has been majorly focused on day-time driving, and we took a variant of our model not exposed to night / low light data from the dataset and tested it at night for emergency braking. Our models gave decent results in those scenarios as well, showcasing generalisation capabilities across times of day. 

  1. Animals & Pushcarts: Our training dataset has exposure to an extremely low percentage of animals and no exposure to push-carts at all, yet our models were able to negotiate these unseen obstacles on the road as shown in the following video. 


The Learning Loop

In a data-driven approach, our autopilot system learns progressively, adding new skills over time. Each interaction reveals new insights, helping us avoid design pitfalls, generalization debt, and short-term thinking. Hence, it is important to build a Continuous Learning Loop that can automate the entire process to enable learning & deployment at scale. This requires rigorous engineering efforts to build a system that can scale effectively. 

1. Create Curriculum

We begin by defining learning objectives - what capabilities the model needs at this stage. This spans a spectrum from basic lane-following to lane maintenance in absence of lane markings, taking narrow turns, handling aggressive cut-ins by other vehicles, different weather conditions, etc.

Each objective defines the tasks, corner cases, and behaviors we want the model to learn. This helps scope data needs and evaluation strategies.

2. Data Collection & Curation

Our data collection vehicle continuously captures driving data across city and highway environments using proprietary onboard software having:

  • High-res camera data with efficient compression and serialization

  • Sub-10 cm accurate localization (GNSS and Inertial)

  • Low-latency runtime with multi-sensor sync, monitoring, and visualization

  • Onboard/offboard data verification with real-time alerts

  • Scalable & structured logging formats

The data is uploaded to our cloud platform for:

  • Data cleaning

  • Scene understanding using vision-language models (VLMs) and other relevant metadata from pseudo-labels and rule-based filters.

  • Curriculum-aligned dataset curation

Rich scene descriptions, pseudo-labeling, and rigorous QA helps reduce bias and ensures high-quality & diverse training data.

3. Model Training

Our team of researchers train and release a new model candidate.

4. Offline Evaluation

Before testing on field, models go through regression tests in two stages:

• Open Loop Evaluation: Benchmarking on pre-recorded test datasets using metrics like similarity scores, collision rate, etc.

• Closed Loop Evaluation in Simulation: Closed-loop scenario wise testing in simulated environments and replays on real-world logs.

This enables us to efficiently identify the best-performing models on identical test scenarios, ensuring safety before deploying them on real roads.

5. Online Evaluation  - Rigorous Real-World Testing

The models are optimized on edge inference and deployed using our in-house onboard software. The model takes camera feed and predicts the trajectory for the car to follow. This is ingested by the controller module to convert it into a series of steer, throttle, brake commands - transmitted to our drive-by-wire test vehicle.

The entire system from sensors to control is tested in both controlled environments and public roads under supervision of a safety driver. This covers both scenario-specific testing and continuous driving at varying speeds—currently up to 25 km/h—across different weather conditions.

Our engineers visualise the results, log performance results and identify failure modes. The data is further analysed by the team offline around our series of test metrics - interventions, safety envelope violations, consistency across multiple trials, etc. We will discuss our test metrics in our upcoming blogs.

6. Triage Failures and Loop Back

Every test run becomes a diagnostic probe. Failures are triaged into categories and mapped to data, model or software engineering gaps.

This goes as a feedback into:

• Curriculum updates (new objectives or re-weighted priorities)

• Dataset augmentation (targeted collection and hard-negative mining)

• Model architecture or loss function changes

• Feature additions and improvements in onboard software

Each cycle tightens the model’s capabilities and improves its generalization, safety, and performance in open-world driving.

What are the key insights and the new challenges we are working on?

From our recent experiments, we learnt few important insights & identified new challenges that we are working on:

Data Quality and Distribution is Crucial:

Since end-to-end models learn without intermediate modules, high-quality data is crucial — errors, noise, and distribution biases can easily creep into accuracy of decision making. To ensure data quality, we implement multiple checks: well-trained drivers, a dedicated QA team, automated alerts for policy violations, etc. These measures help us curate high-quality training data.

Beyond quality, datasets must capture diverse driving behaviors (e.g., turns, merges), times of day, weather, and road types (single/double lane, urban/rural, etc.).

We use Vision-Language Models (VLMs) to assist in data curation at scale by automatically tagging and organizing large volumes of driving footage using natural language descriptions. They enable efficient filtering of rare or critical scenarios (e.g., jaywalking, unusual obstacles) without manual annotation and just by using natural language. 

Open loop metrics can be misleading:

Many state-of-the-art models report open-loop regression metrics like L2 score or simulation-based scores to quantify the performance of the models but these metrics fail to reflect the real-world performance.

Open-loop testing uses pre-recorded scenarios and cannot capture how other agents react to ego vehicle actions. Metrics like L2 score can give wrong picture because they:

• Penalize valid, safe alternatives

• Encourage behavior averaging, leading to unsafe, indecisive outputs

• Ignore safety, feasibility, and rule compliance

• Miss closed-loop effects where small errors escalate

• Don’t evaluate the full stack, overlooking sensor noise, latencies, and system-level failures

Narrow accuracy metrics alone can lead to a false sense of progress. A robust evaluation framework must capture reliability, generalization, and real-world readiness.

We’re building such a cohesive framework—spanning multi-modal prediction accuracy, collision rates, safety violations, closed-loop performance, intervention rates, scenario diversity, and user comfort—to truly assess system performance at scale. We will discuss this in more detail in our upcoming blogs.

Higher Speeds:

Scaling autonomy to highway speeds also reveals fundamentally harder engineering problems — not just a data and model problem. High speeds demand longer prediction horizons, kinodynamically feasible planning, more stringent QA standards and near-zero tolerance for software latency, bugs, or control instability. We are testing and optimizing for higher speeds in the coming months.

Simulation:

Due to the gap between real world and open loop metrics, closed loop testing in simulation plays a core role in the model release pipeline. And it suffers from sim-to-real gap, distribution bias and lack of natural stochasticity in other agents. We aim to improve our simulation workflow with photo-realistic Operational Design Domains (ODDs), ability to generate scenarios autonomously and seamless integration into the evaluation loop.

Interpretability:

As end-to-end models operate in latent space, it is hard to interpret their decision making. We are exploring different methods like supervised probes, activation maps, attention maps etc. to improve the explainability and interpretability of our models.

Causal confusion:

End-to-end models can sometimes create a spurious connection between “what they see” and “what they do”. While the predicted action of the model might look reasonable (say stopping at a traffic light), the reason could be spurious. For example, the model might learn to stop by detecting the traffic pole instead of the color of the light. Due to this spurious correlation, the model might stop even for a green traffic light or random poles. We will be actively working on this in the coming weeks. 

Catastrophic forgetting:

Catastrophic forgetting is a key challenge in foundational models—where learning from new data leads to a decline in performance on previously seen scenarios. This often stems from limited model capacity, data imbalance, or suboptimal training procedures. We’ve observed such regressions in our experiments and are actively developing training strategies and architectural tweaks to mitigate forgetting and ensure consistent, incremental learning.

We believe we have just touched the tip of the iceberg, and we would continue to share more updates on our progress as we improve our AI driver to deliver ADAS/AD to build the most human like AI driver for some of the toughest road conditions.



Why the shift to end-2-end autonomous driving?

Traditional AV systems (AV 1.0) rely on modular, rule-based architectures and extensive human-labeled data with distinct components dedicated to tasks like object detection, tracking, trajectory prediction, path planning, and vehicle control. This approach limits scalability and struggles to handle long-tail or edge cases effectively.

The rise of foundational models transforms this landscape by enabling robust cross-modal understanding, particularly in vision and language, through training on large volumes of unlabeled, task-agnostic data and exhibiting strong generalization capabilities.

Given the inherently dynamic and unpredictable nature of driving, a data-driven paradigm is not just preferable but necessary. In our approach, we replace the traditional AV stack with a unified end-to-end foundational model that learns to predict safe driving trajectories directly from camera inputs, trained on large-scale real-world fleet data.

Since, the end-to-end model observes the complete image as input, instead of breaking it down to hard-defined representations like bounding boxes, semantic maps, etc., it is able to use the complete and more generalized visual context for decision making without human labelling. 

For example in the images above, if we use classical object detection with pre-defined labels, a model may get confused to understand whether this is a vehicle or a human. End-to-end models are able to actively perceive the complete context and take appropriate actions for such edge cases. 

While the end-to-end approach to autonomous driving has been a key topic of discussion in the AV industry & research community for the last 3-4 years, we specifically wanted to test & demonstrate its capabilities in some of the most challenging traffic conditions, such as those encountered in India.

How are we attempting this problem?

We will talk about this in two parts -

• Training the Foundational Model

• The Learning Loop - from data to deployment

Training the Foundational Model

This workflow broadly has two major parts:

• Pre-training a foundational backbone on large-scale diverse data to learn general feature representations that can be fine-tuned for driving-specific tasks.

• Fine-tuning the model for downstream tasks of planner by learning to predict trajectories using conditional imitation learning.

Pre-training the foundational model: 

We pre-train our foundational backbone in the manner described above on millions of unlabeled video frames collected from diverse cities, environmental conditions, and sensor setups. Using self-supervised tasks like next-frame prediction, this process encodes broad real-world knowledge into transferable representations that power our end-to-end driving models.

To build models that generalize, exposure to diverse geographies, weather, and rare events is critical. Human drivers rely on both experience and prior knowledge to handle rare events - something imitation learning models struggle with, as they’re limited by the data they’re trained on. Edge cases, like an elephant crossing at night, may never appear in the collected data-set. However, in internet scale data, such cases could be present. So can we learn from this data? But, we do not have the ground-truth for this data. This is where foundational models come in as they distill this world knowledge using self-supervision objectives (e.g. next frame prediction, masked autoencoder, reconstruction etc.) instead of supervision from ground-truth like trajectories or bounding boxes, etc. Since, the foundation models are exposed to millions of images, they are able to create a richer and more abstract representation of the world which can be used to bootstrap our end-to-end driving models. 

This allows us to combine best-of-both-worlds - world knowledge embedded in massive internet data and driving-specific knowledge present in our specially collected datasets.

Fine-tuning using Imitation Learning: 

We make use of imitation learning to finetune the foundational backbone for the downstream task of trajectory prediction. In imitation learning, machine learning models are trained to copy an agent's behaviour by showing them a large amount of demonstration data. In our case, we are using the actual driving data of an expert driver as demonstrations. We utilise high precision localization methodologies during data collection that computes the pose and orientation of the vehicle to sub -10 cm accuracy and a sequence of such waypoints are post-processed to create proposed trajectories for fine-tuning.

In the fine-tuning stage, we use the checkpoints trained in the previous pretraining step as the starting point and fine-tune our end-to-end models using images as input and trajectory data as supervision signal for imitation learning. 

This helps us to build models which learn to exhibit human-like driving behaviours by automatically reasoning about the underlying decision making and preferences that humans use instead of hand-coded rules.

For example, when a lead vehicle stops abruptly, a skilled driver intuitively balances braking based on road friction, safe headway, and passenger comfort—factors hard to encode with rules. Imitation learning captures such nuanced decisions by learning directly from expert behavior.

Generalisation:

The primary benefit of using end-to-end foundational models is their capability to generalise across geographies, obstacles, environments, sensor setups & vehicle form factors. It’s usually not feasible to capture all possible scenarios that can occur on roads in the training datasets, so such generalization capabilities of foundational models enable learning of new unseen scenarios using limited data. 

We observed some interesting examples of generalisation of our models in the following unseen cases:

  1. Night time: Our data collection till date has been majorly focused on day-time driving, and we took a variant of our model not exposed to night / low light data from the dataset and tested it at night for emergency braking. Our models gave decent results in those scenarios as well, showcasing generalisation capabilities across times of day. 

  1. Animals & Pushcarts: Our training dataset has exposure to an extremely low percentage of animals and no exposure to push-carts at all, yet our models were able to negotiate these unseen obstacles on the road as shown in the following video. 


The Learning Loop

In a data-driven approach, our autopilot system learns progressively, adding new skills over time. Each interaction reveals new insights, helping us avoid design pitfalls, generalization debt, and short-term thinking. Hence, it is important to build a Continuous Learning Loop that can automate the entire process to enable learning & deployment at scale. This requires rigorous engineering efforts to build a system that can scale effectively. 

1. Create Curriculum

We begin by defining learning objectives - what capabilities the model needs at this stage. This spans a spectrum from basic lane-following to lane maintenance in absence of lane markings, taking narrow turns, handling aggressive cut-ins by other vehicles, different weather conditions, etc.

Each objective defines the tasks, corner cases, and behaviors we want the model to learn. This helps scope data needs and evaluation strategies.

2. Data Collection & Curation

Our data collection vehicle continuously captures driving data across city and highway environments using proprietary onboard software having:

  • High-res camera data with efficient compression and serialization

  • Sub-10 cm accurate localization (GNSS and Inertial)

  • Low-latency runtime with multi-sensor sync, monitoring, and visualization

  • Onboard/offboard data verification with real-time alerts

  • Scalable & structured logging formats

The data is uploaded to our cloud platform for:

  • Data cleaning

  • Scene understanding using vision-language models (VLMs) and other relevant metadata from pseudo-labels and rule-based filters.

  • Curriculum-aligned dataset curation

Rich scene descriptions, pseudo-labeling, and rigorous QA helps reduce bias and ensures high-quality & diverse training data.

3. Model Training

Our team of researchers train and release a new model candidate.

4. Offline Evaluation

Before testing on field, models go through regression tests in two stages:

• Open Loop Evaluation: Benchmarking on pre-recorded test datasets using metrics like similarity scores, collision rate, etc.

• Closed Loop Evaluation in Simulation: Closed-loop scenario wise testing in simulated environments and replays on real-world logs.

This enables us to efficiently identify the best-performing models on identical test scenarios, ensuring safety before deploying them on real roads.

5. Online Evaluation  - Rigorous Real-World Testing

The models are optimized on edge inference and deployed using our in-house onboard software. The model takes camera feed and predicts the trajectory for the car to follow. This is ingested by the controller module to convert it into a series of steer, throttle, brake commands - transmitted to our drive-by-wire test vehicle.

The entire system from sensors to control is tested in both controlled environments and public roads under supervision of a safety driver. This covers both scenario-specific testing and continuous driving at varying speeds—currently up to 25 km/h—across different weather conditions.

Our engineers visualise the results, log performance results and identify failure modes. The data is further analysed by the team offline around our series of test metrics - interventions, safety envelope violations, consistency across multiple trials, etc. We will discuss our test metrics in our upcoming blogs.

6. Triage Failures and Loop Back

Every test run becomes a diagnostic probe. Failures are triaged into categories and mapped to data, model or software engineering gaps.

This goes as a feedback into:

• Curriculum updates (new objectives or re-weighted priorities)

• Dataset augmentation (targeted collection and hard-negative mining)

• Model architecture or loss function changes

• Feature additions and improvements in onboard software

Each cycle tightens the model’s capabilities and improves its generalization, safety, and performance in open-world driving.

What are the key insights and the new challenges we are working on?

From our recent experiments, we learnt few important insights & identified new challenges that we are working on:

Data Quality and Distribution is Crucial:

Since end-to-end models learn without intermediate modules, high-quality data is crucial — errors, noise, and distribution biases can easily creep into accuracy of decision making. To ensure data quality, we implement multiple checks: well-trained drivers, a dedicated QA team, automated alerts for policy violations, etc. These measures help us curate high-quality training data.

Beyond quality, datasets must capture diverse driving behaviors (e.g., turns, merges), times of day, weather, and road types (single/double lane, urban/rural, etc.).

We use Vision-Language Models (VLMs) to assist in data curation at scale by automatically tagging and organizing large volumes of driving footage using natural language descriptions. They enable efficient filtering of rare or critical scenarios (e.g., jaywalking, unusual obstacles) without manual annotation and just by using natural language. 

Open loop metrics can be misleading:

Many state-of-the-art models report open-loop regression metrics like L2 score or simulation-based scores to quantify the performance of the models but these metrics fail to reflect the real-world performance.

Open-loop testing uses pre-recorded scenarios and cannot capture how other agents react to ego vehicle actions. Metrics like L2 score can give wrong picture because they:

• Penalize valid, safe alternatives

• Encourage behavior averaging, leading to unsafe, indecisive outputs

• Ignore safety, feasibility, and rule compliance

• Miss closed-loop effects where small errors escalate

• Don’t evaluate the full stack, overlooking sensor noise, latencies, and system-level failures

Narrow accuracy metrics alone can lead to a false sense of progress. A robust evaluation framework must capture reliability, generalization, and real-world readiness.

We’re building such a cohesive framework—spanning multi-modal prediction accuracy, collision rates, safety violations, closed-loop performance, intervention rates, scenario diversity, and user comfort—to truly assess system performance at scale. We will discuss this in more detail in our upcoming blogs.

Higher Speeds:

Scaling autonomy to highway speeds also reveals fundamentally harder engineering problems — not just a data and model problem. High speeds demand longer prediction horizons, kinodynamically feasible planning, more stringent QA standards and near-zero tolerance for software latency, bugs, or control instability. We are testing and optimizing for higher speeds in the coming months.

Simulation:

Due to the gap between real world and open loop metrics, closed loop testing in simulation plays a core role in the model release pipeline. And it suffers from sim-to-real gap, distribution bias and lack of natural stochasticity in other agents. We aim to improve our simulation workflow with photo-realistic Operational Design Domains (ODDs), ability to generate scenarios autonomously and seamless integration into the evaluation loop.

Interpretability:

As end-to-end models operate in latent space, it is hard to interpret their decision making. We are exploring different methods like supervised probes, activation maps, attention maps etc. to improve the explainability and interpretability of our models.

Causal confusion:

End-to-end models can sometimes create a spurious connection between “what they see” and “what they do”. While the predicted action of the model might look reasonable (say stopping at a traffic light), the reason could be spurious. For example, the model might learn to stop by detecting the traffic pole instead of the color of the light. Due to this spurious correlation, the model might stop even for a green traffic light or random poles. We will be actively working on this in the coming weeks. 

Catastrophic forgetting:

Catastrophic forgetting is a key challenge in foundational models—where learning from new data leads to a decline in performance on previously seen scenarios. This often stems from limited model capacity, data imbalance, or suboptimal training procedures. We’ve observed such regressions in our experiments and are actively developing training strategies and architectural tweaks to mitigate forgetting and ensure consistent, incremental learning.

We believe we have just touched the tip of the iceberg, and we would continue to share more updates on our progress as we improve our AI driver to deliver ADAS/AD to build the most human like AI driver for some of the toughest road conditions.



Why the shift to end-2-end autonomous driving?

Traditional AV systems (AV 1.0) rely on modular, rule-based architectures and extensive human-labeled data with distinct components dedicated to tasks like object detection, tracking, trajectory prediction, path planning, and vehicle control. This approach limits scalability and struggles to handle long-tail or edge cases effectively.

The rise of foundational models transforms this landscape by enabling robust cross-modal understanding, particularly in vision and language, through training on large volumes of unlabeled, task-agnostic data and exhibiting strong generalization capabilities.

Given the inherently dynamic and unpredictable nature of driving, a data-driven paradigm is not just preferable but necessary. In our approach, we replace the traditional AV stack with a unified end-to-end foundational model that learns to predict safe driving trajectories directly from camera inputs, trained on large-scale real-world fleet data.

Since, the end-to-end model observes the complete image as input, instead of breaking it down to hard-defined representations like bounding boxes, semantic maps, etc., it is able to use the complete and more generalized visual context for decision making without human labelling. 

For example in the images above, if we use classical object detection with pre-defined labels, a model may get confused to understand whether this is a vehicle or a human. End-to-end models are able to actively perceive the complete context and take appropriate actions for such edge cases. 

While the end-to-end approach to autonomous driving has been a key topic of discussion in the AV industry & research community for the last 3-4 years, we specifically wanted to test & demonstrate its capabilities in some of the most challenging traffic conditions, such as those encountered in India.

How are we attempting this problem?

We will talk about this in two parts -

• Training the Foundational Model

• The Learning Loop - from data to deployment

Training the Foundational Model

This workflow broadly has two major parts:

• Pre-training a foundational backbone on large-scale diverse data to learn general feature representations that can be fine-tuned for driving-specific tasks.

• Fine-tuning the model for downstream tasks of planner by learning to predict trajectories using conditional imitation learning.

Pre-training the foundational model: 

We pre-train our foundational backbone in the manner described above on millions of unlabeled video frames collected from diverse cities, environmental conditions, and sensor setups. Using self-supervised tasks like next-frame prediction, this process encodes broad real-world knowledge into transferable representations that power our end-to-end driving models.

To build models that generalize, exposure to diverse geographies, weather, and rare events is critical. Human drivers rely on both experience and prior knowledge to handle rare events - something imitation learning models struggle with, as they’re limited by the data they’re trained on. Edge cases, like an elephant crossing at night, may never appear in the collected data-set. However, in internet scale data, such cases could be present. So can we learn from this data? But, we do not have the ground-truth for this data. This is where foundational models come in as they distill this world knowledge using self-supervision objectives (e.g. next frame prediction, masked autoencoder, reconstruction etc.) instead of supervision from ground-truth like trajectories or bounding boxes, etc. Since, the foundation models are exposed to millions of images, they are able to create a richer and more abstract representation of the world which can be used to bootstrap our end-to-end driving models. 

This allows us to combine best-of-both-worlds - world knowledge embedded in massive internet data and driving-specific knowledge present in our specially collected datasets.

Fine-tuning using Imitation Learning: 

We make use of imitation learning to finetune the foundational backbone for the downstream task of trajectory prediction. In imitation learning, machine learning models are trained to copy an agent's behaviour by showing them a large amount of demonstration data. In our case, we are using the actual driving data of an expert driver as demonstrations. We utilise high precision localization methodologies during data collection that computes the pose and orientation of the vehicle to sub -10 cm accuracy and a sequence of such waypoints are post-processed to create proposed trajectories for fine-tuning.

In the fine-tuning stage, we use the checkpoints trained in the previous pretraining step as the starting point and fine-tune our end-to-end models using images as input and trajectory data as supervision signal for imitation learning. 

This helps us to build models which learn to exhibit human-like driving behaviours by automatically reasoning about the underlying decision making and preferences that humans use instead of hand-coded rules.

For example, when a lead vehicle stops abruptly, a skilled driver intuitively balances braking based on road friction, safe headway, and passenger comfort—factors hard to encode with rules. Imitation learning captures such nuanced decisions by learning directly from expert behavior.

Generalisation:

The primary benefit of using end-to-end foundational models is their capability to generalise across geographies, obstacles, environments, sensor setups & vehicle form factors. It’s usually not feasible to capture all possible scenarios that can occur on roads in the training datasets, so such generalization capabilities of foundational models enable learning of new unseen scenarios using limited data. 

We observed some interesting examples of generalisation of our models in the following unseen cases:

  1. Night time: Our data collection till date has been majorly focused on day-time driving, and we took a variant of our model not exposed to night / low light data from the dataset and tested it at night for emergency braking. Our models gave decent results in those scenarios as well, showcasing generalisation capabilities across times of day. 

  1. Animals & Pushcarts: Our training dataset has exposure to an extremely low percentage of animals and no exposure to push-carts at all, yet our models were able to negotiate these unseen obstacles on the road as shown in the following video. 


The Learning Loop

In a data-driven approach, our autopilot system learns progressively, adding new skills over time. Each interaction reveals new insights, helping us avoid design pitfalls, generalization debt, and short-term thinking. Hence, it is important to build a Continuous Learning Loop that can automate the entire process to enable learning & deployment at scale. This requires rigorous engineering efforts to build a system that can scale effectively. 

1. Create Curriculum

We begin by defining learning objectives - what capabilities the model needs at this stage. This spans a spectrum from basic lane-following to lane maintenance in absence of lane markings, taking narrow turns, handling aggressive cut-ins by other vehicles, different weather conditions, etc.

Each objective defines the tasks, corner cases, and behaviors we want the model to learn. This helps scope data needs and evaluation strategies.

2. Data Collection & Curation

Our data collection vehicle continuously captures driving data across city and highway environments using proprietary onboard software having:

  • High-res camera data with efficient compression and serialization

  • Sub-10 cm accurate localization (GNSS and Inertial)

  • Low-latency runtime with multi-sensor sync, monitoring, and visualization

  • Onboard/offboard data verification with real-time alerts

  • Scalable & structured logging formats

The data is uploaded to our cloud platform for:

  • Data cleaning

  • Scene understanding using vision-language models (VLMs) and other relevant metadata from pseudo-labels and rule-based filters.

  • Curriculum-aligned dataset curation

Rich scene descriptions, pseudo-labeling, and rigorous QA helps reduce bias and ensures high-quality & diverse training data.

3. Model Training

Our team of researchers train and release a new model candidate.

4. Offline Evaluation

Before testing on field, models go through regression tests in two stages:

• Open Loop Evaluation: Benchmarking on pre-recorded test datasets using metrics like similarity scores, collision rate, etc.

• Closed Loop Evaluation in Simulation: Closed-loop scenario wise testing in simulated environments and replays on real-world logs.

This enables us to efficiently identify the best-performing models on identical test scenarios, ensuring safety before deploying them on real roads.

5. Online Evaluation  - Rigorous Real-World Testing

The models are optimized on edge inference and deployed using our in-house onboard software. The model takes camera feed and predicts the trajectory for the car to follow. This is ingested by the controller module to convert it into a series of steer, throttle, brake commands - transmitted to our drive-by-wire test vehicle.

The entire system from sensors to control is tested in both controlled environments and public roads under supervision of a safety driver. This covers both scenario-specific testing and continuous driving at varying speeds—currently up to 25 km/h—across different weather conditions.

Our engineers visualise the results, log performance results and identify failure modes. The data is further analysed by the team offline around our series of test metrics - interventions, safety envelope violations, consistency across multiple trials, etc. We will discuss our test metrics in our upcoming blogs.

6. Triage Failures and Loop Back

Every test run becomes a diagnostic probe. Failures are triaged into categories and mapped to data, model or software engineering gaps.

This goes as a feedback into:

• Curriculum updates (new objectives or re-weighted priorities)

• Dataset augmentation (targeted collection and hard-negative mining)

• Model architecture or loss function changes

• Feature additions and improvements in onboard software

Each cycle tightens the model’s capabilities and improves its generalization, safety, and performance in open-world driving.

What are the key insights and the new challenges we are working on?

From our recent experiments, we learnt few important insights & identified new challenges that we are working on:

Data Quality and Distribution is Crucial:

Since end-to-end models learn without intermediate modules, high-quality data is crucial — errors, noise, and distribution biases can easily creep into accuracy of decision making. To ensure data quality, we implement multiple checks: well-trained drivers, a dedicated QA team, automated alerts for policy violations, etc. These measures help us curate high-quality training data.

Beyond quality, datasets must capture diverse driving behaviors (e.g., turns, merges), times of day, weather, and road types (single/double lane, urban/rural, etc.).

We use Vision-Language Models (VLMs) to assist in data curation at scale by automatically tagging and organizing large volumes of driving footage using natural language descriptions. They enable efficient filtering of rare or critical scenarios (e.g., jaywalking, unusual obstacles) without manual annotation and just by using natural language. 

Open loop metrics can be misleading:

Many state-of-the-art models report open-loop regression metrics like L2 score or simulation-based scores to quantify the performance of the models but these metrics fail to reflect the real-world performance.

Open-loop testing uses pre-recorded scenarios and cannot capture how other agents react to ego vehicle actions. Metrics like L2 score can give wrong picture because they:

• Penalize valid, safe alternatives

• Encourage behavior averaging, leading to unsafe, indecisive outputs

• Ignore safety, feasibility, and rule compliance

• Miss closed-loop effects where small errors escalate

• Don’t evaluate the full stack, overlooking sensor noise, latencies, and system-level failures

Narrow accuracy metrics alone can lead to a false sense of progress. A robust evaluation framework must capture reliability, generalization, and real-world readiness.

We’re building such a cohesive framework—spanning multi-modal prediction accuracy, collision rates, safety violations, closed-loop performance, intervention rates, scenario diversity, and user comfort—to truly assess system performance at scale. We will discuss this in more detail in our upcoming blogs.

Higher Speeds:

Scaling autonomy to highway speeds also reveals fundamentally harder engineering problems — not just a data and model problem. High speeds demand longer prediction horizons, kinodynamically feasible planning, more stringent QA standards and near-zero tolerance for software latency, bugs, or control instability. We are testing and optimizing for higher speeds in the coming months.

Simulation:

Due to the gap between real world and open loop metrics, closed loop testing in simulation plays a core role in the model release pipeline. And it suffers from sim-to-real gap, distribution bias and lack of natural stochasticity in other agents. We aim to improve our simulation workflow with photo-realistic Operational Design Domains (ODDs), ability to generate scenarios autonomously and seamless integration into the evaluation loop.

Interpretability:

As end-to-end models operate in latent space, it is hard to interpret their decision making. We are exploring different methods like supervised probes, activation maps, attention maps etc. to improve the explainability and interpretability of our models.

Causal confusion:

End-to-end models can sometimes create a spurious connection between “what they see” and “what they do”. While the predicted action of the model might look reasonable (say stopping at a traffic light), the reason could be spurious. For example, the model might learn to stop by detecting the traffic pole instead of the color of the light. Due to this spurious correlation, the model might stop even for a green traffic light or random poles. We will be actively working on this in the coming weeks. 

Catastrophic forgetting:

Catastrophic forgetting is a key challenge in foundational models—where learning from new data leads to a decline in performance on previously seen scenarios. This often stems from limited model capacity, data imbalance, or suboptimal training procedures. We’ve observed such regressions in our experiments and are actively developing training strategies and architectural tweaks to mitigate forgetting and ensure consistent, incremental learning.

We believe we have just touched the tip of the iceberg, and we would continue to share more updates on our progress as we improve our AI driver to deliver ADAS/AD to build the most human like AI driver for some of the toughest road conditions.

Sign up to our newsletter

Latest Blog Posts