Main / Data Governance / Google Fixes Cross-Tenant RCE Flaw in Vertex AI SDK

Google Fixes Cross-Tenant RCE Flaw in Vertex AI SDK

Jun 17, 2026

The rapid integration of artificial intelligence into enterprise workflows has introduced a new class of security vulnerabilities that bypass traditional defense mechanisms by targeting the underlying infrastructure of machine learning development. As of early 2026, the complexity of cloud-native AI platforms has reached a point where minor oversights in software development kits can lead to catastrophic security failures across tenant boundaries. The discovery of a remote code execution vulnerability within the Google Cloud Vertex AI SDK highlights the precarious balance between user convenience and robust security in automated model deployment pipelines. This specific flaw, characterized by its ability to allow an attacker to hijack model uploads without prior access to a victim’s project, underscores the necessity for rigorous validation of every automated step in the machine learning lifecycle. By exploiting deterministic resource naming and a lack of ownership verification, the vulnerability transformed a standard feature into a potent vector for cross-tenant compromise. This scenario serves as a stark reminder that as organizations accelerate their adoption of AI services, the security of the tools used to manage these services becomes a critical point of failure that demands immediate attention and sophisticated remediation strategies.

1. Executive Overview: A Critical Security Resolution

The security landscape in 2026 has been significantly impacted by the disclosure of a critical vulnerability in the Vertex AI SDK for Python, which allowed for unauthorized code execution. Dubbed Pickle in the Middle, this flaw enabled an adversary to operate from a completely separate Google Cloud project to intercept and poison machine learning models being uploaded by a target user. The vulnerability did not require the attacker to have any permissions or credentials within the victim’s environment, relying instead on the predictable nature of the staging process. By exploiting a combination of bucket squatting and insecure deserialization, an attacker could replace legitimate model artifacts with malicious payloads. Once the victim attempted to deploy their model to a production endpoint, the malicious code would execute within the Google-managed serving infrastructure. This process effectively bypassed traditional identity and access management controls, providing a direct path for an attacker to gain a foothold in high-value AI environments without raising initial alarms.

Google has since addressed this issue in the latest versions of the google-cloud-aiplatform SDK, specifically version 1.148.0 and beyond. The vulnerability primarily resided in the way the SDK handled the creation and selection of Google Cloud Storage buckets used for staging model data before it reached the Model Registry. Because the SDK used a deterministic pattern to name these buckets based on the project ID and region, an attacker could preemptively claim the bucket name in their own project. The SDK would then inadvertently use the attacker’s bucket for the victim’s model storage. This discovery prompted a comprehensive review of how cloud service providers manage default resource allocation and the assumptions made by automated tools regarding resource ownership. Developers are strongly encouraged to upgrade their environments to the patched versions of the SDK and to implement explicit staging bucket definitions to mitigate the risk of similar resource-naming exploits in the future.

2. Core Concepts and Infrastructure: Understanding Vertex AI Operations

Vertex AI serves as a comprehensive machine learning platform designed to streamline the training and deployment of models and AI applications. Within this ecosystem, the Vertex AI SDK for Python acts as the primary interface for developers to interact with various cloud services programmatically. A central component of this platform is the Model Registry, a repository where users manage different versions of their machine learning models before they are served to end users. When a model is uploaded via the SDK, it undergoes a two-step process: first, the model artifacts are staged in a Google Cloud Storage bucket, and then they are registered with the service. This staging area is crucial because it acts as the source of truth for the serving infrastructure. When a model is deployed, Google’s internal service agents retrieve the files from this bucket and load them into a specialized container environment to handle inference requests.

The security of this workflow relies heavily on the isolation of storage resources and the integrity of the files being transferred. Bucket squatting is a technique that exploits the global uniqueness requirement of cloud storage names across the entire Google Cloud ecosystem. Since no two buckets can share the same name anywhere in the world, an attacker who can predict the name of a bucket that a service intends to use can create it first. This creates a scenario where a legitimate user’s data is directed to a storage container controlled by a third party. Furthermore, the use of Python’s pickle module for serializing models introduces additional risks. Because pickle is designed for flexibility rather than security, it allows for the inclusion of arbitrary logic during the deserialization process. When the Vertex AI infrastructure loads a poisoned model using standard Python libraries, it unintentionally triggers any embedded code, leading to remote code execution within the tenant project hosting the model endpoint.

3. Technical Breakdown: How the Flaw Manifests in Code

The vulnerability was identified within the gcs_utils.py component of the google-cloud-aiplatform library, which manages the interaction between the SDK and cloud storage. Analysis of the source code revealed that the SDK followed a deterministic logic to generate a staging bucket name when one was not explicitly provided by the user. The pattern combined the project ID, the region, and a static suffix, resulting in a predictable string that remained constant for any given project. When the SDK prepared to upload a model, it would check for the existence of a bucket with this specific name. If the bucket was already present, the code proceeded to use it without verifying the identity of the project that owned the bucket. This oversight meant that as long as a bucket with the correct name existed, the SDK assumed it was the intended destination, regardless of its actual ownership or security configuration.

The discovery of this specific path was accelerated by the use of advanced Large Language Models designed for security research and code analysis. These AI tools were used to scan the extensive codebase of the SDK to identify patterns where resource names were constructed from user-controllable or predictable inputs. By feeding the LLM the logic used for GCS bucket interactions, researchers were able to pinpoint the exact lines of code where the ownership check was missing. This represents a significant shift in how vulnerabilities are found in 2026, as automated reasoning can now connect disparate pieces of logic, such as naming conventions and permission checks, to find exploit chains that might be missed during manual audits. The failure to confirm that the staging bucket belonged to the same project as the Vertex AI instance created a massive security hole that could be easily weaponized by anyone with knowledge of a target’s project ID.

4. The Sequential Attack Chain: Exploiting Predictable Logic

The execution of a Pickle in the Middle attack begins with the adversary identifying a target project ID and the region where the victim is likely to deploy their AI models. Using this information, the attacker calculates the predictable staging bucket name and creates it within their own malicious Google Cloud project. To ensure the attack remains invisible and efficient, the attacker configures a Cloud Function or a similar serverless trigger that monitors the newly created bucket for any incoming file uploads. This automation is critical because it allows the attacker to respond in real-time the moment a victim starts their model registration process. At this stage, the attacker’s project is effectively acting as a silent intermediary, waiting for the SDK in the victim’s environment to mistake the attacker’s bucket for a legitimate staging area.

Once the victim initiates a model upload through a vulnerable version of the Vertex AI SDK, the client library detects the existence of the attacker-controlled bucket and begins transferring the model artifacts. During this process, the attacker’s automated script detects the new files and immediately replaces the legitimate model files with a malicious version. This malicious file contains a serialized Python object designed to execute code upon loading, often utilizing the __reduce__ method of the pickle protocol to trigger a reverse shell or a data exfiltration script. The timing of this substitution is vital, as it must occur after the upload completes but before the Vertex AI service agent attempts to read the artifacts for registration. Because the SDK does not verify the integrity of the staged files after they are uploaded, the victim unknowingly registers a compromised model in their registry.

The final phase of the attack occurs when the victim deploys the registered model to a Vertex AI endpoint for production use. At this point, the Google-managed service account responsible for model serving fetches the artifacts from the attacker’s bucket and loads them into the serving container. As the container environment initializes the model using libraries like Joblib or Pickle, the embedded malicious code is executed with the privileges of the service agent. This grants the attacker remote code execution within the tenant project, which is a specialized environment used by Google to host customer-specific resources. From this vantage point, the attacker can move laterally, access sensitive metadata, and interact with other cloud services that the service agent has permission to reach. The entire chain is completed without the attacker ever needing to compromise the victim’s local machine or their primary Google Cloud credentials.

5. Impact and Post-Exploitation Analysis: The Scope of Potential Damage

Achieving remote code execution within the Vertex AI serving infrastructure provides an attacker with a high-privilege platform for further exploitation. One of the most immediate risks is the theft of OAuth tokens from the metadata server of the serving container. These tokens are associated with the per-project service accounts that Vertex AI uses to function, and they often possess significant permissions to read from storage buckets, interact with other AI models, and manage cloud resources. By exfiltrating these credentials, an attacker can extend their reach far beyond the initial serving container, potentially accessing proprietary datasets or modifying other models within the victim’s organization. This creates a secondary layer of risk where the initial RCE is merely the entry point for a wider campaign of corporate espionage or data destruction.

Beyond credential theft, the attacker can perform extensive reconnaissance within the internal networks of the tenant project. This includes enumerating BigQuery datasets to identify sensitive business intelligence or customer data that might be linked to the AI pipeline. Furthermore, since Vertex AI often runs on managed Kubernetes clusters, the attacker may be able to gather intelligence about the underlying infrastructure, such as cluster logs, configuration maps, and network policies. In some scenarios, this could lead to the discovery of other users’ models hosted on the same shared infrastructure, though Google maintains strong isolation between different tenants. However, the ability to observe the operational metadata of a high-scale AI environment provides an adversary with the tactical knowledge needed to craft more sophisticated attacks against the platform’s core architecture.

6. Mitigation and Resolution: Securing the Machine Learning Pipeline

To resolve the Pickle in the Middle vulnerability, Google implemented a series of critical changes to the Vertex AI SDK designed to eliminate the predictability of staging resources. In version 1.148.0 of the SDK, the logic for generating default bucket names was updated to include random UUIDs, making it practically impossible for an attacker to guess and preemptively create the bucket. More importantly, the SDK now includes a mandatory ownership check that verifies the project ID associated with a bucket before any data is uploaded. If the bucket does not belong to the user’s current project, the SDK will refuse to use it and will instead prompt the user to provide a secure, verified storage location. This double-layered defense ensures that even if a naming conflict occurs, the SDK will not inadvertently leak data to an external project.

The disclosure of this flaw followed a coordinated process between security researchers and the Google security team, emphasizing the value of bug bounty programs in the AI sector. Upon receiving the report, Google validated the findings and worked to deploy a fix that would protect all users of the managed service. For developers, the most immediate action is to update their local and CI/CD environments to the latest version of the google-cloud-aiplatform library. Additionally, organizations should adopt the practice of explicitly defining a staging bucket using the staging_bucket parameter in the SDK’s initialization. By managing their own storage resources rather than relying on default behaviors, teams can maintain a higher level of visibility and control over where their sensitive model artifacts are stored during the development lifecycle.

7. Protective Measures: Proactive Defense in an AI-Driven Landscape

As the industry moves forward in 2026, the lessons learned from the Vertex AI SDK vulnerability highlight the need for specialized security tools focused on AI posture management. Organizations should implement monitoring solutions that can detect anomalies in cloud storage usage, such as unauthorized access to buckets or unexpected file modifications during model staging. Identity and entitlement management remains a cornerstone of this defense, where the principle of least privilege should be applied to service accounts used by AI platforms. By restricting the permissions of these accounts to only the specific resources they need, the impact of a potential RCE can be significantly contained. Furthermore, security teams must treat machine learning models as untrusted code, particularly when they are serialized using formats like pickle that are known to be inherently insecure.

The remediation of this cross-tenant flaw represented a significant milestone in securing the automated pipelines that powered modern machine learning. In the past, the industry had often overlooked the security implications of the staging and registration phases, focusing instead on the security of the final inference endpoint. The resolution of the Pickle in the Middle vulnerability shifted the focus back to the entire supply chain of the model, ensuring that every step from local development to cloud deployment was protected against hijacking. Moving forward, developers were encouraged to integrate static and dynamic analysis into their AI workflows to identify potential resource naming conflicts and insecure deserialization points. By adopting these proactive measures and staying current with security patches, organizations successfully reduced their exposure to the evolving threats targeting the global artificial intelligence infrastructure.