Together AI's $800M Bet Is the Enterprise's Permission Slip for Open Models

Video is the last major unstructured data frontier. TwelveLabs' Series B and Marengo 3.0 signal that enterprise AI is expanding from text to the 90% of organizational knowledge locked in recordings, meetings, and footage.

By James Whitfield, Enterprise SaaS · Jul 2, 2026 · 13 min read

TwelveLabs closed a $100 million Series B in late June 2026, co-led by NEA and NAVER Ventures, with Amazon investing directly alongside an AWS preferred cloud partnership. The round validates a specific thesis: video is the last major unstructured data frontier, and the infrastructure to search, analyze, and extract structured insight from video at enterprise scale does not yet exist at the required fidelity — but it is about to.

The company's two foundation models — Marengo 3.0 for video understanding and Pegasus 1.5 for video-to-structured-data extraction — represent a qualitatively different approach to video AI than what general-purpose multimodal LLMs provide. Where GPT-4o or Gemini Ultra are designed to answer questions about individual video frames or short clips, TwelveLabs' models are designed for temporal understanding: tracking objects, events, speakers, and concepts across hours of video, with the ability to locate specific moments by semantic description rather than timestamp or manual tagging.

The $100M That Signals the End of Unsearchable Video

Enterprise organizations have been drowning in video for five years. The COVID-era shift to remote work created video meeting libraries — Zoom recordings, Teams calls, recorded product demos, customer onboarding sessions — that exist in organizational storage systems at scale but are effectively inaccessible as knowledge assets. You cannot search a Zoom recording the way you search a document. You cannot ask "find every customer call where the prospect mentioned competitor X in Q1" across a library of 50,000 recordings. The knowledge in those recordings is permanently locked unless someone watched them, took notes, and filed them correctly — which almost no one does consistently at scale.

The parallel video archive in media, security, sports, and government is larger still. A broadcast media company with 40 years of archive footage cannot monetize that archive because there is no infrastructure to retrieve specific moments by semantic description. A security operations center monitoring 1,000 cameras cannot retroactively search footage for a specific event. A sports analytics team cannot automatically extract performance metrics from game film at scale. The value is real; the unlock has been technically infeasible until now.

CEO Jae Lee has framed TwelveLabs' mission as making "video as searchable and analyzable as text" — a formulation that sounds simple until you understand the temporal complexity that makes it hard. The $100M Series B is the capital to pursue that mission at enterprise scale across five verticals and four international office locations.

Marengo 3.0 and Pegasus 1.5: What the Models Actually Do

The technical differentiation matters for understanding why TwelveLabs raised at this valuation rather than being dismissed as a feature within a general-purpose multimodal AI product.

Marengo 3.0 is TwelveLabs' video understanding foundation model. It ingests video files — any length, any format, with or without audio — and indexes them at multiple modalities simultaneously: visual content (objects, scenes, colors, actions), audio content (speech, music, ambient sound, acoustic signatures), and motion patterns (movement trajectories, gesture recognition, event sequences). The indexed representation enables semantic search queries like "find all moments where a presenter references pricing" or "find every clip where a vehicle turns left at the intersection" across multi-hour video libraries in under two seconds.

The key architectural difference from general multimodal LLMs is temporal chunking with cross-chunk relationship modeling. Marengo 3.0 processes video in segments while maintaining a temporal context graph across the entire video length, enabling queries that reference events separated by hours in the source footage. This is not a capability that prompt-engineering around GPT-4o provides — it requires a model architecture designed for long-form temporal reasoning from the ground up, with training data that reflects the cross-temporal relationships present in real video content.

Pegasus 1.5 sits on top of Marengo and handles video-to-structured-data extraction. Given an input video, Pegasus produces structured output: scene boundaries with semantic labels, entity mentions with temporal coordinates, speaker diarization, event sequences, and customizable taxonomy classifications. The structured output integrates directly into enterprise data warehouses, business intelligence platforms, and downstream AI workflows.

The combination makes the enterprise value proposition coherent: Marengo 3.0 makes content findable; Pegasus 1.5 makes it measurable.

The Enterprise Data Problem That Video Intelligence Solves

The McKinsey Global Institute estimates that AI-driven productivity improvements could generate $4.4 trillion in annual economic value. A disproportionate share of that potential is locked in unstructured data — and video represents roughly 90% of the unstructured data enterprise organizations generate but cannot systematically analyze.

The five categories of enterprise video data with the clearest ROI for intelligence infrastructure:

Video Data Category	Volume Pattern	Primary Business Value	TwelveLabs Use Case
Customer calls (sales + CS)	High frequency, moderate length	Revenue intelligence, coaching, compliance	Conversation intelligence at scale
Internal meetings	Very high frequency, variable length	Knowledge management, action item extraction	Meeting intelligence search
Product tutorials and demos	Low frequency, high reuse value	Support deflection, feature adoption	Searchable support library
Security and surveillance	Continuous, very high volume	Incident investigation, compliance	Retroactive event search
Media archive	Historical, large catalog	Monetization, content repurposing	Semantic archive search

The customer call category is where immediate enterprise ROI is clearest. Revenue operations teams at enterprise SaaS companies pay $20,000–$80,000 per year for conversation intelligence platforms like Gong and Chorus that address a subset of what TwelveLabs' API can do — at the cost of vendor lock-in, limited integrations, and a product scope that does not extend to non-sales video. TwelveLabs' API allows enterprises to build custom conversation intelligence workflows on top of their existing recording infrastructure rather than adopting another point solution with its own data silo.

The internal meeting category represents a larger total video volume — most enterprise employees generate more recorded meeting content than sales calls — but the immediate ROI is less clear because the search and retrieval problem is more ambiguous than the revenue intelligence use case. The companies getting early traction with meeting intelligence applications of TwelveLabs are those with highly regulated information environments, where the ability to search and audit meeting recordings has compliance value that is easily quantified.

The AWS Partnership: Why Cloud-Native Distribution Matters

Amazon's direct investment in TwelveLabs, paired with the AWS preferred cloud partnership and optimization for AWS Trainium inference chips, is a distribution story as much as a capital story.

Enterprise AI infrastructure procurement in 2026 increasingly flows through cloud marketplace channels. AWS Marketplace and Azure Marketplace have become procurement paths that bypass traditional enterprise software sales cycles — IT organizations with existing cloud commitments and marketplace credits prefer to source AI infrastructure as marketplace transactions rather than standalone vendor contracts. Being an AWS preferred partner with Trainium optimization means TwelveLabs' API can be provisioned through channels that enterprise buyers already have open.

The Trainium optimization is technically significant for a different reason. AWS Trainium is designed for inference-optimized workloads with high throughput and lower per-inference cost than standard GPU instances. TwelveLabs' video processing model — long-form content with very high token-equivalent processing volume — is exactly the kind of workload where Trainium's architecture provides a meaningful cost advantage over standard GPU infrastructure. The partnership means TwelveLabs can offer enterprise pricing with better margins than competitors running on standard compute.

The enterprise AI distribution challenge has been a consistent theme in Signal's coverage of AI infrastructure: the companies that solve the last-mile distribution problem through cloud partnerships and marketplace integration close enterprise deals faster than those relying solely on direct sales. TwelveLabs' AWS partnership is explicitly solving that problem.

The Five Verticals Where Video Intelligence Has Enterprise ROI

TwelveLabs' Series B funded an expansion from three offices (San Francisco, Seoul, New York) to five (adding London), reflecting a deliberate multi-vertical enterprise strategy rather than a single-market focus.

1. Media and broadcasting is the company's most established vertical. Broadcast media organizations — sports leagues, news networks, entertainment studios — have the largest historical video archives and the clearest monetization path for semantic search. A sports league that can license specific gameplay moments by semantic query rather than by manual curation has a different archive monetization business than one operating on manual tagging at scale. TwelveLabs' Marengo model identifies specific events in sports footage with accuracy approaching human editorial standard for highlight reel generation.

2. Security and compliance is the fastest-growing vertical. Enterprise security operations centers use TwelveLabs to enable retroactive search of surveillance footage — finding specific individuals, vehicles, or events across days of footage from hundreds of cameras without real-time human monitoring at each feed. The compliance use case is growing in financial services, where recorded advisory conversations must be searchable for regulatory examination purposes on demand.

3. Advertising and brand safety has emerged as a significant use case as programmatic video advertising scale creates a moderation challenge that human review cannot handle at volume. TwelveLabs' Pegasus model classifies video content by brand safety taxonomy — identifying segments containing violence, competitor mentions, or contextual risk factors — enabling automated pre-bid filtering at the speed programmatic requires.

4. Government and intelligence applications are served through TwelveLabs' relationship with NAVER (the South Korean search and technology conglomerate that co-led the round), which brings government customer relationships across the Asia-Pacific region. The London office expansion adds access to UK and European government customer channels under GDPR-compliant deployment configurations.

5. Enterprise knowledge management is the long-horizon opportunity. As enterprise organizations accumulate years of recorded meetings, training sessions, and product documentation video, the ability to retrieve specific moments by semantic query rather than timestamp becomes a knowledge management infrastructure capability rather than a point solution. This vertical requires TwelveLabs to integrate with enterprise content management and collaboration platforms — the AWS partnership is partly a distribution mechanism for those integrations.

The Product Architecture for Enterprise Deployment

TwelveLabs' enterprise deployment model reflects the lessons of prior API-first infrastructure companies: a horizontal API layer plus vertical configurations that address the specific integration and compliance requirements of each customer segment.

The core API exposes three primary endpoints: index (ingest and process a video into TwelveLabs' semantic index), search (query the index with natural language), and extract (run Pegasus 1.5 to produce structured output from indexed video). These three operations compose into the full range of enterprise use cases without requiring enterprises to adopt a purpose-built product surface area for each workflow.

Enterprise deployments add: data residency options (US-only, EU-only, or sovereign deployment for government customers); SOC 2 Type II compliance documentation; API rate limiting and quota management; custom taxonomy configurations for Pegasus classification; and integration connectors for the enterprise platforms where video already lives — Zoom, Microsoft Teams, AWS S3, Google Cloud Storage, and major enterprise video platforms.

Pricing follows an API-consumption model: per minute of video indexed plus per-query cost for search and extraction operations. For high-volume deployments, TwelveLabs offers committed-use contracts with volume discounts, similar to the hyperscaler committed-use model. For enterprise customers running above $500K per year in API consumption, custom enterprise agreements with SLA guarantees and dedicated support are available.

The middleware tax question — when does an AI infrastructure layer add defensible value versus becoming a rent-seeking intermediary — has a clear answer in TwelveLabs' case: temporal reasoning is a capability that cannot be adequately addressed by prompting a general-purpose multimodal LLM, which means the value is architectural rather than a convenience wrapper around existing API calls.

The Competitive Landscape After the Series B

TwelveLabs operates in a space where three categories of competitors are present but none fully overlaps with its positioning.

General multimodal AI providers — OpenAI GPT-4o, Google Gemini Ultra, Anthropic Claude with vision — offer video frame analysis through multimodal APIs. Their limitation is temporal: they process individual frames or short clips without the long-form temporal reasoning that makes TwelveLabs' models useful for feature films, multi-hour meeting recordings, or surveillance footage spanning days. These models will improve at temporal reasoning over time, but the training data and architecture investments required are different from their core text-and-reasoning optimization.

Point-solution conversation intelligence platforms — Gong, Chorus, Grain — solve a subset of TwelveLabs' use case: sales call analysis, with purpose-built product interfaces. Their advantage is product polish and focused go-to-market. Their limitation is scope: they do not generalize to non-sales video, do not offer API access for custom workflows, and do not address the broader video intelligence use case beyond sales enablement.

Cloud-native computer vision APIs — AWS Rekognition, Google Video Intelligence API, Azure Video Analyzer — offer video analysis capabilities from the hyperscaler infrastructure layer. Their limitations are semantic depth: they perform object detection and scene classification well, but lack the natural-language semantic search and temporal reasoning that makes TwelveLabs' API significantly more capable for enterprise use cases requiring semantic retrieval across long-form content.

The enterprise AI infrastructure gap is wider in video than in text, and TwelveLabs is positioned at the widest part of that gap. The technical moat is temporal reasoning at scale — a capability requiring different model architecture, training data, and infrastructure optimization than the current generation of general-purpose multimodal AI systems can provide without significant dedicated investment.

What the Next 24 Months Look Like for Enterprise Video AI

TwelveLabs' London office expansion and NAVER co-lead reflect a Series B operating plan oriented around geographic expansion and vertical depth rather than horizontal API feature development alone.

The $100M funds three primary initiatives. First, foundation model capability expansion: Marengo 4.0 (targeted for H1 2027) will extend the temporal reasoning window to 24-hour footage coverage, enabling full-day surveillance analysis and broadcast event analysis across multi-day coverage. Second, enterprise distribution partnerships beyond AWS — Google Cloud and Azure marketplace integrations targeting European and Asian markets where NAVER and the London office provide relationship infrastructure. Third, vertical product development: purpose-built configurations for media, security, and enterprise knowledge management that reduce integration effort and create application-layer differentiation commanding premium enterprise pricing.

Ambient enterprise AI distribution — the pattern where AI capabilities flow through existing enterprise software rather than requiring separate product adoption — is the most likely path for video intelligence to reach mass enterprise adoption. TwelveLabs' integration strategy with Zoom, Microsoft Teams, and enterprise cloud storage is the right bet: the video AI that enterprises actually use is the one embedded in the workflow where the video already lives, not the one requiring a separate procurement decision and a distinct onboarding flow.

The 24-month trajectory for TwelveLabs is the question of whether video intelligence becomes horizontal infrastructure — like object storage or search indexing — that every enterprise eventually procures as a commodity, or whether TwelveLabs can maintain premium positioning through model quality and temporal reasoning depth in a maturing market. The $100M Series B is the capital to pursue the former while building the architectural defensibility required for the latter.

Takeaway: TwelveLabs' $100M Series B is a bet that the 90% of enterprise knowledge locked in video recordings represents the largest unaddressed data intelligence opportunity remaining in enterprise AI. Marengo 3.0's temporal reasoning and Pegasus 1.5's structured extraction are the technical foundations for that bet. The AWS partnership and London expansion provide the distribution infrastructure to turn technical differentiation into enterprise revenue at scale. For enterprise technology leaders evaluating AI infrastructure investments in 2026, video intelligence is the capability gap where the current tool stack — general multimodal APIs, point-solution conversation intelligence, hyperscaler computer vision — leaves the most enterprise value unrealized. TwelveLabs is the clearest available bet on closing it.

Frequently Asked Questions

What did TwelveLabs raise in its Series B and why did Amazon invest?

TwelveLabs closed a $100 million Series B in late June 2026, co-led by NEA and NAVER Ventures, with Amazon investing directly alongside an AWS preferred cloud partnership. The round reflects the intersection of three investor theses: NEA's conviction in enterprise AI infrastructure, NAVER's strategic interest in video AI for its search and media business in Asia, and Amazon's recognition that video intelligence will become a significant workload category on AWS infrastructure. The AWS preferred cloud partnership gives TwelveLabs both distribution through AWS Marketplace and optimization for AWS Trainium inference chips, which provide cost-per-inference advantages for the long-form video processing workloads that represent TwelveLabs' core use cases. The company operates offices in San Francisco, Seoul, New York, and newly opened offices in London added with the Series B capital.

How does TwelveLabs Marengo 3.0 differ from GPT-4o for video analysis?

Marengo 3.0 is designed for temporal understanding of long-form video, which is architecturally different from what general-purpose multimodal models like GPT-4o offer. GPT-4o can analyze individual video frames or short clips, answering questions about what it sees in a specific moment. Marengo 3.0 ingests full-length video — hours of content — and builds a temporal context graph that tracks objects, events, speakers, and concepts across the entire timeline. This enables semantic queries like 'find all moments where the presenter mentions churn' across a library of 500 hours of recorded customer calls, or 'find every scene where a specific vehicle is visible' across days of surveillance footage. The distinction matters for enterprise use cases involving meeting recordings, archive footage, surveillance, and media — where the value is in cross-temporal retrieval, not single-frame analysis.

Which industries use video intelligence AI for enterprise applications in production?

TwelveLabs has production deployments across five major verticals. Media and broadcasting organizations use TwelveLabs to enable semantic search of historical archives — sports leagues licensing specific gameplay moments by query, news networks retrieving historical footage by topic without manual tagging, entertainment studios identifying repurposable clips at scale. Security operations centers use the API to enable retroactive event search across days of surveillance footage from hundreds of cameras. Advertising and brand safety teams use Pegasus 1.5 to classify video content for brand risk factors before programmatic ad placement. Financial services organizations use TwelveLabs to make recorded advisory conversations searchable for regulatory compliance examination. Enterprise knowledge management is the emerging long-horizon use case, applied to recorded meeting libraries and internal training video catalogs.

What makes TwelveLabs different from Gong or Chorus for conversation intelligence?

Gong and Chorus are product-layer conversation intelligence platforms built specifically for sales team workflows, with purpose-built interfaces for call recording, coaching, deal intelligence, and CRM integration. TwelveLabs operates at the infrastructure layer: it provides an API that enterprises use to build custom video understanding workflows, of which conversation intelligence is one use case among many. The practical difference is scope and flexibility. TwelveLabs' API applies to any video type — customer calls, internal meetings, product demos, archived recordings, surveillance footage — not just sales calls. Enterprise customers can build custom taxonomies, integrations, and data pipelines on top of TwelveLabs' API that Gong and Chorus cannot support. For organizations that already have video infrastructure and want to build proprietary video intelligence capabilities rather than adopting another vendor's product surface area, TwelveLabs' API is the right architectural choice.

How does AWS Trainium help with video AI inference costs and performance?

AWS Trainium is Amazon's purpose-built inference chip designed for high-throughput, cost-optimized AI workloads. TwelveLabs' video processing model — which processes long-form content with very high token-equivalent volume per video minute — is a natural fit for Trainium's architecture, which provides better cost-per-inference than standard GPU instances for sustained high-throughput workloads. The optimization partnership between TwelveLabs and AWS means TwelveLabs' inference infrastructure is tuned specifically for Trainium, enabling the company to offer enterprise pricing with better margins than competitors running on standard NVIDIA-based GPU infrastructure. For enterprise customers, this translates to lower per-minute indexing costs and better throughput SLAs for high-volume video processing workloads like continuous security camera feeds or large historical archive ingestion.