VentureFlow
Research
ResearchAIMachine LearningVenture Capital

How We Built a Startup Success Prediction Engine from Public Data

A look at the research, data sources, and early findings behind VentureFlow's deal scoring model

VentureFlow Research Team|April 2026|8 min read
225,000+
Unique companies in training dataset
86.3%
ROC AUC on held-out test set
84%
Precision@50 — top 50 scored, 42 actual successes
4 sources
Public data sources combined

The problem with gut instinct at scale

Venture capital has always run on pattern recognition. The best investors develop an intuition for what a winning team looks like, what a real market opportunity feels like, and what signals separate the companies worth backing from the ones that aren't. The problem is that this intuition doesn't scale.

A typical seed-stage fund sees 1,000 to 3,000 inbound pitches per year. A partner can only take so many meetings. Something has to give — and what usually gives is the quality of evaluation on anything outside a partner's existing network or thesis.

VentureFlow was built to solve the prioritization problem. But we wanted the scoring behind our deal rankings to be more than a rules-based rubric or a Claude prompt. We wanted it to reflect what actually predicts startup success — grounded in data.

This post describes what we found.

Starting with the research

Before writing a single line of model code, we spent time in the academic literature on startup success prediction. The field has grown significantly in the past five years, with several rigorous studies using machine learning against large VC datasets.

A few findings shaped our thinking most:

First, textual signals matter as much as structured data.Research from LMU Munich (Maarouf et al., 2024) found that when combining structured company fundamentals with free-text descriptions — the kind of language a founder uses to describe their business — prediction accuracy improved meaningfully, with the text component alone responsible for a significant share of the model's lift. The implication: how a founder describes what they're building is itself a signal.

Second, team composition outranks almost everything elsein investor surveys, but is harder to quantify than it seems. A survey of 885 institutional VCs found that the team was cited as the essential investment factor by 95% of respondents. Yet when researchers try to encode founder quality as structured features, the signal is noisier than expected — prior exits matter, domain expertise matters, but the relationship isn't linear and varies by sector.

Third, look-ahead bias is the silent killer of ML modelsin this domain. Many published models inadvertently train on information that wouldn't have been available at the time of the actual investment decision — like whether a company later raised a Series C. Building a time-aware training pipeline, where each prediction is evaluated only against information available at a specific historical date, turned out to be one of the most important design decisions we made.

Building the dataset

We made an early decision to build our training dataset entirely from public sources rather than licensing proprietary databases. There were a few reasons for this. Public data has no usage restrictions for model training. It's reproducible. And frankly, the signal quality in public startup data is better than most people expect.

Our dataset combines several sources:

Historical startup funding data. We ingested multiple publicly available datasets covering startup companies, funding rounds, acquisition events, and IPO records — spanning roughly 2000 through 2020. After deduplication and cleaning, this gave us a foundation of over 200,000 unique companies with varying degrees of structured data.

SEC EDGAR filings. Every US company that raises money from accredited investors files a Form D with the SEC. These filings are public, machine-readable, and cover private fundraising activity going back decades. We use S-1 filings — the registration statements companies file when going public — as ground-truth success labels. A company that filed an S-1 is, by definition, a successful exit.

Patent data.Through the USPTO's public data API, we enriched company records with patent filing history. Prior research (Ross et al., 2021, the CapitalVX model) found that patent activity was one of the strongest predictors of acquisition — companies with recent patent filings are materially more likely to be acquired by larger players.

Accelerator portfolios. Public directories from major accelerator programs provide curated, high-quality company profiles with outcome data. These companies are labeled with known statuses — active, acquired, public, or inactive — giving us reliable outcome labels for a meaningful slice of the training set.

After deduplication across sources — matching companies by domain and name using fuzzy matching — we ended up with a clean training set of roughly 225,000 unique companies with known or estimable outcomes.

What we measured

Our feature set covers four categories:

Funding Signals

8 features

Stage, capital raised, round velocity, investor backing

Team Signals

4 features

Founder count, prior exits, technical backgrounds, investor tier

Company Signals

4 features

Age, industry, geography, data completeness

Text Signals

768-dim

Semantic embeddings of company descriptions

Funding signals: Stage at time of evaluation, total capital raised (log-scaled), number of rounds, velocity between rounds (average days between successive raises), days since last round, and whether the company had institutional VC backing.

Team signals: Number of founders, whether any founder had a prior exit, whether the founding team had domain-relevant technical backgrounds, and investor tier (whether lead investors were top-tier institutional funds versus angels or unknowns).

Company signals: Founding year, company age at time of evaluation, industry category, and geography.

Text signals:Sentence embeddings of the company's description — a 768-dimensional vector encoding the semantic content of how the company describes itself. This is where the text research matters most: companies that describe their business with specificity and clarity tend to score differently than vague, jargon-heavy descriptions.

One feature we borrowed from the CapitalVX research that might seem counterintuitive: data completeness. How many of a company's profile fields are filled in turns out to be a meaningful signal on its own. Companies with more complete public profiles tend to have higher outcomes, likely because profile completeness correlates with intentional company-building and active investor relations.

The model architecture

Pitch Deck / Company ProfileStructured TowerXGBoost / MLP — 20 featuresText TowerSentence Embeddings 768 → 64Fusion LayerSuccess Probability + SHAP Attribution

We use a two-tower architecture. The first tower is a gradient-boosted tree model (XGBoost) trained on the structured features — fast, interpretable, and well-suited to tabular data with missing values. The second tower processes the text embeddings through a projection layer.

At scoring time, the structured tower runs on whatever fields are available in the pitch deck or company profile. Missing fields are handled with explicit defaults rather than dropped — the model was trained to handle incomplete data because incomplete data is the norm in early-stage evaluation.

The text tower encodes any available description using a pre-trained sentence transformer model and projects it into the same embedding space used during training.

Both towers are fused through a final layer that produces a single probability score: the estimated likelihood that a company with these characteristics represents a successful investment outcome.

For explainability, we use SHAP values to attribute each score back to individual features. When VentureFlow tells you that a deal's top positive signals are “Tier-1 investor backing” and “prior exit founder,” those attributions are derived from the model's actual feature weights, not from a template.

Research timeline

Literature review
Data sourcing
Pipeline build
Model training
Integration

Early findings

We want to be careful not to overstate what a model trained on historical data can tell you about any specific company today. Startup outcomes are inherently noisy, and the best companies often look wrong by historical metrics early on.

That said, a few patterns emerged clearly from our analysis:

Funding velocity is underweighted by human screeners. The time between funding rounds — how quickly a company moves from seed to Series A, or A to B — is a stronger predictor of eventual outcomes than total capital raised. A company that raises slowly may simply be capital-efficient, but a company that raises quickly usually has demonstrated traction.

Investor network effects are real and measurable. Companies backed by investors who have previously backed successful exits have meaningfully higher success rates — not just because those investors pick better, but because their involvement changes the trajectory of the companies they back. Sorting by investor tier is one of the highest-signal quick filters a VC can apply.

Geography matters less than it used to, but still matters. US-based companies in our dataset had higher outcome rates than international peers, but the gap has narrowed significantly in the post-2015 cohorts. The predictive power of geography is declining as global startup infrastructure improves.

The text signal is real.Companies that describe their business with precision — specific customer segments, named problems, quantified claims — score differently than those that rely on category buzzwords. This held up even when controlling for funding stage and industry. It's one reason we believe pitch deck analysis is a higher-signal input than a Crunchbase profile alone.

What this means for deal evaluation

Our model doesn't replace judgment — it informs it. The probability score VentureFlow produces is best understood as a prior: given what we can observe about a company's characteristics, how does it compare to the historical base rate of success for companies with similar profiles?

A high score on a company you'd otherwise pass on is worth a second look. A low score on a company that excites you is worth understanding — which specific signals are driving it down, and do you have information the model doesn't?

The most useful thing a model can do in this context isn't to make the decision. It's to surface the deals you might have missed and the assumptions you might not have questioned.

What's next

The current model is our baseline. We're actively expanding the dataset with additional public sources, improving the text encoding pipeline to handle pitch deck content directly, and building the feedback loop that will let VentureFlow's model improve over time as partner decisions are recorded in the platform.

We're also exploring how thesis-conditioning — training or fine-tuning the model on the specific historical patterns of a given fund — changes the predictive power. A model tuned to what Hustle Fund has historically invested in should look different from one tuned to Sequoia's patterns.

We'll share more as the research develops.

VentureFlow is an AI-powered deal evaluation platform for venture capital firms. If you're interested in early access, reach out here.