Measuring a complex disease — Working Complexity

We have agreed, almost without dissent, that interstitial lung disease is complex. We say it in the first slide of every advisory board. And then, having said it, we reach for forced vital capacity.

— The contradiction worth sitting with

The pathways are heterogeneous. The trajectories are non-linear. A single label — IPF, NSIP, HP, SARD-ILD — sits over a biology that refuses to hold still. Progression is not one thing happening at one rate. It is fibroblast foci, epithelial injury, immune feedback, pulmonary vasculopathy, deconditioning, anxiety, and the accumulating weight of comorbidity, all moving at once and all coupled to one another.

The language we use to describe the disease is the language of complexity. The instruments we use to measure it are the instruments of reduction. We declare the system irreducible and then we reduce it — to a single scalar, sampled a few times a year, modelled as if it fell along a straight line. The gap between the two is not a detail. It is the reason so many of our trials are underpowered against the thing patients actually experience, and so much of our prognostication is a population average wearing the costume of a personal forecast.

This piece is an argument and a path. The argument: reductionist measurement is not a regrettable compromise we tolerate until something better arrives — it is the primary epistemology of the field, and the composite endpoint, which we treat as our concession to complexity, is a far smaller concession than we think. The path: complexity science already owns a toolkit built for exactly this kind of system. We outline it, and we are specific about how each instrument could be deployed in ILD.

§ 01

The anatomy of a reduction

Look honestly at what we measure and how.

FVC as a scalar. The forced vital capacity is effort-dependent, sampled sparsely, and reported as a single number per visit. We then define meaningful decline as a threshold crossing — 10% relative, sometimes 5–10% — and we model the change as approximately linear over 52 weeks. Every one of those moves is a reduction. A lung is reduced to a volume. A volume is reduced to a slope. A slope is reduced to a binary: progressed, or not. By the time a patient becomes a data point, almost everything that made the disease complex has been quotiented away.

The single primary endpoint. Regulatory convention rewards one pre-specified primary, and FVC decline has become the field's default because it is familiar, it moves within a trial's horizon, and it is defensible to a committee. The convention is reasonable as governance. It is poor as biology. A single endpoint can only test a single, pre-committed hypothesis about what the drug does — and a drug acting on a complex system rarely does one clean thing.

The univariate prognostic marker. DLCO. Six-minute walk distance. Each is read as a channel in isolation, its trend interpreted on its own axis, as if the patient were several diseases running in parallel rather than one system whose parts are wired together.

The point

None of these instruments is wrong. FVC tracks something real, and the antifibrotic trials that relied on it changed practice. The problem is not that the instruments are invalid. The problem is that they are the primary epistemology — the thing we reach for first and trust most — for a disease we have already conceded is the wrong shape for them.

§ 02

The composite is a smaller nod to complexity than we think

When pressed on this, the field has an answer ready: the composite.

We have GAP and ILD-GAP, which fold gender, age and physiology into a staging score (Ley et al., 2012; Ryerson et al., 2014). We have the Composite Physiologic Index, built explicitly to net out the confounding effect of emphysema on function (Wells et al., 2003). We have the multi-criterion definition of progressive pulmonary fibrosis used to enrol INBUILD — physiological or symptomatic or radiological worsening (Flaherty et al., 2019). And we now have hierarchical composites and the win ratio, which rank patients across a priority order of outcomes rather than summing them.

These are genuine intellectual work, and we should say so. But notice what kind of object a composite is. It is an aggregation of reductions. We take several univariate measures — each already a scalar, each already stripped of its dynamics — and we combine them with fixed weights, or a fixed hierarchy, into one number or one ordering. The composite acknowledges that the disease has more than one dimension. It does not, in any deep sense, acknowledge that the disease is a system.

Three things a composite still cannot do:

It cannot represent interaction. A weighted sum treats dyspnoea, walk distance and FVC as additive contributions. But in the patient, deconditioning drives breathlessness drives inactivity drives deconditioning. The clinically decisive fact is the loop, and a loop is precisely what a sum erases.
It cannot represent dynamics. A composite is computed at a visit. It carries no information about the rate of change of the rate of change, the rising variance, the loss of regulatory tone — the signatures by which complex systems announce that they are about to tip.
It cannot represent heterogeneity of structure. The same composite score can sit over two patients whose diseases are organised completely differently — one driven by a vascular hub, one by an inflammatory one. Averaged together in a trial, they cancel, and the drug looks inert.

The composite is real progress and a small nod at once. It is multidimensional measurement. It is not yet complexity-aware measurement.

The distinction matters because it tells us the work is not finished — and points at what the next instruments have to be able to do.

§ 03

The toolkit complexity science already owns

Complexity science is not a metaphor we are importing for colour. It is a set of mathematics — dynamical systems, networks, statistical mechanics, information theory — developed precisely to characterise systems with feedback, heterogeneity, and emergent behaviour. ILD is such a system. The tools below already exist, are validated in adjacent fields, and map onto specific, named problems in our discipline. We take them in turn, with the deployment spelled out.

01 · Critical transitions & early-warning signals

The single highest-value target in ILD is the acute exacerbation: sudden, often fatal, and at present essentially unpredictable on the clinical timescale that would let us act. Dynamical systems theory offers a reason for hope. As a system approaches a tipping point — a bifurcation — it exhibits critical slowing down: it recovers more sluggishly from small perturbations, and that shows up as rising temporal autocorrelation, rising variance, and rising cross-correlation between coupled signals before the transition itself (Scheffer et al., 2009; Scheffer et al., 2012). These early-warning signals have been demonstrated in ecosystems, climate, and physiological collapse.

Deployment in ILD — forecasting the acute exacerbation

Home spirometry, wearable oximetry, actigraphy and acoustic cough monitoring already generate the high-frequency longitudinal data in which early-warning statistics live. Instead of asking only "has FVC dropped 10%?", we compute the rolling variance and lag-1 autocorrelation of daily home FVC, nocturnal SpO₂ and cough count, and watch for the rise. The deliverable is not a better cross-sectional score. It is a time-to-event alarm — a personalised forecast that this patient's system is losing resilience and an exacerbation is becoming probable, issued in the window where steroids, referral, or trial rescue could change the outcome.

02 · Network analysis — symptoms & multimorbidity as a wired structure

Borrowing from psychometric network theory (Borsboom & Cramer, 2013) and network medicine (Barabási et al., 2011), we stop treating symptoms and comorbidities as a checklist and start treating them as nodes in a graph, with edges estimated from their conditional dependencies. Dyspnoea, cough, fatigue, anxiety, depression, deconditioning, GORD, pulmonary hypertension, coronary disease — these do not co-occur by accident. They are coupled.

Deployment in ILD — find the hub, not the sum

Estimate the patient- or subgroup-level symptom-and-comorbidity network from longitudinal PRO and clinical data, and compute centrality. The actionable output is the identification of hub and bridge nodes — the symptom or comorbidity through which the most activation flows. If anxiety is the bridge that couples breathlessness to inactivity, the highest-leverage intervention is the anxiety node, even though it never appears on a respiratory endpoint. A composite sums these nodes; a network tells us which one to push. The same method, run on the referral pathway rather than the patient, exposes the structural bottlenecks behind the diagnostic delay we keep describing and rarely re-engineer.

03 · Joint longitudinal–survival & latent trajectory models

FVC decline is not linear and not homogeneous. The population contains distinct trajectory shapes — slow, rapid, stepwise, relapsing — and the slope so far is informative about the hazard ahead. Latent class mixed models discover the trajectory subgroups from the data rather than imposing them. Joint models link the longitudinal biomarker process to the time-to-event process, so the entire shape of a patient's FVC and DLCO history — not just the latest value — drives a continuously updated survival prediction (Rizopoulos, 2012).

Deployment in ILD — prognosis that updates

Replace static GAP-at-baseline with dynamic prediction: at each visit, the model ingests the full longitudinal record and returns an updated, individualised survival and progression forecast with honest uncertainty bands. In trials, the same machinery enriches enrolment — selecting the rapid-trajectory latent class for whom an antifibrotic effect is detectable in 52 weeks — and recovers power that a single-slope analysis throws away. Prognosis as a moving estimate, not a one-time stamp.

04 · Computational phenotyping — endotypes, not imposed labels

Our diagnostic taxonomy is historical, not mechanistic. IPF-versus-not is a clinically useful line that almost certainly cuts across the biology rather than along it. Unsupervised learning on multimodal data — HRCT radiomic texture, longitudinal physiology, peripheral blood transcriptomics and proteomics, genetics — can recover data-driven endotypes: clusters of patients who share an underlying mechanism regardless of which named ILD they carry.

Deployment in ILD — randomise within mechanism

Cluster the multimodal feature space; characterise each endotype biologically; then re-stratify. The prize is trial design that randomises within a mechanistically coherent endotype rather than within a label, so a drug acting on, say, a vascular-predominant cluster is tested in the patients who can respond, not diluted across a population defined by chest-CT pattern and clinical history. This is how we stop averaging responders and non-responders into a null.

05 · Mechanistic multi-scale models & the patient digital twin

Fibrosis is an emergent property of a feedback system — epithelial injury, fibroblast activation, ECM stiffening that itself drives further activation, a mechano-biological positive loop with the hallmarks of self-organised criticality. Multi-scale mechanistic models couple this tissue-level biology to organ-level lung mechanics. Combine such a model with a patient's own streaming data through data assimilation — Kalman filtering, particle filters — and you have a digital twin: a personalised, continuously corrected simulation of that individual's lung.

Deployment in ILD — forecast and simulate

The twin does two things a statistical model cannot. It forecasts the individual trajectory with a mechanistic basis, so the prediction degrades gracefully and explains itself. And it lets us run in silico interventions — simulate this patient on antifibrotic A versus B versus add-on before committing them — and supports in silico trial arms that reduce the real-world sample size needed to detect an effect. Ambitious, data-hungry, and the most genuinely systems-level instrument on this list.

06 · Information-theoretic complexity of physiological signals

A robust and counter-intuitive finding from physiology: health is complex, and disease is a loss of complexity. Healthy physiological output — heart rate, breathing, gait, daily activity — carries rich, fractal, multi-scale variability; ageing and disease flatten it (Lipsitz & Goldberger, 1992; Goldberger et al., 2002). The breath of a failing respiratory system is more regular, not less. Variability is regulatory reserve, and losing it is the signal.

Deployment in ILD — loss of complexity as a biomarker

From the same passive wearable streams, compute multiscale entropy and fractal scaling of breathing pattern, heart-rate dynamics and activity. A falling complexity index becomes a continuous, effort-independent digital biomarker of declining reserve — one that does not depend on a patient performing a maximal manoeuvre on a clinic spirometer, and that can move months before the threshold crossing we currently wait for.

§ 04

What changes if we take this seriously

These tools are not a wish list to be deferred. Several are deployable now on data we are already collecting, and they imply concrete change at three levels.

In the clinic

Prognosis becomes a living estimate that updates at every visit and integrates the whole record, rather than a stage assigned once at diagnosis.

Monitoring shifts from sparse maximal tests to dense passive signals, and from threshold-crossing to resilience-watching. The question moves from has it progressed? to is it about to?

In the trial

Enrolment is enriched by trajectory class and endotype, not by label. Endpoints can become genuinely systemic — a forecast of time-to-exacerbation, a network-derived measure of symptom-system burden, a complexity index — rather than a single slope.

In silico arms and dynamic prediction recover statistical power that reductionist analysis discards. In a disease this heterogeneous, that is the difference between a viable programme and an underpowered one.

And then the honest hard part: the conversation with regulators and payers. A novel endpoint has to be validated, qualified, and accepted, and that is slow. So the route is not to abandon FVC but to run these instruments alongside it — accumulating the evidence that an early-warning alarm or a complexity index predicts the outcomes that matter, until the new measure earns standing on its own. We win the argument with data, not assertion.

We should name the costs plainly, because the brand of this field is overpromising. These methods are data-hungry and demand longitudinal, multimodal collection most centres do not yet do at scale. They can overfit — a flexible model that fits the past and forecasts nothing is worse than a humble slope. They need external validation across cohorts and countries before they touch a decision.

None of this is a reason to wait. It is the specification for doing it properly.