Jio Arogya Pre-Investment Validation — Data

Jio Arogya — Data Strategy

Section 1.1: Focus on Longitudinal Data

This is not a shotgun data acquisition business. It does not believe that any and all data is equal. It believes that there is specific data, if captured systematically and continuously across cohorts, that can train models on biology, inform discovery on novel molecules and therapeutics, create evidence for personalized pathologies and care pathways and change disease trajectories forever. This also believes that while there are companies who are trying to build models today, those models are being built and powered by relatively small data sets (e.g., Tempus was valued at ~$6B at its 2024 IPO which was built on approximately 7 million patient records, almost exclusively in oncology. Flatiron Health was acquired by Roche for $1.9B on 2.2 million linked and longitudinal cancer records.). The most valuable and life changing company will be the one that builds the right data platform for discovery.

Example: The premium for depth

23andMe (Shallow genetics, self-reported phenotypes): ~$60 per participant implied value.
deCODE Genetics (Deep whole-genome, rich clinical linkage): ~$2,600 per participant implied
The Delta: A 43x premium for depth.

That 43x gap is the entire business model. The market is already telling us that fragmented data has value, but linked longitudinal data has monopoly value.

There is ofcourse, service specific data to be captured and included as part of our inputs. For example, imaging scans, and OPD data. These can be captured in innovative low cost partnerships (e.g., low cost devices for free in exchange for data) and value exchanges. However, the majority value comes from capturing complete, specific pre-determined data and stitching it together. Service level data or episodic clinical data exists with parties today already, but monetization has been weak.

Data quality, storage, labelling matters tremendously. If not in our control, it negates the final value of what we are trying to create for discovery, lifesciences, therapeutics and care. For example - without knowing consistent proteomics sample handling, you cannot distinguish a real biological signal from a degradation artefact. Without a patient identifier linking a pharmacy dispensing record to a lab result, the drug exposure data is commercially worthless. These are make-or-break inputs and is why we must have control end-to-end.

Our conviction is that the data that is going to be truly valuable is

Longitudinal
Continuous (not episodic)
Biological (not just clinical)
Quality controlled

Section 1.2: Existing Longitudinal Data Assets and Cohorts

Dataset	Geography / Population	Baseline Size	Primary Focus	Biosample collection (Yes/No)
Prospective Study of 500,000 Adults in Chennai, India	Chennai adults aged 35+	500,816	Large urban adult cohort; chronic disease and mortality epidemiology	No
Indian Study of Healthy Ageing (ISHA - Barshi)	Solapur district, Maharashtra; 362 villages and 3 towns	219,888	Healthy ageing, lifestyle, obesity, chronic disease, genetic determinants	No
Vadu Health and Demographic Surveillance Site (Vadu HDSS)	22 villages near Pune	200,000	Demographic surveillance; births, deaths, migration; sub-cohorts for NCDs and cognition	Yes
Hodal Demographic Surveillance System	Hodal Block, Palwal district	199,000+	Pregnancy risk factors and infant outcomes; repeated surveillance	No
Population Registry of Lifestyle Diseases (PROLIFE)	Varkala area, Thiruvananthapuram district, Kerala	161,943	Population registry for lifestyle diseases	No
Dibrugarh Health and Demographic Surveillance System	60 villages and tea gardens in Assam	106,769	District-level surveillance and morbidity/program monitoring	No
Mumbai Cohort Study	Mumbai adults aged 35+	99,570	Mortality and chronic disease epidemiology in urban adults	No
Ballabgarh Health and Demographic Surveillance System	28 villages in northern India	90,240	Long-running rural demographic and health surveillance	No
Birbhum Health & Demographic Surveillance System	351 villages in West Bengal	59,395	Population health and healthcare utilisation surveillance	No
India Human Development Survey (IHDS)	National household survey across India	41,554 households	Human development, socioeconomic conditions, education, health	No
New Delhi Birth Cohort (NDBC)	South Delhi mothers and births	20,755 women; 8,181 babies	Birth-to-adult and intergenerational growth/cardiometabolic outcomes	Yes
AIIMS Cohort Study (ACS)	Rural and urban India; adults 50+	15,000	Stroke and cognitive decline / dementia risk	No
Vellore Birth Cohort (VBC)	North Arcot district, urban and rural	10,670 live-born babies	Maternal health, pregnancy outcomes, growth and adult follow-up	Yes
LoCARPoN	South Delhi; urban upper-middle-class adults 50+	~7,500	Ageing, stroke, and dementia incidence	No
India Rural Economic Development Survey (REDS)	Rural India across major states	4,527 households	Rural household economics and agriculture	No
Andhra Pradesh Children and Parents Study (APCAPS)	29 rural and peri-urban villages in Telangana	2,601 children; 1,815 families	Intergenerational nutrition and cardiometabolic health	Yes
Pune Maternal Nutrition Study (PMNS)	Six rural villages in Pune district, Maharashtra	2,466 women; 770 children	Maternal nutrition, fetal growth, diabetes/CVD susceptibility	Yes
10/66 Dementia Research Group Population-Based Cohort Study – India	South Chennai adults aged 65+	2,004	Ageing, dementia, mortality follow-up	No
Young Lives India	Andhra Pradesh and Telangana children	2,011 younger cohort; 1,008 older cohort	Childhood poverty, development, education	No
LIFE – Longitudinal Indian Family hEalth Pilot Study	Rural and peri-urban Hyderabad/Telangana	1,227	Maternal and infant health; environmental and socioeconomic determinants	Yes
1934–66 Mysore Birth Records Cohort	Mysore birth records-linked cohort	1,069	Birth size and later cardiometabolic, lung, and ageing outcomes	Yes
Mysore Parthenon Cohort	Mysore antenatal clinic recruits and offspring	830 mothers; 830 fathers; 663 children	Developmental origins and parental-child longitudinal follow-up	Yes

Section 1.3: Potential Valuable Short-Medium Term Data Assets (Life Science)

(*If building a cardiometabolic focused cohort / intervention)

Cohort target mix (held constant at every scale): 60% healthy / pre-diabetic · 25% established T2D · 15% established CVD / post-cardiac-event · Scale: Y1 5K · Y2 25K · Y3 100K · Y4 250K · Strengthening notes indicate what additional activity would materially improve each asset.

1. Indian Metabolomics Reference + Dietary Annotation

Lead commercial asset at every scale threshold

Data Required:

Participants: Full disease spectrum — 60/25/15 mix at every scale

Samples: Fasting plasma + urine annually; post-prandial plasma (2hr) for T2D subset

Dietary recall: 2 × 24-hr recalls/yr via WhatsApp chatbot in 6 regional languages; IFCT 2017 validated

Medications: Full medication list at every visit — metformin, SGLT2i, GLP-1, statins; drug-metabolite confound mapping

Prerequisite: Indian dietary capture tool built before first enrollment — 3–4 months, non-negotiable

What It Means:

Untargeted plasma and urine metabolomics — 500–3,000 small molecules per sample — paired with validated Indian dietary data in regional languages. Indian dietary metabolites from dal, fermented foods, spices, and regional cuisines are absent from every existing Western reference database.

The gap: Every pharma company running metabolomics in Indian trials gets 20–40% of their signals back as unknown. You map them. At scale, your reference covers every clinically relevant Indian subpopulation — making it the only usable Indian metabolomics reference globally.

State at 18 months (5,000 Members):

5,000 members; ~1,250 T2D + ~750 CVD. First full-spectrum South Asian metabolomics baseline.

Disease comparison live: Healthy vs. T2D vs. CVD metabolite map available within 12 months — the commercial product pharma pays for immediately

Drug-metabolite map: 1,250 T2D members on known medications — controlled confound dataset unique to this program

Reference beginning: Metro Indian dietary metabolite signatures characterised for first time

What would strengthen it:

Post-prandial subset: 100–200 T2D members doing a structured Indian meal challenge — anchors diet-to-metabolite relationships scientifically; publishable standalone study

Tier 2 supplement: 500 members from non-metro cities via D2C postal kit converts ‘metro Indian’ to ‘South Asian’ reference credibly

State at Y5 (250K Members):

Year 2 — 25,000 members

Subgroup products emerge: Metabolomics by regional diet (Gujarati / Tamil / Bengali), by drug class (metformin vs. SGLT2i vs. GLP-1), by age decade, by sex — each subgroup is a separately licensable reference product

~6,250 T2D: Metabolite response to every major diabetes drug class in South Asians — direct regulatory affairs value for pharma label extensions

Year 3 — 100,000 members

National reference: Geographic and dietary diversity now spans the full Indian dietary ecology — the ‘South Asian’ claim is scientifically defensible, not aspirational

Predictive asset: ~18,000 pre-diabetic members with 3 years of follow-up; metabolite trajectories before T2D onset identifiable with statistical power. Candidate biomarkers publishable.

Platform revenue: Model shifts from partnership fees to tiered data platform licences — multiple simultaneous pharma licensees at ₹5–15Cr each annually

Year 4 — 250,000 members

Dietary biomarker product: Validated panel assessing diet quality from a single blood draw — no self-report needed. Standalone commercial diagnostic. No global equivalent for any non-Western population.

~62,500 T2D: Real-world drug effectiveness data at a scale that rivals any global pharma RCT population — CDSCO real-world evidence submissions, post-market surveillance, label expansions

What would strengthen it at scale:

Outcomes linkage: HN hospitalisation DPA for enrolled members closes the outcome loop — converts reference into predictive asset by Y3

Value to Pharma, Biotech & Life Sciences:

Novo Nordisk / AstraZeneca: Indian trial correction dataset — ₹10–25Cr collaboration Y1; re-prices to ₹30–60Cr platform licence by Y3 as subgroup depth grows

Roche Diagnostics: South Asian cardiometabolic panel calibration — ₹4–8Cr upfront + per-test royalty; royalty stream grows with panel adoption across Indian labs

Boehringer Ingelheim: Empagliflozin South Asian drug-metabolite interaction — direct regulatory value — ₹6–15Cr

At 100K+ scale: The dataset becomes the mandatory reference for any company running metabolomics in Indian trials. Recurring annual licence revenue from 10–20 simultaneous pharma partners. Total platform revenue ₹100–300Cr/yr by Y4.

Moat: Dietary recall tool is proprietary. Disease-spectrum annotation cannot be replicated from device or lab data. 3-year head start at Y4 is structurally unassailable.

2. South Asian Plasma Proteomics Cohort

Crosses rare-variant pQTL threshold at Y2. Drug target platform by Y3.

What It Means:

Measurement of 500–7,000 proteins in blood plasma repeated annually across the full cardiometabolic spectrum, linked to outcomes over time. Proteins are the effectors of disease — closer to clinical phenotype than DNA, detectable from a blood draw.

What scale unlocks: Common variant pQTL is powered at 5,000. Rare variant pQTL — South Asian-specific variants absent from European populations — is powered at 25,000. Novel drug target discovery from ancestry-specific biology begins at Year 2, not Year 1.

Data Required:

Platform: Targeted Olink panel (500–1,000 proteins) for all members annually; deep SomaScan 7K for 2,000-person stratified subsample annually — discovery depth at manageable cost

WGS: All members at onboarding — essential for pQTL; do not defer this

Disease sub-cohorts: T2D and CVD members provide comparison arm and medication confound layer; drug-protein interaction mapping at scale is a standalone product

Cost strategy: Olink co-funding partnership — subsidised platform cost ₹5–10Cr in exchange for dataset access rights; approach before enrollment opens

State at 18 months (5,000 Members):

5,000 members: complete targeted proteomics baseline across disease spectrum

Full spectrum baseline: Healthy / pre-diabetic / T2D / CVD proteome comparison live — cross-sectional South Asian disease-state proteome; the commercial product at Y1

Common variant pQTL: Powered at 5,000 with WGS — South Asian protein-controlling genetic variants identifiable; first South Asian pQTL map in production

Reference: South Asian protein reference ranges across disease spectrum — correction dataset for Indian trial data

What would strengthen it:

Olink deal: Formalise before enrollment — cost reduction is material at this sample volume

Outcome disclosure: Be explicit with pharma partners that outcome-linked proteomics is Y3 upside; cross-sectional disease-spectrum baseline is the Y1 product

State at Y5 (250K Members):

Year 2 — 25,000 members

Rare variant pQTL powered: South Asian-specific genetic variants controlling protein levels — biology invisible in European cohorts. Novel drug targets specific to South Asian ancestry begin to emerge. This is the scientific output pharma pays the most for.

~6,250 T2D: Drug-protein interaction map across diabetes drug classes in South Asians — SGLT2i, GLP-1, metformin effects on protein expression at scale; regulatory-grade pharmacodynamic dataset

Year 3 — 100,000 members

Drug target platform: South Asian pQTL + disease-state proteomics at 100,000 is the reference dataset for every pharma company developing metabolic drugs for the Indian market. The asset category shifts from ‘research dataset’ to ‘discovery infrastructure’.

Longitudinal signal: Members with 3 serial proteomic timepoints; proteins that change before T2D and CVD events identifiable from pre-disease members who convert. First validated South Asian proteomic biomarkers publishable.

Year 4 — 250,000 members

~62,500 T2D + ~37,500 CVD: Outcome-linked proteomic prediction models — which protein patterns at Year 1 predict MI, renal failure, or HbA1c >10% at Year 4 — powered for rare outcomes at this scale

Biomarker licensing: Validated South Asian cardiometabolic risk protein panel — licensable to every Indian diagnostics lab and insurer. ₹200–500Cr annual market at full Indian lab penetration.

What would strengthen at scale:

Consortium linkage: Joint pQTL analysis with GenomeIndia and UK Biobank South Asian participants multiplies rare-variant discovery power without additional enrollment cost

Value to Pharma, Biotech & Life Sciences:

Boehringer Ingelheim / AstraZeneca: South Asian pQTL co-development — ₹15–30Cr Y1; re-prices to ₹50–100Cr exclusive platform access by Y3

Olink / Somalogic: ₹5–10Cr in platform subsidies + co-publication; approaches at Y1, renegotiated at Y2 as cohort scale proves out

Eli Lilly: Tirzepatide South Asian proteomics — GIP/GLP-1 pathway in South Asian T2D vs. Western trial populations — ₹8–20Cr

At 100K+: The pQTL map and disease-state proteomics become the definitive South Asian molecular reference. Multi-year exclusive access agreements at ₹50–150Cr each. Multiple simultaneous partners.

Novel target value: A validated, novel, South Asian-specific drug target — a protein whose variant uniquely predicts T2D in South Asians — is worth ₹500Cr–₹2,000Cr in licensing or M&A value.

3. South Asian Glycaemic Reference (CGM, Full Disease Spectrum)

Pharmacovigilance grade at Y2. Regulatory endpoint platform by Y3.

Data Required:

Healthy/pre-diabetic: Annual 14-day CGM; wearable HRV and sleep integration; concurrent food log

T2D members: Quarterly 14-day CGM (4 cycles/yr); HbA1c, fasting glucose, insulin at each visit; medication records for each CGM period

CVD members: Annual CGM; cardiac medication list; captures glucose-cardiac interaction patterns

Anchor: HbA1c + HOMA-IR at every visit validates CGM metrics against clinical standards for regulatory use

What It Means:

Continuous glucose profiles — every 15 minutes, 14 days — across the full South Asian cardiometabolic spectrum, with full clinical phenotyping and dietary annotation. Non-diabetic reference, pre-diabetic trajectory, and established T2D management patterns in one cohort under one protocol.

Why clinical annotation is the differentiator: Abbott and Dexcom have volume. They do not have clinically phenotyped, research-consented, medication-annotated, dietary-linked CGM data across the full disease spectrum. That annotation layer — which only a care program can generate — is what creates the regulatory-grade research asset.

State at 18 months (5,000 Members):

5,000 members: ~1,250 T2D on quarterly CGM; ~750 CVD on annual CGM; ~3,000 healthy/pre-diabetic on annual CGM

T2D quarterly data: ~1,250 T2D members × 4 CGM cycles = ~5,000 CGM wear-periods in Year 1 from established South Asian T2D. Does not exist anywhere.

Non-diabetic reference: South Asian time-in-range norms, post-prandial glucose responses to Indian meals, glycaemic variability by metabolic risk category

Device licensing: Clinically annotated disease-spectrum CGM data — materially more valuable to Abbott/Dexcom than raw device uploads; licensable within Year 1

What would strengthen it:

Meal challenge subset: 50–100 T2D members eating a weighed standardised Indian meal while on CGM — Indian food glycaemic index in T2D physiology; publishable standalone study

ECG addition: Annual 2-week ECG patch for CVD members is viable at 750 members — adds HRV and arrhythmia layer; revisit at month 9

State at Y5 (250K Members):

Year 2 — 25,000 members

Pharmacovigilance grade: ~6,250 T2D members on quarterly CGM; real-world glucose outcomes across diabetes drug classes at scale. CDSCO and FDA accept real-world CGM evidence for label extensions and post-market surveillance at this population size.

Drug class comparison: CGM-derived time-in-range and variability across metformin / SGLT2i / GLP-1 / insulin regimens in South Asian T2D — the real-world effectiveness comparison pharma cannot generate in a controlled trial

Year 3 — 100,000 members

Endpoint qualification: ~60,000 healthy/pre-diabetic with 3 years of annual CGM; incident T2D conversions with pre-conversion CGM trajectory identifiable. CGM-derived glycaemic variability qualifiable as a South Asian regulatory surrogate endpoint — reduces T2D prevention trial length by 2–3 years.

~25,000 T2D: Complication signal detectable — glycaemic patterns preceding eGFR decline, retinopathy progression, and cardiovascular events in South Asian T2D. First outcome-linked CGM findings publishable.

Year 4 — 250,000 members

Digital health infrastructure: ~62,500 T2D members on quarterly monitoring = ~250,000 CGM wear-periods annually. Large enough to train and validate South Asian glucose prediction and complication risk ML models. Asset transitions from dataset to intellectual property.

Insurer value: Real-world glycaemic control data across 62,500 T2D members enables outcome-linked insurance pricing for the Indian diabetes population — a completely new actuarial product

What would strengthen at scale:

Outcome linkage: HN hospitalisation DPA ties CGM patterns to hard events — converts to regulatory endpoint qualification dataset by Y3

Endpoint qualification target: Regulatory-grade surrogate endpoint qualification requires 15,000–20,000 in most frameworks — crossed at Y3; prioritise regulatory submission process from Y2

Value to Pharma, Biotech & Life Sciences:

Abbott / Dexcom: Clinically annotated South Asian disease-spectrum CGM licence — ₹4–10Cr Y1; renegotiates to annual platform licence at Y2 as quarterly T2D data volume grows

Novo Nordisk: Semaglutide and oral GLP-1 CGM response in South Asian T2D — companion to metabolomics partnership — ₹5–15Cr

Drug trial endpoints: Validated South Asian CGM surrogate endpoint reduces T2D prevention trial length by 2–3 years — ₹100–400Cr in saved trial costs per program; highest-value long-term use of this asset

At 100K+: Real-world evidence platform for CDSCO and FDA submissions. Post-market surveillance licensing. Indian insurer actuarial product. Multiple simultaneous revenue streams.

Digital therapeutics: Fitterfly, BeatO, Ultrahuman — South Asian T2D CGM reference for CDSCO clinical validation — ₹1–3Cr each; fast domestic closes in Y1

4. South Asian Pre-Disease Trajectory Cohort (Multi-omic)

First findings at Y2. Most valuable asset globally by Y4.

What It Means:

Longitudinal multi-omic study following South Asians before disease onset until a fraction develops T2D or CVD — identifying the molecular changes that precede disease by 2–5 years. The established T2D and CVD members in the same cohort are the biological destination, making the full journey visible in one dataset from day one.

Why scale is transformative: At 5,000, this is a well-designed pilot. At 25,000, it generates first findings. At 100,000, it is the largest prospective South Asian multi-omic cohort ever assembled — larger than any Asian biobank at equivalent molecular depth. At 250,000, it is a national health infrastructure asset.

Data Required:

Pre-disease arm: ~60% of cohort — healthy/pre-diabetic, high-risk enriched; HbA1c 5.5–6.4%, family history, central obesity; annual WGS, proteomics, metabolomics, CGM

Disease endpoint arm: ~25% T2D + ~15% CVD — biological destination; same omic protocol as pre-disease arm; medication records; incident events tracked

Serial design: Same participant, same protocol, every year — the trajectory is the asset

Conversion tracking: Annual HbA1c flags members who progress from pre-diabetic to T2D — incident conversions with complete pre-disease omic history are the scientific prize

State at 18 months (5,000 Members):

5,000 members: ~3,000 pre-disease + ~1,250 T2D + ~750 CVD. Complete multi-omic baseline across full spectrum.

Destination in cohort from day 1: The full biological journey — pre-disease to established disease — is present within the cohort without waiting for conversions. Cross-sectional trajectory comparison is live.

Commercial position: Sold as co-design access — pharma pays to nominate questions and get preferential findings access. Standard early-cohort model globally.

No spontaneous conversions yet: 18 months is too short. Honest disclosure — conversion data begins at Y2. The Y1 story is architecture, scale, and the disease-endpoint arm.

What would strengthen it:

High-risk enrichment: Targeting HbA1c 6.0–6.4% specifically accelerates conversion events — more signal per participant per year

Intervention sub-study: Nested 500-person RCT randomising high-risk members to lifestyle or metformin — prevention evidence; multiplies commercial value to prevention-focused pharma

State at Y5 (250K Members):

Year 2 — 25,000 members

First conversion wave: ~5% annual T2D conversion from ~15,000 pre-diabetic members = ~750 incident T2D cases with complete pre-disease omic histories. First longitudinal molecular signal available. Candidate pre-disease biomarkers identifiable and publishable.

Rare variant pQTL: 25,000 with WGS crosses the threshold for South Asian rare-variant analysis — novel pre-disease genetic architecture emerging; drug target identification begins

Year 3 — 100,000 members

Largest Asian multi-omic cohort: ~60,000 pre-disease members with 3 annual timepoints; ~3,000+ cumulative T2D conversions with pre-disease histories. More scientifically powerful than UK Biobank’s South Asian sub-cohort by any molecular metric.

Biomarker validation: Candidate pre-disease protein and metabolite signatures validated across independent conversion waves. First South Asian pre-disease biomarker ready for diagnostic development licensing.

Category shift: Asset transitions from research cohort to national reference platform. ICMR and government engagement likely. Data governance structure review required.

Year 4 — 250,000 members

~150,000 pre-disease: 5,000–10,000 cumulative T2D conversions; 3,000–5,000 CVD events. Powered for rare outcome biomarker discovery — which pre-disease signals predict uncommon but devastating outcomes like sudden cardiac death and ESRD in South Asians.

The prize: A validated South Asian pre-disease biomarker — protein or metabolite predicting T2D 3 years before diagnosis — ready for licensing as a companion diagnostic alongside a prevention drug program. This is the ₹500Cr–₹2,000Cr licensing event.

What would strengthen at scale:

International consortium: Joint analysis with Singapore Indian Cohort, MASALA study, UK Biobank South Asian arm — multiplies statistical power for rare variants and rare outcomes without additional Indian enrollment

Intervention sub-study at Y2: A funded nested prevention RCT from Y2 converts the observational cohort into an interventional platform — the highest-value configuration for prevention pharma

Value to Pharma, Biotech & Life Sciences:

Novo Nordisk: Prevention franchise — semaglutide/oral GLP-1 for high-risk South Asians; South Asian pre-disease trajectory prerequisite for CDSCO prevention indication — ₹12–25Cr co-design Y1; re-prices to ₹50–120Cr exclusive platform by Y3

Pfizer / Lilly: Pre-disease drug target ID — naturally structured as rider on proteomics partnership — ₹5–15Cr additional

At Y2 — first findings: Candidate biomarker announcement drives immediate partnership re-pricing. Companies not yet in the partnership pay a premium to access findings. Entry cost rises at each publication.

The long-game value: A validated South Asian pre-disease biomarker is a ₹500Cr–₹2,000Cr licensing event. A 250,000-person prospective cohort is a permanent national health infrastructure asset that generates revenue indefinitely.

M&A value: At 250,000 longitudinally enrolled, multi-omic, research-consented South Asians with outcomes data, the data entity itself becomes an acquisition target for a global genomics or pharma infrastructure company.

Y1 commercial target: Metabolomics collaboration ₹10–25Cr (Asset 1) + proteomics co-development ₹15–30Cr (Asset 2) + CGM device licence ₹4–10Cr (Asset 3). Y2 inflection: Rare-variant pQTL powered (Asset 2), first conversion wave (Asset 4), pharmacovigilance-grade CGM (Asset 3) — all three re-price existing partnerships upward and open new partner categories. Non-negotiable: Indian dietary capture tool built before first enrollment. Disease-state proportion (25% T2D / 15% CVD) defended at every scale milestone through physician referral channels.

Section 1.4: Data Validation and Assessment

Coding Standards

Domain	Coding Standard	Exchange Format	Terminology	Regulatory Mapping	Delivery Format
Clinical labs	LOINC 2.77	HL7 FHIR R4	LOINC + SNOMED CT	CDISC SDTM LB domain	FHIR Bundle / SDTM XPT
Medications	ATC (WHO)	HL7 FHIR R4 MedicationDispense	RxNorm + ATC	CDISC SDTM CM domain	FHIR Bundle / SDTM XPT
Diagnoses	ICD-10-CM	HL7 FHIR R4 Condition	ICD-10 + SNOMED CT	CDISC SDTM MH domain	FHIR Bundle / SDTM XPT
Genomics (WGS)	GRCh38 + GATK	gVCF / VCF 4.3	HGVS nomenclature	N/A (exploratory)	gVCF per sample + joint VCF
Proteomics	Olink NPX / SomaScan RFU	Structured matrix (CSV/Parquet)	UniProt protein IDs	N/A (exploratory)	NPX/RFU matrix + metadata
Metabolomics	HMDB / KEGG / METLIN	mzML + feature table	HMDB identifiers	N/A (exploratory)	Feature table (CSV/Parquet)
CGM	AGP standard metrics	CSV + structured JSON	TIR/TBR/CV%/MAGE	FDA CGM guidance	AGP report + raw glucose CSV
Dietary	IFCT 2017 food codes	Structured JSON	IFCT 2017 + FoodEx2	N/A	Macronutrient matrix per meal
Imaging	DICOM 3.0	DICOM + structured report	RadLex	CDISC SDTM FA domain	DICOM + structured findings

Quality Assurance Architecture

Quality Dimension	Definition	Measurement	Target	Escalation
Completeness	% of expected data points captured	Monthly audit per domain	>85% across all domains	Domain lead review; participant outreach
Accuracy	Agreement between captured and true value	Duplicate sample validation; phantom calibration	<5% discordance rate	Recollection or reprocessing
Timeliness	Lag between collection and data availability	SLA tracking per vendor	<7 days for labs; <30 days for omics	Vendor escalation protocol
Consistency	Cross-domain data coherence	Cross-validation rules (e.g., HbA1c vs. CGM TIR correlation)	No unexplained major discordances	Clinical data scientist review
Provenance	Full audit trail from collection to analysis	Metadata tags: source, timestamp, operator, QC status, consent tier	100% of records traceable	System architecture requirement

Data Completeness Targets

Domain	Target Completeness	Minimum Acceptable	Measurement Method	Escalation if Below Minimum
Clinical biomarkers	>98%	>95%	% of scheduled draws completed	Care coordinator outreach within 7 days
CGM wear compliance	>90%	>80%	% of scheduled wear-periods with ≥10 days data	Device troubleshooting + replacement protocol
Medication dispensing (NetMeds)	>85%	>75%	% of members with active NetMeds dispensing records	Incentive program (NetMeds discount for members)
Omics (WGS/proteomics/metabolomics)	>95%	>90%	% of baseline samples successfully processed by Strand	Reprocessing protocol; recollection if needed
Dietary recall	>80%	>70%	% of scheduled 24-hr recalls completed	WhatsApp chatbot follow-up; dietitian outreach
Wearable data	>75%	>65%	% of enrolled members with active device pairing	Onboarding protocol check; device pairing support
Follow-up retention (12-month)	>75%	>65%	% of enrolled members completing 12-month visit	Retention program: WhatsApp, reports, travel reimbursement

Section 1.5: Sample and Data Collection Framework

Sample/Data	Collection Method	Analysis Outputs	When
Whole Genome Sequencing (WGS)	Blood Draw (10ml EDTA Tube)	Genetic Variants For pQTL Analysis, Drug-Gene Interactions, Ancestry-Specific Rare Variants	Before First Enrollment
Fasting Plasma (500–1,000 μl × 4 Tubes)	Blood Draw (Fasting)	Untargeted Metabolomics (500–3,000 Small Molecules), Targeted Proteomics (Olink 500–1,000 Proteins), Baseline Protein Reference Values	Baseline, Then Annually
Fasting Urine	Urine Collection (Morning, 50ml)	Urine Metabolomics, Dietary Metabolite Fingerprinting, Medication Metabolites	Baseline, Then Annually
HbA1c + Fasting Glucose + Insulin + HOMA-IR	Blood Draw (Fasting, 5ml)	Disease Classification (Healthy/Pre-Diabetic/T2D), Insulin Resistance Quantification, Clinical Anchor For CGM Validation	Baseline, Cadence Varies By State
Full Medication List + History	Structured Interview (Medical Record)	Drug-Metabolite Confound Mapping, Drug-Protein Interaction Baseline	Baseline + Updated Each Visit
Cardiac Medication List (CVD Members Only)	Structured Interview	Glucose-Cardiac Interaction Pattern Mapping	Baseline + Updated Annually
Clinical Phenotyping Form	Structured Questionnaire	Disease State Classification, Risk Stratification, Dietary Pattern Baseline	Baseline Only

Ongoing Dietary Data:

Sample/Data	Collection Method	Analysis Outputs	Cadence
24-Hr Dietary Recall × 2 Per Year	WhatsApp Chatbot (6 Regional Languages)	Indian Dietary Metabolite Signatures, Regional Diet-Metabolite Mapping, Dietary Quality Assessment	2 Times Per Year (e.g., Jan + Jul)
Concurrent Food Log (During CGM Wear Only)	Photo-Logged Or Written	Linking Specific Meals To Post-Prandial Glucose Excursions, Indian Food Glycaemic Index In Participant Physiology	During Each CGM Wear Period

Section 1.6: Data Collection Variations based on Disease State

Healthy / Pre-Diabetic Members

Sample/Data	Collection Method	Analysis Outputs	Cadence	Annual Events
14-Day CGM Wear	Glucose Monitor Patch (Abbott/Dexcom) + Concurrent HRV + Sleep Wearable Data	South Asian Time-In-Range Norms, Post-Prandial Glucose Response Baseline, Glycaemic Variability By Metabolic Risk, Incident T2D Conversion Trajectory If Applicable	Annually	1 Wear Period/Year
HbA1c + HOMA-IR	Blood Draw (Fasting, 5ml)	Disease Progression Tracking, Pre-Diabetic Trajectory Quantification, Conversion Flag (HbA1c Crossing 6.5%)	Annually	1 Draw/Year
Fasting Plasma + Urine	Blood Draw (Fasting) + Urine Collection	Annual Metabolomics And Proteomics Refresh, Trajectory Biomarkers	Annually	1 Collection/Year
Targeted Olink Proteomics	From Fasting Plasma Above	Protein Reference Trajectory, pQTL Mapping, Pre-Disease Protein Signature Evolution	Annually	Included In Plasma Draw

T2D Members

Sample/Data	Collection Method	Analysis Outputs	Cadence	Annual Events
14-Day CGM Wear	Glucose Monitor Patch + Medication Record For Each Period	Real-World Glucose Control By Drug Class, Drug Effectiveness Comparison (Metformin Vs. SGLT2i Vs. GLP-1 Vs. Insulin), Complication Risk Signal Detection	Quarterly (4 Cycles/Year)	4 Wear Periods/Year (e.g., Jan, Apr, Jul, Oct) + Medication Logs For Each
HbA1c + Fasting Glucose + Insulin	Blood Draw (Fasting, 5ml)	Glycaemic Control Trend, Insulin Dynamics, Diabetes Progression	Quarterly At Each Visit	4 Draws/Year (Aligned With CGM Periods)
Medication Records	Structured List (Updated At Each Visit)	Drug-Metabolite Interaction Mapping, Drug-Protein Effect Quantification	Updated Quarterly	Logged 4 Times/Year
Fasting Plasma (For Metabolomics + Proteomics)	Blood Draw (Fasting, 500–1,000 μl × 4 Tubes)	Annual Metabolomics Refresh + Drug-Metabolite Confound Dataset, Annual Proteomics Refresh	Annually	1 Draw/Year
Fasting Urine	Urine Collection (Morning, 50ml)	Annual Urine Metabolomics	Annually	1 Collection/Year
Post-Prandial Plasma (If In Subset)	Blood Draw 2 Hours After Standardised Indian Meal	Post-Prandial Metabolite Response, Diet-To-Metabolite Relationship Anchoring, Indian Food Glycaemic Index In Diabetes Physiology	If Selected For Meal Challenge Subset: 1 Structured Meal Challenge, 1 Fasting + 1 Post-Prandial Draw	—
Targeted Olink Proteomics	From Fasting Plasma Above	Drug-Protein Interaction Mapping, Annual Protein Trajectory, Pharmacodynamic Response To Medication	Annually	Included In Plasma Draw
Deep SomaScan 7K (If In Stratified Subsample Of 2,000)	From Fasting Plasma Above (Deeper Analysis Of Same Draw)	High-Resolution Protein Discovery, Rare Protein-Genetic Variant Mapping	Annually (If Selected)	Included In Plasma Draw If Selected

CVD Members

Sample/Data	Collection Method	Analysis Outputs	Cadence	Annual Events
14-Day CGM Wear	Glucose Monitor Patch	Glucose-Cardiac Interaction Patterns, Glycaemic Control In Cardiac Context, CV Event Risk Signal	Annually	1 Wear Period/Year
Annual ECG Patch (Optional Expansion)	Wearable ECG Monitor (2-Week Continuous)	HRV Integration, Arrhythmia Detection, Glucose-Arrhythmia Correlation	Annually If Implemented	1 Wear Period/Year
HbA1c + HOMA-IR	Blood Draw (Fasting, 5ml)	Glycaemic Control In CVD Context, Metabolic Progression	Annually	1 Draw/Year
Cardiac Medication List	Structured Interview	Medication-Metabolite / Medication-Protein Confounding	Updated Annually	1 Update/Year
Fasting Plasma + Urine	Blood Draw (Fasting) + Urine Collection	Annual Metabolomics And Proteomics Refresh For CVD Trajectory	Annually	1 Collection/Year
Targeted Olink Proteomics	From Fasting Plasma Above	Cardio-Metabolic Protein Signature Evolution, Outcome-Linked Proteomics	Annually	Included In Plasma Draw
HbA1c + Hospitalisation Tracking (If Outcomes Linkage Activated)	Clinical Records + Participant Report	HN Hospitalisation Outcome Linkage For CVD Event Prediction	Ongoing Tracking	Tracked Continuously If DPA In Place

Section 1.7: (Denovo-only) Annual Data Collection Summary by Disease State

Healthy / Pre-Diabetic Member

1 Annual Clinic Visit (HbA1c + HOMA-IR Blood Draw)
1 Annual Metabolomics/Proteomics Sample (Fasting Plasma + Urine)
1 Annual CGM Wear (14 Days)
2 Dietary 24-Hr Recalls (WhatsApp, Not In-Person)
Total In-Person Time: ~2 Hours (Enrollment + 1 Annual Visit + CGM Instruction)
Total Blood Draws: 2 (Baseline + Annual)
Total Urine Collections: 2 (Baseline + Annual)

T2D Member

4 Quarterly Clinic Visits (HbA1c + Glucose + Insulin)
1 Annual Metabolomics/Proteomics Sample (Fasting Plasma + Urine)
4 CGM Wear Periods (14 Days Each = ~2 Months Cumulative Data)
2 Dietary 24-Hr Recalls (WhatsApp)
4 Medication Logs (One Per Quarter, Can Be Submitted Digitally)
Total In-Person Time: ~4–5 Hours (Enrollment + 4 Quarterly Visits + CGM Resets)
Total Blood Draws: 5 (Baseline + 4 Quarterly)
Total Urine Collections: 2 (Baseline + Annual)

CVD Member

1 Annual Clinic Visit (HbA1c + HOMA-IR)
1 Annual Metabolomics/Proteomics Sample (Fasting Plasma + Urine)
1 Annual CGM Wear (14 Days)
1 Optional Annual ECG Patch (2 Weeks, If Implemented)
2 Dietary 24-Hr Recalls (WhatsApp)
Total In-Person Time: ~2 Hours (Enrollment + 1 Annual Visit + Device Resets)
Total Blood Draws: 2 (Baseline + Annual)
Total Urine Collections: 2 (Baseline + Annual)

Section 1.8: Implied Testing Cost today (Estimated)

Implied Testing Cost Per Member-Year at present cost

Cohort	Year 1 (Baseline)	Year 2+ (Recurring)
Healthy/Pre-Diabetic	₹62,200	₹62,200
T2D	₹65,000	₹81,000
CVD	₹62,200	₹65,000–69,000
Blended (Cohort Mix)	₹62,200	₹69,300

Note: This assumes tests are done during the measurement year. If we decide to store samples in a biobank instead and pursue a strategy of sequencing only those that have a clinical event or change in health state, then the combination of the smaller fraction of testing plus the reduction in sequencing costs by that data, would likely bring our blended cost per member down by 50-90%.