This is not a shotgun data acquisition business. It does not believe that any and all data is equal. It believes that there is specific data, if captured systematically and continuously across cohorts, that can train models on biology, inform discovery on novel molecules and therapeutics, create evidence for personalized pathologies and care pathways and change disease trajectories forever. This also believes that while there are companies who are trying to build models today, those models are being built and powered by relatively small data sets (e.g., Tempus was valued at ~$6B at its 2024 IPO which was built on approximately 7 million patient records, almost exclusively in oncology. Flatiron Health was acquired by Roche for $1.9B on 2.2 million linked and longitudinal cancer records.). The most valuable and life changing company will be the one that builds the right data platform for discovery.
Example: The premium for depth
That 43x gap is the entire business model. The market is already telling us that fragmented data has value, but linked longitudinal data has monopoly value.
There is ofcourse, service specific data to be captured and included as part of our inputs. For example, imaging scans, and OPD data. These can be captured in innovative low cost partnerships (e.g., low cost devices for free in exchange for data) and value exchanges. However, the majority value comes from capturing complete, specific pre-determined data and stitching it together. Service level data or episodic clinical data exists with parties today already, but monetization has been weak.
Data quality, storage, labelling matters tremendously. If not in our control, it negates the final value of what we are trying to create for discovery, lifesciences, therapeutics and care. For example - without knowing consistent proteomics sample handling, you cannot distinguish a real biological signal from a degradation artefact. Without a patient identifier linking a pharmacy dispensing record to a lab result, the drug exposure data is commercially worthless. These are make-or-break inputs and is why we must have control end-to-end.
Our conviction is that the data that is going to be truly valuable is
| Dataset | Geography / Population | Baseline Size | Primary Focus | Biosample collection (Yes/No) |
|---|---|---|---|---|
| Prospective Study of 500,000 Adults in Chennai, India | Chennai adults aged 35+ | 500,816 | Large urban adult cohort; chronic disease and mortality epidemiology | No |
| Indian Study of Healthy Ageing (ISHA - Barshi) | Solapur district, Maharashtra; 362 villages and 3 towns | 219,888 | Healthy ageing, lifestyle, obesity, chronic disease, genetic determinants | No |
| Vadu Health and Demographic Surveillance Site (Vadu HDSS) | 22 villages near Pune | 200,000 | Demographic surveillance; births, deaths, migration; sub-cohorts for NCDs and cognition | Yes |
| Hodal Demographic Surveillance System | Hodal Block, Palwal district | 199,000+ | Pregnancy risk factors and infant outcomes; repeated surveillance | No |
| Population Registry of Lifestyle Diseases (PROLIFE) | Varkala area, Thiruvananthapuram district, Kerala | 161,943 | Population registry for lifestyle diseases | No |
| Dibrugarh Health and Demographic Surveillance System | 60 villages and tea gardens in Assam | 106,769 | District-level surveillance and morbidity/program monitoring | No |
| Mumbai Cohort Study | Mumbai adults aged 35+ | 99,570 | Mortality and chronic disease epidemiology in urban adults | No |
| Ballabgarh Health and Demographic Surveillance System | 28 villages in northern India | 90,240 | Long-running rural demographic and health surveillance | No |
| Birbhum Health & Demographic Surveillance System | 351 villages in West Bengal | 59,395 | Population health and healthcare utilisation surveillance | No |
| India Human Development Survey (IHDS) | National household survey across India | 41,554 households | Human development, socioeconomic conditions, education, health | No |
| New Delhi Birth Cohort (NDBC) | South Delhi mothers and births | 20,755 women; 8,181 babies | Birth-to-adult and intergenerational growth/cardiometabolic outcomes | Yes |
| AIIMS Cohort Study (ACS) | Rural and urban India; adults 50+ | 15,000 | Stroke and cognitive decline / dementia risk | No |
| Vellore Birth Cohort (VBC) | North Arcot district, urban and rural | 10,670 live-born babies | Maternal health, pregnancy outcomes, growth and adult follow-up | Yes |
| LoCARPoN | South Delhi; urban upper-middle-class adults 50+ | ~7,500 | Ageing, stroke, and dementia incidence | No |
| India Rural Economic Development Survey (REDS) | Rural India across major states | 4,527 households | Rural household economics and agriculture | No |
| Andhra Pradesh Children and Parents Study (APCAPS) | 29 rural and peri-urban villages in Telangana | 2,601 children; 1,815 families | Intergenerational nutrition and cardiometabolic health | Yes |
| Pune Maternal Nutrition Study (PMNS) | Six rural villages in Pune district, Maharashtra | 2,466 women; 770 children | Maternal nutrition, fetal growth, diabetes/CVD susceptibility | Yes |
| 10/66 Dementia Research Group Population-Based Cohort Study – India | South Chennai adults aged 65+ | 2,004 | Ageing, dementia, mortality follow-up | No |
| Young Lives India | Andhra Pradesh and Telangana children | 2,011 younger cohort; 1,008 older cohort | Childhood poverty, development, education | No |
| LIFE – Longitudinal Indian Family hEalth Pilot Study | Rural and peri-urban Hyderabad/Telangana | 1,227 | Maternal and infant health; environmental and socioeconomic determinants | Yes |
| 1934–66 Mysore Birth Records Cohort | Mysore birth records-linked cohort | 1,069 | Birth size and later cardiometabolic, lung, and ageing outcomes | Yes |
| Mysore Parthenon Cohort | Mysore antenatal clinic recruits and offspring | 830 mothers; 830 fathers; 663 children | Developmental origins and parental-child longitudinal follow-up | Yes |
(*If building a cardiometabolic focused cohort / intervention)
Cohort target mix (held constant at every scale): 60% healthy / pre-diabetic · 25% established T2D · 15% established CVD / post-cardiac-event · Scale: Y1 5K · Y2 25K · Y3 100K · Y4 250K · Strengthening notes indicate what additional activity would materially improve each asset.
Lead commercial asset at every scale threshold
Data Required:
Participants: Full disease spectrum — 60/25/15 mix at every scale
Samples: Fasting plasma + urine annually; post-prandial plasma (2hr) for T2D subset
Dietary recall: 2 × 24-hr recalls/yr via WhatsApp chatbot in 6 regional languages; IFCT 2017 validated
Medications: Full medication list at every visit — metformin, SGLT2i, GLP-1, statins; drug-metabolite confound mapping
Prerequisite: Indian dietary capture tool built before first enrollment — 3–4 months, non-negotiable
What It Means:
Untargeted plasma and urine metabolomics — 500–3,000 small molecules per sample — paired with validated Indian dietary data in regional languages. Indian dietary metabolites from dal, fermented foods, spices, and regional cuisines are absent from every existing Western reference database.
The gap: Every pharma company running metabolomics in Indian trials gets 20–40% of their signals back as unknown. You map them. At scale, your reference covers every clinically relevant Indian subpopulation — making it the only usable Indian metabolomics reference globally.
State at 18 months (5,000 Members):
5,000 members; ~1,250 T2D + ~750 CVD. First full-spectrum South Asian metabolomics baseline.
Disease comparison live: Healthy vs. T2D vs. CVD metabolite map available within 12 months — the commercial product pharma pays for immediately
Drug-metabolite map: 1,250 T2D members on known medications — controlled confound dataset unique to this program
Reference beginning: Metro Indian dietary metabolite signatures characterised for first time
What would strengthen it:
Post-prandial subset: 100–200 T2D members doing a structured Indian meal challenge — anchors diet-to-metabolite relationships scientifically; publishable standalone study
Tier 2 supplement: 500 members from non-metro cities via D2C postal kit converts ‘metro Indian’ to ‘South Asian’ reference credibly
State at Y5 (250K Members):
Year 2 — 25,000 members
Subgroup products emerge: Metabolomics by regional diet (Gujarati / Tamil / Bengali), by drug class (metformin vs. SGLT2i vs. GLP-1), by age decade, by sex — each subgroup is a separately licensable reference product
~6,250 T2D: Metabolite response to every major diabetes drug class in South Asians — direct regulatory affairs value for pharma label extensions
Year 3 — 100,000 members
National reference: Geographic and dietary diversity now spans the full Indian dietary ecology — the ‘South Asian’ claim is scientifically defensible, not aspirational
Predictive asset: ~18,000 pre-diabetic members with 3 years of follow-up; metabolite trajectories before T2D onset identifiable with statistical power. Candidate biomarkers publishable.
Platform revenue: Model shifts from partnership fees to tiered data platform licences — multiple simultaneous pharma licensees at ₹5–15Cr each annually
Year 4 — 250,000 members
Dietary biomarker product: Validated panel assessing diet quality from a single blood draw — no self-report needed. Standalone commercial diagnostic. No global equivalent for any non-Western population.
~62,500 T2D: Real-world drug effectiveness data at a scale that rivals any global pharma RCT population — CDSCO real-world evidence submissions, post-market surveillance, label expansions
What would strengthen it at scale:
Outcomes linkage: HN hospitalisation DPA for enrolled members closes the outcome loop — converts reference into predictive asset by Y3
Value to Pharma, Biotech & Life Sciences:
Novo Nordisk / AstraZeneca: Indian trial correction dataset — ₹10–25Cr collaboration Y1; re-prices to ₹30–60Cr platform licence by Y3 as subgroup depth grows
Roche Diagnostics: South Asian cardiometabolic panel calibration — ₹4–8Cr upfront + per-test royalty; royalty stream grows with panel adoption across Indian labs
Boehringer Ingelheim: Empagliflozin South Asian drug-metabolite interaction — direct regulatory value — ₹6–15Cr
At 100K+ scale: The dataset becomes the mandatory reference for any company running metabolomics in Indian trials. Recurring annual licence revenue from 10–20 simultaneous pharma partners. Total platform revenue ₹100–300Cr/yr by Y4.
Moat: Dietary recall tool is proprietary. Disease-spectrum annotation cannot be replicated from device or lab data. 3-year head start at Y4 is structurally unassailable.
Crosses rare-variant pQTL threshold at Y2. Drug target platform by Y3.
What It Means:
Measurement of 500–7,000 proteins in blood plasma repeated annually across the full cardiometabolic spectrum, linked to outcomes over time. Proteins are the effectors of disease — closer to clinical phenotype than DNA, detectable from a blood draw.
What scale unlocks: Common variant pQTL is powered at 5,000. Rare variant pQTL — South Asian-specific variants absent from European populations — is powered at 25,000. Novel drug target discovery from ancestry-specific biology begins at Year 2, not Year 1.
Data Required:
Platform: Targeted Olink panel (500–1,000 proteins) for all members annually; deep SomaScan 7K for 2,000-person stratified subsample annually — discovery depth at manageable cost
WGS: All members at onboarding — essential for pQTL; do not defer this
Disease sub-cohorts: T2D and CVD members provide comparison arm and medication confound layer; drug-protein interaction mapping at scale is a standalone product
Cost strategy: Olink co-funding partnership — subsidised platform cost ₹5–10Cr in exchange for dataset access rights; approach before enrollment opens
State at 18 months (5,000 Members):
5,000 members: complete targeted proteomics baseline across disease spectrum
Full spectrum baseline: Healthy / pre-diabetic / T2D / CVD proteome comparison live — cross-sectional South Asian disease-state proteome; the commercial product at Y1
Common variant pQTL: Powered at 5,000 with WGS — South Asian protein-controlling genetic variants identifiable; first South Asian pQTL map in production
Reference: South Asian protein reference ranges across disease spectrum — correction dataset for Indian trial data
What would strengthen it:
Olink deal: Formalise before enrollment — cost reduction is material at this sample volume
Outcome disclosure: Be explicit with pharma partners that outcome-linked proteomics is Y3 upside; cross-sectional disease-spectrum baseline is the Y1 product
State at Y5 (250K Members):
Year 2 — 25,000 members
Rare variant pQTL powered: South Asian-specific genetic variants controlling protein levels — biology invisible in European cohorts. Novel drug targets specific to South Asian ancestry begin to emerge. This is the scientific output pharma pays the most for.
~6,250 T2D: Drug-protein interaction map across diabetes drug classes in South Asians — SGLT2i, GLP-1, metformin effects on protein expression at scale; regulatory-grade pharmacodynamic dataset
Year 3 — 100,000 members
Drug target platform: South Asian pQTL + disease-state proteomics at 100,000 is the reference dataset for every pharma company developing metabolic drugs for the Indian market. The asset category shifts from ‘research dataset’ to ‘discovery infrastructure’.
Longitudinal signal: Members with 3 serial proteomic timepoints; proteins that change before T2D and CVD events identifiable from pre-disease members who convert. First validated South Asian proteomic biomarkers publishable.
Year 4 — 250,000 members
~62,500 T2D + ~37,500 CVD: Outcome-linked proteomic prediction models — which protein patterns at Year 1 predict MI, renal failure, or HbA1c >10% at Year 4 — powered for rare outcomes at this scale
Biomarker licensing: Validated South Asian cardiometabolic risk protein panel — licensable to every Indian diagnostics lab and insurer. ₹200–500Cr annual market at full Indian lab penetration.
What would strengthen at scale:
Consortium linkage: Joint pQTL analysis with GenomeIndia and UK Biobank South Asian participants multiplies rare-variant discovery power without additional enrollment cost
Value to Pharma, Biotech & Life Sciences:
Boehringer Ingelheim / AstraZeneca: South Asian pQTL co-development — ₹15–30Cr Y1; re-prices to ₹50–100Cr exclusive platform access by Y3
Olink / Somalogic: ₹5–10Cr in platform subsidies + co-publication; approaches at Y1, renegotiated at Y2 as cohort scale proves out
Eli Lilly: Tirzepatide South Asian proteomics — GIP/GLP-1 pathway in South Asian T2D vs. Western trial populations — ₹8–20Cr
At 100K+: The pQTL map and disease-state proteomics become the definitive South Asian molecular reference. Multi-year exclusive access agreements at ₹50–150Cr each. Multiple simultaneous partners.
Novel target value: A validated, novel, South Asian-specific drug target — a protein whose variant uniquely predicts T2D in South Asians — is worth ₹500Cr–₹2,000Cr in licensing or M&A value.
Pharmacovigilance grade at Y2. Regulatory endpoint platform by Y3.
Data Required:
Healthy/pre-diabetic: Annual 14-day CGM; wearable HRV and sleep integration; concurrent food log
T2D members: Quarterly 14-day CGM (4 cycles/yr); HbA1c, fasting glucose, insulin at each visit; medication records for each CGM period
CVD members: Annual CGM; cardiac medication list; captures glucose-cardiac interaction patterns
Anchor: HbA1c + HOMA-IR at every visit validates CGM metrics against clinical standards for regulatory use
What It Means:
Continuous glucose profiles — every 15 minutes, 14 days — across the full South Asian cardiometabolic spectrum, with full clinical phenotyping and dietary annotation. Non-diabetic reference, pre-diabetic trajectory, and established T2D management patterns in one cohort under one protocol.
Why clinical annotation is the differentiator: Abbott and Dexcom have volume. They do not have clinically phenotyped, research-consented, medication-annotated, dietary-linked CGM data across the full disease spectrum. That annotation layer — which only a care program can generate — is what creates the regulatory-grade research asset.
State at 18 months (5,000 Members):
5,000 members: ~1,250 T2D on quarterly CGM; ~750 CVD on annual CGM; ~3,000 healthy/pre-diabetic on annual CGM
T2D quarterly data: ~1,250 T2D members × 4 CGM cycles = ~5,000 CGM wear-periods in Year 1 from established South Asian T2D. Does not exist anywhere.
Non-diabetic reference: South Asian time-in-range norms, post-prandial glucose responses to Indian meals, glycaemic variability by metabolic risk category
Device licensing: Clinically annotated disease-spectrum CGM data — materially more valuable to Abbott/Dexcom than raw device uploads; licensable within Year 1
What would strengthen it:
Meal challenge subset: 50–100 T2D members eating a weighed standardised Indian meal while on CGM — Indian food glycaemic index in T2D physiology; publishable standalone study
ECG addition: Annual 2-week ECG patch for CVD members is viable at 750 members — adds HRV and arrhythmia layer; revisit at month 9
State at Y5 (250K Members):
Year 2 — 25,000 members
Pharmacovigilance grade: ~6,250 T2D members on quarterly CGM; real-world glucose outcomes across diabetes drug classes at scale. CDSCO and FDA accept real-world CGM evidence for label extensions and post-market surveillance at this population size.
Drug class comparison: CGM-derived time-in-range and variability across metformin / SGLT2i / GLP-1 / insulin regimens in South Asian T2D — the real-world effectiveness comparison pharma cannot generate in a controlled trial
Year 3 — 100,000 members
Endpoint qualification: ~60,000 healthy/pre-diabetic with 3 years of annual CGM; incident T2D conversions with pre-conversion CGM trajectory identifiable. CGM-derived glycaemic variability qualifiable as a South Asian regulatory surrogate endpoint — reduces T2D prevention trial length by 2–3 years.
~25,000 T2D: Complication signal detectable — glycaemic patterns preceding eGFR decline, retinopathy progression, and cardiovascular events in South Asian T2D. First outcome-linked CGM findings publishable.
Year 4 — 250,000 members
Digital health infrastructure: ~62,500 T2D members on quarterly monitoring = ~250,000 CGM wear-periods annually. Large enough to train and validate South Asian glucose prediction and complication risk ML models. Asset transitions from dataset to intellectual property.
Insurer value: Real-world glycaemic control data across 62,500 T2D members enables outcome-linked insurance pricing for the Indian diabetes population — a completely new actuarial product
What would strengthen at scale:
Outcome linkage: HN hospitalisation DPA ties CGM patterns to hard events — converts to regulatory endpoint qualification dataset by Y3
Endpoint qualification target: Regulatory-grade surrogate endpoint qualification requires 15,000–20,000 in most frameworks — crossed at Y3; prioritise regulatory submission process from Y2
Value to Pharma, Biotech & Life Sciences:
Abbott / Dexcom: Clinically annotated South Asian disease-spectrum CGM licence — ₹4–10Cr Y1; renegotiates to annual platform licence at Y2 as quarterly T2D data volume grows
Novo Nordisk: Semaglutide and oral GLP-1 CGM response in South Asian T2D — companion to metabolomics partnership — ₹5–15Cr
Drug trial endpoints: Validated South Asian CGM surrogate endpoint reduces T2D prevention trial length by 2–3 years — ₹100–400Cr in saved trial costs per program; highest-value long-term use of this asset
At 100K+: Real-world evidence platform for CDSCO and FDA submissions. Post-market surveillance licensing. Indian insurer actuarial product. Multiple simultaneous revenue streams.
Digital therapeutics: Fitterfly, BeatO, Ultrahuman — South Asian T2D CGM reference for CDSCO clinical validation — ₹1–3Cr each; fast domestic closes in Y1
First findings at Y2. Most valuable asset globally by Y4.
What It Means:
Longitudinal multi-omic study following South Asians before disease onset until a fraction develops T2D or CVD — identifying the molecular changes that precede disease by 2–5 years. The established T2D and CVD members in the same cohort are the biological destination, making the full journey visible in one dataset from day one.
Why scale is transformative: At 5,000, this is a well-designed pilot. At 25,000, it generates first findings. At 100,000, it is the largest prospective South Asian multi-omic cohort ever assembled — larger than any Asian biobank at equivalent molecular depth. At 250,000, it is a national health infrastructure asset.
Data Required:
Pre-disease arm: ~60% of cohort — healthy/pre-diabetic, high-risk enriched; HbA1c 5.5–6.4%, family history, central obesity; annual WGS, proteomics, metabolomics, CGM
Disease endpoint arm: ~25% T2D + ~15% CVD — biological destination; same omic protocol as pre-disease arm; medication records; incident events tracked
Serial design: Same participant, same protocol, every year — the trajectory is the asset
Conversion tracking: Annual HbA1c flags members who progress from pre-diabetic to T2D — incident conversions with complete pre-disease omic history are the scientific prize
State at 18 months (5,000 Members):
5,000 members: ~3,000 pre-disease + ~1,250 T2D + ~750 CVD. Complete multi-omic baseline across full spectrum.
Destination in cohort from day 1: The full biological journey — pre-disease to established disease — is present within the cohort without waiting for conversions. Cross-sectional trajectory comparison is live.
Commercial position: Sold as co-design access — pharma pays to nominate questions and get preferential findings access. Standard early-cohort model globally.
No spontaneous conversions yet: 18 months is too short. Honest disclosure — conversion data begins at Y2. The Y1 story is architecture, scale, and the disease-endpoint arm.
What would strengthen it:
High-risk enrichment: Targeting HbA1c 6.0–6.4% specifically accelerates conversion events — more signal per participant per year
Intervention sub-study: Nested 500-person RCT randomising high-risk members to lifestyle or metformin — prevention evidence; multiplies commercial value to prevention-focused pharma
State at Y5 (250K Members):
Year 2 — 25,000 members
First conversion wave: ~5% annual T2D conversion from ~15,000 pre-diabetic members = ~750 incident T2D cases with complete pre-disease omic histories. First longitudinal molecular signal available. Candidate pre-disease biomarkers identifiable and publishable.
Rare variant pQTL: 25,000 with WGS crosses the threshold for South Asian rare-variant analysis — novel pre-disease genetic architecture emerging; drug target identification begins
Year 3 — 100,000 members
Largest Asian multi-omic cohort: ~60,000 pre-disease members with 3 annual timepoints; ~3,000+ cumulative T2D conversions with pre-disease histories. More scientifically powerful than UK Biobank’s South Asian sub-cohort by any molecular metric.
Biomarker validation: Candidate pre-disease protein and metabolite signatures validated across independent conversion waves. First South Asian pre-disease biomarker ready for diagnostic development licensing.
Category shift: Asset transitions from research cohort to national reference platform. ICMR and government engagement likely. Data governance structure review required.
Year 4 — 250,000 members
~150,000 pre-disease: 5,000–10,000 cumulative T2D conversions; 3,000–5,000 CVD events. Powered for rare outcome biomarker discovery — which pre-disease signals predict uncommon but devastating outcomes like sudden cardiac death and ESRD in South Asians.
The prize: A validated South Asian pre-disease biomarker — protein or metabolite predicting T2D 3 years before diagnosis — ready for licensing as a companion diagnostic alongside a prevention drug program. This is the ₹500Cr–₹2,000Cr licensing event.
What would strengthen at scale:
International consortium: Joint analysis with Singapore Indian Cohort, MASALA study, UK Biobank South Asian arm — multiplies statistical power for rare variants and rare outcomes without additional Indian enrollment
Intervention sub-study at Y2: A funded nested prevention RCT from Y2 converts the observational cohort into an interventional platform — the highest-value configuration for prevention pharma
Value to Pharma, Biotech & Life Sciences:
Novo Nordisk: Prevention franchise — semaglutide/oral GLP-1 for high-risk South Asians; South Asian pre-disease trajectory prerequisite for CDSCO prevention indication — ₹12–25Cr co-design Y1; re-prices to ₹50–120Cr exclusive platform by Y3
Pfizer / Lilly: Pre-disease drug target ID — naturally structured as rider on proteomics partnership — ₹5–15Cr additional
At Y2 — first findings: Candidate biomarker announcement drives immediate partnership re-pricing. Companies not yet in the partnership pay a premium to access findings. Entry cost rises at each publication.
The long-game value: A validated South Asian pre-disease biomarker is a ₹500Cr–₹2,000Cr licensing event. A 250,000-person prospective cohort is a permanent national health infrastructure asset that generates revenue indefinitely.
M&A value: At 250,000 longitudinally enrolled, multi-omic, research-consented South Asians with outcomes data, the data entity itself becomes an acquisition target for a global genomics or pharma infrastructure company.
Y1 commercial target: Metabolomics collaboration ₹10–25Cr (Asset 1) + proteomics co-development ₹15–30Cr (Asset 2) + CGM device licence ₹4–10Cr (Asset 3). Y2 inflection: Rare-variant pQTL powered (Asset 2), first conversion wave (Asset 4), pharmacovigilance-grade CGM (Asset 3) — all three re-price existing partnerships upward and open new partner categories. Non-negotiable: Indian dietary capture tool built before first enrollment. Disease-state proportion (25% T2D / 15% CVD) defended at every scale milestone through physician referral channels.
| Domain | Coding Standard | Exchange Format | Terminology | Regulatory Mapping | Delivery Format |
|---|---|---|---|---|---|
| Clinical labs | LOINC 2.77 | HL7 FHIR R4 | LOINC + SNOMED CT | CDISC SDTM LB domain | FHIR Bundle / SDTM XPT |
| Medications | ATC (WHO) | HL7 FHIR R4 MedicationDispense | RxNorm + ATC | CDISC SDTM CM domain | FHIR Bundle / SDTM XPT |
| Diagnoses | ICD-10-CM | HL7 FHIR R4 Condition | ICD-10 + SNOMED CT | CDISC SDTM MH domain | FHIR Bundle / SDTM XPT |
| Genomics (WGS) | GRCh38 + GATK | gVCF / VCF 4.3 | HGVS nomenclature | N/A (exploratory) | gVCF per sample + joint VCF |
| Proteomics | Olink NPX / SomaScan RFU | Structured matrix (CSV/Parquet) | UniProt protein IDs | N/A (exploratory) | NPX/RFU matrix + metadata |
| Metabolomics | HMDB / KEGG / METLIN | mzML + feature table | HMDB identifiers | N/A (exploratory) | Feature table (CSV/Parquet) |
| CGM | AGP standard metrics | CSV + structured JSON | TIR/TBR/CV%/MAGE | FDA CGM guidance | AGP report + raw glucose CSV |
| Dietary | IFCT 2017 food codes | Structured JSON | IFCT 2017 + FoodEx2 | N/A | Macronutrient matrix per meal |
| Imaging | DICOM 3.0 | DICOM + structured report | RadLex | CDISC SDTM FA domain | DICOM + structured findings |
| Quality Dimension | Definition | Measurement | Target | Escalation |
|---|---|---|---|---|
| Completeness | % of expected data points captured | Monthly audit per domain | >85% across all domains | Domain lead review; participant outreach |
| Accuracy | Agreement between captured and true value | Duplicate sample validation; phantom calibration | <5% discordance rate | Recollection or reprocessing |
| Timeliness | Lag between collection and data availability | SLA tracking per vendor | <7 days for labs; <30 days for omics | Vendor escalation protocol |
| Consistency | Cross-domain data coherence | Cross-validation rules (e.g., HbA1c vs. CGM TIR correlation) | No unexplained major discordances | Clinical data scientist review |
| Provenance | Full audit trail from collection to analysis | Metadata tags: source, timestamp, operator, QC status, consent tier | 100% of records traceable | System architecture requirement |
| Domain | Target Completeness | Minimum Acceptable | Measurement Method | Escalation if Below Minimum |
|---|---|---|---|---|
| Clinical biomarkers | >98% | >95% | % of scheduled draws completed | Care coordinator outreach within 7 days |
| CGM wear compliance | >90% | >80% | % of scheduled wear-periods with ≥10 days data | Device troubleshooting + replacement protocol |
| Medication dispensing (NetMeds) | >85% | >75% | % of members with active NetMeds dispensing records | Incentive program (NetMeds discount for members) |
| Omics (WGS/proteomics/metabolomics) | >95% | >90% | % of baseline samples successfully processed by Strand | Reprocessing protocol; recollection if needed |
| Dietary recall | >80% | >70% | % of scheduled 24-hr recalls completed | WhatsApp chatbot follow-up; dietitian outreach |
| Wearable data | >75% | >65% | % of enrolled members with active device pairing | Onboarding protocol check; device pairing support |
| Follow-up retention (12-month) | >75% | >65% | % of enrolled members completing 12-month visit | Retention program: WhatsApp, reports, travel reimbursement |
| Sample/Data | Collection Method | Analysis Outputs | When |
|---|---|---|---|
| Whole Genome Sequencing (WGS) | Blood Draw (10ml EDTA Tube) | Genetic Variants For pQTL Analysis, Drug-Gene Interactions, Ancestry-Specific Rare Variants | Before First Enrollment |
| Fasting Plasma (500–1,000 μl × 4 Tubes) | Blood Draw (Fasting) | Untargeted Metabolomics (500–3,000 Small Molecules), Targeted Proteomics (Olink 500–1,000 Proteins), Baseline Protein Reference Values | Baseline, Then Annually |
| Fasting Urine | Urine Collection (Morning, 50ml) | Urine Metabolomics, Dietary Metabolite Fingerprinting, Medication Metabolites | Baseline, Then Annually |
| HbA1c + Fasting Glucose + Insulin + HOMA-IR | Blood Draw (Fasting, 5ml) | Disease Classification (Healthy/Pre-Diabetic/T2D), Insulin Resistance Quantification, Clinical Anchor For CGM Validation | Baseline, Cadence Varies By State |
| Full Medication List + History | Structured Interview (Medical Record) | Drug-Metabolite Confound Mapping, Drug-Protein Interaction Baseline | Baseline + Updated Each Visit |
| Cardiac Medication List (CVD Members Only) | Structured Interview | Glucose-Cardiac Interaction Pattern Mapping | Baseline + Updated Annually |
| Clinical Phenotyping Form | Structured Questionnaire | Disease State Classification, Risk Stratification, Dietary Pattern Baseline | Baseline Only |
Ongoing Dietary Data:
| Sample/Data | Collection Method | Analysis Outputs | Cadence |
|---|---|---|---|
| 24-Hr Dietary Recall × 2 Per Year | WhatsApp Chatbot (6 Regional Languages) | Indian Dietary Metabolite Signatures, Regional Diet-Metabolite Mapping, Dietary Quality Assessment | 2 Times Per Year (e.g., Jan + Jul) |
| Concurrent Food Log (During CGM Wear Only) | Photo-Logged Or Written | Linking Specific Meals To Post-Prandial Glucose Excursions, Indian Food Glycaemic Index In Participant Physiology | During Each CGM Wear Period |
| Sample/Data | Collection Method | Analysis Outputs | Cadence | Annual Events |
|---|---|---|---|---|
| 14-Day CGM Wear | Glucose Monitor Patch (Abbott/Dexcom) + Concurrent HRV + Sleep Wearable Data | South Asian Time-In-Range Norms, Post-Prandial Glucose Response Baseline, Glycaemic Variability By Metabolic Risk, Incident T2D Conversion Trajectory If Applicable | Annually | 1 Wear Period/Year |
| HbA1c + HOMA-IR | Blood Draw (Fasting, 5ml) | Disease Progression Tracking, Pre-Diabetic Trajectory Quantification, Conversion Flag (HbA1c Crossing 6.5%) | Annually | 1 Draw/Year |
| Fasting Plasma + Urine | Blood Draw (Fasting) + Urine Collection | Annual Metabolomics And Proteomics Refresh, Trajectory Biomarkers | Annually | 1 Collection/Year |
| Targeted Olink Proteomics | From Fasting Plasma Above | Protein Reference Trajectory, pQTL Mapping, Pre-Disease Protein Signature Evolution | Annually | Included In Plasma Draw |
| Sample/Data | Collection Method | Analysis Outputs | Cadence | Annual Events |
|---|---|---|---|---|
| 14-Day CGM Wear | Glucose Monitor Patch + Medication Record For Each Period | Real-World Glucose Control By Drug Class, Drug Effectiveness Comparison (Metformin Vs. SGLT2i Vs. GLP-1 Vs. Insulin), Complication Risk Signal Detection | Quarterly (4 Cycles/Year) | 4 Wear Periods/Year (e.g., Jan, Apr, Jul, Oct) + Medication Logs For Each |
| HbA1c + Fasting Glucose + Insulin | Blood Draw (Fasting, 5ml) | Glycaemic Control Trend, Insulin Dynamics, Diabetes Progression | Quarterly At Each Visit | 4 Draws/Year (Aligned With CGM Periods) |
| Medication Records | Structured List (Updated At Each Visit) | Drug-Metabolite Interaction Mapping, Drug-Protein Effect Quantification | Updated Quarterly | Logged 4 Times/Year |
| Fasting Plasma (For Metabolomics + Proteomics) | Blood Draw (Fasting, 500–1,000 μl × 4 Tubes) | Annual Metabolomics Refresh + Drug-Metabolite Confound Dataset, Annual Proteomics Refresh | Annually | 1 Draw/Year |
| Fasting Urine | Urine Collection (Morning, 50ml) | Annual Urine Metabolomics | Annually | 1 Collection/Year |
| Post-Prandial Plasma (If In Subset) | Blood Draw 2 Hours After Standardised Indian Meal | Post-Prandial Metabolite Response, Diet-To-Metabolite Relationship Anchoring, Indian Food Glycaemic Index In Diabetes Physiology | If Selected For Meal Challenge Subset: 1 Structured Meal Challenge, 1 Fasting + 1 Post-Prandial Draw | — |
| Targeted Olink Proteomics | From Fasting Plasma Above | Drug-Protein Interaction Mapping, Annual Protein Trajectory, Pharmacodynamic Response To Medication | Annually | Included In Plasma Draw |
| Deep SomaScan 7K (If In Stratified Subsample Of 2,000) | From Fasting Plasma Above (Deeper Analysis Of Same Draw) | High-Resolution Protein Discovery, Rare Protein-Genetic Variant Mapping | Annually (If Selected) | Included In Plasma Draw If Selected |
| Sample/Data | Collection Method | Analysis Outputs | Cadence | Annual Events |
|---|---|---|---|---|
| 14-Day CGM Wear | Glucose Monitor Patch | Glucose-Cardiac Interaction Patterns, Glycaemic Control In Cardiac Context, CV Event Risk Signal | Annually | 1 Wear Period/Year |
| Annual ECG Patch (Optional Expansion) | Wearable ECG Monitor (2-Week Continuous) | HRV Integration, Arrhythmia Detection, Glucose-Arrhythmia Correlation | Annually If Implemented | 1 Wear Period/Year |
| HbA1c + HOMA-IR | Blood Draw (Fasting, 5ml) | Glycaemic Control In CVD Context, Metabolic Progression | Annually | 1 Draw/Year |
| Cardiac Medication List | Structured Interview | Medication-Metabolite / Medication-Protein Confounding | Updated Annually | 1 Update/Year |
| Fasting Plasma + Urine | Blood Draw (Fasting) + Urine Collection | Annual Metabolomics And Proteomics Refresh For CVD Trajectory | Annually | 1 Collection/Year |
| Targeted Olink Proteomics | From Fasting Plasma Above | Cardio-Metabolic Protein Signature Evolution, Outcome-Linked Proteomics | Annually | Included In Plasma Draw |
| HbA1c + Hospitalisation Tracking (If Outcomes Linkage Activated) | Clinical Records + Participant Report | HN Hospitalisation Outcome Linkage For CVD Event Prediction | Ongoing Tracking | Tracked Continuously If DPA In Place |
| Cohort | Year 1 (Baseline) | Year 2+ (Recurring) |
|---|---|---|
| Healthy/Pre-Diabetic | ₹62,200 | ₹62,200 |
| T2D | ₹65,000 | ₹81,000 |
| CVD | ₹62,200 | ₹65,000–69,000 |
| Blended (Cohort Mix) | ₹62,200 | ₹69,300 |
Note: This assumes tests are done during the measurement year. If we decide to store samples in a biobank instead and pursue a strategy of sequencing only those that have a clinical event or change in health state, then the combination of the smaller fraction of testing plus the reduction in sequencing costs by that data, would likely bring our blended cost per member down by 50-90%.