A Sea of Information
In this series, I put the spotlight on data and its various uses in the life sciences industry, touching on the value of information, its intrinsic role in the development and manufacturing of new pharmaceuticals, and the implications of certain types of information and its potential weaponization in an uncertain world. What this really points to is the need for life science organizations to understand what data they have and how to monetize it most effectively. [3]
![Figure 1 Source: Grunge, How Long Could a Person Survive Lost at Sea? [2]](https://greenleafgrp.com/wp-content/uploads/2025/11/several-years-with-a-sustainable-setup-1690392187.jpg)
As I mentioned in earlier posts, the amount of information companies need to handle is enormous and grows with the size of the organization. At the same time, in the life sciences industry, data isn’t just an operational by-product—it underpins scientific breakthroughs, regulatory adherence, and market delivery. Yet many organizations underestimate the gap between their current data practices and the level of maturity needed to treat data as a strategic asset fully.
Numerous issues hinder life science companies from unlocking the full potential of their information assets. Due to this information density, it’s difficult to distinguish between the rocks and the open sea. Issues include:
Data Volume, Variety, and Velocity
The life sciences industry sees an explosion of data sources. In previous posts, I mentioned genomics, proteomics, and clinical trial information. Added to this list are wearables, imaging, and real-world evidence provided by open sources. Other activities that generate large amounts of information include advanced research (e.g., next-generation sequencing, high-resolution microscopy), which produce terabytes or petabytes of data. All of these generate vast, heterogeneous datasets. Combining structured (e.g., clinical trial data) and unstructured (e.g., lab notes, PDFs, images) data is complex. Storing, retrieving, and processing large datasets becomes expensive and slow. For example, sequencing a single genome can make hundreds of gigabytes of raw data. [4]
Data Fragmentation and Silos
Data is often scattered across different systems (Laboratory Information Management Systems (LIMS), ELNs, clinical databases, ERP, CRM, etc.), business units, and research teams. The size and heterogeneity of information make it difficult to integrate or compare results across experiments, studies, or geographies. The genomic information, the subject of my last post, is a great example. In many cases, data from one lab’s system cannot be easily combined with clinical trial outcomes stored elsewhere.
Multiple Standards with Varying Levels of Abstraction
The diversity of life sciences data is vast, encompassing everything from traditional formats such as text files, tables, and PDFs to more complex ‘frontier’ data, including population genomics, single-cell data, bioimaging, and various forms of multi-omics data. [5] Data formats, ontologies, and metadata standards vary widely. These weaknesses hinder interoperability, automation, and cross-dataset analysis. For example, different organizations use different terms to record and analyze blood glucose information. This terminology mismatch makes integrating the information difficult and would require programmatic intervention to normalize the data to a single standard unit of measurement.
![Figure 2 - Life Science Data Reference Models [1]](https://greenleafgrp.com/wp-content/uploads/2025/11/multiple-standards-with-varying-levels-of-abstraction.png)
Conceptual Definitions and Standards
Several organizations attempt to define standards to avoid these issues. These include:
- ISO/TS 22220:2011 — Health informatics: Defines core elements for identifying subjects of healthcare, though true comparability across datasets remains challenging.[6]
- Health Level Seven International (HL7)/ FHIR (Fast Healthcare Interoperability Resources): Establishes standardized structures and terminologies for clinical data; HL7 v2 remains the dominant messaging standard.[7, 8]
- The Life Sciences Domain Analysis model (LS DAM) comprises 130 classes across several core areas, including Experiment, Molecular Biology, Molecular Databases, and Specimen. Developed under HL7’s Biomedical Research & Regulation (BR&R) group, this model structures data for life sciences R&D and harmonizes with the BRIDG model. [7, 9]
- Biomedical Research Integrated Domain Group (BRIDG) Model: Enables computable semantic interoperability (CSI) between clinical research and healthcare data, aligning with HL7’s Reference Information Model (RIM).
Focus distinction: BRIDG covers regulated clinical research, while LS DAM supports preclinical and discovery domains.
Terminology Models
- SNOMED CT (Systematized Nomenclature of Medicine – Clinical Terms): clinical terminology for consistent labeling. [9]
- LOINC (Logical Observation Identifiers Names and Codes): standard codes for lab and clinical measurements. [10]
- UCUM (Unified Code for Units of Measure): unified system for unit symbols and conversions, enabling unambiguous communication across systems. [9]
Together, these standards operate at different abstraction levels but are interrelated within the biomedical data ecosystem. Despite shared labels, data values may not be directly comparable; ongoing research aims to normalize disparate sources.
Implementation
- The Clinical Data Interchange Standards Consortium (CDISC): Focuses on interoperable standards across the research data lifecycle to make clinical data reusable, efficient, and impactful, advancing global health and scientific discovery. [9, 11]
Data Quality and Integrity
Like the issues presented above, experimental and clinical data can be incomplete, noisy, or error-prone due to manual entry, inconsistent protocols, or faulty instruments. Poor data quality reduces confidence in results, slows down regulatory submissions, and leads to rework.
Building strong data governance and quality assurance processes is key to gaining dependable insights. Green Leaf helps organizations implement data validation frameworks, governance models, and monitoring systems to improve accuracy and maintain compliance throughout the data lifecycle. [12]
Integration of Multi-Omics and Heterogeneous Data
I discussed the importance of this issue for the life sciences industry from both legal and regulatory perspectives in my October 28 post [13]. Modern life science research produces large, diverse datasets — including genomics, proteomics, imaging, clinical outcomes, and more. This presents significant technical challenges in integrating structured data (such as clinical data) and unstructured data (like imaging and notes), which is both technically and scientifically complex. For example, combining genomic variant data with phenotypic and real-world evidence (RWE) requires advanced data models and analytics pipelines. [13]
Compliance and Data Privacy
As noted in my last post on the new DOJ rules on bulk sensitive information, Life science data is often regulated and personally identifiable (mainly clinical data). Compliance with regulations like GDPR, HIPAA, and 21 CFR Part 1, as well as the recent Presidential order, adds complexity and limits data sharing. [13]
Collaboration and Access Control
Balancing open collaboration—both internal and external—with secure data governance is challenging because overly restrictive access hampers teamwork, while overly open access compromises data security. For example, outsourced CROs (Contract Research Organizations) need data access without risking exposure of proprietary information. In my previous role as CISO, this was a primary concern: how to provide access to valuable data while maintaining security and compliance with key regulations.
Data Lifecycle Management
Scientific data must be tracked from creation through analysis, publication, archiving, and reuse. Poor lifecycle management leads to lost data, duplication, and difficulty reproducing results. Life science companies are data-rich but insight-poor — they generate vast, valuable data that is often underutilized due to fragmentation, poor quality, regulatory barriers, and lack of integration. For example, when a project ends, the researcher leaves, and key datasets or scripts are lost. [14]
Artificial Intelligence Readiness
The observable trend across extensive discussions of large datasets is that human intelligence alone will not be able to fully comprehend the value hidden within petabytes of information needed by science companies. Automation and the use of Artificial Intelligence may be the only opportunity for life science companies to navigate this sea of information.
However, problems remain. Data is often not “AI-ready” — it is commonly unstructured, unlabeled, or lacking context. This limits the ability to apply machine learning or predictive analytics effectively to datasets that are already proving too tricky to normalize and analyze manually. [4]
At Green Leaf, we help life sciences organizations bridge the gap with our Comprehensive AI partnership framework—a clear approach that guides teams from scattered data to actionable insights. By assessing data readiness, defining use cases, and establishing the right technical and organizational foundations, this partnership enables companies to safely experiment, demonstrate value, and scale AI efforts that turn complex data into meaningful discoveries and measurable results.
Conclusion
In the life sciences industry, data is not just an operational by-product—it is the foundation for scientific discovery, regulatory compliance, and market delivery. However, many organizations underestimate the gap between their current data practices and the level of maturity needed to fully leverage data as a strategic asset.
Organizations should conduct a data readiness assessment (DRA) specifically designed for the life sciences industry, providing a clear, evidence-based overview of current practices and the essential steps for meaningful improvement. Using global best practices, this process helps organizations move from reactive data management to a proactive, value-driven approach.
Due to the increasing volume of information, the need for diverse data classification standards, and regulatory and legal compliance demands, life science organizations must manage larger amounts of data. The most effective solution is through intelligent automation. However, to leverage these advanced automated tools, relevant datasets need to be semantically normalized. Data science leaders must understand the purpose and context of data collection and use consistent terminology and measurement units. Doing this manually would be impossible even for the most prominent organizations. Agentic AI solutions offer an opportunity to support data normalization and information restructuring, unlocking business value.
Partnering with experienced data experts like Green Leaf enables life science organizations to shift from reactive data management to proactive, insight-driven operations. Through their data strategy, governance, and AI enablement services, they help clients extract measurable value from their information assets, turning data complexity into an operational advantage.
References
- OpenAI, Life Sciences Data Models Relationships. 2025, OpenAI: Internet.
- Grunge.com, How Long Could a Personal Potentially Survive Lost At Sea? – Several Years With a Sustainable Set Up, Getty Images: https://www.grunge.com/1349654/how-long-could-you-survive-lost-at-sea/
- NNIT, Data Maturity Turning Potential into Real Business Value, in Insights. 2025, NNIT: Copenhagen, Denmark.
- Automata, Automation Helping with the Challenges of Data in Life Sciences in Automata – Insights. 2024, Automata: Mewton, MA USA.
- Papadopoulos, S., The paradox of data in precision medicine, in Drug Raget Review. 2024, Russell Publishing Ltd.: Brasted, Kent, United Kingdom.
- (ISO), I.S.O.ISO TS 22220 2011 – Health informatics — Identification of subjects of health care. 2025 [cited 2025 October 31].
- HL7.Fast Healthcare Interoperability Resources (FHIR) v5.0.0 – Executive Summary. 2023 [cited 2025; Available from:https://hl7.org/fhir/summary.html.
- Freimuth, R.R., et al., Life sciences domain analysis model. J Am Med Inform Assoc, 2012. 19(6): p. 1095–102.
- Regenstrief Institute, I.,LOINC Quick Start User Guide. 2015, Regenstrief Institute, Inc.: Indianapolis, IN.
- International, S. What is SNOMED CT. 2025; Available from:https://www.snomed.org/what-is-snomed-ct.
- Schadow, G. and C.J. McDonald, The Unified Code for Units (UCUM) of Measure: Specification. 2017, Regenstrief Institute, Inc: Indianapolis, IN.
- CDISC.CDISC Standards. 2025 [cited 2025 November 3, 2025]; Available from:https://www.cdisc.org/standards.
- Ferrara, E., Data in the Evolving World of Life Sciences Part 3 in Insights, N. Miner, Editor. 2025, Green Leaf Consulting Group: Ambler, PA USA.
- Odebrecht, C., Research Data Governance. The Need for a System of Cross-organisational Responsibility for the Researcher’s Data Domain.Data Science Journal, 2025. 24.