Skip to main content

Talk Data to Me: Conversational AI for FAIR and Accessible Biomedical Data Discovery

  • Room: 80/1-001 - Globe of Science and Innovation - 1st Floor
  • Speaker:
    • Alberto Pepe, CTO, , Alberto Pepe is a scientist and entrepreneur specializing in data science, research infrastructure, and scholarly communication. He is the CTO of Sage Bionetworks. Previously, he was the co-founder and CEO of Authorea, a collaborative platform for writing and publishing scientific documents, which was acquired by Wiley. Prior to this, he held research roles at Harvard University and the Harvard-Smithsonian Center for Astrophysics, focusing on the analysis and management of large-scale scientific data. Alberto also worked at CERN, where he contributed to projects aimed at improving data sharing and collaboration in high-energy physics. He holds a Ph.D. in Information Science from the University of California, Los Angeles. , Sage Bionetworks, https://sagebionetworks.org/

This demo will discuss how conversational interfaces can bridge technical, linguistic, and policy gaps in data discovery—particularly for regulated data that must remain access-controlled. It also explores the integration of safeguards such as access permissions, response transparency, and bias auditing.

The biomedical research ecosystem is rich in data but poor in discoverability. Researchers often struggle to identify and evaluate datasets across fragmented platforms, inconsistent metadata schemas, and access-restricted environments. Traditional search interfaces fail to accommodate the exploratory and interdisciplinary nature of modern research.

We propose a new paradigm: using conversational AI to transform metadata search into natural dialogue. Powered by large language models, the chatbot prototype on Synapse.org interprets user intent, translates natural language into structured queries, and surfaces metadata summaries—enabling ethical and efficient discovery even in regulated domains.

Users can ask nuanced questions like “Which Alzheimer's datasets involve Type II Diabetes in patients over 60?” and receive synthesized metadata responses, access notes, and provenance trails. The chatbot supports non-technical users and facilitates equitable access by acting as a semantic translator between scientific domains. Moreover, logs of user queries inform improvements in metadata quality and usability.

This presentation will discuss how conversational interfaces can bridge technical, linguistic, and policy gaps in data discovery—particularly for regulated data that must remain access-controlled. It also explores the integration of safeguards such as access permissions, response transparency, and bias auditing.

We argue that conversational AI is not just an interface improvement, but a step toward inclusive and intuitive open science infrastructure.