A framework for modeling human monogenic diseases by deploying organism selection
A framework for modeling human monogenic diseases by deploying organism selection
Drug development requires organismal models to evaluate the efficacy and safety of therapeutic candidates. Most pharmaceutical research uses rodents, assuming they're similar enough to humans to be useful; however, as others have noted [1][2], such models can be expensive, slow, and even inaccurate. Can we unlock new opportunities by studying human diseases in different organisms?
We previously released a computational method to systematically identify similarities between proteins in humans and diverse research organisms by comparing protein secondary structural properties and correcting for phylogenetic relationships [3]. We found that phylogenetic distance doesn't always determine modeling utility; the best predicted organisms for a given gene could sometimes be very unexpected. We created the Zoogle interface, hoping this would make it easier for both basic science researchers and drug developers to use our dataset to create disease models. However, users struggled to leverage Zoogle for their own work.
In an effort to improve the usefulness of our predictions for external users, we tried to use Zoogle ourselves to identify actionable organism–gene pairings. We focused on developing a workflow for a particular user type, namely “organism experts” in biology. Such experts have critical, specialized knowledge about the life cycle, phenotypes, experimental tools, and relevant datasets for their organismal model of choice. They’re often part of larger organismal research communities, which helps with troubleshooting and collaborations. To test our workflow, we worked with experts on two organisms with unique biology that are suitable for genetic experiments — a unicellular protist that's closely related to animals, Salpingoeca rosetta, and a sea squirt that's closely related to vertebrates, Ciona intestinalis.
In this context, we aimed to identify which genes within a given organism might offer the greatest relevance to human biology and disease, helping experts prioritize their experimental efforts. Here, we present a heuristic decision-making approach that combines computational filtering with manual diligence to evaluate gene–disease pairs. We prioritized experimental feasibility and therapeutic impact by evaluating disease mechanisms, protein conservation, available genetic tools, and phenotypic assays. We ultimately identified seven actionable genes in S. rosetta and three in C. intestinalis. We’re funding two academic labs to pursue experimental testing of our predictions.
Feel free to provide feedback by commenting in the box at the bottom of this page or by posting about this work on social media. Please make all feedback public so other readers can benefit from the discussion.
Through the Zoogle interface, we present a list of matches between the proteins in an organism’s proteome and the proteins in the human proteome. Each match represents a hypothesis about the utility of modeling a human protein’s function using the homologous protein in a non-human organism. These matches are ranked based on how unexpected their similarity is with respect to the phylogenetic distance between the human and non-human proteins.
What this means on a practical level is that Zoogle presents a ranked list of tens of thousands of predictions of genetic similarity to scientific users. For Salpingoeca rosetta, Zoogle catalogs 27,354 predictions; for Ciona robusta, there are 50,693. How can a scientist determine which, among these thousands of predictions, represents the most actionable set for experimental testing? We combined computational filtering with manual diligence into an overarching framework for winnowing these predictions to an actionable short list for downstream experiments.
When considering what it means to model a human disease, there are many different strategies [1]. The most technically accurate model for human diseases would be humans. However, due to obvious ethical and safety considerations, this isn't the preferred starting point for drug development.
In practice, all drug development relies on disease models. The most common approaches to disease modeling are:
Borrowing a common saying from statistics, we’d argue that all models are wrong, but some are useful – each strategy has its pitfalls. In vitro cell culture models often use immortalized cell lines with abnormal karyotypes [6]. Patient-derived primary cells have genetic and environmental variability and are expensive to acquire and maintain [1]. Organoids require long experimental timelines, while not fully capturing the complexity of real human tissues [7]. Non-human models have fundamental differences at the molecular level — human and mouse proteins aren't identical and can have drastically different properties, which can lead to costly failures to translate [2][8].
The inaccuracies and inefficiencies of existing models have been recognized by authorities such as the FDA, which recently announced a plan to phase out animal testing requirements for monoclonal antibody therapies.
Our strategy of using unconventional organisms doesn’t overcome concerns with using non-human models. Sequence and structural differences between human and model proteins remain relevant. We account for some of these differences by identifying those pairs of non-human and human proteins with unusually high similarity [3]. But the ultimate goal of modeling diseases using such organisms isn’t to eliminate the use of human cell or mouse models; rather, it's to complement them.
Our hypothesis is that experimentally tractable and more scalable model organisms, such as invertebrates and unicellular eukaryotes, are advantageous and underutilized tools at the earliest stages of therapeutic research and development. Some of these organisms may be more accurate biological models for a particular human disease than rodents are. Others might be comparable to existing models, and also have experimental advantages that complement rodents or in vitro studies, such as tissue-level testing opportunities or cheaper, higher-throughput ways to conduct early screens. Moreover, expanding the list of organisms with the potential for disease modeling provides more avenues for basic science to have translational impact.
Assumptions we’re challenging with our framework.
We aim to challenge three major assumptions about non-human models (Figure 1):
Graphical abstract illustrating the overall framework.
The 27k and 50k proteins originate from the number of proteins found in the final organism selection dataset for each organism.
What we’re ultimately interested in identifying are genes with the potential to be modeled advantageously in our organisms of interest, where we can identify a measurable phenotype to test therapeutic mechanisms of action. This leads to a simple overall experimental framework (Figure 2, right):
While this experimental plan is fairly straightforward, choosing which of the thousands of candidate genes to pursue within a given organismal model is far more opaque. We spent a lot of time investigating tractable candidate genes for two example organisms, Salpingoeca rosetta and Ciona intestinalis (check out specific findings in individual pubs about each in [4] and [5]), and describe our overall approach (Figure 2, left) in the rest of this pub.
To understand how organism experts might approach designing experiments using Zoogle, we needed feedback from external scientists. We interviewed a handful of experts in our personal networks whose organisms are included in Zoogle to understand:
We decided to work with experts in two academic research laboratories: David Booth’s laboratory at UCSF, which uses S. rosetta, and Alberto Stolfi’s laboratory at the Georgia Institute of Technology, which uses Ciona. These experts provided invaluable feedback during our diligence process.
Skip to “Methods” for nitty-gritty details, or read on to get a big-picture sense of how we tried to select the most useful and feasible disease-associated genes to study in two uncommon organismal models.
Predictions within the organism selection dataset in Zoogle are currently ranked based on the percentile of the phylogenetically-corrected structural distance of proteins within gene families. This is essentially the relative ranking of each protein compared to others in the same gene family. While this metric was easy to implement into a web interface, it doesn’t account for variability among gene families in their size and distribution of distances from human homologs.
To account for these differences, we included two new metrics aimed at quantifying whether each distance to human homologs was exceptionally similar or not, given the observed distribution of distances in each gene family. Specifically, we used a permutation test-based approach to calculate two p-values: one “within organism,” and one “across organisms.” The within-organism p-value indicates, for each gene family, whether the degree of similarity with respect to the human homolog is exceptional for a given species. In contrast, the across-organism p-value indicates, for each gene family, which species are exceptionally similar to humans.
For a detailed description of how we carried out these analyses, see “Methods.”
Access our code for generating these p-values in an updated version of our organism selection GitHub repo (DOI: 10.5281/zenodo.15693939).
Access the updated organism selection dataset with p-values at this Zenodo deposition (DOI: 10.5281/zenodo.15685124).
We then filtered our genes for each organism using the following steps (as illustrated in Figure 3):
Funnel chart illustrating the stages of the computational filtering pipeline and the corresponding number of predictions remaining after each filter.
This set of crude filters is an initial prototype, and we recognize there are many ways to improve upon our approach. Our primary goal in the filtering process was to decrease the number of genes we needed to manually diligence. At the end of this filtering process, we were left with a “long list” of 153 genes in Salpingoeca rosetta and 192 genes in Ciona intestinalis.
Access our filtering pipeline notebook on GitHub. This pipeline also generates hyperlinks to external resources for filtered genes, such as OMIM, OpenTargets, and MARRVEL. We added this functionality because we found it useful in our downstream manual diligence process.
From our long list, we performed manual diligence to evaluate how actionable each possible gene would be for downstream experiments. We considered two high-level questions during our process:
We didn't pursue comprehensive diligence for each hypothesis in our long list; rather, we assessed each individual gene until the first point of failure — that is, as soon as we determined that making a model wouldn’t be easy or useful. We also didn’t perform manual diligence on every single member of our long list, as this is time-consuming. Instead, we diligenced ~30–40 genes from each organism, starting with those with a low percentile score. We added a handful of others based on the research interests of two research groups we’re funding to experimentally test our Zoogle predictions.
For each of our high-level questions, we cataloged a number of failure modes based on our technical analyses, listed below. The details of the technical analyses are described in the Methods section.
A schematic diagram of the desired qualities of a candidate gene and areas of consideration can be found in Figure 4.
Guide showing areas to consider in diligencing essential qualities of a candidate gene.
Rows are areas of evaluation and columns are technical analyses that we took into account to perform the evaluation.
When considering experimental feasibility, we encountered the following failure modes. These criteria aren’t necessarily dealbreakers for the overall utility of models — rather, they helped us eliminate options that weren’t low-hanging fruit.
When considering therapeutic impact, we encountered the following failure modes:
Notably, we didn't consider whether there were substantial market opportunities to treat a given disease (either due to market size, incidence, or degree of unmet need). A challenge for drug development in rare diseases is that economic forces make it difficult to justify investing the high capital cost of drug development for a small number of patients. For our proof-of-concept experiments, focusing on the financial upside would have been prohibitively limiting. Our hope is that our framework can help match academic researchers focused on specific model organisms with rare disease communities to spur transformative research without having to worry about turning a profit.
To get a sense of how it looks to do this process of elimination, check out copies of the working documents we used to catalog our thoughts for each organism:
Below are the technical analyses we performed as part of this work.
We originally quantified the degree of molecular conservation between non-human gene copies and their human homologs within each of the 9,260 gene families containing humans assessed in our recent pub [3] (see “The approach” for a detailed description of methods). Here, we extended this approach, statistically quantifying our confidence in asserting that measured distances were exceptional, whether looking within species and across gene families (“within organism”), or within gene family and across species (“across organism”).
We calculated the within-organism p-value by permuting the distances from human homologs observed within each species and across gene families 10,000 times, determining the number of times a distance was smaller than observed. We calculated the across-species p-value by permuting the distances within gene family and across species 10,000 times, with the p-value corresponding to the probability of observing a distance smaller than observed. We carried out all analyses in R [12], with the permutation tests implemented as custom Rcpp scripts (found here), and called by the dist_permute_test
function implemented here.
We manually reviewed the existing literature summaries on the known function of the wild-type protein and the consequences of mutation found in OMIM, MARRVEL [13], and UniProt [14]. For genes we were interested in modeling, we dove deeper by reading the primary literature for each disease.
We manually reviewed the existing literature summaries on human genetic variation found in OMIM’s case studies and in MARRVEL. We used the case studies highlighted in OMIM’s “Allelic Variants” sections to understand the mechanistic underpinnings of mutations that can contribute to disease. We also reviewed the ClinVar [9], Geno2MP, and gnomAD [15] data compiled in MARRVEL to understand the broader scope of human variation.
We used the literature summaries from OMIM and the loss-of-function observed/expected (LoF o/e) score and lethality evaluation from MARRVEL’s gnomAD module to evaluate whether a gene is likely to be lethal upon knockout.
We used the disease-specific literature summaries from OMIM, disease summaries from MedlinePlus, disease descriptions in Orphanet, and descriptions of patient phenotypes and experience from patient advocacy group websites, if available, to understand the severity of diseases.
We used the literature summaries from OMIM focused on organismal models (usually found in the “Animal Model” section) to understand whether phenotypic information has been generated for organisms such as mice and zebrafish. In some cases, we also checked whether a mouse mutant exists in the Mouse Genome Informatics (MGI) database or used web searches to look for literature not cataloged in OMIM on existing mouse or zebrafish models.
We used the “Approved Drugs and Active Ligands from PHAROS” info in MARRVEL and the protein-specific pages in OpenTargets, as well as treatment information from OMIM, Orphanet, and MedlinePlus to understand the current state of therapeutics for each disease.
We used pre-folded structures from AlphaFold [16], retrieved via UniProt ID, to search proteomes within the Foldseek Search server. We ran Foldseek searches in 3Di/AA mode against the AFDB-Proteome and AFDB-SwissProt databases with a taxonomic filter for human proteins. We evaluated whether the top human hit to the non-human protein matched the predictions in our organism selection pipeline. We also checked for differences in overall protein structure, such as new or differently sized domains. For proteins of very large size or with multiple domains, we didn’t consider a low TM-score to be disqualifying, as TM-score relies on static structural alignment, which doesn't account for the possibility of flexible domains.
For one pair of proteins — human VWA8 (UniProt ID A3KMH1) and its Ciona homolog (UniProt ID F6QXZ7), which were too large to be folded by AlphaFold (> 1,500 aa) and therefore not included in publicly available datasets, we used ESMFold [17] to generate predicted structures. You can download these structures below.
Download an ESM-predicted structure for Homo sapiens VWA8.
Download an ESM-predicted structure for Ciona intestinalis Vwa8.
For visualization figures, we used the Pairwise Structural Alignment tool found in the RCSB databank [18]. We generated structural alignments using either TM-align [19] or, for large proteins where fixed-body alignment was likely to fail, using jFATCAT [20].
We used ClustalOmega [21] provided by the UniProt web interface to perform pairwise sequence alignment, with default settings.
We created a lightweight, locally installable Python package called zoogletools to organize the code we used in our computational analyses, which is part of our GitHub repository. This package contains scripts for our filtering pipeline, as well as scripts for visualizing gene expression from S. rosetta and Ciona resources described in our companion pubs.
All of our code for this pub, including the zoogletools package, is in this GitHub repo (DOI: 10.5281/zenodo.15724881).
As part of our evaluation process, we evaluated the state of genetic tools in each organism by reviewing existing literature and consulting with experts. This helped us evaluate what kinds of experiments would be easy to perform without substantial technical development. You can read more about the state of technology for each organism in the corresponding pubs.
We used ChatGPT to help write code and comment our code. We used Claude to help write code, clean up code, comment our code, suggest wording ideas that informed our phrasing choices, write text that we edited, expand on summary text that we provided, and clarify and streamline our text. We also used Cursor to help generate and revise code.
We used plotly (v5.17.0) [22] arcadia-pycolor (v0.6.3) [23] to generate figures before manual adjustment.
You can read more about the specific takeaways from each organism in two companion pubs.
Modeling human monogenic diseases using…
In this pub, we present a framework for leveraging organism selection to identify which monogenic diseases you might effectively model with your favorite research organism. This framework isn’t a computational pipeline; rather, it’s a recorded example of the types of scientific reasoning that scientists do almost every day.
As we developed our framework, we realized that this work was more challenging and time-consuming than we expected, particularly as novices in working with these organisms. The predictions presented in Zoogle were a useful starting point, but we needed to gather a lot of additional information about diseases and the technologies in each organism to develop an actionable experimental plan. In some cases, we had to onboard to community resources or integrate expert opinions on the most practical ways to test our predictions. Access to such implicit and explicit knowledge was essential.
We’ve sometimes described this work as a miniature version of a qualifying exam, and hope that sharing our framework will help others identify new ways to deploy their favorite research organisms for broader impact on human health. Below, we summarize some of the lessons we learned from this exercise.
When we performed our initial reasoning, we relied on existing literature and publicly available resources to help us understand the state of technology in each organism. Relying on existing literature sometimes failed to give us a clear picture of what experiments were trusted in the field; in other cases, it completely misled us. For example, a single report of targeted genetic engineering in the literature may not reflect the likelihood of success, applicability to other examples, or represent a dependable protocol. Speaking with experts cleared this up quickly. It also helped us understand what data resources were most useful to integrate into our analyses and how to navigate the bespoke datasets that are common in emerging research organisms. When attempting to design experiments in unfamiliar organisms, human experts remain irreplaceable.
We recommend that organism experts who follow our framework carefully consider the unique strengths and limitations of their organism of interest when evaluating which diseases to model. What opportunities are uniquely unlocked by the biology of your organism? And how does your organism provide an advantage over the status quo?
Our final gene lists contained some intuitive examples — for example, modeling the function of a stereocilia gene in a S. rosetta, an organism with a stereocilium, might appear sensible and even obvious. In other cases, our reasoning arrived at highly counterintuitive results. For example, all three of the genes we chose in Ciona are implicated in immunodeficiencies, yet all three might be well-modeled through a completely different cellular process: notochord lumen morphogenesis.
It’s important to note that we didn’t set out to search for either intuitive or counterintuitive examples when performing our reasoning exercise. Starting with data, we reasoned through the options through a practical lens to arrive at our final short list. It was heartening to see that taking a data-driven approach to choosing scientific questions can lead to surprising and exciting new research directions.
Our most important next step is to evaluate whether our predictions or decision-making steps led to actionable outcomes. We’ll pursue this by funding organism experts to perform experiments based on our predictions. Results from this work will be published openly through modular units (see below). Assuming that our framework proves useful, there are a variety of ways we could imagine improving it.
The framework presented in this pub is a prototype with many possible areas for improvement, including:
This pub and its organism-specific companion pieces [4][5] are just the start of a longer series of experiments. Stay tuned to learn more about the results of our testing, which the two labs we’re funding for this work will publish through an open, journal-independent approach.
Plotly Technologies Inc. (2015). Collaborative data science. https://plot.ly
Feel free to provide feedback by commenting in the box at the bottom of this page or by posting about this work on social media. Please make all feedback public so other readers can benefit from the discussion.