Modeling human monogenic diseases using the choanoflagellate Salpingoeca rosetta
Modeling human monogenic diseases using the choanoflagellate Salpingoeca rosetta
We developed a framework for identifying actionable disease–gene pairs from organism selection data [1]. Here, we apply that framework to genes in the choanoflagellate Salpingoeca rosetta and identify seven genes to pursue for experimental testing. We share information about S. rosetta, including its life history and currently available genetic tools. We also provide examples of genes we rejected or were particularly excited about in our decision-making process.
Scientists interested in using choanoflagellates or other microbial eukaryotes to model human disease may find our analyses and experimental design process useful in their work. We include scripts to help visualize gene expression in S. rosetta and a summary of the state of technology, which can aid in scientific decision-making.
Feel free to provide feedback by commenting in the box at the bottom of this page or by posting about this work on social media. Please make all feedback public so other readers can benefit from the discussion.
Salpingoeca rosetta is a choanoflagellate and an emerging genetic model organism [2]. Choanoflagellates are a group of microbial eukaryotes that share a common ancestor with the last common ancestor of animals [3]. Research groups often study this diverse group of organisms to understand the origins of multicellularity. Choanoflagellates often have transient multicellular life stages generated through cell division [4]. Cooperativity in these colonies provides insight into the transition from unicellular to multicellular life.
Free-swimming single S. rosetta cells have a stereotypical cell shape characterized by a flagellum circumscribed by an array of microvilli referred to as a “collar” [5] (Figure 1, A). S. rosetta cells can divide to form fragile “chains” of related cells attached by cell–cell adhesion [5]. S. rosetta also forms spherical colonies referred to as “rosettes” in response to a sulfonolipid produced by the bacteria Algorphiagus machipongonensis [6]. In addition to its multicellular life stages, S. rosetta also has diverse unicellular life stages, including slow swimmers, fast swimmers, and thecate cells (Figure 1, A) [5]. S. rosetta can even form amoeboid cells in response to physical compression [7].
Overview of diligence on Salpingoeca rosetta.
(A) Illustration of different cell types of S. rosetta and schematic of available genetic tools and types of human alleles that we expect to be able to model, given available tools. Cell types include slow swimmers, rosettes, fast swimmers, and thecate cells.
(B) Illustration of manual diligencing outcomes listing the number of genes eliminated by each failure mode. The initial short list was seven genes, all of which we selected for experimental testing.
S. rosetta has robust genetic tools and genomic resources.
You can generate gene expression visualizations from S. rosetta data by following the examples in this Jupyter notebook.
We manually diligenced 41 different genes in Salpingoeca rosetta to arrive at seven potentially actionable genes for pilot experiments (Figure 1, B).
You can see a copy of the working document we used to catalog our thoughts for each gene we diligenced in S. rosetta here.
Download Foldseek results for the S. rosetta plastin 1 (F2TWP3) and coronin 1a (F2TZV2) homologs, plus ProteinCartography data for the S. rosetta homolog of the protein encoded by DDC (F2UCR6) from Zenodo (DOI: 10.5281/zenodo.15724261).
Here, we describe our specific reasoning for rejecting a handful of select genes from our short list for S. rosetta to give you a sense of the variety of reasons we might have passed on a given gene.
Lack of homolog confidence
The DDC gene is implicated in deficiency of aromatic-L-amino-acid decarboxylase, which leads to severe postembryonic developmental phenotypes in patients [13]. S. rosetta is a top hit for this gene in our organism selection analysis. However, the S. rosetta protein (UniProt ID F2UCR6) is putatively annotated as “glutamate decarboxylase,” and FoldSeek searches of this protein don’t return the protein encoded by DDC as a top hit.
To understand the broader scope of related gene families, we ran sequence- and structure-based searching and clustering using ProteinCartography [14] and observed that the S. rosetta protein is found in a different Leiden cluster (LC00) than dopamine decarboxylases from vertebrates (mostly in LC12) and invertebrates (LC05) (Figure 2; download inputs and results from Zenodo). This suggests that the S. rosetta protein may not be a dopamine decarboxylase. Due to this uncertainty, we rejected this candidate gene.
Interactive scatter plot that ProteinCartography generated for the choanoflagellate homolog of the protein encoded by DDC identified in the organism selection dataset.
Lack of unmet need
Trehalose is a disaccharide commonly found in mushrooms and yeast. Patients with a deficiency in α,α-trehalase are unable to process this sugar, resulting in gastrointestinal symptoms such as vomiting and diarrhea [15]. While unpleasant for patients, this disease is treatable by omitting food products containing mushrooms and yeast in patient diets. Due to the low severity of the disease’s symptoms and ease of treatment, we didn’t consider this disease to have a high unmet need, and therefore, we rejected it.
Treatment not possible
Homozygous loss of function of HYLS1 causes a lethal malformation syndrome with defects across organ systems, resulting in stillbirth or death shortly after birth [16]. Due to the lethal and developmental nature of this disease, we didn't believe it would be possible to develop a therapeutic, and therefore rejected this gene.
Here are some of the best-supported case studies from our short list in S. rosetta.
Structural comparisons of top-selected short-list proteins to human proteins.
(A) Plastin 1
(B) Coronin 1a
(C) Beta-1,3-N-acetylgalactosaminyltransferase 2
Tan proteins are human structures; blue proteins are predicted S. rosetta homolog structures. “Trait distance” is a multivariate Mahalanobis distance between pairs of proteins calculated based on 10 physicochemical properties (see the “Methods” section of our previous pub for more details [17]). “Portfolio rank” is the relative rank of this protein compared to all other proteins found in the organism selection/the Zoogle portal. TM-score is a standard measure of solid-body protein structural similarity [18]. In general, TM-scores of 0.5 or above indicate the same fold.
Plastin 1, also known as fimbrin, is an actin-bundling protein implicated in autosomal-dominant hearing loss [19]. Missense mutations in this protein lead to destabilization of plastin binding to actin and progressive or non-progressive hearing loss with variable age of onset [19]. While hearing loss in humans is reported as a missense allele, mouse knockouts have shown comparable modeling of human disease. Pls1 knockout mice show defects in the morphology of stereocilia on ear hair cells [20], which are the mechanosensitive cells in the cochlea responsible for transducing sound into an electrical signal. Knockout mice also show an increasing loss of hearing as they age [20].
Choanoflagellate cells normally exhibit a stereociliar structure, the collar, which resembles ear hair cell stereocilia [21]. This structure is involved in feeding and possibly in mechanosensation [22]. The choanoflagellate plastin homolog shows high structural similarity (TM-score = 0.9) to plastin 1 and slightly lower structural similarity to two other major human homologs of plastin, LCP1 (also known as plastin 2, TM-score = 0.88) and plastin 3 (TM-score = 0.85) (Figure 3) (download Foldseek results from Zenodo). The choanoflagellate gene also appears to be expressed in all four cell states in RNA-seq data, with particularly high expression in fast swimmers (Figure 4).
Given the high degree of structural similarity between the human and choanoflagellate plastins, we were interested in generating knockouts of choanoflagellate plastin and assaying for possible phenotypes. Modeling plastin function in choanoflagellates also poses certain advantages over modeling the same function in mice or in differentiated human stem cells. Protocols for differentiating iPSC-derived hair-like cells take 50–60 days to complete [23], while assaying for strong phenotypes in mice can take months, due to an increase in hearing loss signal as the animals age [20]. In contrast, choanoflagellate cells exhibit collar morphologies at all life stages. Generating knockouts, identifying possible phenotypes, and screening for molecules that rescue plastin function can all be performed with greater speed in choanoflagellates than in existing models. For these reasons, we were particularly excited to model PLS1 function using S. rosetta.
Bulk gene expression data for S. rosetta.
(A) Illustration of different cell states of S. rosetta, including slow swimmers, rosettes, fast swimmers, and thecate cells. Colors used for different cell states are reflected across the remaining charts.
(B–D) Bulk RNA-seq expression of pls1 (B), coro1a (C), and b4galt7 (D) across four cell states.
(B′–D′) Matrix showing degrees of significance of differential expression between pairs of cell states for pls1 (B′), coro1a (C′), and b4galt7 (D′).
Coronin 1a is associated with severe combined immunodeficiency, also known as immunodeficiency 8 (IMD8). Patients with loss of coronin have early-childhood onset of recurrent infections, often associated with Epstein-Barr virus [24]. Mice with missense mutations in Coro1a show defects in actin localization in the leading edge of T cells [25].
Choanoflagellate cells produce a variety of actin-mediated cell structures, such as ruffles, filopodia, exocytic cups, and others [22][5][26][7]. The choanoflagellate homolog of coronin 1a also has a high structural similarity to multiple coronins (the top seven hits in our Foldseek search included CORO6, CORO1C, CORO2B, CORO1A, CORO1B, CORO2A, and CORO7) (Figure 3) (download Foldseek results from Zenodo). This gene is also expressed across cell states with three different expression classes: slow swimmers and rosettes, fast swimmers, and thecate cells (Figure 4). Due to the high number of different phenotypes that could be captured in S. rosetta, we felt confident that we could identify some defect in cells with a mutant version of this gene.
Missense, nonsense, and frameshift mutations in this glycosylation enzyme are associated with a form of muscular dystrophy caused by congenital brain and eye abnormalities [27][28]. Morpholino knockdown of the zebrafish homolog of beta-1,3-N-acetylgalactosaminyltransferase 2 results in severe morphological defects, including those of the eye, brain, and spinal cord [28]. Mutations in B3GALNT2 also appear to cause hydrocephalus in horses [29].
The S. rosetta homolog of beta-1,3-N-acetylgalactosaminyltransferase 2 is a strong structural match (TM-score = 0.79) to the human protein (Figure 3) and is expressed at low levels in all cells and slightly higher levels in thecate cells. (Figure 4).
Glycosylation plays an important role in the proper development of choanoflagellate rosettes. Recent work has shown that loss of function of glycosylation enzymes can lead to “clumpy” choanoflagellate cells that stick together [30]. This can be assayed by mixing non-fluorescent and fluorescent choanoflagellate cells and looking for aggregates across strains. We expect that loss of function in this particular glycosylation enzyme may produce similar phenotypes, which can be easily measured using this assay. For these reasons, we decided to pursue this gene.
Several other genes — CLTC, UNC13D, GALC, and B3GALNT2 — also made our short list (Table 1).
HGNC gene symbol (UniProt ID) | Human protein name | Associated human disease (OMIM) | S. rosetta protein UniProt ID | S. rosetta identifier | Possible S. rosetta phenotypes |
PLS1 | Plastin 1 | PTSG_00515 | Effects on collar morphology | ||
CORO1A | Coronin 1a | PTSG_01270 | Effects on membrane dynamics | ||
B4GALT7 (Q9UBV7) | Beta-1,4-galactosyltransferase 7 | PTSG_07984 | Effects on glycosylation, cell clumping | ||
CLTC (Q00610) | Clathrin heavy chain | PTSG_02909 | Effects on membrane dynamics | ||
UNC13D | Unc-13 homolog D | PTSG_11723 | Effects on exocytosis | ||
GALC (P54803) | Galactosylceramidase | PTSG_00266 | Effects on glycosylation, lysosomes | ||
B3GALNT2 (Q8NCR0) | Beta-1,3-N-acetylgalactosaminyltransferase 2 | Muscular dystrophy-dystroglycanopathy (congenital with brain and eye anomalies), type A, 11 | PTSG_06143 | Effects on glycosylation, cell clumping |
Summary of the relevant diseases and our initial guess for phenotypes for each short-list gene.
HGNC: HUGO gene nomenclature committee.
We sought expert feedback on our short list from David Booth, a choanoflagellate researcher at UCSF, one of the external scientists we consulted as part of this work. While we didn’t have strong phenotypic hypotheses for all seven genes on our short list, given the ease of performing simple phenotypic assays such as growth, morphology, cell state transitions, and cell clumping, we decided to pursue all of our hits. Below is a rough overview of our projected experimental plan in S. rosetta.
For details about the technical analyses described in this pub, see the companion pub describing our evaluation framework.
All code related to this pub is available in this GitHub repo (DOI: 10.5281/zenodo.15707938). Code specific to S. rosetta is in this analysis notebook and these visualization scripts.
We downloaded the S. rosetta cell type-specific RNA-seq data from this FigShare deposition and used a custom script found in our GitHub repository to visualize gene expression counts from bulk data for four cell types: slow swimmers, fast swimmers, thecate cells, and rosettes. We also visualized the computed pairwise differential expression significance p-values, such as in Figure 4.
We used Claude to help write code, clean up code, comment our code, draft text that we edited, suggest wording ideas for small phrases and sentence structures, and clarify and streamline text that we wrote. We also used Cursor to access Claude.
We used plotly (v5.17.0) [31] arcadia-pycolor (v0.6.3) [32] to generate figures before manual adjustment.
During our diligence process, we came to realize a variety of advantages that make S. rosetta a valuable potential model for human diseases, including:
Our diligence process uncovered just seven of many possible disease-causing genes that researchers can feasibly model using S. rosetta. Moreover, we sought to identify “low-hanging fruit” among our candidates. Scientists with greater experience with particular types of assays or a willingness to generate targeted mutations might be able to pursue disease–gene pairs that we rejected in our analysis. We encourage others to review our working diligence document and see whether their evaluation of feasibility differs from ours.
We’re funding downstream work by researchers in the David Booth lab at UCSF. They’ll share their results openly during their experiments. Stay tuned for more!
Plotly Technologies Inc. (2015). Collaborative data science. https://plot.ly
Feel free to provide feedback by commenting in the box at the bottom of this page or by posting about this work on social media. Please make all feedback public so other readers can benefit from the discussion.