Raman spectroscopy enables rapid and inexpensive exploration of biology
To test its utility in analyzing biological samples, we built an open-source Raman spectrometer and collected spectra from chilis, beer, and algae. We could stratify samples, classify replicates, and link spectra with quantitative traits of beer (ABV) and chilis (perceived heat).
Raman spectroscopy enables rapid and inexpensive exploration of biology
·
Purpose
Raman spectroscopy is a non-destructive technique that provides a unique chemical fingerprint based only on the interaction of light with a sample. It’s been used extensively in materials science applications and more recently, in biology. This technique doesn’t require molecular or chemical labeling (it’s “label-free”), making it a potentially useful tool for studying organisms without genetic tools.
We wondered if we could build a Raman spectrometer using open-source protocols and use it to rapidly distinguish samples based on chemical properties in a label-free way, with minimal data processing. We decided to try a hackathon to test this idea — we selected three types of samples (beer, chilis, and algae) and found that the spectra were reproducible and had sufficient dynamic range to do comparative analyses. We were able to use the Raman spectra to differentiate the three types of samples and to distinguish subgroups of samples within a given type. Beer sample spectra varied by alcohol content and by type. Chili pepper data clustered by perceived heat (Scoville units) and color. We could differentiate algae by genetic background. Finally, we found that specific spectral regions correlate with quantitative characteristics of beer (alcohol by volume) and chilis (perceived heat).
Our work highlights the utility and ease of this technique. We hope it will empower scientists to capture the chemical composition of samples and extract a great degree of high-dimensional data from Raman spectra. We imagine this report could also be useful for science educators who want to use the OpenRAMAN resource and our code to run a lab class on Raman spectroscopy. We’d love to know if you try this technique and whether it allows you to distinguish features in a way that isn’t possible or is more difficult using other methods.
All associated code for analyzing the spectral data is available in this GitHub repository.
Data from this pub, including the raw spectra of beer samples, chili peppers (seeds and flesh), and algal samples, are available in the “data” folder of the GitHub repo.
The comprehensive parts list that we used to build the OpenRaman is in the “resources” folder of the GitHub repo.
Share your thoughts!
Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.
Background and goals
At Arcadia, we’re mapping genetic and phenotypic diversity across the tree of life to aid in predictive modeling and biological discovery. We’ve recently shown that high-dimensional phenotyping can improve the accuracy of phenotypic models [1] and, likely, genotype-to-phenotype mappings. However, measuring high-dimensional phenotypes is often laborious, most studies only measure one phenotype, and phenotyping often requires you to know what you’re looking for by pre-selecting a specific phenotype to quantify. In this pub, we evaluate the suitability of Raman spectroscopy for high-throughput, high-dimensional agnostic phenotype acquisition.
Raman spectra capture information about the chemical composition of a sample. Samples are briefly exposed to a high-intensity, single-wavelength light source. Most of the light is reflected or scattered elastically and is the same wavelength as the incident light. A minor fraction of the scattered light shifts wavelength. These shifts are caused by energy loss through vibrational or rotational absorption and shifts are characteristic of specific chemical bonds. Thus, the spectral distribution and intensity of this inelastically scattered light provide a fingerprint for the chemical bonds in the sample [2].
Raman spectroscopy of cells has recently been shown to contain holistic proteomic [3] and expression [4] data. In these studies, the authors used cellular Raman spectra to predict entire proteomes and single-cell expression profiles. Furthermore, we’ve shown that spectra of differing species reflect their phylogenetic relationships [5].
To better evaluate the utility of Raman spectroscopy for the analysis of biological information, we conducted a two-day hackathon [6] where we used a Raman spectrometer (OpenRAMAN) that we built in preparation to collect spectra for three types of biological samples (beer, chili peppers, and algal species). We then looked to see if we could 1) use the spectra for clustering/classification and trait/feature prediction, and 2) identify the importance of specific wavelengths for these predictive tasks. We selected samples that were likely to have clear and quantifiable dimensions of variation, such as alcohol content for beer and perceived heat for chili pepper.
Raman spectra contain enough information to not only differentiate samples but also to differentiate sample types based on combinations of features. Skip straight to these results or continue reading to review our methodology.
SHOW ME THE DATA: Data from this pub, including the raw spectral data, are available here (DOI: 10.5281/zenodo.11406248).
The approach
We ran an internal hackathon to quickly assess the utility of Raman spectroscopy in analyzing complex biological samples. Hoping to answer this question in just a few days, we chose a low-cost, open-source spectrometer to build ahead of time and test during the hackathon (OpenRAMAN). We designed our experiment to test three types of samples with varying attributes that we expected could be differentiated by their Raman spectra. We selected beer with varying levels of alcohol content (ethanol) and of different varieties representing different brewing yeasts, hops, malt, and other ingredients. We chose chilis that ranged in capsaicin level, color, and state (fresh vs. dried). Finally, we used algae species of varied genetic backgrounds that we were already using in other projects [7].
We built our spectrometer according to the directions for the “Starter Edition” with a few minor changes. Namely, we made the 3D-printed components using inexpensive fused deposition modeling instead of the suggested selective laser sintering due to tool availability. We also modified the inner diameter of the camera bracket from 32 mm to 34 mm to accommodate our camera lens. Finally, the 550 nm dichroic mirror was not available, so we replaced it with a 567 nm dichroic mirror (Thorlabs DMLP567). For ease of communication with our analysis computer, our camera (Teledyne Flir BFS-U3-16S2M-CS) used a universal serial bus 3 interface instead of a gigabit ethernet interface.
We’ve put together a comprehensive parts list that includes all the parts we used, plus other necessary tools and materials, which you can find here:
Data collection and sample preparation
From the options available at Berkeley Bowl West (Berkeley, CA, USA), we selected a variety of beers differing in alcohol content (alcohol by volume, ABV) and style. We collected the characteristics of these beers from both brewery web pages and the beer information aggregation website Untappd. These data reflect the values as of March 21st, 2024; given their crowdsourced origin, they’re likely to change over time. For sample preparation, we poured beer into weigh boats, where we agitated the beer to reduce bubbles and carbonation before pipetting 5 µl of each sample onto Parafilm and placing it in the sample chamber of the spectrometer.
We selected 20 chili peppers from Berkeley Bowl West (Berkeley, CA, USA), aiming for a wide distribution of spiciness and color. We dissected fresh and dried whole chili pepper varieties into two different sample types (flesh and seed) using razor blades on aluminum foil. Crushed red pepper flakes contain both seeds and flesh, so we selected a fragment of flesh and a fragment of seed for testing. We cut the flesh into roughly 0.5 cm3 pieces and collected spectra from the interior face. We found that spectra from whole seeds were qualitatively similar to dissected seeds, so we’re presenting only spectra captured from whole seeds here, but included the data acquired from the pepper flesh in our GitHub repo. We used forceps to transfer pepper samples onto Parafilm for data collection.
Icon
Chili variety (as labeled at Berkeley Bowl)
Abbreviation in GitHub repo (arbitrarily assigned)
Chili condition
Perceived heat range (Scoville units)
Median Scoville units
Chili color
Average length (inches)(ChatGPT)
Typical use (ChatGPT)
Green bell
GrBe
Fresh
0
0
Green
4.5
Raw, salads
Red Thai
ReTh
Fresh
110,000
110,000
Red
1.5
Curries, soups
Hot Italian frying
HoIt
Fresh
100–1,000
550
Green
6
Fried, sauteed
Poblano
Pbl
Fresh
1,000–1,500
1,250
Green
5
Stuffed, roasted
Ancho (dried poblano)
Ancho
Dried
1,000–1,500
1,250
Red
4
Dried, powdered
Hungarian wax
HuWa
Fresh
1,000–15,000
8,000
Yellow
4
Pickled, stuffed
Chilaca
Chil
Fresh
1,000–2,500
1,750
Green
8
Dried, sauces
Serrano
Serr
Fresh
10,000–23,000
16,500
Red
2
Salsas, raw
Chili de arbol
Arbol
Dried
15,000–30,000
22,500
Red
2.5
Dried, powdered
Orange habañero
OrHa
Fresh
150,000–350,000
250,000
Orange
1.5
Salsas, hot sauces
Red Fresno
Fres
Fresh
2,500–10,000
6,250
Red
3
Salsas, salads
Jalapeño
Jala
Fresh
2,500–8,000
5,250
Green
3
Stuffed, salsas
Chipotle (dried jalapeno)
Chip
Dried
2,500–8,000
5,250
Red
3
Smoke, sauces
Indian long
InLo
Fresh
25,000–100,000
62,500
Green
6
Curries, stir-fries
Crushed red
CrRe
Dried
32,000–48,000
40,000
Red
0.25
Dried, seasoning
Shishito
Shis
Fresh
50–200
125
Green
4
Grilled, raw
Anaheim
Anah
Fresh
500–2,500
1,500
Green
5
Stuffed, roasted
Yellow wax
YeWa
Fresh
5,000–15,000
10,000
Yellow
4
Pickled, sauteed
Green Thai
GrTh
Fresh
50,000–100,000
75,000
Green
1.5
Curries, soups
New Mexico
NeMe
Dried
800–1,400
1,100
Red
6
Stuffed, sauces
Table 2. Pepper varieties and phenotypes. We collected spectra for both flesh and seed for each sample, but only present data for the seeds here. All chili pepper samples are cultivars within the species Capsicum annuum except the orange habañero (Capsicum chinense).
We collected spectra from several unicellular algae, including freshwater Chlamydomonas reinhardtii, Chlamydomonas smithii, four hybrid strains from crossing these species [7], and the marine alga Isochrysis galbana. Using sterile loops, we transferred algae from solid media culture plates to Parafilm for data collection.
We clustered spectra using linear dimensionality-reduction methods. First, we performed unsupervised clustering of the full spectral dataset via principal component analysis (PCA). We assessed sample relationships by comparing the first two principal components (Figure 3). We then used linear discriminant analysis (LDA) to assess the extent to which we could classify individual samples within each data class (beer, chilis, algae). For each, we used the lda function in the R package MASS [8] to find a linear combination of spectral features that best classified samples (i.e., beer type, chili variety, and algal species). We assessed each LDA by comparing the first two linear discriminants (Figure 4).
Next, we assessed the extent to which we could identify regions of these spectra that correlate with quantitative features of different beers or chilis. Specifically, we examined the alcohol content of each beer (ABV), and, independently, the perceived heat of each chili (Scoville units). We obtained ABV values from each beer can (Table 1) and Scoville units from several sites including Wikipedia, Bonnie Plants, Chili Pepper Madness, and Scoville Scale (Table 2). In cases where a chili variety had a range of reported Scoville values, we used the median. The distribution of Scoville units was highly skewed, so we transformed the data so that we could perform analyses that assume a normal distribution. We added one to all Scoville values to eliminate zeros and transformed these measures using log10. For each sample, we collected between two and four spectra. We used the median of these spectra for subsequent analyses.
We expect that many of the components of these spectra will not be useful in predicting any particular quantitative feature of the samples. We, therefore, chose the least absolute shrinkage and selection operator (LASSO) regression [9] as implemented using the glmnet R package (version 4.1.8) [10]. Unlike the ordinary least squares solution to regression problems, this method is regularized using the L1 norm and expects that few model parameters contribute to a trait.
LASSO has a single tunable parameter, the L1 penalty (or λ), that determines the degree of regularization. To identify a value of λ that leads to the most usefully predictive model, we took a permutation-based approach. For 5,000 permutations, we randomly subsampled 75% of our data. We then used this 75% to tune λ through cross-validation (according to [10]). We tested the predictions for each permutation on the 25% of data that we didn’t use in the training. Following all permutations, we then used the λ that resulted in the most accurate predictive model to train a final model using all of the data. For significance testing, we calculated confidence intervals for each spectral position (pixel) from these permutations. We considered each location significant at p < 0.05. We note that these are local statistical tests that do not account for the multiple tests conducted in this study. The coefficients resulting from that final model are those presented in Figure 5 and Figure 6.
All code generated and used for the pub is available in this GitHub repository (DOI: 10.5281/zenodo.11406248), including scripts and notebooks used for processing and visualizing the data.
Additional methods
We used ChatGPT to help write code and add comments to our code. We also used it to generate the average length and typical uses of the peppers in Table 2.
The results
SHOW ME THE DATA: Access our raw Raman spectral data here.
Raw spectra are reproducible across technical replicates
Since spectroscopic measurements can be influenced by various noise sources — sample heterogeneity, hardware variability, fluorescence — we were interested in qualitatively assessing how consistently our spectra performed before more complex analyses (Figure 2). Encouragingly, spectra were similar within sample type (e.g., within beer or chilis) and reproducible across technical replicates (Figure 2). Furthermore, the spectra differed across sample types (Figure 2). Some of these differences seemed to reflect readily apparent features of the samples. For example, samples with “greener” color (green/yellow chilis and algae) seemed to have increased spectral intensity in the 1200–1400 pixel region (consistent with chlorophyll fluorescence; Figure 2, B–C). Similarly, light beers displayed a spectral peak between 1,300–1,400 px that other beer types lacked (Figure 2, A). We concluded that our measurements were sufficiently consistent, and displayed enough dynamic range across samples, that quantitative analyses would be interesting to pursue.
Clustering the spectra lets us separate samples by type
A potential benefit of Raman spectroscopy is that a single rapidly acquired measurement may provide enough information to classify complex biological samples. We explored this possibility by performing unsupervised clustering via principal component analysis (PCA) on raw spectra. We reasoned that the outcome of the PCA could inform us about the structure and richness of information contained within the spectra. For example, if we observed extreme mixing of samples among the principal components (i.e., no clustering), then we might conclude that the spectra are either too complex or too noisy to easily identify samples from raw measurements. On the other hand, if we found tight clusters corresponding to sample type, then spectra may be highly sample-specific but lack enough quantitative information to usefully stratify similar samples based on their biochemical differences.
Comparing the first two principal components, we qualitatively found that samples largely clustered by type and that we could separate them linearly (Figure 3). For example, PC1 appeared to mostly separate algae from the other samples, while PC2 delineated beer from chilis (Figure 3). Sample types also displayed qualitatively differing amounts of variation. Algae samples were the most variable, followed by beer and then chilis (Figure 3). These findings suggest that our spectra fall in between the two extremes outlined above: they contain enough information to cluster sample types but there is also measurable variation within the different sample types (i.e., beer, chilis, and algae). This encouraged us to explore the nuances of spectral data within sample types.
We were interested to see how a classifier might perform when applied to our spectra. Specifically, we created linear classifiers predicting each sample type from spectra via linear discriminant analysis (LDA). We found that, in each case, the first two linear discriminants grouped technical replicates together. Individual beer samples did cluster approximately according to their alcohol content — the three highest-ABV beers clustered together, including Dark Majik at 11%, Sneaky AF at 10%, and Big Love at 9% (Figure 4, A). Interestingly, though two of these three are IPAs, similar-style beers like a second Hazy Double IPA did not join this cluster. We also found that three of the lighter-style beers with lower alcohol content clustered together, including the Kolsch Kolchstastic at 5.2%, the lager Helles (Long Nights Edition) at 4.9%, and the light lager Party Wave at 4.2% (Figure 4, A). The key exception was the pilsner, Temescal Pils (5.0%), which did not cluster with the other lighter-style, low-alcohol beers. Instead, the pilsner joined the third cluster, which includes beers with an intermediate ABV (Figure 4, A). The chili seed samples tended to be sorted by color of the chili on LD1, with the red chilis and the various dried chilis to the left and the green chilis to the right (Figure 4, B). Across samples, one of the dominant signals was pigment fluorescence, including chlorophyll and carotenoids. This held true even for chili seeds. Finally, we found that each algal sample clustered independently, demonstrating that the cross between Chlamydomonas reinhardtii and Chlamydomonas smithii resulted in unique progeny that are differentiable from either parent (Figure 4, C). This suggests that the genetic and resultant physiological and chemical differences between these unique hybrid strains are captured in Raman spectra. These spectra can be used as high-dimensional phenotypes to differentiate both species and strains and potentially improve genotype-to-phenotype mappings [1].
Specific regions of the spectra correlate with quantitative features of the samples
Our clustering results show that these Raman spectra contain sufficient information to identify individual biological samples, suggesting they might also contain information about quantitative features that varied across those same samples. To test this possibility, we identified spectral regions that significantly capture information about beer alcohol content (ABV) and the perceived heat of a chili (Scoville units). We did not analyze quantitative traits for algae because we tested fewer individual samples (i.e., strains) than we did for chilis and beer. For both ABV (Figure 5) and Scoville units (Figure 6), we conducted a LASSO, a regularized form of regression, where intensities at individual spectral positions were independent variables and the quantitative trait was the dependent variable. We chose LASSO because it is effective in cases where only very few of the model parameters (intensity at individual pixels in the spectra) influence the response variable, something we expect to be true for these data. We optimized our model for the prediction of “test” data not used during training. Therefore, significant spectral features are predictive of the particular quantitative trait. We determined the significance of each spectral position by permutation test (see “Data analysis” for details).
Our analysis of beer samples identified several regions of Raman spectra that significantly predict ABV (Figure 5, bootstrapped confidence intervals, p < 0.05). Although the LASSO regression treats each spectral position as independent of the others, the spectral positions with significant coefficients appear (qualitatively) to cluster in spectral space, though we did not formally test this. For instance, the major peaks in spectral intensity for lower-ABV beers are often flanked by spectral positions with significant coefficients (Figure 5, B). There are apparent clusters of significant coefficients at these positions, where the intensity of Raman signal begins to shift. Thus, we can use these spectra to identify features that significantly predict the ABV of a sample.
Across the chili seed samples, chlorophyll fluorescence drove much of the variation (Figure 6, A, pixels 1,200–1,440). Despite this, we identified spectral regions that predict perceived heat (Figure 6, B; bootstrapped confidence intervals, p < 0.05). The regression coefficients for spectral regions with variation driven by chlorophyll or carotenoid fluorescence (Figure 6, B; pixels 1,200–1,440) are much smaller than coefficients for other sections of the spectra. This pattern could indicate that chemicals causing Raman shifts in this spectral range contribute less to a pepper’s perceived heat than chemicals causing Raman shifts in other spectral ranges. Alternatively, it could be that the strong chlorophyll or carotenoid fluorescence reduces our ability to estimate the contribution of truly meaningful features. A less exploratory study would benefit from more rigorous control of this confounder. One could explore this further by comparing the spectral data from seeds to flesh and isolating the spectral contribution of the pigment (chlorophyll and carotenoids). Though not presented here, our data from the analysis of chili flesh samples are also available in our GitHub repository.
The analyses of both beer and chilis show that these spectra contain information about quantitative features of these biological samples and we can identify the components of the spectra that contribute to these features.
Key takeaways
Raman spectroscopy yields meaningful data about the chemical composition of biological samples, and there’s a cheap, quick, easy, and open-source way to build your own Raman spectrometer (OpenRAMAN).
Testing the OpenRAMAN spectrometer on chilis, beer, and algae showed that this approach is sufficient to classify samples by their spectra and associate them with quantitative traits.
High-dimensional phenotyping through Raman spectroscopy is useful and accessible.
Next steps
In this pub, we rapidly tested the feasibility of using a tool for our downstream work by running a hackathon. This hackathon structure was quite useful for constraining a small project in time and scope and we’ll likely try it again in the future. Because of the ease of data collection and application of machine learning algorithms, we’ll continue to leverage Raman spectroscopy, including using the inexpensive OpenRAMAN spectrometer, as a powerful approach for probing biology. We’d like to help make Raman spectra from biological samples easier to interpret, so we’d love to hear if there are any Raman-focused FAIR databases that would be appropriate for these spectra. We’ve shared our data in the GitHub repo associated with this pub, but it would be great to make them more discoverable and contribute to a shared, centralized resource.
Share your thoughts!
Watch a video tutorial on making a PubPub account and commenting. Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. Please make all feedback public so other readers can benefit from the discussion.
I love this study! I’m curious if these spectra reflect a direct measure of ABV, or whether it’s more of an indirect measure. An interesting follow-up would be to add pure ethanol to raise ABV in the lower ABV sample to see how the spectra are modified, and whether the shift to match higher ABV beers.
?
Tara Essock-Burns:
Hi Fabrice, thanks for your interest in the pub. We expect the Raman signal shown here to be an indirect measure of ABV, because we can’t ascribe a particular feature we’re tracking. The spectra we show here have a mixture of Raman and other signals, like fluorescence. Because ethanol is only one component in the beer and we performed PCA and LDA on the full spectra (without subtracting out fluorescence), we see that ABV explains some of the sorting on the LDA (Fig 4A) but not all. It would be interesting to do your follow-up study to isolate the Raman signal of ethanol and test how adding ethanol changes the readout for low-ABV beers. In fact, it seems many other people use Raman spectroscopy to test beers. See the study described in this link as an example for looking for changes in particular peaks, indicative of modes that are in the ethanol Raman spectrum, correlated to ABV (https://www.metrohm.com/en/applications/application-notes/raman-anram/an-rs-041.html).
Keep an eye out for upcoming pubs with improvements to our Raman spectroscopy approaches and we’d love to hear your feedback!
?
Sanchari Saha:
Have you ever thought about comparing the Raman spectra of complex materials with those of their original components or ingredients? This method could offer valuable insights when correlating genotypes and phenotypes of other species through Raman spectroscopy. For instance, in the case of beer have you looked at whether the Raman spectra of ingredients like wheat, barley, and hops match the spectra of the final product, such as IPA beer? If so, did this approach improve the correlation or prediction of genotypes and phenotypes in other complex biological species?
Ben Braverman:
Hi Sanchari, thank you for your question. We have not tried to get Raman spectra of the individual components of beer, but I think that would be a great follow up to this experiment. We are continuing to improve the spectra quality from our OpenRaman system and hope to share more soon!
Sunanda Sharma:
The camera listed in the text and the BOM do not appear to the same. It might also be useful to add a sentence just above this line saying that the camera is otherwise functionally similar to the original OpenRaman camera - same pixel size, number of pixels etc.
Ben Braverman:
Thank you for pointing this out. The camera in the BOM is the one recommended on the OpenRaman website, we opted to use a USB version that we already had on hand. The camera we used has a smaller sensor size (1440 x 1080 px) compared to the recommended camera (2048 x 1536px). We have since upgraded the camera to the BFS-U3-31S4M-C which has a 2048 x 1536 resolution and a USB interface instead of a GigE
?
Manon Morin:
Were the spectra subjected to any preprocessing steps ( blank removal subtraction, baseline correction, or normalization)? Additionally, were the technical replicates averaged before clustering the spectra, or were they analyzed separately?
Ryan Lane:
Hi Manon, thank you for the question! The spectra were indeed pre-processed with background subtraction, and we will update the text to reflect that. The technical replicates were only averaged for illustrative purposes in Figure 2 — they were not averaged prior to clustering in the analysis.
?
Manon Morin:
Could you specify maybe the data acquisition parameters used for each type of sample, including exposure time, number of frames, and the number of technical replicates?
Ben Braverman:
Hi Manon, thanks for your question. Our exposure time was about 2000ms, our number of frames per acquisition was 5, and the number of technical replicates was 3.