Automating identification and quantification of mouse scratch behavior in video recordings

Kira E. Poskanzer; Behnom Farboud; Peter S. Thuy-Boun; Tori Doran; Audrey Bell; Claire Kwon; Seemay Chou; Alba Peinado; MaryClare Rollins; Megan L. Hochstrasser; Feridun Mert Celebi; Keith Cheveralls

doi:10.57844/arcadia-zf7s-3264

Resource Feedback requested Ticks as treasure troves: Molecular discovery in new organisms

Published on Aug 15, 2025 by Arcadia Science

Automating identification and quantification of mouse scratch behavior in video recordings

Itch is a key symptom of many diseases. Drug development for these diseases requires assessing itch to determine if potential drugs work. We developed a workflow to rapidly quantify scratching, a measure of itch, in a pre-clinical animal model to speed drug discovery.

Automating identification and quantification of mouse scratch behavior in video recordings

Purpose

We sought to speed drug discovery by building an end-to-end video analysis pipeline to assess itch in mice treated with itch-inducing substances (pruritogens) and tick extracts that contain compounds that may alleviate itch. The pipeline leverages machine learning (ML)-based pose estimation to track mouse scratching behavior and automation in the cloud to accelerate analysis.

This resource might be useful to researchers working on itch, applying ML-based models to assess other behavioral readouts, or hoping to extract general lessons for productionizing ML workflows in the cloud.

All associated code is available in this GitHub repository.
Check out our protocol, “A behavioral assay for measuring acute itch in mice,” for step-by-step instructions to carry out the upstream scratch assay and generate the videos that feed into this computational pipeline.

Share your thoughts!

Feel free to provide feedback by commenting in the box at the bottom of this page or by posting about this work on social media. Please make all feedback public so other readers can benefit from the discussion.

The strategy

The problem

Itch is a debilitating symptom and driver of numerous diseases that have a devastating impact on physical and emotional health. Being able to categorize and quantify itch is necessary to assess disease progression and the efficacy of treatment for itch-associated diseases. The gold standard for quantifying itch in preclinical translational models has been human observation and tallying of itch-specific behaviors from video recordings, specifically pruritogen-induced scratching in mouse models [1].

Mouse scratching has traditionally been quantified through manual annotation of mouse videos, counting mouse paw swipes across affected areas. This approach has translated reasonably well to human clinical phenotypes [2], although the process of manual annotation can be slow and tedious. Given our goal of using behavioral phenotyping as a guide for our fractionation experiments, we needed ways to speed up our in vivo analyses.

Our solution

Overview diagram of video processing and scratching behavior quantification, where the three key steps are 1) pre-processing videos, 2) ML-guided body position tracking, and 3) identifying and quantifying scratching. The scratching data gets plotted on a graph with time on the x-axis and distance between hindlimbs on the y-axis. Each oscillation represents one scratch. — **Schematic of automated workflow for the analysis of mouse scratching**.
We uploaded unprocessed video files and their matching metadata (TXT) files to AWS, where we triggered the NextFlow workflow. We cropped the files to include only one mouse per video and renamed them based on the data in the metadata file (time, date, camera, frame of capture). We then used DeepLabCut to track the six body parts, indicated by the colored circles. We processed the output pose estimation data from DLC with a peak-finding script to identify and quantify episodes of scratching. The script looks at the difference in x,y positions between the two rear hindlimbs to identify the stereotypical cyclical movement pattern of the scratching motion.

To accelerate our studies on anti-itch compounds, we built a computational pipeline to automate the steps involved in quantifying mouse scratching (please check out our companion protocol on how to perform the mouse scratch assay). Here, we present the workflow and code we developed. We estimate that it ultimately sped up our data analysis by at least 50×. One limitation, however, is that our pipeline performs well only on DBA/2J mice, which have unique coloration patterns. Further work is necessary to make it generalizable for all mice.

Others have also developed ML-based training sets to track and quantify scratching in rodents [3][4]. While tracking mice using ML-based algorithms isn't a new tool to aid in scratch quantification, automating the different arms of the process in the cloud was unique and a game-changer for decreasing our analysis time between experiments.

We built this pipeline to quantify scratching in mice, but we recognize that there are other, less-appreciated, itch-related behaviors that aren't categorized as scratching. Since the pipeline provides positional data for numerous body parts of the mouse, we hope others will use it to identify and track metrics besides scratching using unbiased ML-based clustering algorithms.

Figure 1 outlines the pipeline. It automates video cropping and cataloging, body position tracking using DeepLabCut, and scratch quantification.

The resource

Code, including the NextFlow workflow, packages to preprocess videos and analyze them with DLC, and the peak-finding scripts, is available in our GitHub repo (DOI: 10.5281/zenodo.16879067).

In this section, we dive into the details of our automated scratch assay analysis pipeline. Figure 1 outlines the key steps of the computational analysis.

Preprocessing videos

We treated DBA/2J mice with pruritogens (itch inducers) and captured videos of their behavior (see our protocol for this assay on protocols.io). Each video captured two mice. To unambiguously catalog each mouse recording, we subjected videos containing two mice to automated cropping into two videos containing one mouse each (Figure 1, top panel). We renamed these videos based on position in the frame (right vs. left enclosure), capture time, and date. We affixed quick response (QR) codes just outside each mouse enclosure, and used a script to assign a 900 × 900 pixel bounding box based on the corner of the QR code to perform the video cropping. Two boxes defined two regions in the video to crop to generate two separate videos. We then extracted the video capture time and date from an associated metadata file and used them to rename the two cropped video files.

Training ML-guided body position tracking

We then employed automated, machine learning-guided body position tracking using DeepLabCut (v2.2.1, RRID: SCR_021391) [5]. We manually labeled six body parts, generating a skeleton of the front left paw, front right paw, rear left paw, rear right paw, nose, and rear of the animal. Some other studies [3] focus exclusively on labeling the limbs that perform the scratching, but miss out on tracking other possible behaviors that could be found by tracking many body parts. We performed training on an Intel Core i9 CPU system and a NVIDIA GeForce RTX 3080-10GB graphics card. We prepared a training dataset with the ResNet-50 convolutional neural network backbone [5], using 43 videos with 2,200 total frames, extracting an average of 50 frames per video to label the six body parts manually. We split the dataset 95:5 for training and testing, respectively. At each iteration of training, we extracted outlier frames, corrected labeling, and then used them to retrain. Using a confidence threshold of 0.6, training achieved an average train error of 0.32 mm (1.89 pixels) and average test error of 1.25 mm (7.39 pixels).

Identifying and quantifying scratching

We then used the pose estimation data to explore the most robust means to quantify scratching. Scratching in mice involves raising a right or left rear hindlimb to scratch the affected area [6]. When plotting the change in hindlimb x,y position over time, the cyclical nature of the scratching motion appears as a distinct oscillating curve (Figure 1, bottom panel). We empirically determined that the clearest scratching pattern (the greatest and most consistent difference in x,y positions over time) was revealed when plotting the difference in the positions of the right and left hindlimbs during scratching episodes.

We then wrote a Python peak-finding script to automatically identify the itch-associated cyclical hindlimb displacement pattern noted above. To accurately flag "waves" of rear hindlimb swipes in displacement-time graphs as scratching events, we included several key variables in the script (Figure 1, bottom panel):

The minimum and maximum prominence of the peak
Peak prominence is a measure of how much a peak stands out from its surroundings by calculating the minimum vertical distance you need to descend from the apex of a peak before you could climb to an adjacent higher peak.
The minimum and maximum width of the peak
The maximum frequency of the scratches
The minimum number of oscillations of the curve within a given time, corresponding to the number of scratches across the neck in a time bin
This final metric allowed us to distinguish the repeated pattern of scratching from just random movement that may appear as less frequent oscillating displacement of the hindlimb (e.g., rapid walking).

We adjusted these parameters, generated output tables with predicted scratching events, and compared the predictions to ground-truth video footage and manual quantification to confirm that the automated calls were correct. After some iteration, we determined how to adjust the variable settings to call scratching events accurately.

Using this script, we were able to capture the following data (with precision dictated by the video capture rate of 1/120 of a second):

Which limb is performing the scratching
Individual hindlimb swipes at the affected area
Number of swipes in a bout of scratching
Frequency of swipes and frequency of bouts of scratching.

Overall, these measures help capture the dynamics of scratching across the entire time course of the video.

NextFlow workflow to orchestrate the analysis pipeline in AWS

With all the different modules of the analysis pipeline complete, we wanted to take advantage of the parallel processing power available in the cloud to further speed up the video analysis. We ported the modules to Amazon Web Services (AWS) and built a workflow using NextFlow to orchestrate the pipeline.

The NextFlow workflow is available in our GitHub repo.

Findings and caveats

The outcome of this effort was automation and parallel processing of analysis that significantly accelerated the quantification of scratching and allowed rapid and nearly real-time adjustment of experimental parameters for follow-up experiments. Less than a day after an experiment was complete, we would have the results and could perform the next experiment.

Direct comparison of manual quantification time to automated quantification is difficult because a huge efficiency gain came from running the automated analysis in parallel with many videos. However, a good approximation can be made by comparing the analysis time for an average single experiment of 48 15-minute videos. Quantifying 48 videos took approximately 36 hours, which generally spanned seven days. With the automated pipeline, 48 videos could be processed in three hours (~56× faster).

As noted by others [4], ML-based training weights for pose estimation are sensitive to the specific conditions in the video (lighting, camera resolution, camera angle, frame rate, etc.). For instance, our pipeline successfully tracks mice with coloration patterns specific to DBA/2J mice. Re-training and fine-tuning will be necessary to generalize to other mouse breeds that are visually distinct from DBA/2J mice. Additionally, slight variations in our experimental setup (described here), may also require re-training.

Additional methods

We used ChatGPT to review our code and selectively incorporated its feedback. We used Claude to review our code, selectively incorporating its feedback, and to help clarify and streamline text that we wrote. We also used Grammarly Business to suggest wording ideas and then chose which small phrases or sentence structure ideas to use.

Key takeaways

In summary, we built a video analysis pipeline to quantify itch in mice treated with itch-inducing substances and potential anti-itch compounds. Our pipeline significantly accelerates video analysis and removes the variation inherent in using different human observers in the same experiment. It also yields more granular data, such as scratching frequency. Additionally, it provides the framework to use positional data and non-scratching behavioral information to track other outcomes of experimental treatments.

Next steps

We’re currently winding down the Trove effort [7], so we won’t continue development of this resource. For those who may want to use this tool, further work should be done to generalize ML training to follow mouse strains other than those with DBA/2J coloration patterns. Further optimization of GPU use in the cloud could also increase efficiency and decrease the analysis time.

Unsupervised ML-based algorithms [8][9] could also be used with our positional data to identify emergent behavioral patterns caused by treatment regimes. These emergent behaviors could provide another metric to follow that may display less variability mouse-to-mouse than scratching.

Acknowledgments
- We’d like to thank Alan Basbaum and Juan Salvatierra for suggestions on mouse recordings and strategies for generating the training weights with DLC.

Share your thoughts!

Provide feedback

Audrey Bell

Visualization

Feridun Mert Celebi

Methodology, Software, Validation

Keith Cheveralls

Resources, Validation

Seemay Chou

Conceptualization, Supervision

Tori Doran

Critical Feedback, Investigation, Methodology, Validation

Behnom Farboud

Conceptualization, Formal Analysis, Investigation, Methodology, Supervision, Writing

Megan L. Hochstrasser

Editing

Claire Kwon

Critical Feedback, Investigation, Methodology, Validation

Alba Peinado

Formal Analysis, Methodology, Software

Kira E. Poskanzer

Conceptualization, Supervision

MaryClare Rollins

Conceptualization, Formal Analysis, Investigation, Methodology, Supervision, Writing

Peter S. Thuy-Boun

Conceptualization, Critical Feedback, Methodology

A MacKinnon on Oct 27, 2025

I love this as an example of how the ability to measure and quantify a complex biological phenomenon can become the engine for new biological insights. It makes me wonder whether other phenotypes or behavioral states beyond the canonical “itch” — such as discomfort, restlessness, or distress — might also be identifiable and quantifiable from the same data. If so, perhaps these analyses could form the basis for new behavioral systems to evaluate diverse therapeutic modalities. In my own field of cancer biology, it reminds me of how the deceptively simple Ames assay for measuring the mutagenic potential of novel chemicals provided a quantitative framework that transformed our understanding of the origins of cancer. Keep up the good work!

Nandhitha Venkatesh on Oct 21, 2025

What a wonderful addition to behavioral phenotyping efforts! Although it is different from quantifying a behavior, I have spent hours quantifying microbial phenotypes during grad school and can appreciate the value of automation in such situations!

I’m wondering whether this could be integrated with hyperspectral imaging to correlate scratching activity with local skin physiological changes (perhaps pre- and post- scratch event or treatment with a drug?). Could such a combined setup also be useful for probing the mechanisms of action of anti-itch compounds?

Mtuhukumar Karuppasamy on Oct 20, 2025

It is a great work incorporating both animal models and machine learning-based pose estimation. I have a few comments.

In this study, pruritogens/itch-inducing substances were used. However, the study did not specify the types of pruritogens, their concentrations, or the exact itch-inducing substances used. Could you clarify these details?
How did you distinguish scratching from natural mouse behavior versus induced itch?
How did you select these anti-itch compounds? Were screening assays performed for identifying anti-itch compounds?
DBA/2J mice: studies performed only on DBA/2J mice, and it was agreed as a limitation, which is an excellent one. However, the inclusion of future studies on additional models is recommended.

Dina Juarez-Salinas on Oct 19, 2025

As someone who spent years scoring scratching videos for itch research (in Alan's lab!), what really catches my eye here is the ability to detect the frequency of the scratches and not just the bouts. When testing several different pruritogens, I always noticed not only the bout number changing (which is what I was quantifying), but also the scratch frequency. However, in practice it was not reasonable to try to quantify scratch frequency manually - this technology provides a nice solution to capture this nuance, which may not be a nuance at all and may potentially tell us something deeper about the quality of the itch experience under various circumstances.

Geoffrey Diehl on Nov 18, 2025

To piggyback onto the discussion, I agree with Dina and think that in addition to automating and streamlining your ability to identify relevant scratch events, I see a huge value in that you also have a streamlined and modifiable pipeline to extract high dimensional movement data to do all kinds of further analysis off of. Maybe not all scratching is equal and there are different classes and/or degrees which could be extracted by clustering approaches or the like. By characterizing the movement with ML/DLC you intrinsically have moved from your video data into a multidimensional timeseries data in contrast to potentially not much more than a running scratch count if scored manually. Why not now do further characterization of the movement behavior in things like scratch frequency (as mentioned above), peak-valley intensity, variation in these characteristics between successive scratch events, etc. I always find it a waste to take such rich data sets like composite, naturalistic behavioral responding and boil it down into a single count such as number of scratches. Your pipeline retains a lot of the behavioral richness while expressly funneling it all into hard quantifications that are ripe for further computation. Furthermore, you have all of the data already up in the cloud ready to be handed right off to the next arm of a large scale analysis pipeline to do things like UMAP or GLM and extract even more valuable information from the behavior.

Behnom Farboud on Oct 20, 2025

Thank you for your comment, Dina! Yes, the frequency of scratching certainly caught our attention, too! We noticed that different pruritogens can lead to varying intensities of scratching, resulting in differences in the frequency of bouts over a given time period and differences in the frequency of continuous scratching swipes within a bout. For example, we observed that C48/80 could induce more intense, higher frequency swipes than serotonin. We hoped to use this pipeline to extract information that could describe different qualities of itch experience for a large panel of pruritogens. Of note, others have previously described a similar approach (e.g., Wimalasena, et al., 2021). Thanks again, Dina!

Snigdha Mukerjee on Oct 16, 2025

As itch detection is a rhythmic pattern, is this pipeline translatable to other behaviors, specifically, licking behavior? Do you have any recommendations for what type of camera one could use?

Behnom Farboud on Oct 17, 2025

Hi Snigdha, thanks for your questions! We explored tracking other rhythmic and arrhythmic behaviors using this pipeline, and we believe it’s transferable to any behavior that can be tracked using the six labeled body parts.

Our recording setup required that we be able to assemble and disassemble our observational setup daily. We therefore chose a modular high-resolution camera that could capture high frame rate videos in low light. The Sony a7S III Mirrorless Camera met these criteria and performed exceptionally well for our purposes.

Apaala on Oct 16, 2025

This is an excellent demonstration of how pose-estimation-based behavioral quantification can be scaled using cloud workflows. I particularly appreciate the modular design integrating DeepLabCut with NextFlow, as that parallelization approach could be very useful for labs like mine that run high-throughput behavioral assays. In my own experience analyzing open-field videos in mice, we face similar challenges related to variation in lighting, coat color, and camera angle that affect tracking accuracy. Your note on retraining for specific strains (DBA/2J vs others) resonates strongly. Have you experimented with domain adaptation or transfer learning methods (e.g., fine-tuning ResNet weights on small labeled datasets from different strains) to generalize across visually diverse cohorts? Additionally, since your pipeline outputs positional data for multiple body parts, it might be interesting to apply unsupervised clustering (e.g., t-SNE or UMAP embeddings of multi-joint trajectories) to discover spontaneous non-scratch behaviors or micro-movements. This could complement open-field locomotor analyses and expand the behavioral repertoire captured from the same dataset. Overall, this work provides a solid foundation for behavioral automation for both pruritus studies and broader translational phenotyping in preclinical models.

Behnom Farboud on Oct 17, 2025

Thank you so much for your thoughtful comments and suggestions, Apaala! We'd begun to fine-tune the DBA2/J weights using videos of C57BL/6 mice and saw improvements that suggested that we could continue to adjust the weights to generate a more generalizable model. However, we were focused on moving forward with many other aspects of the project and didn't get a chance to return to model optimization for tracking more diverse environmental factors, such as mouse coloration. And great point about applying unsupervised clustering to track non-scratch behaviors associated with itch! We were excited about this avenue and pursued behavioral clustering for both itch and pain, with some interesting initial results suggesting that itch and pain could produce stillness at different times after an itch or pain-initiating event. There were other potential patterns that we were following up on, but we didn't get a chance to flesh them out. It's a great point and one that has a lot of promise--thanks again!

Archana Proddutur on Sep 05, 2025

The pipeline presented for measuring itch automatically is an important step forward. An important and potential issue to consider is that it might confuse scratching caused by itch with other similar movements, like grooming. Grooming, especially around the head and neck, involves quick paw movements that might look like scratching. When done manually, researchers can tell the difference by looking at the context, for example, whether a paw touches the face or the back of the neck. However, automated systems that focus only on hindlimb movements might mix up these behaviors. Grooming typically uses the forelimbs and head, but hindlimbs can also move rhythmically, which could lead to mistakes in identification. Since the current pipeline mainly uses hindlimb movements to quantify scratching, it risks incorrectly counting grooming as scratching. Since the pipeline already tracks forelimbs and head/nose positions, integrating an unsupervised algorithm could help distinguish grooming from true scratching more effectively. One way to achieve this is by incorporating additional control video datasets that include explicitly groomed behavior and scratching video datasets. This would improve the robustness and generalizability of the pipeline to differentiate true scratching from grooming behavior. I think this pipeline may be expanded to use in depression behavior too, because generally in depression, animals tend to groom less, while they are anxious, they over groom.

Behnom Farboud on Sep 08, 2025

Thank you for the comment, Archana! At times, we observe grooming/licking of the hindlimb after scratching, which can be rhythmic. To specifically identify scratching and not this or other behaviors that could potentially be confused with scratching, our peak finding scripts require selecting a threshold for movement displacement of the hindlimb (pixels moved per scratch cycle), frequency range for the swipes, and minimum number of swipes (cycles) per scratch bout. We tested these metrics using the scripts to extract scratching events and then compared the results against the ground state truth of manual human observation of the videos. We then iteratively fine-tuned the metrics and retested the scripts until >95% of scratching events were identified without capturing non-scratching events. At least from the ~4 hours of videos we tested against, we didn't find non-scratching, rhythmic hindlimb motion that was systematically confused with scratching. This is not to say that under different treatment regimens, other behaviors could not emerge that would look like scratching using these metrics/scripts. But, altering the treatment regime would require reconfirming that the scripts still accurately quantify scratching.

Raul Ramos on Sep 03, 2025

Reflection: This is a great tool that, through its development, has outlined a path forward for automating and accelerating any behavioral assessments that rely on pose estimation algorithms. One limitation, in addition to the DBA/2J mice-color/contrast issue, is also the camera angle (from underneath the mice) versus a leveled side profile with the mice. While I believe that the under-angle can accurately capture scratching behaviors, having trained a model on a side profile video would have allowed for using both the nape of the neck itch model for scratching and the cheek model for scratching and wiping (pain behavior). Additionally, most itching in the context of medicine results from the systemic consumption of a compound and not a localized injection, followed by localized scratching. A side profile would have provided a more generalizable model with a similar training effort, which could have provided information on ear vs. trunk scratching vs. grooming, etc. However, as the authors state, there are already other resources available for the automated scoring of somatosensory behaviors, and the innovation here lies in automating the preprocessing and running the videos in parallel on the cloud. I'm interested in how others in the scientific community might adopt similar approaches going forward and how this could reduce a significant time sink (behavior analysis) and push science forward faster. One interesting idea that this write-up has me thinking about with respect to the treasure trove project is that leeches would be an interesting organism to explore for other molecules that could have topical anesthetic properties.

David Tran on Aug 22, 2025

Very interesting work. It was mentioned the current method is limited to DBA/2J mice and may need further work for other mice strains. Can you elaborate further on why this is and what would need to be modified in future work? Is it because the DBA/2J mouse has a brown coat with white feet, resulting in good contrast between the feet and the fur, while it may be more difficult to track movement on a more white-colored mouse? Thank you.

Behnom Farboud on Aug 25, 2025

Hi David, Thanks for your question! As you mentioned, we believe the current training weights don’t generalize across mice because the DBA/2J strain has a light brown coat with white feet. In contrast, other mice can have different coloration patterns and contrast between their paws and coats. We'd need to perform additional training on mice with varying color patterns to generalize to mice of various colorations. For instance, we have evidence that additional training with C57BL/6 mice improved the DBA/2J model, but we didn't continue developing the model."

Contributors (A-Z)

Purpose

Share your thoughts!

The strategy

The problem

Our solution

The resource

Preprocessing videos

Training ML-guided body position tracking

Identifying and quantifying scratching

NextFlow workflow to orchestrate the analysis pipeline in AWS

Findings and caveats

Additional methods

Key takeaways

Next steps

References

Share your thoughts!

Provide feedback

Pub details

Table of contents