June 2025

Audio-Based Human Activity Recognition

Comparing annotation methods for IMU-based activity data — and finding the labeling pipeline matters more than the model.

PythonPyTorchLSTMAndroidMIDI/YAMLMoviePy

Pannuto Group Poster LinkedIn Post

Overview

UCSD’s Early Research Scholars Program placed me in Pat Pannuto’s embedded systems lab for a year. Our team worked on a problem that doesn’t get as much attention as it should: human activity recognition models need large labeled datasets to train well, and the standard methods for building those datasets are slow and hard to scale.

Research question

How can Human Activity Recognition data be collected more efficiently to create large, high-quality datasets for training machine learning models?

Team

Jonathan Ty
Momina Habibi
Yuan-Kai Yang

CSE 198, Early Research Scholars Program, 2024–2025.

The labeling problem

Human activity recognition is the task of classifying what someone is doing — walking, sitting, climbing stairs — from sensor data. It shows up in fitness trackers, fall detection systems, rehabilitation monitoring. The classifier side has gotten reasonably well-developed. What hasn’t kept pace is the data side.

The standard collection method is video-based: record participants, then manually annotate the time-series sensor data by watching the video back and marking activity boundaries. The video provides genuine ground truth, but annotation is slow — often several minutes of work per minute of recorded data. At any meaningful scale, that’s the bottleneck.

Button-based labeling shifts that burden onto participants: they press a button in an app when they switch activities, and the timestamps become the labels. Cheaper than video, but it depends on participants pressing at exactly the right moment, which they often don’t. The label quality ceiling sets the model quality ceiling, and sloppy button data has a low ceiling.

Three approaches

We tested all three labeling methods on data from a chest-mounted 9-axis IMU sensor. The sensor logs accelerometer, gyroscope, and magnetometer readings at 100Hz. We collected data across five activities: walking, running, standing, and climbing stairs in both directions.

Video-based: participants recorded while performing activities, with labeling done afterward using action annotation software. Accurate but time-intensive.
Button-based: participants press a button in a custom Android app when they transition to a new activity. Timestamps from those button events become the labels.
Audio-based: participants follow a pre-generated instruction track — verbal cues synchronized to a musical beat tell them which activity to do and when. The audio timestamps become labels automatically, with no post-hoc annotation required.

The hardware

The TotTag, built by the Pannuto lab, is a small wearable that records 9-axis IMU data and connects to a custom Android app over Bluetooth. We taped it to each participant’s chest in a small Ziploc bag to keep placement consistent across sessions. After each recording, the app exported a ZIP file with the sensor CSV and any audio or button logs, which we uploaded to a shared drive for processing.

My contribution

My main piece was the audio synthesis pipeline. The premise: if you can generate an instruction track where verbal cues are time-locked to a musical beat, and give that track to participants during data collection, you don’t need to label anything afterward. The audio timestamps tell you exactly when each activity started.

In practice this meant writing Python scripts that parsed a YAML configuration defining activity transitions by timestamp, extracted measure positions from a MIDI file to locate beat positions in real time, generated Google TTS audio for each verbal cue, and combined everything using MoviePy and FluidSynth into a final MP3. The hard part was synchronization drift. A 300-millisecond late cue misaligns every label in that segment, and misaligned labels corrupt the training data in ways that don’t surface until you’re looking at validation numbers wondering what went wrong.

I also handled data preprocessing: converting action annotation outputs into timestamped CSV files in the format the ML model expected, debugging edge cases where the exported video and audio durations didn’t match, and helping restructure the overall pipeline from raw sensor exports to model-ready inputs.

Results

The audio approach achieved 81% validation accuracy on the CNN classifier — higher than both the button-based method and the standard baseline. It also had the lowest labeling time of the three, since the alignment step is mostly automated rather than manual.

The result that actually surprised me wasn’t the accuracy number. It was the time cost gap. Manual video annotation takes substantially longer per labeled sample than audio-based alignment, and that difference compounds if you want a dataset large enough to generalize. A collection method that produces good labels automatically is worth more than a slightly better model trained on hand-labeled data.

Looking back

We tested on a small participant pool, which limits what you can actually conclude. A model trained on ten people doing activities in a quiet hallway might be picking up on how those specific ten people walk, not on walking in general. Testing in noisier settings with more variation would tell you whether the audio approach holds up or just works under controlled conditions.

The button approach probably got worse numbers than it deserved. We didn’t invest much in the app’s timing feedback loop, and participants were pressing when it felt right rather than when it was precise. That sloppiness went straight into the labels. With more design work on that part, it might have performed better.

The bigger thing this project changed for me has less to do with HAR specifically. Before ERSP, I thought about ML mostly in terms of model architecture and training. After spending a year trying to produce a clean labeled dataset, I think about data collection as its own design problem — one that often constrains everything downstream more than the model does.