Blog Posts

Week 15 – FDR

Team Noesys and AGIS AI liaisons at FDR

Team Noesys has successfully completed our Final Design Review presentation this week! We presented our emotional analysis system for AGIS AI at the Reitz Union, showcasing our web application’s real-time capabilities in detecting emotions across audio, video, and text modalities. The presentation highlighted our journey from concept to functional prototype. Check out our poster and video below for more information.

As we wrap up the IPPD experience, we want to express our gratitude to our liaisons at AGIS AI, our coach, and the IPPD staff who made this project possible. The past two semesters have been both challenging and incredibly rewarding. We’re proud of what we’ve accomplished as Team Noesys and excited about the potential impact of our work. Thanks for reading our blog, and farewell from all of us at Team Noesys!

Week 15 – Poster and Video Demo

Our team’s poster

To accompany our FDR presentation, the team developed a poster explaining our project’s background, describing how the emotional analysis system works behind the scenes, and highlighting important system features.

We also recorded a video that illustrates our product’s use case and showcases key features in action. Check it out below:

Week 14 – FDR Preparation and Final System Refinements

Updated demo with action units

This week, we enhanced our webapp with action unit detection and recording capabilities while updating it to incorporate our latest audio model. We also introduced speech semantic analysis using 24 nuanced emotional categories, providing more detailed insights. Our team completed the draft of our Final Design Review (FDR) report and presented our FDR slides to peer teams and coaches, receiving valuable feedback. We made necessary improvements to our product demonstration video and evaluated our final fusion model on the curated IPPD dataset.

For the coming week, we’ll focus on making final updates to our system functionality and completing the revision of our FDR report. We’ll refine our FDR presentation to ensure it fits within the allotted time frame and improve its overall readability. We’ll also finalize our project video and poster. As we approach the end of the semester, our project remains on schedule with all major components nearing completion.

Week 13 – Enhanced UI and Expanded Emotional Analysis

Sample session report from our web app

Our team improved our web application and emotional analysis capabilities this week. We enhanced the web app by adding a dedicated audio window, substantially improving transcript accuracy. We also introduced key moments detection and an emotional state timeline, providing users with a more comprehensive analysis of emotional patterns over time. The audio team tested various weighting approaches in our audio models and integrated the best-performing version into the web application. Our transcript analysis received a big upgrade as we expanded from our seven core emotions to utilize all 24 emotions available in the model, resulting in more nuanced emotional detection.

Next week, we’ll continue refining our system with several planned enhancements. The audio team will incorporate Silero for improved voice detection, while the transcript team will integrate the detailed emotion spectrum into our summary generation. We’ll evaluate the current Huggingface model in our web app against our best-performing custom models. Additionally, we’ll update our poster and promotional video based on the feedback received and begin work on our Final Design Review draft.

Week 12 – Prototype Inspection Day

Team Noesys and coach Dr. Fang at Prototype Inspection Day

Our team presented our prototype at Prototype Inspection Day and received valuable reviewer feedback. We substantially improved our web application interface, adding data exporting capabilities and an LLM-based emotion summary feature. Our audio team achieved a 80% macro-F1 score by applying weighted loss and even sampling techniques to exaggerated datasets including RAVDESS, TESS, and CREMA-D. The video team trained and tested a new ResEmoteNet model on our curated FAFE dataset. Meanwhile, our text team expanded our transcript dataset to include over 5,000 sentences evenly distributed across all seven emotion classes.

Next week, we’ll be filming and editing our promotional video to showcase our system. The audio team will evaluate our model performance on our custom dataset, while the video team continues to fine-tune new models and establish accuracy benchmarks for the FAFE dataset. Our text team will fine-tune the original roBERTa go-emotions model using our expanded transcript dataset combined with MOSEI. Based on PID feedback, we’ll enhance our webapp by incorporating Action Units with emotional data and adding separate text emotion analysis along with other suggested functionality.

Week 11 – Prototype Refinement and Dataset Expansion

Team Noesys discussing potential improvements to the audio model

Our team has been busy this week preparing for our Project Implementation Design (PID) presentation. The text modality team created a comprehensive CSV with over 1,000 sentences labeled across seven emotion classes, expanding our training and testing capabilities. We fine-tuned our BART model to work more effectively across different datasets. We’ve continued iterating on our demo webapp, improving live emotional prediction and transcript recording functionalities.

The audio team implemented weighted loss and balanced sampling techniques for COVAREP LSTM and wav2vec2 models, training on the MOSEI dataset. Our visual team obtained ResEmoteNet, a new open-source model with pre-trained weights, and developed a combined dataset from our previous resources.

Next week, we’ll be preparing for our PID presentation. We’ll finalize our prototype, prepare presentation slides and the demo video, and focus on further model improvements. The text team plans to expand their sentence collection and combine it with MOSEI for training. Our fusion team will implement weighted loss in the intermediate fusion model, while the audio team will explore LSTM implementations from the CMU-MOSEI paper. The visual team will fine-tune and test new and existing models on our combined dataset. We’ll also enhance our demo’s data flow and potentially add summary generation functionality.

Week 9 – Dataset Creation and Model Improvements

Sample sentences from our new dataset

This week, we started creating our emotion-labeled sentence dataset with over 500 entries categorized into our seven emotion classes, which will be used to generate our custom testing dataset. We also developed our demonstration webapp, which now implements live emotional prediction and transcript recording. This provides a tangible way to showcase our technology to stakeholders and users.

Our audio team implemented weighted loss for both Wav2Vec and Whisper fine-tuning to address class imbalance issues. Meanwhile, the visual team expanded our dataset resources by obtaining two new datasets: Emo135 and ExpW-Cleaned. They also successfully tested EmotionCLIP on the Affectnet-YOLO dataset and validated video functionality with the MOSEI dataset. Our late fusion system was tested on CMU-MOSEI using all our best-performing models.

For the coming week, we’ll focus on collecting recordings of sentences from our dataset to finalize our multimodal testing data. The transcript team will complete BART fine-tuning and compare its performance against our other models. Our fusion efforts will concentrate on implementing weighted loss functions, while the audio team will develop LSTM capabilities for COVAREP features and survey state-of-the-art emotional classification models. The visual team will evaluate current model performance on our new datasets and prepare the ExpW dataset for training. We’ll also enhance our demo by adding individual modality prediction information and incorporating pitch and volume markers.

Week 8 – Late Fusion Demo

Screenshot of our late fusion demo

This week, our team built our first real-time demo incorporating all modalities with late fusion. As the user speaks, the demo captures visual, audio, and textual information, and integrates the predictions from each of these models to determine the likelihood of each emotion. We were pleasantly surprised at how low the latency was, as classifications were made within 1-2 seconds of the spoken expression. This program also logs the transcript and emotional predictions with time stamps in CSV format, facilitating smooth implementation of post-processing features in the future.

We’ve made some progress towards correcting biases in predictions resulting from class imbalances in our training data, but there’s still more work to be done. Our audio team implemented weighted loss function in our fine-tuning, but imbalances still remain. Our plan is to next try balanced sampling to see if this alleviates the issue.

Our vision modality team received the weights for EmotionCLIP, an implementation of OpenAI’s CLIP model fine tuned for emotional classification. The performance on our test datasets does still need improvement, so we’ve been trying to find ways to adapt it for our use case.

Week 7 – QRB2

Team Noesys preparing to present for QRB2

This week, our team presented our QRB2 update to the review committee. During the review, we received valuable feedback that will guide our ongoing development efforts. The committee identified several key areas for improvement in our emotional analysis system. Based on this feedback, we’ll be implementing several changes. We’re going to evaluate using a weighted prediction approach that gives higher priority to better-performing modalities, and we’ll be evaluating an intermediate fusion model that can dynamically learn which inputs deserve more weight.

Data imbalance between emotion classes emerged as another challenge, resulting in rare emotions being predicted less frequently and with lower accuracy. Our solution includes implementing balanced sampling during training, utilizing weighted loss functions to account for these imbalances, and expanding our training datasets.

Additional improvement areas include finding more representative datasets that better match our use case, integrating attention layers to help weaker models focus on informative features, standardizing preprocessing across modalities, and incorporating prediction confidence into our late fusion weighting function.

Week 6 – Late Fusion Testing

Example data from CMU-MOSEI

This week our team integrated text and bounding box components into our late fusion model. We verified that our fusion accuracy successfully exceeds that of individual modalities, validating our multi-modal approach. Our audio team completed 1D CNN model evaluations and began testing wav2vec2 for improved performance. Meanwhile, the video team conducted extensive cross-dataset evaluations, testing CLIP trained on AffectNet-YOLO against CMU-MOSEI data, and running Dinov2 evaluations trained on CMU-MOSEI against AffectNet-YOLO.

Next week, we’ll focus on addressing class imbalances in our datasets through weighted loss functions and will begin reporting results using macro f1 scores for more accurate performance metrics. We’re working to incorporate training datasets that better represent our intended use case. The audio team will continue finetuning and evaluating wav2vec2 while performing hyperparameter tuning on models trained on MELD. Our video team plans to cross-compare CLIP and Dinov2 performance and identify another compatible commercial dataset for additional cross-evaluation. All teams will standardize their evaluation approach by training and testing models on identical datasets to ensure valid comparisons.