torch-log-wmse: Audio Quality Metric and Loss Function Library

GitHub Stats Value
Stars 26
Forks 1
Language Python
Created 2024-03-28
License Apache License 2.0

The torch-log-wmse project provides a PyTorch implementation of the logWMSE (logarithm of frequency-weighted Mean Squared Error) metric, originally developed by Iver Jordal of Nomono. This metric and loss function are designed to evaluate and improve the quality of audio signals, particularly addressing the limitations of traditional audio metrics, such as the inability to handle digital silence targets. It can be used both as a metric to assess audio quality and as a loss function for training audio separation and denoising models. Installing it is straightforward with pip install torch-log-wmse, making it a valuable tool for audio processing tasks.

The torch-log-wmse project implements the logWMSE (logarithm of frequency-weighted Mean Squared Error) metric and loss function for audio signals, originally proposed by Iver Jordal of Nomono. Here are the key features:

  • Custom Metric and Loss Function: logWMSE calculates the logarithm of a frequency-weighted Mean Squared Error, addressing shortcomings of common audio metrics, especially supporting digital silence targets.
  • Usage in Training Models: It can be used as a loss function for training audio separation and denoising models.
  • Frequency Weighting: The metric applies frequency weighting to align with human hearing sensitivity, giving more weight to frequencies humans are more sensitive to.
  • Accepts three torch tensors: unprocessed audio, processed audio, and target audio.
  • Each tensor has dimensions for batch size, audio stems (sources), audio channels, and samples.
  • Supports Digital Silence: Unlike other metrics, it handles digital silence targets.
  • Scaled for Human Hearing: Logarithmic scaling reflects human hearing sensitivity and avoids small value range issues of MSE.
  • Scale-Invariant: The metric remains consistent even if all inputs are scaled by the same amount.
  • Not invariant to arbitrary scaling, polarity inversion, or offsets in estimated audio.
  • Does not fully model human auditory perception, such as auditory masking.
  • Can be installed via pip install torch-log-wmse.
  • Example usage provided in the README shows how to instantiate and use the logWMSE metric.

This project is licensed under the Apache License 2.0 and welcomes contributions.

You can use torch-log-wmse as a loss function to train models for audio source separation. For example, if you are separating a mixed audio signal into vocals, drums, bass, and other instruments, you can calculate the logWMSE between the unprocessed mixed audio, the separated audio stems, and the target clean audio.

python

## Example: Audio source separation
unprocessed_audio = torch.rand(4, 2, 44100)  # Batch size 4, stereo, 1 second
processed_audio = torch.rand(4, 4, 2, 44100)  # Batch size 4, 4 stems, stereo, 1 second
target_audio = torch.rand(4, 4, 2, 44100)  # Batch size 4, 4 stems, stereo, 1 second

log_wmse = LogWMSE(audio_length=1.0, sample_rate=44100)
loss = log_wmse(unprocessed_audio, processed_audio, target_audio)
print(loss)  # Output: The calculated logWMSE loss

For denoising audio, you can use torch-log-wmse to evaluate and train models that remove noise from an audio signal.

python

## Example: Audio denoising
unprocessed_audio = torch.rand(4, 2, 44100)  # Batch size 4, stereo, 1 second
processed_audio = torch.rand(4, 1, 2, 44100)  # Batch size 4, 1 stem (denoised), stereo, 1 second
target_audio = torch.rand(4, 1, 2, 44100)  # Batch size 4, 1 stem (clean), stereo, 1 second

log_wmse = LogWMSE(audio_length=1.0, sample_rate=44100)
loss = log_wmse(unprocessed_audio, processed_audio, target_audio)
print(loss)  # Output: The calculated logWMSE loss
  • Installation: Install the package using pip install torch-log-wmse.
  • Usage: Use the LogWMSE class to calculate the logWMSE metric or loss function for your audio processing tasks.

Key Points:

  • Custom Metric and Loss Function: logWMSE is designed for audio quality evaluation and training audio separation and denoising models, addressing shortcomings of common metrics like MSE.
  • Digital Silence Support: It supports digital silence targets, overcoming a limitation of other audio metrics.
  • Human Hearing Alignment: Scale-invariant and logarithmic, aligning with human hearing sensitivity and frequency weighting.
  • Multi-Input Capability: Requires unprocessed, processed, and target audio inputs for comprehensive evaluation.
  • Limitations: Not invariant to arbitrary scaling, polarity inversion, or offsets; does not fully model human auditory perception.

Future Potential:

  • Enhanced Audio Models: Can improve the performance of audio separation and denoising models by providing a more accurate and relevant loss function.
  • Broader Applications: Potentially applicable in various audio processing tasks beyond separation and denoising, such as speech enhancement and music restoration.
  • Contributions and Improvements: Open to contributions for further enhancements and new features, potentially addressing current limitations.

For further insights and to explore the project further, check out the original crlandsc/torch-log-wmse repository.

Content derived from the crlandsc/torch-log-wmse repository on GitHub. Original materials are licensed under their respective terms.