Handling audio data is an essential task for machine learning engineers working in the fields of speech analytics, music information retrieval and multimodal data analysis, but also for developers that simply want to edit, record and transcode sounds. This article shows the basics of handling audio data using command-line tools, and also provides a not-so-deep dive into handling sounds in Python.
So what is sound and which are its basic attributes?
According to physics, sound is a travelling vibration, i.e. a wave that moves through a medium such as the air. The sound wave is transferring energy from particle to particle until it is finally āreceivedā by our ears and perceived by our brains. The two basic attributes of sound are amplitude (what we also call loudness) andĀ frequencyĀ (a measure of the waveās vibrations per time unit).
Similarly to images and videos, sound is an analog signal that has to be transformed to a digital signal, in order to be stored in computers and analyzed by software. This analog to digital conversion includes two processes:Ā samplingĀ andĀ quantization.
SamplingĀ is used to convert the time-varying continuous signal x(t) to a discrete sequence of real numbers x(n). The interval between two successive discrete samples is the sampling period (Ts). We use the sampling frequency (fs = 1/Ts) as the attribute that describes the sampling process.
Typical sampling frequencies are 8KHz, 16KHz and 44.1KHz. 1Hz means one sample per second, so obviously higher sampling frequencies mean more samples per second and therefore better signal quality.
(This actually means that the discrete signal can capture a higher range of frequencies, namely from 0 to fs/2 Hz according to the Nyquist rule)
QuantizationĀ is the process of replacing each real number, x(n), of the sequence of samples with anĀ approximationĀ from a finite set of discrete values. In other words, quantization is the process of reducing the infinite number precision of an audio sample to a finite precision as defined by a particular number of bits.
In the majority of the cases, 16 bits per sample are used to represent each quantized sample, which means that there are 2¹ⶠlevels for the quantized signal. For that reason, raw audio values usually vary from -2¹ⵠto 2¹āµ(1 bit used for the sign), however, as we will see later, this is usually normalized in the (-1, 1) range for the sake of simplicity.
We usually call this bit resolution property of the quantization procedure āsample resolutionā and it is measured inĀ bits per sample.
Tools and libraries used in this article
Iāve selected the following command-line tools, programs and libraries to use for basic handling of audio data:
- ffmpeg/libav. FFmpeg (https://ffmpeg.org) is a free, open-source project for handling multimedia files and streams. Some think that ffmpeg and libav are the same, but actually libav is a fork project from ffmpeg
- soxĀ (http://sox.sourceforge.net) aka āthe Swiss Army knife of sound processing programsā is a free cross-platform command line utility for basic audio processing. Despite the fact that it has not been updated since 2015, it is still a good solution. In this article we mostly demonstrate ffmpeg and a couple of examples in sox
- audacity (https://www.audacityteam.org) is a free, open-source and cross-platform program for editing sounds
programming: we will useĀ pydubĀ (https://github.com/jiaaro/pydub) and scipy (https://scipy-cookbook.readthedocs.io) for reading audio data and librosaĀ (https://librosa.github.io/librosa/) .
We could also use pyAudioAnalysis (https://github.com/tyiannak/pyAudioAnalysis) for IO or for more advanced feature extraction and signal analysis.
Finally, we will also useĀ plotlyĀ (https://plotly.com) for basic signal visualization.
This article is divided into two parts:
- 1st part: how to use ffmpeg and sox to handle audio files
- 2nd part: how to programmatically handle audio files and perform basic processing
Part I: Handling audio data ā the command-line way
Below are some examples for the most basic audio handling such as conversion between formats, temporal trimming, merging and segmentation, using mostly ffmpeg and sox.
To convertĀ videoĀ (mkv)Ā toĀ audioĀ (mp3)
ffmpeg -i video.mkv audio.mp3
ForĀ downsamplingĀ to 16KHz, converting stereo (2 channels)Ā to mono (1 channel)Ā and converting MP3Ā to WAV (uncompressed audio samples), one needs to use the -ar (audio rate) -ac (audio channel) properties:
ffmpeg -i audio.wav -ar 16000 -ac 1 audio_16K_mono.wav
Note that, in that case, stereo to mono conversion means that the two channels are averaged to one. Also, downsampling of an audio file and stereo to mono conversion can be achieved usingĀ soxĀ in the following manner: sox <source_file_ -r <new_sampling_rate> -c 1 <output_file>)
Now letās see the new fileās attributes using ffmpeg:
ffmpeg -i audio_16K_mono.wav
will return:
Input #0, wav, from āaudio_16K_mono.wavā:
Metadata:
encoder : Lavf57.71.100
Duration: 00:03:10.29, bitrate: 256 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz,
mono, s16, 256 kb/s
ToĀ trimĀ an audio file, e.g. from the 60th to the 80th second (20 seconds new duration):
ffmpeg -i audio.wav -ss 60 -t 20 audio_small.wav
(This can be achieved with the -to argument, which is used to define the end of the trimmed segment, in the example above that would be 80)
ToĀ concatenateĀ two or more audio files one can use the āffmpeg -f concatā command. Suppose you want to concatenate all files f1.wav, f2.wav and f3.wav to a large file called output.wav. What you need to do is create a text file of the following format (say named ālist_of_files_to_concatā):
file 'file1.wav'
file 'file2.wav'
file 'file3.wav'
and then run
ffmpeg -f concat -i list_of_files_to_concat -c copy output.wav
On the other hand, toĀ breakĀ an audio file into successive chunks (segments) of the (same) specified duration can be done with the āffmpeg -f segmentā option. For example, the following command will break output.wav into 1-second, non-overlapping segments named out00000.wav, out00001.wav, etc.:
ffmpeg -i output.wav -f segment -segment_time 1 -c copy out%05d.wav
With regards to channel handling, apart from simple mono to stereo conversion (or stereo to mono) through the -ac property, one may want to switch stereo channelsĀ (right to left). The way to achieve this is through the ffmpeg map_channel property:
ffmpeg -i stereo.wav -map_channel 0.0.1 -map_channel 0.0.0 stereo_inverted.wav
To create aĀ stereo file from two mono files, say left.wav and right.wav:
ffmpeg -i left.wav -i right.wav -filter_complex "[0:a][1:a]join=inputs=2:channel_layout=stereo[a]" -map "[a]" mix_channels.wav
On the opposite direction, toĀ split a stereo file into two monoĀ (one for each channel):
ffmpeg -i stereo.wav -map_channel 0.0.0 left.wav -map_channel 0.0.1 right.wav
Map_channel can also be used toĀ muteĀ aĀ channelĀ from a stereo signal, e.g. (below the left channel is muted):
ffmpeg -i stereo.wav -map_channel -1 -map_channel 0.0.1 muted.wav
VolumeĀ adaptation can also be achieved through ffmpeg, e.g.
ffmpeg -i data/music_44100.wav -filter:a āvolume=0.5ā data/music_44100_volume_50.wav
ffmpeg -i data/music_44100.wav -filter:a āvolume=2.0ā data/music_44100_volume_200.wav
The figure below presents a screen shot from viewing (with Audacity) the original, the 50% volume adaptation and the x2 (200%) volume adaptation signals. The x2 volume boosted signal is clearlyĀ clippedĀ (i.e. some samples cannot be represented and they are assigned the maximum allowed value ā 2¹ⵠfor 16-bit signals):
VolumeĀ change can be achieved withĀ soxĀ as well in the following way:
sox -v 0.5 data/music_44100.wav data/music_44100_volume_50_sox.wav
sox -v 2.0 data/music_44100.wav data/music_44100_volume_200_sox.wav
Part II: Handling audio data ā the programming way
Load WAV and MP3 files to array
Let us first load our sampled audio data to aĀ numpyĀ array (we use numpy arrays as they are considered the most widelly adopted way to process numerical sequences/vectors). The most common way to load WAV data to numpy arrays is scipy.io.wavfile, while for MP3 data one can use pydub (https://github.com/jiaaro/pydub) that uses ffmpeg for encoding / decoding audio data.
In the following example, theĀ sameĀ signal stored in WAV and MP3 files is loaded to numpy arrays.
# Read WAV and MP3 files to array
from pydub import AudioSegment
import numpy as np
from scipy.io import wavfile
from plotly.offline import init_notebook_mode
import plotly.graph_objs as go
import plotly
# read WAV file using scipy.io.wavfile
fs_wav, data_wav = wavfile.read("data/music_8k.wav")
# read MP3 file using pudub
audiofile = AudioSegment.from_file("data/music_8k.mp3")
data_mp3 = np.array(audiofile.get_array_of_samples())
fs_mp3 = audiofile.frame_rate
print('Sq Error Between mp3 and wav data = {}'.
format(((data_mp3 - data_wav)**2).sum()))
print('Signal Duration = {} seconds'.
format(data_wav.shape[0] / fs_wav))
result:
Sq Error Between mp3 and wav data = 0
Signal Duration = 5.256 seconds
Note: the overall duration of the loaded signal (in seconds) is computed by dividing the number of samples by the sampling frequency (Hz = samples per second). Also, in the example above we compute the sum square error to make sure that the two signals are identical despite their mp3 to wav conversion.
Stereo signals
Stereo signals are handled through 2D arrays. In the following example, the data_wav array has two columns, one for each channel. By convention, the left channel is always the first and the second the right channel.
# Handling stereo signals
fs_wav, data_wav = wavfile.read("data/stereo_example_small_8k.wav")
time_wav = np.arange(0, len(data_wav)) / fs_wav
plotly.offline.iplot({ "data": [go.Scatter(x=time_wav,
y=data_wav[:, 0],
name='left channel'),
go.Scatter(x=time_wav,
y=data_wav[:, 1],
name='right channel')]})
Normalization
Normalization is necessary for performing computations on the audio signal values, as it makes the signal values independent to the sample resolution (i.e. signals with 24 bits per sample have much higher range of values than signals with 16 bits per sample). The following example demonstrates how to normalize an audio signal in the (-1, 1) range, by simply dividing by 2¹āµ.
This is because we know that the sample resolution is 16 bits per sample. In the rare case of 24 bits per sample this normalization should obviously change respectively.
# Normalization
fs_wav, data_wav = wavfile.read("data/lost_highway_small.wav")
data_wav_norm = data_wav / (2**15)
time_wav = np.arange(0, len(data_wav)) / fs_wav
plotly.offline.iplot({ "data": [go.Scatter(x=time_wav,
y=data_wav_norm,
name='normalized audio signal')]})
Trim / Segment
The following examples show how to get seconds 2 to 4 from the previously loaded and normalized signal. This is done by simply referring to the respective indices in the numpy array. Obviously the indices must be in audio samples, so seconds need to be multiplied by the sampling frequency.
# Trim (segment) audio signal (2 seconds)
data_wav_norm_crop = data_wav_norm[2 * fs_wav: 4 * fs_wav]
time_wav_crop = np.arange(0, len(data_wav)) / fs_wav
plotly.offline.iplot({ "data": [go.Scatter(x=time_wav_crop,
y=data_wav_norm_crop,
name='cropped audio signal')]})
Fix-sized segmentation
In the first part we showed how we can segment a long recording to non-overlapping segments using ffmpeg. The following code sample shows how to do the same with Python. Line 8 does the actual segmentation in a single-line command. Overall, the following script loads and normalizes an audio signal, and thenĀ it breaks it into 1-second segments and writes each one of them in a file.
(Pay attention to the note in the last comment: you will need to cast to 16bit before saving to file because the numpy conversion has led to higher sample resolutions).
# Fix-sized segmentation (breaks a signal into non-overlapping segments)
fs, signal = wavfile.read("data/obama.wav")
signal = signal / (2**15)
signal_len = len(signal)
segment_size_t = 1 # segment size in seconds
segment_size = segment_size_t * fs # segment size in samples
# Break signal into list of segments in a single-line Python code
segments = np.array([signal[x:x + segment_size] for x in
np.arange(0, signal_len, segment_size)])
# Save each segment in a seperate filename
for iS, s in enumerate(segments):
wavfile.write("data/obama_segment_{0:d}_{1:d}.wav".format(segment_size_t * iS,
segment_size_t * (iS + 1)), fs, (s))
A simple algorithm to remove silent segments from a recording
The previous script has broken a recording into a list of 1-second segments. The code below implements a very simple silence removal method. Towards this end, it computes the energy as the sum of squares of the samples, then it calculates a threshold as 50% of the median energy value, and finally it keeps segments whose energy are above that threshold:
import IPython
# Remove pauses using an energy threshold = 50% of the median energy:
energies = [(s**2).sum() / len(s) for s in segments]
# (attention: integer overflow would occure without normalization here!)
thres = 0.5 * np.median(energies)
index_of_segments_to_keep = (np.where(energies > thres)[0])
# get segments that have energies higher than a the threshold:
segments2 = segments[index_of_segments_to_keep]
# concatenate segments to signal:
new_signal = np.concatenate(segments2)
# and write to file:
wavfile.write("data/obama_processed.wav", fs, new_signal)
plotly.offline.iplot({ "data": [go.Scatter(y=energies, name="energy"),
go.Scatter(y=np.ones(len(energies)) * thres,
name="thres")]})
# play the initial and the generated files in notebook:
IPython.display.display(IPython.display.Audio("data/obama.wav"))
IPython.display.display(IPython.display.Audio("data/obama_processed.wav"))
The energy / threshold plot is shown in the figure below (all segments whose energies are below the red line are removed from the processed recording). Also, note the last two lines of code (using the IPython.display.display() function) that are used to add a clickable audio clip directly in the notebook for both the initial and the processed audio files, as the following screenshot shows:
You can listen to the original and processed (after silence removal) recordings below:
Music analysis: a toy example on bpm (beats per minute) estimation
Music analysis is an application domain of signal processing and machine learning, that focuses on analyzing musical signals, mostly for content-based retrieval and recommendation. One of the major tasks in music analysis, is to extract high-level attributes that describe a song, such as its musical genre and the underlying mood.Ā
TempoĀ is one of the most important attributes of a song. Tempo tracking is the task of automatically estimating a songs tempo (in bpm) directly from the signal. One of the basic implementations of tempo tracking is included in theĀ librosaĀ library.
The following toy example takes as input a mono audio file where a song is stored and produces a stereo file where on the left channel is the initial song, while on the right channel is an artificially generated periodic ābeepā sound that āfollowsā the main tempo of the song:
import numpy as np
import scipy.io.wavfile as wavfile
import librosa
import IPython
# load file and extract tempo and beats:
[Fs, s] = wavfile.read('data/music_44100.wav')
tempo, beats = librosa.beat.beat_track(y=s.astype('float'), sr=Fs, units="time")
beats -= 0.05
# add small 220Hz sounds on the 2nd channel of the song ON EACH BEAT
s = s.reshape(-1, 1)
s = np.array(np.concatenate((s, np.zeros(s.shape)), axis=1))
for ib, b in enumerate(beats):
t = np.arange(0, 0.2, 1.0 / Fs)
amp_mod = 0.2 / (np.sqrt(t)+0.2) - 0.2
amp_mod[amp_mod < 0] = 0
x = s.max() * np.cos(2 * np.pi * t * 220) * amp_mod
s[int(Fs * b):
int(Fs * b) + int(x.shape[0]), 1] = x.astype('int16')
# write a wav file where the 2nd channel has the estimated tempo:
wavfile.write("data/music_44100_with_tempo.wav", Fs, np.int16(s))
# play the generated file in notebook:
IPython.display.display(IPython.display.Audio("data/music_44100_with_tempo.wav"))
The result of the script above is a WAV file where the left channel is the initial song and the right channel is the sequence of beep sounds on the estimated tempo onsets. Below are two examples of generated sounds for two different initial songs:
Real-time recording and frequency analysis
All of the presented code samples above have mainly focused on reading audio data from files and performing some very basic processing on the audio data such as trimming or segmentation to fix-sized windows, and then either plotting or saving the processed sounds into files.
The following code goes one step further in a twofold way: (a) by showing how sound can beĀ capturedĀ by aĀ microphoneĀ in a way that allows real-time and online processing (b) by introducing theĀ frequencyĀ domain representation of a sound.Ā Our goal here is to create a simple Python script that captures sound in a segment-basis, and for each segment it plots in the terminal the segmentās frequency distribution.
Real-time audio capturing is achieved through theĀ pyaudioĀ library. Audio samples are captured in small segments (say, 200 mseconds long). Then, for each segment, the code presented below performs a basic frequency representation by running the following steps:
- compute the magnitudeĀ XĀ of the Fast Fourier Transform (FFT) of the recorded segment. Also, keep the frequency values (in Hz) in a separate array, sayĀ freqs. Then, to put it simply, according to the DFT definition,Ā X(i) is the energy of the audio signal that is concentrated in frequency freqs(i) Hz
- downsample X and freqs, so that we keep much fewer frequency coefficients to visualize
- the script also calculates the total segmentās energy (not just the energy at particular frequency bins as described in 1). This is done just to normalize against the maximum width of the frequency visualization.
- plot the downsampled frequency energies X for all (downsampled as well) frequencies using a simple bar plot.
These four steps are implemented in the following script. The code is also availableĀ hereĀ as part of theĀ pauraĀ library. See inline comments for more detailed explaination:
# paura_lite:
# An ultra-simple command-line audio recorder with real-time
# spectrogram visualization
import numpy as np
import pyaudio
import struct
import scipy.fftpack as scp
import termplotlib as tpl
import os
# get window's dimensions
rows, columns = os.popen('stty size', 'r').read().split()
buff_size = 0.2 # window size in seconds
wanted_num_of_bins = 40 # number of frequency bins to display
# initialize soundcard for recording:
fs = 8000
pa = pyaudio.PyAudio()
stream = pa.open(format=pyaudio.paInt16, channels=1, rate=fs,
input=True, frames_per_buffer=int(fs * buff_size))
while 1: # for each recorded window (until ctr+c) is pressed
# get current block and convert to list of short ints,
block = stream.read(int(fs * buff_size))
format = "%dh" % (len(block) / 2)
shorts = struct.unpack(format, block)
# then normalize and convert to numpy array:
x = np.double(list(shorts)) / (2**15)
seg_len = len(x)
# get total energy of the current window and compute a normalization
# factor (to be used for visualizing the maximum spectrogram value)
energy = np.mean(x ** 2)
max_energy = 0.02 # energy for which the bars are set to max
max_width_from_energy = int((energy / max_energy) * int(columns)) + 1
if max_width_from_energy > int(columns) - 10:
max_width_from_energy = int(columns) - 10
# get the magnitude of the FFT and the corresponding frequencies
X = np.abs(scp.fft(x))[0:int(seg_len/2)]
freqs = (np.arange(0, 1 + 1.0/len(X), 1.0 / len(X)) * fs / 2)
# ... and resample to a fix number of frequency bins (to visualize)
wanted_step = (int(freqs.shape[0] / wanted_num_of_bins))
freqs2 = freqs[0::wanted_step].astype('int')
X2 = np.mean(X.reshape(-1, wanted_step), axis=1)
# plot (freqs, fft) as horizontal histogram:
fig = tpl.figure()
fig.barh(X2, labels=[str(int(f)) + " Hz" for f in freqs2[0:-1]],
show_vals=False, max_width=max_width_from_energy)
fig.show()
# add exactly as many new lines as they are needed to
# fill clear the screen in the next iteration:
print("\n" * (int(rows) - freqs2.shape[0] - 1))
And this is an execution example of the script:
All code examples presented in part B are available in this github repo: https://github.com/tyiannak/basic_audio_handlingĀ as a jupyter notebook.
The last example (the real-time command-line spectrum analyzer) is available atĀ https://github.com/tyiannak/paura/blob/master/paura_lite.py
About the author (tyiannak.github.io)
Thodoris is currently the Director of ML atĀ behavioralsignals.com, where his work focuses on building algorithms that recognise emotions and behaviors based on audio information. He also teaches multimodal information processing in aĀ Data ScienceĀ andĀ AIĀ master program in Athens, Greece.