Audio to Spectrogram and Back

So having established how I was going to build my dataset and that I was going to use pix2pix (initially at least) I need to work out how I was going to convert the audio to a spectrogram image that could be run through pix2pix and then converted back to audio.

Converting the audio to a spectrogram to a 32bit floating point tiff image in python was reasonably easy.

import librosa
x, sr = librosa.load("My.wav")
X = librosa.stft(x)
Xdb = librosa.amplitude_to_db(abs(X))
from PIL import Image
im =Image.fromarray(Xdb).convert('F')
im.save("test.tiff")

And then importing that tiff and converting back audio works quite well. There’s some loss of quality but for my purposes it’s not significant and I actually like the slightly ‘artificial’ tonality that results from the process.

from PIL import Image
import numpy as np

img = Image.open("test.tiff")
recspec = np.array(img)

X2 = librosa.db_to_amplitude(recspec)
audio = librosa.griffinlim(X2)
import soundfile as sf
sf.write("test1.wav", audio, sr)

Unfortunately I can’t use 32bit floating point tiff images with pix2pix. I can save the spectrograms to jpegs and recreate them but it does come at a loss of quality.

import librosa
x, sr = librosa.load("My.wav")
X = librosa.stft(x)
Xdb = librosa.amplitude_to_db(abs(X), ref=np.median)
from PIL import Image
im =Image.fromarray(Xdb).convert('L')
im.save("test.jpg")

from PIL import Image
import numpy as np

img = Image.open("test.jpg")
recspec = np.array(img)

X2 = librosa.db_to_amplitude(recspec)
audio = librosa.griffinlim(X2)
import soundfile as sf
sf.write("testjpg.wav", audio, sr)

I maybe able to live with this loss of quality, however I will be exploring methods to improve it. At the moment the spectrograms are grayscale so are only using one colour channel and as such only 8 of the available bits. I will look at some methods of colour mapping the spectrogram to try and make use of all 3 colour channels and use all 24bits. I would also like to experiment with taconet or wavenet to convert the spectrograms back to audio using synthesis, since my source is all dialogue this might yield better results.

Leave a comment