Audio to Spectrogram and Back

So having established how I was going to build my dataset and that I was going to use pix2pix (initially at least) I need to work out how I was going to convert the audio to a spectrogram image that could be run through pix2pix and then converted back to audio.

Converting the audio to a spectrogram to a 32bit floating point tiff image in python was reasonably easy.

import librosa
x, sr = librosa.load("My.wav")
X = librosa.stft(x)
Xdb = librosa.amplitude_to_db(abs(X))
from PIL import Image
im =Image.fromarray(Xdb).convert('F')
im.save("test.tiff")

And then importing that tiff and converting back audio works quite well. There’s some loss of quality but for my purposes it’s not significant and I actually like the slightly ‘artificial’ tonality that results from the process.

from PIL import Image
import numpy as np

img = Image.open("test.tiff")
recspec = np.array(img)

X2 = librosa.db_to_amplitude(recspec)
audio = librosa.griffinlim(X2)
import soundfile as sf
sf.write("test1.wav", audio, sr)

Unfortunately I can’t use 32bit floating point tiff images with pix2pix. I can save the spectrograms to jpegs and recreate them but it does come at a loss of quality.

import librosa
x, sr = librosa.load("My.wav")
X = librosa.stft(x)
Xdb = librosa.amplitude_to_db(abs(X), ref=np.median)
from PIL import Image
im =Image.fromarray(Xdb).convert('L')
im.save("test.jpg")

from PIL import Image
import numpy as np

img = Image.open("test.jpg")
recspec = np.array(img)

X2 = librosa.db_to_amplitude(recspec)
audio = librosa.griffinlim(X2)
import soundfile as sf
sf.write("testjpg.wav", audio, sr)

I maybe able to live with this loss of quality, however I will be exploring methods to improve it. At the moment the spectrograms are grayscale so are only using one colour channel and as such only 8 of the available bits. I will look at some methods of colour mapping the spectrogram to try and make use of all 3 colour channels and use all 24bits. I would also like to experiment with taconet or wavenet to convert the spectrograms back to audio using synthesis, since my source is all dialogue this might yield better results.

Share this:

Related

Leave a comment Cancel reply