One of my main goals for this project was for the sound generated to be unique to each audience member. My solution to this was to use their face (or visage) as the input for the sound generation. I initially had ideas around using a camera to read biometric information then utilise a sonification process to generate the sound. While this would give me the unique experience I was after, the sonification process was always going to based around some arbitrary connections I had preset which I found deeply unsatisfying. Without really understanding what it would involve I decided the best approach would be to train an AI for the sound generation.
The first thing I would need is a training dataset. I had though I could try creating my own and looked at personally collecting images of peoples faces and recording them reciting their names. It quickly become obvious that I was never going to be able to collect enough examples myself and the included demographic would be quite limited. I gave a fair amount of thought as to what I wanted the connection to be between the person experiencing the work and the sound being generated. The concept I had when using the audio of the recited names was the idea that we grow into our names, who we are is somehow reflected in how we look. What I realised was that the most important thing was not the words that were recorded but the tonality of spoken words, the emotion in the words as they are being spoken will be reflected in the persons facial expression. Because of this I have decided that it doesn’t matter what was being said, or even in what language, it was the connection between the facial expression being displayed and the emotion in what was being said that I was after.
So I am currently piecing together some code in Python to construct a dataset by pulling facial images and the accompanying audio from video files. Using Opencvs facial detection I can run through a video file and store images of any detected faces. I am also able to write to a file the audio present in the video file either side of the saved visage. As i am planning to use PIX2PIX (at least initially) to train the AI, I need to convert the saved audio to an image. I have been able to save the audio as a spectrogram in tiff format and then successfully convert that tiff back to audio. For PIX2PIX though, I need the spectrogram image saved in jpg format which is where I am stuck at the moment. All my attempts to convert the 32bit float tiff image to jpg have resulted in a significant loss of contrast, I think its going to require a stack overflow post.
I am creating and running all the code in Google Colab. It has been extremely helpful to be able to write and run the code from any internet connected machine as well as having the processing power that Colab provides. I am aiming to get all the code syncing up to my Github account soon.