To create the Cepstral Text-to-Speech (TTS) voice of President Obama, we first gathered publicly available samples of Obama speaking; these were mostly speeches. In total, we collected around five hours of recorded speech. Typically, we would want our speakers to be recorded in a professional and consistent environment, but because Obama's data came from 40 different public speeches, we had to process the audio so the speeches would sound like they were all recorded in the same room. Furthermore, Obama's speeches were often recorded in noisy rooms with lots of background noise, microphone feedback, and echoing. Our technicians meticulously reviewed the audio in order to remove a majority of these problems.
Our TTS technology works by recombining the recorded sounds to make novel words and phrases, so once our technicians had corrected Obama's audio, we segmented or "chopped up" his speech into the different sound components. This portion of the project involved some speaker adaptation, as people have their own unique way of pronouncing certain words. In order to capture Obama's voice, we adapted our build process to his particular style of speaking.
Governor Romney does not have many publicly available speeches, so we created our version of his voice by adapting a currently existing synthesis voice. Our sound experts studied Romney's voice carefully in order to learn the attributes of his speaking style. They then edited the audio for a preexisting voice in order to construct Romney's particular speech characteristics.
Once we had completed our TTS voices, we partnered with another company, Evil Genius Designs, to create the interface, imagery and animations that make up the SoapBoxing app. Finally, we put the TTS voices on the cloud so that they could be accessed from any iPhone or iPad device.