Once More, with (Less) Feeling: artificialized vocals

This semester has been challenging and fun. One class, in particular, really pushed me. It’s a class on music information processing. In other words, it’s a class on how computers interpret and process music as audio. I’ll spare you a lot of the technical stuff, but generally speaking we were treating audio recordings are vectors with each value of the vector corresponding to the amplitude of a sample. This allowed us to do all sorts of silly and interesting things to the audio files.

The culmination of the class is an independent project that utilizes principles learned from the class. This presented a unique opportunity to design an effect that I’ve wanted but couldn’t find: a way to make my voice sound like a machine. Sure, there’s vocoders, pitch quantizers, ring modulators, choruses, and more… but they don’t quite do what I want. The vocoder gets awfully close, but having to speak the vocals and also perform the melody on a keyboard is no fun. iZotope’s VocalSynth actually gets very close to what I want, but even that is hard to blend the real and the artificial. There had to be something different!

And now there is. Before I can explain what I did, here’s a little primer on some stuff:

Every sound we hear can be broken down into a combination of sine waves. Each wave has 3 parameters: frequency (pitch), amplitude (loudness), and phase. You’ll note that phase doesn’t have an everyday analog like frequency does with pitch. That’s probably because our hearing isn’t sensitive to phase (with some exceptions not covered here). Below is a picture of a sine wave.

zxfec

See how the wave starts at the horizontal line that bisects the wave? This sine wave has a phase of 0 degrees. If it started at the peak and went down, it would have a phase of 90 degrees. If it started in the middle and went down, it would have a phase of 180, and so forth.

As I said, we don’t really hear phase, but it’s a crucial part of a sound because multiple sine waves are added together to make complex sounds. Some of them reinforce each other, others cancel each other out. All in all, they have a very complex relationship to each other.

This notion of a complex wave represented by a series of sine waves comes from a guy named Fourier. (He’s French so it’s “Four-E-ay.”) There’s a lot of different flavors of the Fourier Transforms, but the type relevant here is the Finite (or Fast) Fourier Transform. This one only deals with finite numbers, which are very computer friendly.

There’s a subset of the FFT called the STFT (short-time Fourier Transform) that maintains phase information in such a way that it’s easier to play with. One of the simplest tricks is to set all of the phases to 0. This makes a monotone, robotic voice with a few parameters changed. Hm! That’s fun, but not very musical.

STFTs, as the name implies, analyze very short segments of audio then jump forward and analyze another short segment. Short, in this case, means something like 0.023 seconds (1024 samples at 44.1k) of audio at a time. Here’s where the robot voice comes in: instead of jumping ahead to the next unread segment, I’ll tell it to jump ahead, say, a quarter of the way and grab 0.023 seconds, then jump another quarter and so on. This imposes a sort of periodicity to the sound, and periodicity is pitch!

By manipulating the distance I am jumping ahead, I can impose different pitches on the audio. This is essentially what I did in my project. More specifically, I:

Made a sample-accurate score of the desired pitches
Made a bunch of vectors for start time, end time, and desired pitches (expressed as a ratio)
Made a loop to step through these vectors
Grabbed a chunk of sound from a WAV file
Performed an STFT using the pitches I plugged in
Did an inverse STFT to turn it back into a vector with just amplitube values for samples
Turned that back into a WAV file

(See the end of the post for a copy of my code.)

Here’s what I ended up with!

And here’s what it started as:

Please be forgiving of the original version. It’s not great… I was trying to perform in such a way that would make this process easier. It did, but the trade off was a particularly weak vocal performance. Yeesh. My pitch, vowels, and timbre were all over the place!

Anyway, here’s the code. You’ll need R (or R Studio!) and TuneR. Oh, and the solo vocal track.

setWavPlayer("/Library/Audio/playRWave")

stft = function(y,H,N) {
 v = seq(from=0,by=2*pi/N,length=N) 
 win = (1 + cos(v-pi))/2
 cols = floor((length(y)-N)/H) + 1
 stft = matrix(0,N,cols)
 for (t in 1:cols) {
 range = (1+(t-1)*H): ((t-1)*H + N)
 chunk = y[range]
 stft[,t] = fft(chunk*win)
 } 
 stft
}

istft = function(Y,H,N) {
 v = seq(from=0,by=2*pi/N,length=N) 
 win = (1 + cos(v-pi))/2
 y = rep(0,N + H*ncol(Y))
 for (t in 1:ncol(Y)) {
 chunk = fft(Y[,t],inverse=T)/N
 range = (1+(t-1)*H): ((t-1)*H + N)
 y[range] = y[range] + win*Re(chunk)
 }
 y
}

spectrogram = function(y,N) {
 bright = seq(0,1,by=.01) 
 power = .2
 bright = seq(0,1,by=.01)^power
 grey = rgb(bright,bright,bright) # this will be our color palate --- all grey
 frames = floor(length(y)/N) # number of "frames" (like in movie)
 spect = matrix(0,frames,N/2) # initialize frames x N/2 spectrogram matrix to 0
 # N/2 is # of freqs we compute in fft (as usual)
 v = seq(from=0,by=2*pi/N,length=N) # N evenly spaced pts 0 -- 2*pi
 win = (1 + cos(v-pi))/2 # Our Hann window --- could use something else (or nothing)
 for (t in 1:frames) {
 chunk = y[(1+(t-1)*N):(t*N)] # the frame t of audio data
 Y = fft(chunk*win)
 # Y = fft(chunk)
 spect[t,] = Mod(Y[1:(N/2)]) 
 # spect[t,] = log(1+Mod(Y[1:(N/2)])/1000) # log(1 + x/1000) transformation just changes contrast
 }
 image(spect,col=grey) # show the image using the color map given by "grey"
}


library(tuneR) 
N = 1024
w = readWave("VoxRAW.wav")
y = w@left
full_length = length(y)


bits = 16
i = 1
# this is a vector containing all of the pitch change onsets, in samples
start = c(0,131076,141117,152552,241186,272557,292584,329239,402666,
 459154,474012,491649,697317,786623,804970,824932,900086,924171,
 944914,968743,984086,1082743,1088571,1120457,1132371,1151571,
 1335171,1476343,1614943,1643400,1666886,1995600,2133514,2274429,
 2300571,2325686,3332571,3412114,3437400,3451800,3526457,3540343,
 3569314,3581657,3600943,3610371,3681086,3694800,3745200,3763371,
 3990000,4072371,4091143,4113000,4195286,4216200,4233429,4254000,
 4286743,4380771,4407701,4422086,4443686,4630114,4750886,4768029,
 4906371,4934829,4958914,5286171,5409686,5428714,5565943,5595086,
 5618829,5944543,6068829,6086057,6223714,6250543,6275057)

#this is a vector containing all of the last samples necessary for pitch changes. in samples
end = c(131075,141116,152551,241185,272556,292583,329238,402665,459153,
 474011,491648,697316,786622,804969,824931,900085,924170,944913,
 968742,984085,1082742,1088570,1120456,1132370,1151570,1335170,
 1476342,1614942,1643399,1666885,1995599,2133513,2274428,2300570,
 2325685,3332570,3412113,3437399,3451799,3526456,3540342,3569313,
 3581656,3600942,3610370,3681085,3694799,3745199,3763370,3989999,
 4072370,4091142,4112999,4195285,4216199,4233428,4253999,4286742,
 4380770,4407700,4422085,4443685,4630113,4750885,4768028,4906370,
 4934828,4958913,5286170,5409685,5428713,5565942,5595085,5618828,
 5944542,6068828,6086056,6223713,6250542,6275056, full_length)

#this ratio determines the pitch we hear by manipulating the window size
ratio = c(4.18128465,3.725101135,3.318687826,3.725101135,4.693333333,
 4.972413456,4.693333333,3.132424191,4.18128465,3.725101135,
 3.318687826,3.132424191,4.18128465,3.725101135,3.318687826,
 3.725101135,4.693333333,4.972413456,4.693333333,3.725101135,
 3.318687826,4.18128465,4.18128465,3.725101135,3.318687826,
 3.132424191,4.972413456,5.581345393,3.725101135,4.18128465,
 4.972413456,4.972413456,5.581345393,3.725101135,4.18128465,
 4.972413456,4.18128465,3.725101135,3.318687826,3.725101135,
 4.18128465,4.693333333,4.972413456,4.693333333,3.725101135,
 3.318687826,2.486206728,4.18128465,3.725101135,3.132424191,
 4.18128465,3.725101135,3.318687826,3.725101135,4.693333333,
 4.972413456,4.693333333,3.725101135,3.318687826,4.18128465,
 3.725101135,3.318687826,3.132424191,4.972413456,3.725101135,
 5.581345393,3.725101135,4.18128465,4.972413456,4.972413456,
 3.725101135,5.581345393,3.725101135,4.18128465,4.972413456,
 4.972413456,3.725101135,5.581345393,3.725101135,4.18128465,
 4.972413456)

w = readWave("VoxRAW.wav")
sr = w@samp.rate
y = w@left
ans = 0

for (i in 1:81) {
#the loop steps through each of the 3 above vectors 
frame = y[start[i]:end[i]] #take a bit of the wave from start to end
 
H = N/ratio[i] #make the window this size to change the perceived pitch
Y = stft(frame,H,N)


Y = matrix(complex(modulus = Mod(Y), argument = rep(0,length(Y))),nrow(Y),ncol(Y)) # robotization
ybar = istft(Y,H,N)
ans = c(ans,ybar) #concatinate all of the steps along the way

i = i + 1 #step through the loops
}

ans = (2^14)*ans/max(ans) #do some rounding to make sure it all fits
u = Wave(round(ans), samp.rate = sr, bit=bits) # make wave struct
#writeWave(u, "robotvox.wav") #save the robot version
o = readWave("VoxRAW.wav")
o = o@left
spectrogram(o, 1024) #what does the original recording look like?
r = readWave("robotvox.wav")
r = r@left
spectrogram(r, 1024) #what does the robot version look like?
#play(u) #listen to the robot version

Joshua D. Sites

A quasi-academic blog about media psychology & cognition of popular music

Once More, with (Less) Feeling: artificialized vocals

Leave a comment Cancel reply

Share this:

Leave a comment Cancel reply