Once More, with (Less) Feeling: artificialized vocals

Standard

 

This semester has been challenging and fun. One class, in particular, really pushed me. It’s a class on music information processing. In other words, it’s a class on how computers interpret and process music as audio. I’ll spare you a lot of the technical stuff, but generally speaking we were treating audio recordings are vectors with each value of the vector corresponding to the amplitude of a sample. This allowed us to do all sorts of silly and interesting things to the audio files.

The culmination of the class is an independent project that utilizes principles learned from the class. This presented a unique opportunity to design an effect that I’ve wanted but couldn’t find: a way to make my voice sound like a machine. Sure, there’s vocoders, pitch quantizers, ring modulators, choruses, and more… but they don’t quite do what I want. The vocoder gets awfully close, but having to speak the vocals and also perform the melody on a keyboard is no fun. iZotope’s VocalSynth actually gets very close to what I want, but even that is hard to blend the real and the artificial. There had to be something different!

And now there is. Before I can explain what I did, here’s a little primer on some stuff:

Every sound we hear can be broken down into a combination of sine waves. Each wave has 3 parameters: frequency (pitch), amplitude (loudness), and phase. You’ll note that phase doesn’t have an everyday analog like frequency does with pitch. That’s probably because our hearing isn’t sensitive to phase (with some exceptions not covered here). Below is a picture of a sine wave.

zxfec

See how the wave starts at the horizontal line that bisects the wave? This sine wave has a phase of 0 degrees. If it started at the peak and went down, it would have a phase of 90 degrees. If it started in the middle and went down, it would have a phase of 180, and so forth.

As I said, we don’t really hear phase, but it’s a crucial part of a sound because multiple sine waves are added together to make complex sounds. Some of them reinforce each other, others cancel each other out. All in all, they have a very complex relationship to each other.

This notion of a complex wave represented by a series of sine waves comes from a guy named Fourier. (He’s French so it’s “Four-E-ay.”) There’s a lot of different flavors of the Fourier Transforms, but the type relevant here is the Finite (or Fast) Fourier Transform. This one only deals with finite numbers, which are very computer friendly.

There’s a subset of the FFT called the STFT (short-time Fourier Transform) that maintains phase information in such a way that it’s easier to play with. One of the simplest tricks is to set all of the phases to 0. This makes a monotone, robotic voice with a few parameters changed. Hm! That’s fun, but not very musical.

STFTs, as the name implies, analyze very short segments of audio then jump forward and analyze another short segment. Short, in this case, means something like 0.023 seconds (1024 samples at 44.1k) of audio at a time. Here’s where the robot voice comes in: instead of jumping ahead to the next unread segment, I’ll tell it to jump ahead, say, a quarter of the way and grab 0.023 seconds, then jump another quarter and so on. This imposes a sort of periodicity to the sound, and periodicity is pitch!

By manipulating the distance I am jumping ahead, I can impose different pitches on the audio. This is essentially what I did in my project. More specifically, I:

  1. Made a sample-accurate score of the desired pitches
  2. Made a bunch of vectors for start time, end time, and desired pitches (expressed as a ratio)
  3. Made a loop to step through these vectors
  4. Grabbed a chunk of sound from a WAV file
  5. Performed an STFT using the pitches I plugged in
  6. Did an inverse STFT to turn it back into a vector with just amplitube values for samples
  7. Turned that back into a WAV file

(See the end of the post for a copy of my code.)

Here’s what I ended up with!

And here’s what it started as:

Please be forgiving of the original version. It’s not great… I was trying to perform in such a way that would make this process easier. It did, but the trade off was a particularly weak vocal performance. Yeesh. My pitch, vowels, and timbre were all over the place!

Anyway, here’s the code. You’ll need R (or R Studio!) and TuneR. Oh, and the solo vocal track.

setWavPlayer("/Library/Audio/playRWave")

stft = function(y,H,N) {
 v = seq(from=0,by=2*pi/N,length=N) 
 win = (1 + cos(v-pi))/2
 cols = floor((length(y)-N)/H) + 1
 stft = matrix(0,N,cols)
 for (t in 1:cols) {
 range = (1+(t-1)*H): ((t-1)*H + N)
 chunk = y[range]
 stft[,t] = fft(chunk*win)
 } 
 stft
}

istft = function(Y,H,N) {
 v = seq(from=0,by=2*pi/N,length=N) 
 win = (1 + cos(v-pi))/2
 y = rep(0,N + H*ncol(Y))
 for (t in 1:ncol(Y)) {
 chunk = fft(Y[,t],inverse=T)/N
 range = (1+(t-1)*H): ((t-1)*H + N)
 y[range] = y[range] + win*Re(chunk)
 }
 y
}

spectrogram = function(y,N) {
 bright = seq(0,1,by=.01) 
 power = .2
 bright = seq(0,1,by=.01)^power
 grey = rgb(bright,bright,bright) # this will be our color palate --- all grey
 frames = floor(length(y)/N) # number of "frames" (like in movie)
 spect = matrix(0,frames,N/2) # initialize frames x N/2 spectrogram matrix to 0
 # N/2 is # of freqs we compute in fft (as usual)
 v = seq(from=0,by=2*pi/N,length=N) # N evenly spaced pts 0 -- 2*pi
 win = (1 + cos(v-pi))/2 # Our Hann window --- could use something else (or nothing)
 for (t in 1:frames) {
 chunk = y[(1+(t-1)*N):(t*N)] # the frame t of audio data
 Y = fft(chunk*win)
 # Y = fft(chunk)
 spect[t,] = Mod(Y[1:(N/2)]) 
 # spect[t,] = log(1+Mod(Y[1:(N/2)])/1000) # log(1 + x/1000) transformation just changes contrast
 }
 image(spect,col=grey) # show the image using the color map given by "grey"
}


library(tuneR) 
N = 1024
w = readWave("VoxRAW.wav")
y = w@left
full_length = length(y)


bits = 16
i = 1
# this is a vector containing all of the pitch change onsets, in samples
start = c(0,131076,141117,152552,241186,272557,292584,329239,402666,
 459154,474012,491649,697317,786623,804970,824932,900086,924171,
 944914,968743,984086,1082743,1088571,1120457,1132371,1151571,
 1335171,1476343,1614943,1643400,1666886,1995600,2133514,2274429,
 2300571,2325686,3332571,3412114,3437400,3451800,3526457,3540343,
 3569314,3581657,3600943,3610371,3681086,3694800,3745200,3763371,
 3990000,4072371,4091143,4113000,4195286,4216200,4233429,4254000,
 4286743,4380771,4407701,4422086,4443686,4630114,4750886,4768029,
 4906371,4934829,4958914,5286171,5409686,5428714,5565943,5595086,
 5618829,5944543,6068829,6086057,6223714,6250543,6275057)

#this is a vector containing all of the last samples necessary for pitch changes. in samples
end = c(131075,141116,152551,241185,272556,292583,329238,402665,459153,
 474011,491648,697316,786622,804969,824931,900085,924170,944913,
 968742,984085,1082742,1088570,1120456,1132370,1151570,1335170,
 1476342,1614942,1643399,1666885,1995599,2133513,2274428,2300570,
 2325685,3332570,3412113,3437399,3451799,3526456,3540342,3569313,
 3581656,3600942,3610370,3681085,3694799,3745199,3763370,3989999,
 4072370,4091142,4112999,4195285,4216199,4233428,4253999,4286742,
 4380770,4407700,4422085,4443685,4630113,4750885,4768028,4906370,
 4934828,4958913,5286170,5409685,5428713,5565942,5595085,5618828,
 5944542,6068828,6086056,6223713,6250542,6275056, full_length)

#this ratio determines the pitch we hear by manipulating the window size
ratio = c(4.18128465,3.725101135,3.318687826,3.725101135,4.693333333,
 4.972413456,4.693333333,3.132424191,4.18128465,3.725101135,
 3.318687826,3.132424191,4.18128465,3.725101135,3.318687826,
 3.725101135,4.693333333,4.972413456,4.693333333,3.725101135,
 3.318687826,4.18128465,4.18128465,3.725101135,3.318687826,
 3.132424191,4.972413456,5.581345393,3.725101135,4.18128465,
 4.972413456,4.972413456,5.581345393,3.725101135,4.18128465,
 4.972413456,4.18128465,3.725101135,3.318687826,3.725101135,
 4.18128465,4.693333333,4.972413456,4.693333333,3.725101135,
 3.318687826,2.486206728,4.18128465,3.725101135,3.132424191,
 4.18128465,3.725101135,3.318687826,3.725101135,4.693333333,
 4.972413456,4.693333333,3.725101135,3.318687826,4.18128465,
 3.725101135,3.318687826,3.132424191,4.972413456,3.725101135,
 5.581345393,3.725101135,4.18128465,4.972413456,4.972413456,
 3.725101135,5.581345393,3.725101135,4.18128465,4.972413456,
 4.972413456,3.725101135,5.581345393,3.725101135,4.18128465,
 4.972413456)

w = readWave("VoxRAW.wav")
sr = w@samp.rate
y = w@left
ans = 0

for (i in 1:81) {
#the loop steps through each of the 3 above vectors 
frame = y[start[i]:end[i]] #take a bit of the wave from start to end
 
H = N/ratio[i] #make the window this size to change the perceived pitch
Y = stft(frame,H,N)


Y = matrix(complex(modulus = Mod(Y), argument = rep(0,length(Y))),nrow(Y),ncol(Y)) # robotization
ybar = istft(Y,H,N)
ans = c(ans,ybar) #concatinate all of the steps along the way

i = i + 1 #step through the loops
}

ans = (2^14)*ans/max(ans) #do some rounding to make sure it all fits
u = Wave(round(ans), samp.rate = sr, bit=bits) # make wave struct
#writeWave(u, "robotvox.wav") #save the robot version
o = readWave("VoxRAW.wav")
o = o@left
spectrogram(o, 1024) #what does the original recording look like?
r = readWave("robotvox.wav")
r = r@left
spectrogram(r, 1024) #what does the robot version look like?
#play(u) #listen to the robot version

MP3s don’t matter (until they do)

Standard

I’ve written before on some of the differences in MP3s vs WAVs, specifically how MP3s seem to invoke more negativity than WAVs in a blind test. I don’t know about you, but I thought those results were interesting and weird. So, I thought it made sense to kind of zoom out and try and get a bigger picture of this phenomenon.

A logical first step was to ask “Can people even hear the difference between WAVs and MP3s in their day-to-day life? If so, in what circumstances?” As the title implies, people generally can’t tell in most circumstances but once they do, it is a very pronounced shift.

The Experiment

I made an online experiment, asking people to listen to 16 different pairs of song segments and select the one they thought sounded better. There were 4 levels of MP3 compression: 320k, 192k, 128k, and 64k.

‘Why those levels of compression?’ you might be wondering. Amazon and Tidal deliver at 320k, Spotify premium does 192k, YouTube does 128k, and Pandora’s free streaming is 64k.

For each pair, one version of the segment was a WAV and the other was an MP3. (See below for more detail.) I also asked basic demographic information and how they usually listen to music and how they were listening to the experiment. For example, a lot of people use Spotify regularly for music listening on their phones, and a lot of people used their phones to do the experiment. Doing the experiment gave up a lot of control over how and where people listened, but the goal was to capture a realistic listening environment.

The Songs

I selected songs that are generally considered to be good recordings capable of offering a kind of audiophile experience. Also, I tried to choose “brighter” sounding recordings because they are particularly susceptible to MP3 artifacts. The thought behind this was to maximize the chance for identification of sonic differences, because I was doubtful there would be any difference until a very high level of compression.

I also split the songs into eras: Pre and Post MP3. I thought that maybe music production techniques might change to accommodate the MP3 medium, and maybe MP3s would be easier to detect in recordings that were not conceived for the medium.

The Song List by Era

Pre MP3 (pre 1993):

  1. David Bowie – Golden Years (1999 remaster)
  2. NIN – Terrible Lie
  3. Cowboy Junkies – Sweet Jane
  4. U2 – With Or Without You
  5. Lou Reed – Underneath the Bottle
  6. Lou Reed & John Cale – Style It Takes
  7. Yes – You and I
  8. Pink Floyd – Time

Post MP3:

  1. Buena Vista Social Club – Chan Chan
  2. Lou Reed – Future Farmers of America
  3. Air – Tropical Disease
  4. David Bowie – Battle for Britain
  5. Squarepusher – Ultravisitor
  6. The Flaming Lips – Race for the Prize
  7. Daft Punk – Giving Life Back to Music
  8. Nick Cave & The Bad Seeds – Jesus Alone

The Song List by Compression Level

320k

  1. Cowboy Junkies – Sweet Jane
  2. Lou Reed – Underneath the Bottle
  3. Squarepusher – Ultravisitor
  4. Daft Punk – Giving Life Back to Music

192k

  1. David Bowie – Golden Years (1999 remaster)
  2. NIN – Terrible Lie
  3. The Flaming Lips – Race for the Prize
  4. Air – Tropical Disease

128k

  1. U2 – With Or Without You
  2. Lou Reed & John Cale – Style It Takes
  3. Buena Vista Social Club – Chan Chan
  4. Nick Cave & The Bad Seeds – Jesus Alone

64k

  1. Pink Floyd – Time
  2. Bowie – Battle for Britain
  3. Lou Reed – Future Farmers of America
  4. Yes – You and I

The Participants

I had a total of 17 participants complete the experiment (and 1 more do part of the listening task) and a whole lot of bogus entries by bots…. sigh. Here’s some info on the real humans that did the experiment:

Pie Charts2.png

Note: options with 0 responses are not shown

Pie Charts3.png

Pie Charts4.png

“Which best describes your favorite way to listen to music that you have regular access to?” was the full question. I didn’t want everyone to think back to that one time they heard a really nice stereo!

Pie Charts5.png

Pie Charts6.png

Pie Charts7.png

“This includes informal or self-taught training. Examples of this include – but are not limited to – musicians, audio engineers, and audiophiles.”

 

Unfortunately, the sample size wasn’t big enough to do any interesting statistical analyses with this demographic info, but it’s still informative to help understand who created this data set.

The Results

Participants reliably (meaning, a statistically significant binomial test) selected WAVs as higher fidelity when the MP3s were 64k. Other than that, there was no statistical difference.

OUTPUT.png

OUTPUT1.png

OUTPUT2.png

OUTPUT3.png

11 to 57 in favor of WAV, p <0.001

When I first looked at the Pre/Post MP3 comparison, I was flummoxed. There is a statistical difference in the Post MP3 category… favoring WAVs.

866

That’s pretty counter-intuitive. That would be like finding that people preferred listening to the Beatles on CD instead of vinyl. It just doesn’t make sense. Why would recordings sound worse in the new hip medium that everyone’s using?

They don’t. My categorization was clumsy. So, yes, I selected 8 songs that were recorded after MP3s were invented, but what I didn’t consider is that the MP3 was not a cultural force until about a decade later, and not a force in the music industry until later than that even. So I went back and looked at just the Post MP3 category and split it again. Figuring out when the MP3 because a major force in the recording industry was a rabbit hole I didn’t want to go down, so I used a proxy: Jonathan Sterne, a scholar who looks at recording technology, published an article in 2006 discussing the MP3 as a cultural artifact. And luckily enough, using 2006 ended up being fruitful because of my 8 songs in the Post MP3 category, none were released on or even near 2006. I had 5 released before and 3 released after, and when I analyzed those groups, there was a strong preference for WAV in the older recordings but not in the newest recordings. This suggests that yes, recordings, after a certain date, are generally recorded to sound just as good as MP3s of a certain quality or WAVs. Here’s the analysis:

better-mp31

25 to 60 in favor of WAV, p < 0.001

 

better-mp3

So, to sum up: the debate between WAV and MP3 doesn’t matter in terms of identifying fidelity differences in real world situations for these participants UNTIL the compression levels are extreme. And, recordings designed for CDs and not MP3s sound better on CDs than MP3s, but it doesn’t matter for older recordings. If I had to guess it could be because some of the limitations of the vinyl medium are similar to MP3 (gasp! Heresy!) and so recordings designed for vinyl work kinda well as MP3s, too.