Once More, with (Less) Feeling: artificialized vocals

Standard

 

This semester has been challenging and fun. One class, in particular, really pushed me. It’s a class on music information processing. In other words, it’s a class on how computers interpret and process music as audio. I’ll spare you a lot of the technical stuff, but generally speaking we were treating audio recordings are vectors with each value of the vector corresponding to the amplitude of a sample. This allowed us to do all sorts of silly and interesting things to the audio files.

The culmination of the class is an independent project that utilizes principles learned from the class. This presented a unique opportunity to design an effect that I’ve wanted but couldn’t find: a way to make my voice sound like a machine. Sure, there’s vocoders, pitch quantizers, ring modulators, choruses, and more… but they don’t quite do what I want. The vocoder gets awfully close, but having to speak the vocals and also perform the melody on a keyboard is no fun. iZotope’s VocalSynth actually gets very close to what I want, but even that is hard to blend the real and the artificial. There had to be something different!

And now there is. Before I can explain what I did, here’s a little primer on some stuff:

Every sound we hear can be broken down into a combination of sine waves. Each wave has 3 parameters: frequency (pitch), amplitude (loudness), and phase. You’ll note that phase doesn’t have an everyday analog like frequency does with pitch. That’s probably because our hearing isn’t sensitive to phase (with some exceptions not covered here). Below is a picture of a sine wave.

zxfec

See how the wave starts at the horizontal line that bisects the wave? This sine wave has a phase of 0 degrees. If it started at the peak and went down, it would have a phase of 90 degrees. If it started in the middle and went down, it would have a phase of 180, and so forth.

As I said, we don’t really hear phase, but it’s a crucial part of a sound because multiple sine waves are added together to make complex sounds. Some of them reinforce each other, others cancel each other out. All in all, they have a very complex relationship to each other.

This notion of a complex wave represented by a series of sine waves comes from a guy named Fourier. (He’s French so it’s “Four-E-ay.”) There’s a lot of different flavors of the Fourier Transforms, but the type relevant here is the Finite (or Fast) Fourier Transform. This one only deals with finite numbers, which are very computer friendly.

There’s a subset of the FFT called the STFT (short-time Fourier Transform) that maintains phase information in such a way that it’s easier to play with. One of the simplest tricks is to set all of the phases to 0. This makes a monotone, robotic voice with a few parameters changed. Hm! That’s fun, but not very musical.

STFTs, as the name implies, analyze very short segments of audio then jump forward and analyze another short segment. Short, in this case, means something like 0.023 seconds (1024 samples at 44.1k) of audio at a time. Here’s where the robot voice comes in: instead of jumping ahead to the next unread segment, I’ll tell it to jump ahead, say, a quarter of the way and grab 0.023 seconds, then jump another quarter and so on. This imposes a sort of periodicity to the sound, and periodicity is pitch!

By manipulating the distance I am jumping ahead, I can impose different pitches on the audio. This is essentially what I did in my project. More specifically, I:

  1. Made a sample-accurate score of the desired pitches
  2. Made a bunch of vectors for start time, end time, and desired pitches (expressed as a ratio)
  3. Made a loop to step through these vectors
  4. Grabbed a chunk of sound from a WAV file
  5. Performed an STFT using the pitches I plugged in
  6. Did an inverse STFT to turn it back into a vector with just amplitube values for samples
  7. Turned that back into a WAV file

(See the end of the post for a copy of my code.)

Here’s what I ended up with!

And here’s what it started as:

Please be forgiving of the original version. It’s not great… I was trying to perform in such a way that would make this process easier. It did, but the trade off was a particularly weak vocal performance. Yeesh. My pitch, vowels, and timbre were all over the place!

Anyway, here’s the code. You’ll need R (or R Studio!) and TuneR. Oh, and the solo vocal track.

setWavPlayer("/Library/Audio/playRWave")

stft = function(y,H,N) {
 v = seq(from=0,by=2*pi/N,length=N) 
 win = (1 + cos(v-pi))/2
 cols = floor((length(y)-N)/H) + 1
 stft = matrix(0,N,cols)
 for (t in 1:cols) {
 range = (1+(t-1)*H): ((t-1)*H + N)
 chunk = y[range]
 stft[,t] = fft(chunk*win)
 } 
 stft
}

istft = function(Y,H,N) {
 v = seq(from=0,by=2*pi/N,length=N) 
 win = (1 + cos(v-pi))/2
 y = rep(0,N + H*ncol(Y))
 for (t in 1:ncol(Y)) {
 chunk = fft(Y[,t],inverse=T)/N
 range = (1+(t-1)*H): ((t-1)*H + N)
 y[range] = y[range] + win*Re(chunk)
 }
 y
}

spectrogram = function(y,N) {
 bright = seq(0,1,by=.01) 
 power = .2
 bright = seq(0,1,by=.01)^power
 grey = rgb(bright,bright,bright) # this will be our color palate --- all grey
 frames = floor(length(y)/N) # number of "frames" (like in movie)
 spect = matrix(0,frames,N/2) # initialize frames x N/2 spectrogram matrix to 0
 # N/2 is # of freqs we compute in fft (as usual)
 v = seq(from=0,by=2*pi/N,length=N) # N evenly spaced pts 0 -- 2*pi
 win = (1 + cos(v-pi))/2 # Our Hann window --- could use something else (or nothing)
 for (t in 1:frames) {
 chunk = y[(1+(t-1)*N):(t*N)] # the frame t of audio data
 Y = fft(chunk*win)
 # Y = fft(chunk)
 spect[t,] = Mod(Y[1:(N/2)]) 
 # spect[t,] = log(1+Mod(Y[1:(N/2)])/1000) # log(1 + x/1000) transformation just changes contrast
 }
 image(spect,col=grey) # show the image using the color map given by "grey"
}


library(tuneR) 
N = 1024
w = readWave("VoxRAW.wav")
y = w@left
full_length = length(y)


bits = 16
i = 1
# this is a vector containing all of the pitch change onsets, in samples
start = c(0,131076,141117,152552,241186,272557,292584,329239,402666,
 459154,474012,491649,697317,786623,804970,824932,900086,924171,
 944914,968743,984086,1082743,1088571,1120457,1132371,1151571,
 1335171,1476343,1614943,1643400,1666886,1995600,2133514,2274429,
 2300571,2325686,3332571,3412114,3437400,3451800,3526457,3540343,
 3569314,3581657,3600943,3610371,3681086,3694800,3745200,3763371,
 3990000,4072371,4091143,4113000,4195286,4216200,4233429,4254000,
 4286743,4380771,4407701,4422086,4443686,4630114,4750886,4768029,
 4906371,4934829,4958914,5286171,5409686,5428714,5565943,5595086,
 5618829,5944543,6068829,6086057,6223714,6250543,6275057)

#this is a vector containing all of the last samples necessary for pitch changes. in samples
end = c(131075,141116,152551,241185,272556,292583,329238,402665,459153,
 474011,491648,697316,786622,804969,824931,900085,924170,944913,
 968742,984085,1082742,1088570,1120456,1132370,1151570,1335170,
 1476342,1614942,1643399,1666885,1995599,2133513,2274428,2300570,
 2325685,3332570,3412113,3437399,3451799,3526456,3540342,3569313,
 3581656,3600942,3610370,3681085,3694799,3745199,3763370,3989999,
 4072370,4091142,4112999,4195285,4216199,4233428,4253999,4286742,
 4380770,4407700,4422085,4443685,4630113,4750885,4768028,4906370,
 4934828,4958913,5286170,5409685,5428713,5565942,5595085,5618828,
 5944542,6068828,6086056,6223713,6250542,6275056, full_length)

#this ratio determines the pitch we hear by manipulating the window size
ratio = c(4.18128465,3.725101135,3.318687826,3.725101135,4.693333333,
 4.972413456,4.693333333,3.132424191,4.18128465,3.725101135,
 3.318687826,3.132424191,4.18128465,3.725101135,3.318687826,
 3.725101135,4.693333333,4.972413456,4.693333333,3.725101135,
 3.318687826,4.18128465,4.18128465,3.725101135,3.318687826,
 3.132424191,4.972413456,5.581345393,3.725101135,4.18128465,
 4.972413456,4.972413456,5.581345393,3.725101135,4.18128465,
 4.972413456,4.18128465,3.725101135,3.318687826,3.725101135,
 4.18128465,4.693333333,4.972413456,4.693333333,3.725101135,
 3.318687826,2.486206728,4.18128465,3.725101135,3.132424191,
 4.18128465,3.725101135,3.318687826,3.725101135,4.693333333,
 4.972413456,4.693333333,3.725101135,3.318687826,4.18128465,
 3.725101135,3.318687826,3.132424191,4.972413456,3.725101135,
 5.581345393,3.725101135,4.18128465,4.972413456,4.972413456,
 3.725101135,5.581345393,3.725101135,4.18128465,4.972413456,
 4.972413456,3.725101135,5.581345393,3.725101135,4.18128465,
 4.972413456)

w = readWave("VoxRAW.wav")
sr = w@samp.rate
y = w@left
ans = 0

for (i in 1:81) {
#the loop steps through each of the 3 above vectors 
frame = y[start[i]:end[i]] #take a bit of the wave from start to end
 
H = N/ratio[i] #make the window this size to change the perceived pitch
Y = stft(frame,H,N)


Y = matrix(complex(modulus = Mod(Y), argument = rep(0,length(Y))),nrow(Y),ncol(Y)) # robotization
ybar = istft(Y,H,N)
ans = c(ans,ybar) #concatinate all of the steps along the way

i = i + 1 #step through the loops
}

ans = (2^14)*ans/max(ans) #do some rounding to make sure it all fits
u = Wave(round(ans), samp.rate = sr, bit=bits) # make wave struct
#writeWave(u, "robotvox.wav") #save the robot version
o = readWave("VoxRAW.wav")
o = o@left
spectrogram(o, 1024) #what does the original recording look like?
r = readWave("robotvox.wav")
r = r@left
spectrogram(r, 1024) #what does the robot version look like?
#play(u) #listen to the robot version

MP3s don’t matter (until they do)

Standard

I’ve written before on some of the differences in MP3s vs WAVs, specifically how MP3s seem to invoke more negativity than WAVs in a blind test. I don’t know about you, but I thought those results were interesting and weird. So, I thought it made sense to kind of zoom out and try and get a bigger picture of this phenomenon.

A logical first step was to ask “Can people even hear the difference between WAVs and MP3s in their day-to-day life? If so, in what circumstances?” As the title implies, people generally can’t tell in most circumstances but once they do, it is a very pronounced shift.

The Experiment

I made an online experiment, asking people to listen to 16 different pairs of song segments and select the one they thought sounded better. There were 4 levels of MP3 compression: 320k, 192k, 128k, and 64k.

‘Why those levels of compression?’ you might be wondering. Amazon and Tidal deliver at 320k, Spotify premium does 192k, YouTube does 128k, and Pandora’s free streaming is 64k.

For each pair, one version of the segment was a WAV and the other was an MP3. (See below for more detail.) I also asked basic demographic information and how they usually listen to music and how they were listening to the experiment. For example, a lot of people use Spotify regularly for music listening on their phones, and a lot of people used their phones to do the experiment. Doing the experiment gave up a lot of control over how and where people listened, but the goal was to capture a realistic listening environment.

The Songs

I selected songs that are generally considered to be good recordings capable of offering a kind of audiophile experience. Also, I tried to choose “brighter” sounding recordings because they are particularly susceptible to MP3 artifacts. The thought behind this was to maximize the chance for identification of sonic differences, because I was doubtful there would be any difference until a very high level of compression.

I also split the songs into eras: Pre and Post MP3. I thought that maybe music production techniques might change to accommodate the MP3 medium, and maybe MP3s would be easier to detect in recordings that were not conceived for the medium.

The Song List by Era

Pre MP3 (pre 1993):

  1. David Bowie – Golden Years (1999 remaster)
  2. NIN – Terrible Lie
  3. Cowboy Junkies – Sweet Jane
  4. U2 – With Or Without You
  5. Lou Reed – Underneath the Bottle
  6. Lou Reed & John Cale – Style It Takes
  7. Yes – You and I
  8. Pink Floyd – Time

Post MP3:

  1. Buena Vista Social Club – Chan Chan
  2. Lou Reed – Future Farmers of America
  3. Air – Tropical Disease
  4. David Bowie – Battle for Britain
  5. Squarepusher – Ultravisitor
  6. The Flaming Lips – Race for the Prize
  7. Daft Punk – Giving Life Back to Music
  8. Nick Cave & The Bad Seeds – Jesus Alone

The Song List by Compression Level

320k

  1. Cowboy Junkies – Sweet Jane
  2. Lou Reed – Underneath the Bottle
  3. Squarepusher – Ultravisitor
  4. Daft Punk – Giving Life Back to Music

192k

  1. David Bowie – Golden Years (1999 remaster)
  2. NIN – Terrible Lie
  3. The Flaming Lips – Race for the Prize
  4. Air – Tropical Disease

128k

  1. U2 – With Or Without You
  2. Lou Reed & John Cale – Style It Takes
  3. Buena Vista Social Club – Chan Chan
  4. Nick Cave & The Bad Seeds – Jesus Alone

64k

  1. Pink Floyd – Time
  2. Bowie – Battle for Britain
  3. Lou Reed – Future Farmers of America
  4. Yes – You and I

The Participants

I had a total of 17 participants complete the experiment (and 1 more do part of the listening task) and a whole lot of bogus entries by bots…. sigh. Here’s some info on the real humans that did the experiment:

Pie Charts2.png

Note: options with 0 responses are not shown

Pie Charts3.png

Pie Charts4.png

“Which best describes your favorite way to listen to music that you have regular access to?” was the full question. I didn’t want everyone to think back to that one time they heard a really nice stereo!

Pie Charts5.png

Pie Charts6.png

Pie Charts7.png

“This includes informal or self-taught training. Examples of this include – but are not limited to – musicians, audio engineers, and audiophiles.”

 

Unfortunately, the sample size wasn’t big enough to do any interesting statistical analyses with this demographic info, but it’s still informative to help understand who created this data set.

The Results

Participants reliably (meaning, a statistically significant binomial test) selected WAVs as higher fidelity when the MP3s were 64k. Other than that, there was no statistical difference.

OUTPUT.png

OUTPUT1.png

OUTPUT2.png

OUTPUT3.png

11 to 57 in favor of WAV, p <0.001

When I first looked at the Pre/Post MP3 comparison, I was flummoxed. There is a statistical difference in the Post MP3 category… favoring WAVs.

866

That’s pretty counter-intuitive. That would be like finding that people preferred listening to the Beatles on CD instead of vinyl. It just doesn’t make sense. Why would recordings sound worse in the new hip medium that everyone’s using?

They don’t. My categorization was clumsy. So, yes, I selected 8 songs that were recorded after MP3s were invented, but what I didn’t consider is that the MP3 was not a cultural force until about a decade later, and not a force in the music industry until later than that even. So I went back and looked at just the Post MP3 category and split it again. Figuring out when the MP3 because a major force in the recording industry was a rabbit hole I didn’t want to go down, so I used a proxy: Jonathan Sterne, a scholar who looks at recording technology, published an article in 2006 discussing the MP3 as a cultural artifact. And luckily enough, using 2006 ended up being fruitful because of my 8 songs in the Post MP3 category, none were released on or even near 2006. I had 5 released before and 3 released after, and when I analyzed those groups, there was a strong preference for WAV in the older recordings but not in the newest recordings. This suggests that yes, recordings, after a certain date, are generally recorded to sound just as good as MP3s of a certain quality or WAVs. Here’s the analysis:

better-mp31

25 to 60 in favor of WAV, p < 0.001

 

better-mp3

So, to sum up: the debate between WAV and MP3 doesn’t matter in terms of identifying fidelity differences in real world situations for these participants UNTIL the compression levels are extreme. And, recordings designed for CDs and not MP3s sound better on CDs than MP3s, but it doesn’t matter for older recordings. If I had to guess it could be because some of the limitations of the vinyl medium are similar to MP3 (gasp! Heresy!) and so recordings designed for vinyl work kinda well as MP3s, too.

Let’s define music!

Standard

Goodness, I have written lots of word about music, but I’m not sure if I have ever thoroughly defined what I mean by “music.” In this post you’ll find my definition, of course, but I want to clarify right up front that this may read to be slightly antagonistic. In a sense it is meant to be, but ultimately it is about how to define music in the context of communication. I’m trying to push boundaries, not hurt feelings.

I don’t claim all of these thoughts as my own, but this may be a unique synthesis of standing ideas. I’ve also touched on some of these ideas in previous posts, but I wanted to put them all together.

Music describes a way of thinking about sound.

Music is a bit like the infamous Supreme Court ruling on pornography: it’s hard to define but when you’re presented with an example, you recognize it immediately. Once you start leaving the very obvious examples, it gets kind of hard to find the boundary between music and regular sound. That’s because music describes a way of thinking about sound, not a specific kind of sound.

I think the most famous example of pushing the boundaries of music in the western world might be John Cage’s 4’33.” A pianist sits down, prepares to play, then does nothing for 4 minutes and 33 seconds. Is that music? Well, Cage would certainly say so but the audience in the music hall is split. Some say yes, some say no. Who is right?

I would argue that 4’33” in that example is definitively music, and here is why: the context. In his autobiography, Frank Zappa argued that context is key. He called it “putting a frame around it.” Let’s explore this a bit. The audience in my example above is at a music hall to hear music. A performer sits at an instrument, prepares to play, then plays silence for 4’33”. While it is certainly up to audience members to decide how much they enjoy the performance, they can’t really argue about whether or not music happened because the context clearly articulated that music happened.

Here’s another example: you’re walking in the woods alone, and you come to a clearing to find a pianist sitting at a piano. As you approach, she hops up and says “ah! I just finished my performance of 4’33”! What did you think?” Did you hear music for the last 4 minutes and 33 seconds? I don’t think so. There was no contextual clue to encourage you to think about sounds as music for the previous four and a half minutes. (Unless, of course, you just so happened to be doing it on your own free will, but the odds of that are remote.)

Another way to think about it is the old paradox: don’t think about an elephant. It’s impossible to not think about an elephant when you are given this prompt. Similarly, the people in the music hall are thinking about music and thinking about sound as music. Even if they’re thinking “ugh, this is stupid, this isn’t music,” they are still thinking about sound as music.

Music is communication.

When we hear sound as music, we are interpreting and processing it. Music is inherently more vague in its meaning than language, but there is still meaning. Music has emotional impacts, triggers memories, and causes physiological responses. Language does all of these things, too.

I think a lot of people get hung up on the idea of “music is communication” because music isn’t specific or declarative. I agree wholly that music is non-specific and non-declarative. I can’t play you a tune on a recorder to ask you to get me a beer (I would if I could, though!). And if you ask 10 people to listen to the same song, they’ll each tell you something different when asked what it means.

However, language suffers some of the same faults. Has anyone ever misunderstood you? Or have you ever said something that came out wrong? Of course you have. Language is specific, but the interpretation is difficult. I think music suffers a somewhat similar fate: a composer can intend to convey a scene or a feeling, but different audience members will have different responses.

Also, I’m blogging right now. (Duh.) But why? Well, blogging has a certain set of affordances that other kinds of communication lack. I could say this out loud, but only the other people near my desk would hear me. And once I’ve said it, it’s gone forever. I could write a book, but that means people need to buy it to read my thoughts. I could write a poem, but my prose is terrible. The point is that I’m writing this in blog form because it seems to be the best way for me to share these specific ideas in a way that I want to share them. Music is no different. I can express things that are difficult or impossible to express outside of music.

I think a more complete analysis of the affordances of music would be a swell thing to do, but here’s a short sketch: musical expression has no substitute mode of expression. I can’t accurately tell you about a piece of music, I can only approximate it in words. Information is lost when I talk about it compared to you experiencing it first hand. I think what is lost is the thrill and the emotion. Not only am I sharing words, but I’m sharing my interpretation of it. I’ve taken the experience out of it. It’s like baby food: the nutrition is there, but the experience of texture is lost in the processing.

Music is interesting.

Unlike language, music is inherently interesting. Language is designed to convey specific ideas. The goal is clarity and meeting expectations of normal patterns of communication. Sentences have at least a noun and a verb. Normal communication is utilitarian and functional. Musical communication is impressionistic and fanciful.

Part of the joy of listening to music is the blend of having your expectations met and defied in unexpected but carefully constructed ways. A piece of music establishes or implies a set of rules, but then defies those rules for your enjoyment. For example, a common thing to do in a pop song is to modulate up part of the way through the song. This defies expectations because the song has clearly established itself to exist in a given key, but then everything suddenly shifts upwards. The foundation the song was built on just got pushed upward a little bit. It’s startling, but it can be pleasant when done artfully. Another example is establishing a phrase (a pattern) by repeating the structure, but then unexpectedly stopping the pattern short. Again, this can be quite exhilarating and pleasant when done carefully. Imagine that happening in a conversation, though. Someone is talking to you and they just stop right in the

… Language doesn’t work that way, does it? Language is meant to inform and music is meant to challenge and entertain you, in a broad sense. Attempts to describe music in terms of musical forces (like physical forces) sometimes stumble because music does unexpected things. A thrown ball will always obey physical forces. In that sense, it is uninteresting. Music, however, will only sometimes obey musical forces and that’s part of the point.

Music is important.

Music is a means of expression for both performers and listeners. It is therapeutic. Music helps build identity both for individuals and groups. These are concrete, real psychological benefits. Music helps us survive, and it helps shape societies.

And now, I think a brief explanation of what music is not would be useful.

Sheet music is a lie.

Sheet music is not music nor is it an accurate representation of music. It is a shorthand expression and a necessary means to preserve musical ideas in the era before recording audio was possible. It is a useful guide for memorization and performance. Systems that explicitly or implicitly rely on sheet music as if it is real music are faulty.  Sheet music captures onsets and durations in an abstract and imperfect way, and make little to no attempt to capture feeling.

Schenkerian analysis is a way to analyze music, but it is not the way.

Schenkerian analysis is a useful tool to analyze music of a certain type when asking certain questions. However, since it is by far the dominant (heh) method of musical analysis, it is often applied to situations where it is not relevant or meaningful. Schenkerian analysis also presumes that sheet music is an accurate representation of music. Schenkerian analysis is performed on sheet music, not actual music. It is also produces a tautological result: each piece of music can be reduced to simpler and simpler versions, eventually ending in a descending pattern of notes. On the surface, this is a stunning revelation about how music works but the problem is that Schenkerian analysis demands this outcome.

When studying the psychological implications of music, it is important to ask questions about the music that most people actually experience.

Remember, music is a phenomenon that exists in the mind. It then follows that it is important to study the kinds of music found in most minds. And I think it’s safe to say that Schubert isn’t it. It’s time to roll up our sleeves and dig into the music of the now.

Music perception and cognition research largely limits itself to SERIOUS CLASSICAL MUSIC and maybe jazz when feeling cheeky. This is a problem! And please don’t think I’m knocking serious classical music or jazz, or the study of this music. It’s very important and relevant and I am grateful that people do it because both of the forms of music profoundly influence our current popular music.

What I am advocating is that music be studied in such a way that is more related to how most people experience music. Artificiality is a challenge in any line of research, but this stumbling block seems easy enough to avoid. The barriers to studying popular music are institutional elitism, not practical issues.

Anyway, I hope you enjoyed this or at the very least found it provocative. I know it helped me a lot to codify all of these thoughts in one place, so I thank you for the indulgence.

Beauty and the Beast: is there any difference between listening to MP3 vs CD quality?

Standard

TL;DR: yes. But come on! There’s a bunch of graphs and some lame jokes if you actually read the post.

Preface

As I sit here at my desk, I am surrounded by audio equipment and CDs. Spotify is open right now (streaming quality set to “Extreme,” thank you very much). My favorite pair of headphones are within arm’s reach. My studio monitors are effortlessly reproducing a lovely Terry Riley piece. Clearly, I am spoiled. But wait, let’s rewind a moment: I’ve got a stack of CD’s next to me, but I’m streaming compressed audio when I could be enjoying clean, uncompressed audio from my CDs? Why would I do that? (I also have a record player and a few choice vinyls, but it’s an obviously inferior format to CD so it’s not part of the comparison.)

I do it because it’s convenient. And there’s a massive amount of diversity on Spotify that simply isn’t legally accessible to me given my grad student budget. And I’m not alone: a whole heck of a lot of people in the US use streaming services. But all of them, save one, stream in what’s called lossy formats. In fact, other than listening to a CD or vinyl, the music you listen to is probably in a lossy format. It means the previously uncompressed and pristine digital audio of a CD is reduced not just in file size, but in information it contains.WAVs, by comparison, are lossless. It’s kind of bonkers to think, but MP3s and other lossy formats throw away a LOT of sound. That’s partially why they’re so small. The goal, of course, is to only throw away things you can’t hear.

It might sound kind of like science fiction (or the fantasy of scared parents of metal fans): unheard sounds in recordings? It’s true, though. In fact, our cognitive systems are really excellent at filtering out unwanted noise. It’s called the cocktail party effect. So why not automate the process and only save the parts that we hear anyway? It might not be that simple. I, along with a classmate and our advisor, decided to test if there was a difference in the subjective enjoyment of music listening between WAVs and MP3s.

The Experiment

We selected eight songs: four recorded before MP3s were even a glimmer in the Fraunhofer Institute’s eye, and four very recent songs. We did this because there’s an idea floating around in audio engineering and audiophile circles that, for example, the Beatles sound better on vinyl than CD because the albums were recorded for the idiosyncrasies of vinyl in mind. The easiest way to control for this was to have two “early” songs and two “recent” songs as MP3 and another set of two and two as WAVs.

The Song List

  • Aretha Franklin – RESPECT
  • Michael Jackson – Thriller *
  • The Eagles – Hotel California
  • The Beatles – Help! *
  • Carly Rae Jepsen – Call Me Maybe
  • Sia – Chandelier *
  • Rihanna – We Found Love
  • Daft Punk – Get Lucky *

* = MP3, 128k, LAME encoder

Note: the oldest available CD mastering was used for the pre-MP3 songs to eliminate / reduce the chance that some modern mastering techniques would be used to make it more MP3 friendly. For example, “Hotel California” was sourced from the original CD release in 1989.

We had people come in, put on headphones we provide them with, and listen to all 8 songs presented to each person in a random order. After each song, they would rate how positive it made them feel, how negative it made them feel, and how much they enjoyed it. The reason we asked positive and negative separately is because we conceptualize those feelings as representing activations of appetitive or aversive systems, respectively. They can activate separately or they can activate together.

Keep in mind, we told the participants nothing about the sound quality, MP3s or WAVs. As far as they knew, they just had to listen to 8 songs and respond to those 3 questions for each.

Results

I instigated this experiment because I didn’t think there would be a difference. We ended up hypothesizing that there would be a difference between the formats, such that people would like WAVs more. But to be honest I was skeptical, even if I had a theory-driven rationalization as to why I thought it would come out this way. (More on that later.) I thought people might even prefer MP3s since our participants are young and have probably been listening to MP3s their whole lives, give or take.

H1 figure.png

F(1, 17) = 2.162, p = 0.16

The graph above shows the mean positivity results by Format. It’s not statistically significant, but it is in the direction we predicted. Admittedly, this one result alone isn’t convincing. But wait — there’s more!

H2 figure.png

F(1, 17) = 5.224, p < 0.05

And this is a prime example of why we split out positivity and negativity into two measurements: the negative scores are significant, and support our hypothesis that people would like MP3s less.

H3 figure.png

F(1, 17) = 1.7, = 0.21

Again, not statistically significant findings here but the data are trending in the direction we predicted.

RQ1 figure.png

F(1,17) = 5.285, p < 0.05

And here’s the kicker: people rated early era songs as MP3s more negatively than anything else. And this finding is statistically significant.

Discussion

So what gives? Well, it could be as simple as our participants just hated “Thriller” and “Help!” as songs. But more than they hated The Eagles‘ “Hotel California?” I sincerely doubt it. But it is possible, I’ll admit that openly.

Here’s what I think went on, though: remember how I said that MP3s strip out a lot of information, most of which you can’t hear anyway? I bet that process is flawed. It clearly works very well, but I bet that it is imperfect and listening to MP3s is actually MORE work for your brain than uncompressed audio (like WAVs). Our minds are very lazy and, under most circumstances, seek the path of least resistance when hit with a task. If MP3s tax the cognitive systems more than WAVs because we need to actively fill in some of the missing gaps or work harder to do our usual filtering, then it seems logical that we would rate the experience more negatively.

Moving Forward

This study isn’t perfect. I would prefer to have run it with a counterbalanced design where some participants heard Song A as MP3 and others heard Song A as a WAV. That would help remove unwanted effects of the song itself. That, and while I have some ideas as to why these results came about, this experiment doesn’t prove or even directly support my ideas. I need more information before I can put that claim forward more strongly.

The good news is that we have a lot more research in the pipeline regarding audio compression and how it impacts the listening experience.

What does it mean to be interdisciplinary?

Standard

This semester, I took a class where we spent a lot of time talking about what it means to be an interdisciplinary scholar. The class was kind of a mess. We talked about being interdisciplinary all semester, but never got anywhere. In fact, on the last day of class we collectively realized that we still did not have a working definition of interdisciplinary research, let alone a definition of science or the humanities.

So, in an effort to construct some value from this class, I’m going to write my thoughts down – with definitions – and go from there.

As I’ve made clear in the past, I am a scientist. Depending on the day or my mood, if pressed harder I’d tell you I’m either a psychologist or a cognitive scientist. Frankly the distinction is pretty vague from my vantage point. But what does it mean to do science or to be a scientist? What are the underlying assumptions? What do I even do?

One thing I was surprised to learn early in grad school is there is no clear definition of science. As an outsider, I was confounded. All I really knew about science leading up to that point was The Scientific Method, skepticism, and an attempt to explain reality through observation. When you don’t think about it too hard, that seems to paint a pretty clear picture science. Closer inspection, however, shows that this definition is incomplete. More than incomplete, there is disagreement about what science is! So, I’ll share with you where I’m coming from.

There are three primary texts from which I draw my current understanding of science: Susan Haack’s Defending Science – within reason, Thomas Kuhn’s The Structure of Scientific Revolutions, and Alfred Crosby’s The Measure of Reality. (I have Annie Lang to thank for introducing me to all three.)

My main take-away from Crosby is that societal constructs that work under the surface can greatly impact our understanding of the world. Crosby calls this our mentalité. Mentalité is so insidiously subtle that attempting to define your own mentalités are exceedingly difficult. So, I’ll use one of Crosby’s examples to illustrate something seemingly bizarre by our standards: time.

In Ye Olde Days, prior to the prevalence of clocks, people viewed time as fluid and vague. Think about it. Information could only travel distances as fast as the fastest horse. News from far away would be difficult to collect and analyze in a specific order for people living now. But for people living then, it wasn’t really a concern because that’s just how things worked. Time was inconsistent and relative.

I can’t wrap my brain around that. I can read about it, I can think about it, but I can’t empathize with it. I can’t try it on and think like that. Crosby argues that I can’t do it because the people in this example have a different mentalité about time. Mentalités are not hard wired from birth, but they’re learned implicitly and are constantly reinforced. Undoing that kind of statistical learning would take immense work.

What all of this says to me personally is that humans are incapable of being objective, and also incapable of being aware of their blind spots caused by mentalités. Mentalités even help dictate what kinds of questions are asked and how they are answered. Powerful stuff.

Kuhn’s work on the history of science is profound and highly contentious, even some 55 years after it was first published. Kuhn says, in short, that everything you learned about the history of science was wrong. In grade school, I was taught the history of atomic theory. The story starts in Greece (and India!) in philosophy. Then Dalton with the first measures of atomic weights in the 18th century. Later there was Avogadro and Bohr and so forth.

Kuhn says that’s all crap. Bohr has nothing to do with Ancient Greece. Kuhn argues that scientific knowledge isn’t cumulative. Instead, it is destructive. Kuhn talks about paradigms as a way to describe eras of scientific knowledge. Ancient Greece and India may have been talking about similar things to Bohr, but Bohr’s knowledge does not come from them. Let me use another example: Newtonian physics.

Physics is a staple of the high school curriculum. I remember my Physics teacher, Mr. Pettit, explaining early in the class that we would be learning Newtonian physics but we should keep in mind that Newtonian physics is wrong. (Of course, it’s still useful in day-to-day life and that’s why it’s taught!) Einstein didn’t build on Newton: he blew it up. Einsteinian physics is fundamentally incompatible with Newtonian physics. Countless minds kept poking at Newtonian physics and finding flaws. Efforts were made to patch these flaws, but it was clear this was a sinking ship long before Einstein. The problem was, no one had a better answer until then.

This old notion of cumulative knowledge of science commonly uses a metaphor of building a house, but Kuhn says this is flawed. If Newton built the foundation with round pegs and round holes, Einstein’s first floor using square pegs and square holes won’t work. Also, this metaphor of house building suggests that there is a clear goal. But is that reasonable? If so, what is that goal? Is the goal to know everything? I doubt even the most hard-nosed objectivist would ever assert that humans as a species could ever actually know everything. If, instead, the goal is to better explain reality than previously possible, there is a direction, but there isn’t a goal.

And remember Crosby? As mentalités shift, so do the kinds of questions we ask. Mentalités provide implicit structure to science, and paradigms are the next layer. They’re much more explicit, but they too provide structure. Anyone could go out and study alchemy, but the field isn’t exactly thriving anymore. The prevailing paradigm does not include Alchemy.

Finally, my favorite: Haack. I want to someday give Haack the full attention she deserves, so I’ll be brief here. Haack attempts to contextualize science in society. She synthesizes Crosby and Kuhn, and puts them into the current social and political landscape. In short, Haack says that science can do a lot of things, but because we are imperfect human beings, we need to stay skeptical. Science is not fact. Science does not “prove” anything. But science is the best we’ve got and it’s incredibly powerful. Good science acknowledges and limits subjectivity, but it never claims to eliminate it.

Whew! So there you have it: a primer on my epistemology and ontology. And I basically co-opt Haack’s definition of science wholly.

Don’t worry. My definition of the humanities will be much more brief (and probably cringe-inducing to those that know better than I do). All I know about the humanities is conjecture. I spent some time as a member of a digital humanities scholar group – how I got in is unclear to me – and I am learning from my humanities classmates in the new Media School at IU.

Subjectivity and interpretation seem to be at the core of the humanities scholarship I’ve been exposed to. There is no attempt to answer questions directly but instead craft compelling arguments. I have to assume this means the humanities assume there is no objective reality and everything is relative. The story or argument is the answer itself.

And so now the question: what does it mean to be interdisciplinary?

My view of science argues that there is an objective reality and we can interact with and measure it. We probably get it wrong some/most/all of the time, but through empirical evidence and rigor, we can at least demonstrate our findings and produce things that reliably work. I am typing this on a computer, after all. Clearly, scientific knowledge can at least be functional.

My understanding of humanities says that there is no objective reality and everything is relative. Answers are not interesting, but the arguments that precede them are. I think humanities scholars may disagree with this, but if everything is relative then I don’t see how answers actually matter.

Before I can attempt to reconcile these two fields, I need to first define what it means to be interdisciplinary. In the class I was talking about at the beginning of this post, we read a profoundly awful book about interdisciplinary research. It was Really Egregiously and Profoundly Knavish and Oafish. I won’t name it, but maybe you can figure out what I’m talking about. The book didn’t really put in the effort to define interdisciplinary research. So I’ll try.

“Inter” is a prefix that means between. So interdisciplinary research can first and foremost be thought of “between disciplines.” So what is a discipline? I’ll use some shorthand and say that the classic institutional departments more or less align with disciplines: English, Physics, Psychology, etc.

Interdisciplinary research is when a scholar or scholars produce a work that synthesizes two or more disciplines into one example of research in such a way that elements from the two disciplines are intertwined and impossible to separate without rendering the research fundamentally broken.

In other words, a large report on the impacts of shortages on rainfall that has sections dedicated to ecology, sociology, and chemistry is not interdisciplinary. It is multidisciplinary. The section of sociology could be dropped without fundamentally breaking the other sections. Of course, the scope of the research would be narrower, but it would still “work.”

To me, interdisciplinary research is rare and often game-changing. Hofstadter’s Gödel, Escher, Bach: an Eternal Golden Braid comes to mind. Hofstadter brought together comparative literature, cognitive psychology, philosophy, and mathematics in such a way that they could not be disentangled. I don’t think it is incidental that this book is also one of the founding texts of Cognitive Science. Interdisciplinary research makes something fundamentally new from old parts.

I do think there are varying levels of analysis, though. For example, the Media School at IU is interdisciplinary. It combines vastly different scholars that all study media in radically different ways. The Media School wouldn’t be the Media School if one kind of scholar was removed from the mix. It would be fundamentally changed.

However, as a scholar in the Media School, I do not see myself as interdisciplinary. Sure, I seek sources of outside inspiration like any other inquisitive mind but at the end of the day, I’m doing Psychology. I might borrow from Music, but I twist it around and mold it to fit into Psychology.

It would also be possible for me to contribute psychology research to a larger work that another scholar synthesizes with another discipline in such a way that it becomes an interdisciplinary work. But I still didn’t do the heavy lifting.

To purposefully understate it: interdisciplinary work is very hard and very rare. Let it come to you. Don’t force it.

In the preface to GEB, Hofstadter talks about feeling like he had a question that couldn’t be answered within any one discipline. So he set aside all of those boundaries and struck out on his own. I can’t fathom the risk he took. I would be willing to bet that for every GEB, there’s thousands of malformed monstrosities and failed experiments out there.

There is a lot of pressure  to be (and in turn, evaluative sentiments about) interdisciplinary, as if it’s a desirable goal. And sure! I’d love to achieve a truly interdisciplinary work. But I don’t feel like I have an itching urge like Hofstadter. At least not yet. I also don’t feel like I know enough about my own discipline to feel constrained. This is probably because I exist in two interdisciplines: Media and Cognitive Science. The disciplines are defined by their object of study and not much else. There is little, if anything, that is out of bounds. To be interdisciplinary, I think you need to fight your way out of your box. I haven’t found the walls of my boxes yet, and I’m not sure I ever will.

So what does this all mean? Attempting to be interdisciplinary is a waste of time and effort. It’s also disingenuous scholarship. Instead, seek to answer questions you have in whatever way you think best serves the question.

Thanks for reading.

Training Yourself to Forget

Standard

David Byrne in True Stories

“I really enjoy forgetting. When I first come to a place, I notice all the little details. I notice the way the sky looks. The color of white paper. The way people walk. Doorknobs. Everything. Then I get used to the place and I don’t notice those things anymore. So only by forgetting can I see the place again as it really is.” – David Byrne as Narrator in True Stories

True Stories, the movie quoted above, was released in 1986. It was during the peak of The Talking Heads’ popularity, and Warner Brothers decided to let Byrne make a movie. It ended up as a critically-acclaimed flop, making less than $2.5 million in the box office. As a Talking Heads fan, I had to see it. The movie is as bizarre and disjointed as you may expect, but it’s thoughtful and sweet too. Also, the endlessly likable John Goodman is in it. So yeah, give it a watch.

Anyway, the quote I opened with about forgetting has ironically been burned into my memory ever since seeing the movie. Initially, I had applied to my creative endeavors: a large portion of being an audio engineer is being able to listen attentively but dispassionately. The engineer needs to forget that this is a song, or a beloved musician, or an instrument they dislike and just hear it for what it is. At times, an audio engineer needs to listen to a mix and not hear a song, but hear a series of puzzle pieces. If an engineer allows themselves to get swept up in the emotions of the music, then they aren’t able to objectively identify problems in the mix. Of course, an engineer that only listens attentively and dispassionately may make a technically superb mix, but it could be completely devoid of emotion. Whoops. Now, the reason I advocate focusing on forgetting is not because I prefer sterile mixes, but it’s because if I don’t focus on forgetting, I never get past the emotional components of music, and I can’t mix well.

But now I see relevance in this quote beyond my creative endeavors. There’s a benefit to forgetting when doing research, too. I don’t mean “forget to email your IRB paperwork to your adviser” kind of forgetting, but instead allowing yourself to reanalyze a situation/idea/problem for the first time again. By setting aside everything you think you know, or everything you thought you saw/heard/read, you just might find a crucial detail that was previously glossed over.

Funnily enough, I find that some of the same techniques I use to forget what I (think I) know also help me as an audio engineer and musician can also help me as a researcher. They just might work for you too.

  • Always listen in different environments. Listen in the studio. Listen in the car. Listen on headphones. These environmental changes will shine a light into different corners of the mix. Sometimes a high hat will sound great in the studio, but will be The High Hat That Ate New York! in headphones.
  • Don’t get caught up in your perceived value of the playback. My studio monitors are subjectively and objectively a superior playback system to my car, especially in terms of stereo separation (c’mon, Kia!) so it’s easy to assume that there’s no reason to try out other playback systems. But that’s not right. Those imperfections in the car stereo can turn my argument for how the mix should sound on its side and force me to reevaluate my choices. Good advice can come from anywhere.
  • Love what you do, but sometimes put yourself in taxi-cab mode. Taxi-cab mode is a concept I learned from one of my mentors while pursuing my undergrad. His angle was this: Some days you get your “A” rate, other days you earn your “C” rate. In other words, you don’t always get to do what you love. So just imagine yourself as a cab driver. Are you having fun? No, but the meter is running and you’re getting paid. We’ve all been there, but I’ve also found value in putting myself in taxi-cab mode. When I set aside my enjoyment of the work, even for a few moments, I have clearer vision.
  • Put it on the shelf for a week. I’ve definitely gotten stuck in a mix before where I know where I want to go, but I can’t figure out how to get there. Nothing is working. I’m learning to stop myself sooner, before I get frustrated, and walking away for a week or so. Then, problems that seemed insurmountable are suddenly clearly solved.
  • Fear ear fatigue. Don’t work too many hours at a time – your ears get tired and you make bad choices.
  • Use lateral thinking. Sometimes when I feel like I’m in a rut and I need a fresh perspective, I turn to some lateral thinking exercises. Lateral thinking embraces that while the shortest distance between two points may be a straight line (let’s keep it Euclidean, folks); it isn’t necessarily the best path. By purposefully going off course, you might be able to circumvent the problem entirely. From wikipedia:

    Critical thinking is primarily concerned with judging the true value of statements and seeking errors. Lateral thinking is more concerned with the “movement value” of statements and ideas. A person uses lateral thinking to move from one known idea to creating new ideas. Edward de Bono defines four types of thinking tools:

    1. idea-generating tools intended to break current thinking patterns—routine patterns, the status quo
    2. focus tools intended to broaden where to search for new ideas
    3. harvest tools intended to ensure more value is received from idea generating output
    4. treatment tools that promote consideration of real-world constraints, resources, and support

So there you have it. Those are some of my tools that I use to forget myself and what I think I know so that I can approach a problem/situation/challenge with a fresh perspective. Hopefully they can be helpful to you too.

As a parting thought, here’s a song about remembering too much. The protagonist is haunted by what he’s seen – or are they hallucinations? This is the 2008 ‘remix’ of Bowie’s 1987 single “Time Will Crawl.” Large portions of the instruments have been rerecorded for this version to showcase the song. The original arrangement was… uninspired, but still a strong song lived on underneath it. Maybe, in time, people will forget the 1987 version all together.

Year-End Wrapup

Standard

Much like my social life, this blog had a bombastic start that petered out. But hey, it’s the summertime now and I’m learning to come back out of my shell a bit.

This past April, I was able to attend the Broadcast Educators of America conference. As my first academic conference, I’m not sure how much valuable insight I can offer in that regard. But I can offer a little advice:

  • Be prepared to eat over-priced, terrible food. Seriously, like $8 for a soggy sandwich made with only the cheapest ingredients.
  • You won’t be able to get to everything you want to.

I felt that from a scheduling standpoint, BEA was very well organized. There was minimal – if any – overlap within divisions. Most overlaps occurred when my interests were represented in different divisions and those were scheduled simultaneously. Obviously, no one could predict that.

One experience that I found to be surprising and strange was in a business meeting for the audio division. Each year, the audio division holds a contest for students and faculty to show off their projects. The focus is on news reporting, story telling, and advertisements. It is my understanding that other types of content are not explicitly excluded. That being said, a few music submissions are made each year and each year they are rejected.

This year, a music recording submission was not immediately rejected and went through several stages of the contest before being removed. Of course, the person whose entry was removed was not gracious and complained to the division and to HQ. So it boiled down to this: the audio division was debating in this meeting whether or not to include music submissions in the contest. I couldn’t fathom this. Music radio programming dwarfs news programming – Country music stations alone beat out all of news/talk listening.

The choice wasn’t necessarily ideological. It was based out of an inability to judge the material and an unfamiliarity with the core concepts of music production. Let me be clear, I am grateful that the group recognized these issues instead of plowing forward with hubris and ego. However, it still struck me as odd. It would be like a television division having no idea about what a TV show even is, really. People don’t tune into radio for advertisements or even the news necessarily. The product, to the customer, is the content. It baffled me to sit in with a group of radio experts that had no idea how to judge the content of a non-news/talk radio station.

The good news is that the group is working on solving this problem. And again, I am grateful that they didn’t think they could just wing it.

I hope the take away from this is not that I dislike BEA. Quite the opposite. It was an opportunity for me to engage with a lot of different educators on a lot of different levels. BEA has research, production, and pedagogy. What more could you want? Oh, you want free passes to NAB? Well sure, you can have those too.

 

Anyway, aside from reflections on the conference, here’s an update on some of the research I’ve been working on:

  • Evolved vs Symbolic Persuasion: working within the Dynamic Human-Centered Communication Systems Theory framework to see if mediated messages that don’t rely on symbolic communication (written or spoken language) and only use real/natural sounds and visuals are more persuasive (in the form of an implicit attitude change) than those that do use symbolic communication. Results: Not much. I still feel like we are onto something, but I feel like we need different stimuli.
  • My Missing Bridge: working within the Limited Capacity Model of Motivated Mediated Message Processing framework to see if familiarity with a song impacts the orienting response to edited out sections. Meaning, all of the songs were edited to remove every and any section of a song that was not a chorus after the first chorus. Most of our edited songs had this structure: Intro – Verse 1 – Chorus 1 – Chorus 2 – Chorus 3 – Outro. Typically a pop song would have a structure like this: Intro – Verse 1 – Chorus 1 – Verse 2 – Chorus 2 – Bridge – Chorus 3 – Outro. We thought that people that were more familiar with the song would respond more strongly to the edited version because it defied their expectations. Results: Not much. I feel like we struggled against the language of music in this one. Regardless of familiarity, people have implicit attitudes about western pop music and its song structure. Even if a person is not familiar with the song itself, they’re familiar with how similar songs should be shaped. When this song defies that expectation, people orient to it. That’s my thought anyway.
  • Fletcher-Munson Revisited: put on hold for the time being. Hopefully it will be resurrected this coming semester.
  • Up The Hill Backwards: to be confused with the track off of Scary Monsters. This is the working name for the generative music systems in video games study previously mentioned. This will very likely be my thesis. The literature review is done and the study is designed. Now I just need to polish up the proposal paper. I’m excited about this one because if there are significant findings, it should demonstrate the value of generative music systems in interactive media. Could you imagine playing a video game where each moment feels like the soundtrack to a movie? Perfectly timed with every movement and action on screen? That’s where I’d like to go, personally.
  • Spotify: in limbo. My fault, really. Going to be looking for evidence of inverted payola in Spotify. As my committee chair put it: “Well, of course there IS payola, but the question is where?” Hopefully it’ll be the first place I look. That’d just be convenient.
  • Party Music: a bit hush-hush for now. It’s a collaboration between Cognitive Science, Psychology, and Telecommunication people. What I can say is this: it’s partially about party music.
  • Change Deafness: Ever see one of those comedy sketches where an actor approaches a person to ask for help with a map, then the actor is briefly blocked from the person’s vision only to be replaced by an obviously different (to the audience) actor but the person doesn’t notice? That’s called change blindness. So logically, if change blindness is a thing, then is change deafness? Participants listened to a series of messages where the voice actors change during the message. How big of a change needs to occur before it’s reliably noticeable to the participant? Results: Participants notice some changes but not others. We’re working on quantifying the differences in the voices to better understand why some were noticed and others weren’t.

Other than all that, I’m teaching a course of my own design: Location Audio. It’s all about capturing good audio in the field and how to fix it when it goes wrong. Oh, and my fiancee and I are tying the knot in 2 and half weeks.

I’ll sleep when I’m dead.