Real-time online music performance – fact or fiction?

Down to the River by The Virtual Choir 2017 

So far my articles have focused on the impact of using online video tools on teaching interaction. This time I am going to talk about something more technical – why there is a communication delay online, and what this means for musicians who want to perform in real time online. This is a slightly longer read than my previous pieces, but as we approach 3 months of lockdown in the UK, hopefully some of you will be bored enough by now to persevere to the end! 

Why is this important right now?

Social media is currently awash with virtual choirs and orchestras, with the now familiar Zoom style composite of all the members, seemingly performing together. This is understandable; musicians need a creative outlet. Professional performers have had their future income steam wiped out overnight (and for the foreseeable future). Music students in lockdown cannot finish their graded year-end performances. Amateur and Community choirs and orchestras can no longer get together at a time when the distraction and social aspect would be a valuable support. These online projects are a way for amateur ensembles to retain their connection with each other, and for professional musicians to promote themselves, showcase material when they can’t perform live and provide content that, in part, might fulfil a booking that would otherwise have been cancelled. Many of these videos support appropriate and deserving causes such as the NHS ( for example, The Choir Project MCR Fix You) and other Covid-19 impacted professions, including musicians and music venues. 

Is this a response to COVID19?

This isn’t a new concept. An early example is Eric Whitacre’s Virtual Choir, which started in in 2009 as a simple experiment in social media, when a fan of Eric’s music recorded a video of themselves singing his piece “Sleep” and shared it on YouTube.  Eric responded by sending a call out to his online fans to record themselves singing along to the recording, and upload the result. He was so impressed by the result that he formed the project The Virtual Choir (VC). He recorded himself conducting ‘Lux Aurumque’, asking the members to sing along. VC has grown to more than 8,000 singers, aged 4-87, from 120 countries in VC5, and you can currently take part in Virtual Choir 6: Sing Gently, described as a response to ‘these challenging times’.

Is this live performance?

Whilst the performers appear to be performing together, especially in some of the slicker examples, each of separate the parts has been recorded by the individual at home, usually to a click or backing track, and then edited together afterwards (which is no mean feat). Whilst the backing track helps to ensure each part will be in sync and in tune when it is stitched together, performers will not have been able to hear or respond to each other’s contributions. 

For example, in Down to the River by The Virtual Choir the director writes “The 36 voices you hear were all recorded separately at different times and in different locations. Each woman had only the melody playing in her ear and then sang the part that she wanted, no music was written out.” Nine self-isolating musicians from the Netherlands’ Rotterdam Philharmonic Orchestra came together to perform a virtual rendition of Beethoven’s ‘Ode to Joy’. Each soundtrack, recorded in time to a click track in the musicians’ bedrooms. Richard Morrison reviews some recent examples in conversation with Jess Gillam about her virtual scratch orchestra. “The results are a bit hit and miss, “some demonstrating only too clearly how hard it is, when nobody is in the same room as anyone else, to get not only the metrical precision right but also the tuning and the blend.”  

I particularly enjoy the honest videos showing the humour, community and the process behind making them – including false starts (Where Have All The Flowers Gone for Help Musicians), pet and small child interruptions and epic fails. Creative editing exploits the format with separated members performing around the house with bewildered partners trying to work around them (Down for the Count Let’s Face the Music and Dance ) or seemingly passing each other drinks through the frames to celebrate VE Day (The Haywood Sisters VE Day 75th Anniversary Celebration Video). 

The Haywood Sisters VE Day 75th Anniversary Celebration Video

In the end, the quality of the perceived ensemble performance is often down to the skill of the audio and video editor pulling together the individual contributions. This isn’t in itself a bad thing, but there is a risk that the slicker examples propagate the myth that it is possible for musicians to perform together in real time over the internet. It also masks the creative input of an often anonymous contributor – the compiling editors are making decisions, for example about balance and synchronisation, which the performers would usually manage themselves in a live performance. In this way, the outcome is closer to a mixed and mastered recording than a live performance.  

So why can’t we perform together online?

Jess Gillam talks openly about the challenges in relation to her virtual scratch orchestra.

“It was my original dream to have hundreds of people playing simultaneously, connected by the internet, but the technology just can’t cope. The lag is too great, even for a small group to play together.”

There are many technologies offering to enable “musicians to play music together in real time“ or “virtually zero latency”. We are going to talk about what that means, and why it is a promise they can’t really deliver. First, let’s look at what generally happens when an ensemble comes together to rehearse or perform in the same room. 

Playing together is about more than synchronous sound

Simply Saxes performing on Brighton Seafront Bandstand

Consider a saxophone quartet (an example I know well as a member of Simply Saxes). Each of the four saxophones varies in pitch and timbre – a soprano, alto, tenor and baritone saxophone. Whilst many players have more than one saxophone, generally they will rehearse and perform on a specific saxophone in a given ensemble. This is so that they (and the other members) can get used to their own parts, and how they fit with the others. When the quartet play together, they will often have a preferred seating/standing formation. They will be used to the balance of sound that they perceive from their position in that formation. 

Self-directed ensembles (i.e. ones with no conductor), such as quartets, often rehearse in a configuration that enables them to have maximum visibility of each other. They can see each other’s non-verbal cues, as well as directional cues from the nominated leader. In a saxophone quartet the soprano saxophone generally leads. They coordinate pauses, entrances and endings through playing techniques, and gestures with their body and instrument. There is already a body of work on how players lead an ensemble (for example, King, 2006; Seddon & Biasutti, 2009) so we will not go into depth on this here. What is important here is that this non-verbal interaction is significantly disrupted by video-mediated communication (Duffy, 2015, Chapter 14; Duffy & Healey, 2012; Duffy et al., 2012). 

For example, how does the soprano gesture to one player specifically when everyone else is arranged on one flat screen? From their perspective, how does everyone else know which player the soprano was gesturing at, when each has the same face-to-face view of each other? The composite image of the four players on screen is small, the screen size reduces the image of the participant to significantly less than life-size, which reduces the efficiency of non-verbal communication such as gesture and gaze (Cooperstock, 2005). It is also far more difficult to peripherally monitor each other for non-verbal cues on a small, flat screen.

The group balances their combined ensemble volume through listening to each other in their environment. If they usually rehearse in the same place they will be used to the ‘live’ aspect of the space – they don’t just hear the direct sound from other players, they may also hear sound reflected from walls and ceiling (reverberation). This is affected by the furniture in the room, and floor coverings such as carpet which absorb rather than reflect sound. Playing outside is especially challenging as there is no reverb, the sound dissipates quickly and it is much harder to hear each other if there is a breeze. Bandstands (as seen in the photo above) are specifically designed so that the roof reflects sound back to the performers. Similarly, ensemble balance is disrupted when an ensemble is geographically distributed, coming together through online technology. Volume is no longer controlled collectively; each player’s volume is affected by their individual set-up and equipment. They cannot easily perceive how loud they are in the other three locations. They no longer share the same acoustic space. 

Balancing sound between members of the group becomes very difficult. The volume of one player relative to another no longer depends just on relative distance from each other in a room, or how loudly they are playing. The volume now depends on how close each player is to the microphone in their room, the quality of their microphone, the volume of the speaker in the room of the other players, and the quality of their kit. Given volume is an aspect of performance that quartet players constantly adjust for expressive effect, this is almost impossible to control consistently. The overall quality of the session will be determined by the weakest point in the chain (in terms of broadband speed and audio quality).

So what causes the online delay, and why does it matter?

I promised at the start that I would talk about the delay experienced when communicating online, and how this affects real-time performance. Whilst all of the factors I have just talked about are important, and often overlooked, it is the delay that is the real showstopper.

The delay that occurs when sound is transmitted is also known as latency. This is made up of the time that it takes sound to travel the distance required, and processing time required by the technology that carries the sound. The time taken for a signal to travel to a destination and back is commonly known as the round-trip time (RTT) or ‘ping time’. Sound travels in the air at 343 m/s, but when it is transmitted over the internet it also travels through mediums such as fibre optic, or copper (which is much slower). To give you an idea of the duration of delays involved, the time it takes sound to travel from London to Manchester (261km) using a fibre optic cable is estimated at around 7ms. London to New York (5,548km) is estimated at 75ms (source).  This sounds pretty fast right? But in reality you will probably experience something slower. Unless you have a direct bespoke link between locations, the sound signal has to travel through various switches, servers and exchanges. If you are based in a large city, chances are you can access a fast fibre network directly, if you are based in a rural area you might not be so lucky and have a slower link to navigate before the signal reaches a fast network. This adds time to all signals transmitted and received – increasing the ping time. The speed of signal transmission is only as good as the weakest point. Latency also fluctuates during the day between peak and off-peak times, and by internet provider.

The effect of delay on music and speech

The effect of the delay becomes more complex when you consider more than two people connecting, each with different fluctuating ping times, from different technology, locations and internet providers. This may not always be perceptible, for example for two people having a conversation. When it is, the disruption is something that can be managed so that conversation can continue – albeit with increased interruptions and overlap (Duffy, 2015, Chapter 4.4 pp. 50-55).  Even when delays in speech are noticeable, they can be accommodated as people are taking turns at talk, not trying to talk together at the same time in a coordinated way. Musicians playing together have a much lower tolerance/higher perception of latency. One study found that drummers could not synchronise if the latency was above 100ms, other players found latency disruptive with a delay of just 20-40 ms. Duets were difficult to perform with musicality with latency at or above 100 ms (Bartlette, Headlam, Bocko, & Velikic, 2006). Another study describes how remote singers coped with the delay by ignoring the other singer altogether and relying on their own inner beat. They described needing to focus at all times and could not “miss a single millisecond of attention”, which distracted them from the perspective of presence and interaction (Olmos et al., 2009). Whilst experienced musicians can find strategies to cope, this doesn’t necessarily make for a productive rehearsal or learning experience. 

Delay isn’t the only technical problem

There are other technical problems associated with the use of technology designed to support online interaction. Automatic Echo Cancellation (AEC) mechanisms, which are optimised for speech and conversation, selectively suppress one or more of the participants if more than one person is talking at the same time. Now consider how that might work when more than one instrument is playing at the same time. 

Software often screens out some audible frequencies to focus on the frequencies most important for speech intelligibility, but this removes data that is a fundamental part of the characteristics of sound production by musical instruments (Drioli, Allocchio, & Buso, 2013). Transient detection and suppression, designed to reduce the sound of meeting participants typing on a keyboard, blurs the attack of a note played on a piano keyboard. 

Some online conferencing solutions allow users to suppress the speech support that causes so many problems for musical performance. For example, you can access Zoom Music Mode in the advanced settings, and after a recent update, users can now “Enable Original Sound” on smartphones and tablets. This will take off the speech codec. See a helpful overview from Professor Jim Daus Hjernøe at RAMA Vocal Centre, Denmark.

Whilst there are low latency, high fidelity solutions available, they are generally expensive, require considerable space and technical support, and often only accessible via the large organisations or educational institutions that have invested in them. During the coronavirus lockdown, when students need them more than ever, their institutions are closed and they are not accessible.  

The impact depends on the context and purpose of the musical interaction

One of the things I document in detail in my PhD (Duffy, 2015), and in my previous posts, is that the structure of student-tutor interaction in a music lesson is closer to that of a conversation than a musical performance. Unless the student is specifically practicing an accompanied piece or a duet with the tutor, there is much less synchronous playing than you might think. Instead the student and tutor exchange turns at talk, or short musical fragments, for the majority of the lesson. Guidance and support is needed to help first time users, thrown unexpectedly into online teaching, but there is still a lot than can be achieved online, it is simply a matter of understanding the limitations, planning lessons differently and finding creative workarounds. However, the challenges for remote, real-time rehearsal or performance, which are rooted in the technical issues discussed above, are not easy to solve.  

Bibliography

Bartlette, C., Headlam, D., Bocko, M., & Velikic, G. (2006). Effect of network latency on interactive musical performance. Music Perception24(1), 49–62.

Cooperstock, J. (2005). Interacting in shared reality. HCI International 2005, 1–7.

Drioli, C., Allocchio, C., & Buso, N. (2013). Networked performances and natural interaction via LOLA: Low latency high quality A/V streaming system. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)7990 LNCS, 240–250.

Duffy, S. (2015). Shaping Musical Performance Through Conversation (Queen Mary University of London).

Duffy, S., & Healey, P. G. T. (2012). Spatial Co-ordination in Music Tuition. In N. Miyake, D. Peebles, & R. P. Cooper (Eds.), Proceedings of the 34th Annual Conference of the Cognitive Science Society (pp. 1512–1517).

Duffy, S., Williams, D., Stevens, T., Kegel, I., Jansen, J., Cesar, P., & Healey, P. G. T. (2012). Remote Music Tuition. Proceedings of 9th Sound and Music Computing Conference (SMC ’12), 333–338.

King, E. C. (2006). The roles of student musicians in quartet rehearsals. Psychology of Music34(2), 262–282.

Olmos, A., Brulé, M., Bouillot, N., Benovoy, M., Blum, J., Sun, H., … Cooperstock, J. R. (2009). Exploring the role of latency and orchestra placement on the networked performance of a distributed opera. 12th Annual International Workshop on Presence, 1–9.

Seddon, F., & Biasutti, M. (2009). A comparison of modes of communication between members of a string quartet and a jazz sextet. Psychology of Music37(4), 395–415.

Author: Sam Duffy

Dr Sam Duffy works as an academic, with interests in music education, social music making and how technology can be used to transform these interactions. Currently Centre Manager for PRiSM, the Centre for Practice & Research in Science & Music, at RNCM. An Alumni of the Media and Arts Technology doctoral training programme, The Cognitive Science Research Group and the Music Cognition Lab at Queen Mary University of London.

One thought on “Real-time online music performance – fact or fiction?”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s