Meet the Packets: How audio travels into your
browser Sara Fecadu
KATIE: Hello. Welcome back. So, I keep forgetting to do this and I apologize. But the big announcement
right now is that the swag is ready. But do not go get swag now because we’re about to
have a really awesome talk by Sara Fecadu. I asked Sara for a fun fact and her fun fact
was that she makes bakes a mean cookie which unfortunately we can’t all indulge in. So,
as a follow up question, I said what prompted you write this talk about an audio API. And
she said, well, I had spent a year building a checkout form and I just couldn’t stand
to look at it or think about it anymore and I had to do something different. Which I think
is something that literally all you have us can probably identify really strongly with.
So, anyways, Sara is gonna come up and talk to us about the audio API. So, give it up
for Sara. [ Applause ]
SARA: Hello. See if I can get my computer started here. Okay. Welcome to my talk. Meet
the packets. If not everyone has realized, it’s a play off meet the parents. I spent
a lot of time working on that. [ Laughter ]
Let’s see here. One second. Gonna progress? No. Okay. We’re gonna do it without the clicker.
So, this will be interesting. As Katie said, my name oh. My whole slide deck isn’t progressing.
Okay. One second. There we go. Okay. Thank you for coming to talk. As Katie said, my
name is Sara Fecadu. I am from Seattle, Washington. And I don’t have a ton of hobbies besides
making cookies and listening to a lot of podcasts. And by day I’m a software developer at Nordstrom.
And Nordstrom is a clothing retailer founded in 1901. While people don’t usually associate
100 year old companies with tech, we have a thriving tech org working on innovative
ways to get you what you need and feel your best. And a year ago I was hired on to do
a rewrite of Nordstrom.com’s redux. And as of last May, we have been taking 100% of customer
orders. Now, why am I talking about audio streaming? Katie may have taken my joke here,
but the answer is: Form fields. Our checkout UI has 22 form fields. And they come in different
groupings for different reasons. But many of my waking moments over the past year have
been spent thinking about these form fields. And I just wanted to do anything else. So,
I was sitting on my couch one night reading a book on packet analysis, like one does,
and watching a YouTube video. And I thought to myself, how does that work? Like, on the
packet level, how does audio video streaming work? So, to answer the larger question, I
started small with: What is audio streaming? And audio streaming is the act of sending
audio files over the network. And this talk will be about on demand audio streaming. Now,
the major difference between on demand streaming and live streaming, is with on demand streaming
we need all of the packets to get across the wire. Whereas with live streaming, you may
be more interested in keeping them up with the event and a certain amount of packet loss
is acceptable. Over the past few months, I learned that audio streaming, even when limited
to on demand, is as wide a subject as it is deep. I have picked three topics that exemplify
what audio streaming is. Why it’s hard and how to get started yourself. And we will talk
about audio streaming protocols, TCP congestion control and client players. Audio streaming
protocols give us a stand how to encode, segment and ship your code to the client. TCP congestion
control handles congestion on the TCP layer of the stack. And it is relevant with on demand
audio streaming because we’re shipping larger audio files and we need every single packet
to make its way to the client to play audio. A client player is any network connected device
with a play and pause button. So, this could be your phone, your TV, your laptop, et cetera.
And client players not only allow us to play our audio, but when paired with modern audio
streaming protocols, they hold a lot of decision making power. Well, audio streaming protocols
are the heart of audio streaming. And today we’ll talk about adaptive bitrate streaming
it &s it benefits and how to convert your own audio files to work with two popular audio
streaming protocols. Before we get started, I wanted to go over some terms that will come
up. A codec encodes data and uses compression techniques to get the highest quality for
the smallest footprint. Encoding and trans coding is converting it from one type to another.
Trans coding can convert from digital to digital. And then move from analog to other digital
files. Bitrate is how many bits it takes to encode a second of audio. And this number
usually refers to the quality of the audio file. When I think of playing music on the
Internet, I think of an HTML5 audio tag with a source attribute set to the path of my audio
file. And this is a perfectly reasonable way to do it. You can request and receive a single
file containing an entire song. And it would be referred to as progressive streaming and
the major benefit here is you only have one file to deal with. But let’s say, for instance,
you have a user and they have a slow network connection and they can’t download your one
file. They’re stuck. So, adaptive bitrate streaming aims to solve this problem by encoding
your audio in multiple bitrates and allowing the client player to decide which quality
is best for the user to listen to your audio uninterrupted. This allows more users to access
your audio. But it does add a layer of operational complexity because now you’ve got a lot more
work on moving parts. The audio streaming protocols we’ll talk about not only average
adaptive bitrate streaming, but also use HTTP web servers. They do this by encoding the
file, segmenting they will, placing them on a web server and then once requested, partial
audio files are sent to the client one at a time. Here is the secret to our modern audio
streaming protocols is it’s more of a series of downloads than it really is a stream. But
we’ll refer to it as streaming anyway. The two most popular audio streaming protocols
today are HTTP lye streaming, or HLS, and dynamic adaptive streaming over HTTP, MPEG
DASH. It was created by Apple to support streaming to mobile devices and it is default on all
Mac OS and Apple devices. And MPEG DASH was a direct alternative to HLS. It was created
by the forum who want to make MPEG DASH the international streaming. Let’s look at them
side by side. HLS takes the MPC, AAC, AC 3, or EC 3, encodes them into fragmented MP4
files. Those segmented files are in a play list. If you have multiple bitrate streams,
each stream will be in a media play list and all of your media play lists will be in a
master play list. With MPEG DASH, it is agnostic, in theory you can convert any into MPEG DASH.
It will be fragmented into a fragmented MP4 file. That will be displayed in an XML manifest
file called a media presentation description. Okay. We’ve talked about what files will be
used and what they’ll be segmented into, but how do you get it there? You’ve got this audio
file. What tools allow you to convert the audio file? Well, you’ve got options. But
most of these options are paid options. Except for FFmpeg. Which is an open source demand
line tool that among other things allows you to convert audio files to be HLS or MPEG DASH.
However, I founded learning curve for FFmpeg to be pretty steep. And a lot of the documentation
for HLS and MPEG DASH were for video streams. Instead I used Amazon elastic trans coder.
It’s an AWS offering that converts files of one type to another. In our case, we’re taking
an audio file and converting it to be used with HLS and MPEG DASH. It’s pretty much plug
and play. You tell Amazon elastic trans coder what type of files you have and what type
of files you want and it outputs the stream for you. And even though it’s easy to use,
it’s not a free service. So, if you were going to be converting a lot of files, it may be
worth your time to learn more about an open source alternative like MPEG DASH. My workflow
when working with Amazon Elastic Transcoder was to upload to an AWS object store. I told
Amazon Elastic Transcoder where my audio file was and what settings I needed it to convert
my audio files to. And Amazon Elastic Transcoder output my streams into that same S3 bucket.
And I downloaded them for us to explore. This is the basic set of files you would get with
an HLS stream. And it kind of looks like a lot. But we’re going to break it down into
four groups. In the top left, the master play list. In our case, we have two bitrate streams
represented and they will be linked out from the master play list. And then in the top
right you’ll see those media play lists which have each bitrate stream. And those will contain
all of our links to our transport stream files which are the fragmented audio files represented
in both the bottom left and the bottom right. On the bottom right we have our 64K bitrate
stream segmented audio files. And in the bottom, oh. Did I get that backwards? I’m not really
good at right and left. But in the bottom section you’ll have your fragmented audio
files. We’ll take a closer look at those so you can see really what’s in it. This is the
entirety of the HLS master play list. It contains information about the specific bitrate streams
and links out to those media play lists that represent the streams themselves. Let’s look
at the 64K bitrate stream media playlist. It has even more information about the stream
including caching information, the target duration of each segmented audio file, and
most importantly, links out to our transport streams. This is what one of those fragmented
audio times looks like. And there’s something a little interesting going on here. If you’ll
notice, it’s color coded and I kept trying to figure out why. But then I realized a transport
stream has the file extension .ts. And something else has the file extension .ts, TypeScript.
Ignore the colors. It’s just a binary coded file. Now our MPEG DASH audio stream has fewer
files and looks more manageable. But it’s similar. We have our media presentation description,
which is an XML manifest file which contains all of our information about the stream. Then
below we have our two segmented audio files. All of the segments are encapsulated in a
single file, but within them there are segments. That’s why there are fewer files in the MPEG
DASH audio stream than in the other audio stream. Look at the description. See a lot
of stuff here. But there are three important elements. All bitrate streams are represented
in a representation tag. And then all bitrate streams are enclosed in an adaptation set.
Within the representation tag, we do have our URL to our audio files. And taking a look
at one of those audio files we’ll see if looks fairly similar to the segmented audio file
we saw with HLS. Minus the color coding because it’s a .MP4 versus .TS.
visual studio is not confused in this case. Earlier we talked about progressive streaming
which is streaming an entire audio file in one two. We used an audio element and a source
attribute with the path of our audio file. With MPEG DASH and HLS, it’s very similar.
But instead of having the path to our audio file, we have the path to the master play
list for HLS or media presentation description for MPEG DASH. We’re going to take a hard
left here and we’re gonna talk about the second topic in my talk. Which is TCP congestion
control. And TCP is a transport layer protocol and it has mechanisms in both its sender and
receiver which are defined by the operating systems of each to react to and hopefully
avoid congestion when sending packets over the wire. And they are called TCP congestion
control. And today we talk about packet loss congestion control and why it isn’t so great.
And more specific, the congestion window and duplicate acknowledgment in packet loss based
congestion control. Before we get started, somewhere terms, bandwidth is the rate at
which data can be sent. And throughput is the rate at which data can be received. The
congestion window is a TCP variable that defines the amount of data that can be sent before
the acknowledgment is received by the sender. Let’s say you have a user who has requested
your audio file from the server. Your audio packets travel down the network stack, across
the physical layer, up the data link layer in the network layer and arrives at the transport
layer and unfortunately there’s congestion right before we reached our destination. Now,
traffic congestion and network congestion have very similar beginnings. Either too many
cars or too many packets have entered the roadway and there’s nowhere for them to go.
With traffic, you have to wait it out. Luckily for us, TCP congestion control allows them
to flow over the wire, even during congestion. And before we get to the specifics of these
TCP congestion control algorithms, let’s talk about the TCP happy path. We’re going to start
with a single packet sent from the sender to the receiver flowing through the receiver’s
buffer. And being acknowledged by the receiver and having an acknowledgment packet sent back
to the requester. We talked about the congestion window, the amount of data before a sender
receives an acknowledgment. Another way of thinking about the congestion window is as
a sending rate. As the sender receives acknowledgments, the congestion window grows. And as the receiver’s
buffers fill and they drop all excess packets, the sender responds by shrinking the congestion
window. A second way of thinking about the congestion window is as a bucket. And as packet
loss occurs, the bucket shrinks. And as acknowledgments are received by the sender, the bucket gross.
There’s a slight oversight in the bucket explanation in that the receiver has no way of telling
the sender that it is dropping packets due to congestion. But one option the sender does
have is to send a duplicate acknowledgment. And a duplicate is if they’re trying to send
out of order packets. They send one, two and three. For the purposes of our example, the
receiver’s not going to process them right away. So, that when we send packet four, it’s
full and it has nowhere to go. So, packet four dropped due to congestion. And they move
on to process packet one, send an acknowledgment, send for packet two and for three. However,
when it looks at packet five, it says I can’t process you because this would be an out of
order packet. drops packet five and sends back for three. The sender is tipped off that
it needs to sends packets four and five again. So, a more truthful version of the bucket
metaphor would be that the congestion window shrinks as old acknowledgments are received
by the sender. And the bucket window grows as new acknowledgments are sent by the sender.
The first b congestion control algorithms were written in the 1980s and the most recent
were a couple years ago. We will talk about TCP Reno and BBR. TCP Reno is the classic.
And BBR was created by Google engineers a few years ago to address issues that they
saw when using packet based algorithms. TCP Reno starts with a congestion period where
it’s set at some rate increasing by. It’s set at some value, excuse me, increasing by
some rate. And as the sender receives acknowledgments, the congestion window grows by one. And as
the sender adds packets, it is divided by some rate. I have chosen path. So, it’s divided
by two. And the main issue with TCP Reno is that it assumes that small amounts of packet
loss are congestion. And in a world where the sender doesn’t know the state of the receiver’s
buffer and the receiver is unable to tell the sender that it has room left to process
packets, you have an Internet moving at a fraction of the capacity. In 2016, BBR was
created to help you get the most out of your Internet connection. It looks for the place
where sending rate is equal to bandwidth. In theory, you should be able to send to the
receiver and move on to the application without any queuing. Some companies have reported
positive outcomes when using BBR in their production systems. Firstly, it only has to
be implemented in the senders side and is in Linux operating systems with kernel 4.9
or higher. And they found BBR increased bandwidth for the low bandwidth users for 10 15%, and
the bandwidth for their median group 5 7. Additionally, users in Latin America and Asia
saw additional increases. But is it a fair algorithm? Fairness, or using your fair share
of bandwidth is the goal of every TCP control algorithm. And in experiments in Google and
Spotify, they found that BBR was able to co exist with congestion control algorithms like
TCP Reno or QBIC. However, some researchers found that BBR’s initial start algorithm pushed
QBIC spenders back to where they couldn’t reestablish their fair share of bandwidth.
And this is an issue currently being look the at both in and outside of Google. We’ve
reached the final section in this talk. And so far we’ve talked about how audio files
are processed to be streamed and issues that may occur as they travel to devices. We’ll
wrap up by talking about the role of the client player and how to create your own audio strings.
Now, I’m a pretty big fan of Spotify and I use it regularly. But have you ever looked
at what’s being sent back from the web server to create those audio streams? This should
look pretty familiar to what we were looking at with our segmented audio files with HLS
and MPEG DASH. But when I first saw these, I did not have this context. And I kept thinking,
an NPM package I can use? Or is there something simple and obvious going on here that’s going
the web, there is. Because HLS and MPEG DASH handed over a lot of responsibility to the
clients that process their streams. And this not only includes picking the correct quality
of audio to play, but it also includes allowing elements like the audio element to process
segmented audio files without any modification. And most browsers do this by leveraging the
media sources extension API and the encrypted media extensions API. Additionally, libraries
like HLS.JS and Dash.JS are available while cross browser support is low. As a side note,
if you need to support iOS Safari, you need HLS. But with most other browsers, you have
options. So, it would have been really fun to reverse engineer Spotify’s audio player.
But I got tired of reading their minified code. So, I decided to make my own audio player.
And I started with a cassette that I found from a box of cassettes. And I chose it because
it has the words “Map squad” written on it. And I used my iPhone’s voice memo application
to record the audio so the quality is so so at best. But it works. And you can try it
right now. But maybe wait until the end of the talk because I want to show you how it’s
made. The entire application is a single in docs.HTML file with an audio element in the
body. When loaded into the browser, the immediately invoked function runs, the init function.
And at the top, we define the audio that’s equal to our audio element. Next web see if
the media sources extension API is supported in our browser. If it is, we will assume we
can use dash.JS to enable MPEG DASH in most browsers. Pass it to the dash.JS media player.
And when the player is initialized, our audio will be loaded with it. If the media sources
extension API is not available, we’re going to assume we’re using iOS Safari and we need
to have an HLS stream. We will do this by setting the source attribute of our audio
element to the master playlist or the past to our master playlist. And that this file
is all you need to stream audio to most browsers in 2019. If you want to try it out in the
browser for yourself, or you want to create your own audio streams, please feel free to
fork 24 repo. Thank you. [ Applause ]
KATIE: I’m sorry. I think that scared me more than it scared you. Thank you so much, Sara.
Can you believe that is the first talk she has ever given at a conference? Yes. Amazing.
All right. So, we have about a a 15 minute break right now. So, go out and pick up your
swag bags. And we’ll see you back here at 3:00. Patricia Ruiz Realini is talking about
the importance of your local library. Which is pretty cool because I hang out at the library.
We’ll see you back here at 3:00. No, wait. 3:00, yeah.