Let’s Write a Web Browser


Okay, so all that was pretty complex,
learning the HTTP application protocol. It’s going to be simpler when
we actually do it in Python. That’s what’s fun about writing code. It’s like, once you get it working
then you can borrow my code. So if you recall,
we start with this socket, right? Our application is
running on the one side. Our application runs on the one side,
oops, get that right. Our application is running here on our
computer, it creates a socket, and then remember the connect? These first three lines we saw before. The
connect extends the socket to talk to the web server that is files on disk drives,
etc., etc. So, this connect is the thing
that extends the socket. I like to think of it as like,
you have a socket and then you sort of push it across the Internet and
then lock in on the other side. If there was no web server,
this would blow up right here. This actually works all the time because it’s
like just open a connection that I’m going to tell you what to connect to later and
then the connect says make the connection. But when we’re all done at this point in
our code, we import the socket library, we create the end point, and
then we push the end point through the web, we now have a socket that sort of starts
in our computer, our application, and ends in the web server application. So that’s the web server, and port 80 just happens to be
the phone number we called them on. And the one thing that’s different about
a socket compared to a file is you can both send and receive to the socket. And so like I said, the first thing that you got to figure
out in a protocol is who starts. We are the browser, we are the client,
we initiated the connection, we initiated the connection. We’re not going to look at the
code in here, but there’s a similar set of first few lines
to say, I’m ready to get a connection. So that’s different code. But
we’re initiating the connection. So if this is the HTTP, the HTTP protocol, then we have the
responsibility of sending the GET request. But, it looks exactly like
what we sent before. We did this with Telnet before. We send a GET request followed by
a blank followed by the document we’re interested in followed by a blank
followed by the web protocol we want to use. We’re using the old protocol HTTP 1.0,
and then we hit the Enter twice, nn. So it’s exactly what I typed. They difference is Python’s
typing it now, right? I’m not typing it, Python’s typing it. And so we’ve sent this request across,
we send the GET across. And then this server retrieves it,
parses it. Says, I know what you want. Let me go open a file and
let me send that file back. And it starts sending the file back. Send, send, send, send, send, right? And so then what we got to do is
we got to write a loop to read, read, read, read, read. And that’s what this little
loop right here is doing. While True, we’re going to receive
up to 512 characters at a time. That says give me up to 512. If it’s only sent a little bit, you know
like 100 characters, you’ll get it back. So the length of the data is important. If you get nothing, if you’ve got the end
of file, when this has finally sent all of its data, it sends a special mark
that says, oh that’s the end of file. And when that end of file reaches you,
then this call to receive will give you back -1. There is no data, it’s less than one.
And then we break out of the loop. And then all we do is, we just print that,
so the data comes out on the screen. Print the data, and
then close the socket, okay? So that is a very simple web browser. So let’s run that. So here’s a little trick. Don’t name your programs socket.py, because
then you actually conflict with that socket library and then you’ll get,
this import will start blowing up, right? So don’t do that. So this is the code I just showed you. Import the socket, create the end point, connect to the end point, send
the application GET request down, and then receive the data and
then close the socket, okay? And the host that we’re connecting to and
the document is all hard coded in there. Just say python socket1.py. This is an application that’s
going to make a network connection. If you’re not connected to the network,
this is not going to work very well. And there we got, it was exactly
the same stuff as we got before, right? We got the headers that told us something
about medadata about the document, and then a blank line. This one’s a little different
our content type on this one. Remember that romeo.txt is just flat
text because it’s a .txt file. So this is plain text. So there’s no
less thans or greater thans or anything. So this is how a text document looks like. So we just wrote a web browser.
It’s not a pretty web browser, but we made a connection, sent a request
down, and then got the data back and showed it to ourselves on the screen. So, that’s pretty easy. So that’s all it takes. Like 12 lines of code. 11, 12 lines of code. Easy money, right? So there we go. And that’s what you get back. And again, this first part is the header
part, and then there’s a blank line. If you go read the spec, that’s what
the spec says you’re supposed to do. You get a blank line, and then you know it’s the separation
between the header and the data. Okay, so that seams easy, but if we’re
going to do this a whole bunch of times we don’t even want to
write 12 lines of code. So we can make this even easier
with another library called urllib. So socket is this low level,
like make phone call, and then you choose how to talk. Urllib is like an application layer
library that knows about GET and all these other things. And it knows about headers and
it knows about the blank lines. Knows about all the rules. So urllib makes it even easier. So to do the same thing,
urllib makes URLs seem like files, okay? So this is, there’s a Transport Layer
that when we’re talking socket, we’re talking Transport Layer and when we’re talking
urllib, we’re talking Application Layer. It should probably be called “HTTP lib”
because that’s sort of what it’s doing but, actually no. It can talk FTP, URLs, and
other things as well, so I guess we should, it’s okay to call it
urllib given that I didn’t name it, and people smarter than me named it,
we’ll keep it that way. Okay, so here is that same line of code. I mean, we’re solving that same problem
now in four lines of code. And one of them’s the import the urllib. What we get back, we say urrlib.urlopen. This is the method, this is the library,
and we give it one parameter. We don’t have to worry about port 80,
it knows about port 80. We don’t have to worry about GET,
it knows about GET. We don’t have to worry about anything. Okay? We just say give me this URL,
and open it, and give me back.
This is like a file handler. And you can see that we can
then use this in a for loop just like we would use in a file handler. Now this code should start
to look kind of familiar. We’re going to open a URL and
loop through line by line through the URL, and then print it out. That’s what this does. But from here on, that could be the same
thing for opening a local disk file. Opening romeo.txt off of your disk. So let me run that one. Another don’t call your thing urllib. If you name your file the same
thing as a Python library, it will not go well. urllib1.py. Boom. Now one thing you’ll
notice is we do not get, in this urllib, we don’t get the headers. We only get the text and that’s because it assumes that urllib,
we want it to read the content of the file. Because this is just metadata up here. It’s useful. Now it turns out there’s
a way in urllib to say hey give me the headers instead of the body, but
urllib is, the common thing you pretty much want to do is not see the headers,
but instead just see the body. So urllib has simplified that,
and just given us that. But urllib is really beautiful,
because it turns something super complex, that [LAUGH] of course
it was super complex it was 12 lines of code super complex,
but it then reduces it to two lines of code. So it’s pretty cool. And so, that’s what it does,
we already saw that. But the whole idea of urllib is
a just turns URLs into files. And so, we can put these first two
lines at the top, import and open. And then we write pretty much any
program we want to do, right? So, this is a program that we’ve
done before where we’re going to loop through all the lines in the file [SOUND]. Then we’re going to split the lines into
words, and then we’re going to loop through all the words in the file, and then we’re
going to do a dictionary get pattern, right? And then we’re going to print the counts out. The point is, this code is
identical to a program that we did earlier that read a file and counted
the frequency of words in the file. Now, we’re using this exact same
code, only changing the top part. Open a URL and then read it,
versus open a file and then read it. So everything that you’ve been
doing with a file in Python, you can just as easily do with the URL. And you’re saying, like why did he tell
us about all that crazy detail? I don’t know,
I want you to know the detail. When it’s easy to use, I want you to understand that it’s
amazing that it’s this easy to use. Okay, so now we have got to
the point where we can retrieve and view the contents of a URL. The next thing we’re going to do is
we’re going to tear apart and try to make sense of that HTML
in our Python code.

Leave a Reply

Your email address will not be published. Required fields are marked *