Web Scraping using lxml and Python 2018: Extracting data from Steam


Hello everyone! I hope you are doing well. I am Yasoob and most of you remember me from
the Python tips blog and today I’ll teach you the basics of web scraping using Python. First of all, why should you even bother learning
how to web scrape? If your job doesn’t require you to learn it,
then let me give you some motivation. What if you want to create a website which
curates cheapest products from Amazon, Walmart and a couple of other online stores? A lot of these online stores don’t provide
you with an easy way to access their information using an API. Even if they have an API it is usually rate-limited. So you don’t have a lot of flexibility there. In the absence of an API, your only choice
is to create a web scraper which can extract information from these websites automatically
and provide you with that information in an easy to use way. Here is an example of a typical API response
in JSON. This is the response from Reddit. There are a lot of Python libraries out there
which can help you with web scraping. There is lxml, BeautifulSoup and a full-fledged
framework called Scrapy. Most of the tutorials discuss BeautifulSoup
and Scrapy, so I decided to go with lxml in this screencast. I will teach you the basics of XPaths and
how you can use them to extract data from an HTML document. I will take you through a couple of different
examples so that you can quickly get up-to-speed with lxml and XPaths. If you are a gamer, you will already know
of (and likely love) this website. We will be trying to extract data from Steam. More specifically, we will be extracting data
from the “popular new releases” section. I might convert this into a two-part series. For now, we will be creating a Python script
which can extract the names of the games, the prices of the games, the different tags
associated with each game and the target platforms. Step 1: Exploring Steam First of all, open up the “popular new releases”
page on Steam and scroll down until you see the Popular New Releases tab. At this point, I usually open up Chrome developer
tools and see which HTML tags contain the required data. I extensively use the element inspector tool
which is the button in the top left of the developer tools. It allows you to see the HTML markup behind
a specific element on the page with just one click. As a high-level overview, everything on a
web page is encapsulated in an HTML tag and tags are usually nested. You need to figure out which tags you need
to extract the data from and you are good to go. In our case, if we take a look, we can see
that every separate list item is encapsulated in an anchor tag. The anchor tags themselves are encapsulated
in the div with an id of tab_newreleases_content. I am mentioning the id because there are two
tabs on this page. The second tab is the standard “New Releases”
tab, which we don’t want to extract the information from. Hence, we will first extract the “Popular
New Releases” tab, and then we will extract the required information from this extracted
tag. Step 2: Start writing a Python script This is a perfect time to create a new Python
file and start writing down our script. I am going to create a scrape.py file. Now let’s go ahead and import the required
libraries. The first one is the requests library and
the second one is the lxml.html library. The requests library is going to help us open
the web page in Python. We could have used lxml to open the HTML page
but it doesn’t work well with all web pages so to be on the safe side I am going to use
requests. Now let’s open up the web page using requests
and pass that response to lxml.html.fromstring method. This provides us with an object of HtmlElement
type. This object has the xpath method which we
can use to query the HTML document. This provides us with a structured way to
extract information from an HTML document. Step 3: Fire up the Python Interpreter Now save this file and open up a terminal. Copy the code from the scrape.py file and
paste it in a Python interpreter session. We are doing this so that we can quickly test
our XPaths without continuously editing, saving and executing our scrape.py file. Let’s try writing an XPath for extracting
the div which contains the ‘Popular New Releases’ tab. I will explain the code as we go along. This statement will return a list of all the
divs in the HTML page which have an id of tab_newreleases_content. Now because we know that only one div on the
page has this id we can take out the first element from the list and that would be our
required div. Let’s break down the xpath and try to understand
it. These double forward slashes tell lxml that
we want to search for all tags in the HTML document which match our requirements. Another option was to use a single forward
slash. The single forward slash returns only the
immediate child tags/nodes which match our requirements. div tells lxml that we are searching for divs
in the HTML page. And this particular piece of code tells lxml
that we are only interested in those divs which have an id of tab_newreleases_content Cool! We have got the required div. Now let’s go back to chrome and check which
tag contains the titles of the new releases. Step 4: Extract the titles & prices The title is contained in a div with a class
of tab_item_name. Now that we have the “Popular New Releases”
tab extracted we can run further XPath queries on that tab. Write down the following code in the same
Python console which we previously ran our code in. This gives us with the titles of all of the
games in the “Popular New Releases” tab. Here is the expected output. Let’s break down this XPath a little bit as
well because it is a bit different from the last one. Here a “.” tells lxml that we are only interested
in the tags which are the children of the new_releases tag
This particular filter is pretty similar to how we were filtering divs based on id. The only difference is that here we are filtering
based on the class name And /text() tells lxml that we want the text
contained within the tag we just extracted. In this case, it returns the title contained
in the div with the tab_item_name class name Now we need to extract the prices for the
games. We can easily do that by running the following
code. I don’t think I need to explain this code
as it is pretty similar to the title extraction code. The only change we made is the change in the
class name. Step 5: Extracting tags Now we need to extract the tags associated
with the titles. Here is the HTML markup. Write down the following code in the Python
terminal to extract the tags. You can pause the video at any time to take
a better look at the code. I will also provide code files with this video
so that it is easier for you check whether you wrote the correct code or not. So what we are doing here. We are extracting the divs containing the
tags for the games. Then we loop over the list of extracted tags
and then extract the text from those tags using the text_content() method. text_content() method returns the text contained
within an HTML tag without the HTML markup. We could have also made use of a list comprehension
to make that code shorter. I wrote it down in this way so that even those
who don’t know about list comprehensions can understand the code. Eitherways, let’s go ahead and convert this
into a list comprehension because you already know how to do it normally. Lets separate the tags in a list as well so
that each tag is a separate element. Step 6: Extracting the platforms Now the only remaining thing is to extract
the platforms associated with each title. Here is the HTML markup. The major difference here is that the platforms
are not contained as texts within a specific tag. They are listed as the class name. Some titles only have one platform associated
with them like this. While some titles have 5 platforms associated
with them like this. As we can see these spans contain the platform
type as the class name. The only common thing between these spans
is that all of them contain the platform_img class. First of all, we will extract the divs with
the tab_item_details class, then we will extract the spans containing the platform_img class
and finally we will extract the second class name from those spans. Write down the following code. In line 1 we start with extracting the tab_item_details
div. The XPath in line 5 is a bit different. Here we have this instead of simply having
[@class=”platform_img”]. The reason is that [@class=”platform_img”]
returns those spans which only have the platform_img class associated with them. If the spans have an additional class, they
won’t be returned. Whereas this filters all the spans which have
the platform_img class. It doesn’t matter whether it is the only class
or if there are more classes associated with that tag. In line 6 we are making use of a list comprehension
to reduce the code size. The .get() method allows us to extract an
attribute of a tag. Here we are using it to extract the class
attribute of a span. We get a string back from the .get() method. In case of the first game, the string being
returned is “platform_img win”, “win” as in “windows”, so we split that string based on
the comma and the whitespace, and then we store the last part (which is the actual platform
name) of the split string in the list. In lines 7-8 we are removing the hmd_separator
from the list if it exists. This is because hmd_separator is not a platform. It is just a vertical separator bar used to
separate actual platforms from VR/AR hardware. Step 7: Conclusion This is the code we have so far. Now we just need this to return a JSON response
so that we can easily turn this into a Flask based API. Here is the code. This code is self-explanatory. We are using the zip function to loop over
all of those lists in parallel. Then we create a dictionary for each game
and assign the title, price, tags, and platforms as a separate key in that dictionary. Lastly, we append that dictionary to the output
list. In a future screencast, we might take a look
at how to convert this into a Flask based API and host it on Heroku. But this is going to be the final step in
this particular screencast. I hope you guys enjoyed this tutorial. If you want to read more tutorials of a similar
nature, please go to Python Tips. I regularly write Python tips, tricks, and
tutorials on that blog. And if you are interested in learning intermediate
Python, then please check out my open source book here. If you have any feedback based on this screencast
please share it in the comments below. I would love to improve future screencasts
and your feedback is the only way for me to know what to improve. Have a great day!

17 thoughts on “Web Scraping using lxml and Python 2018: Extracting data from Steam

  1. Hi, thank you for the wonderful article.
    I played a little bit with the code as per your guide and tried to implement something a little bit by myself:
    https://pastebin.com/TpviE7sn

    The code can be run saving the pages but I got this message and can't figure out what I've done wrong as I'm a beginner. Why is it "out of range"?

    Traceback (most recent call last):
    File "h:python_scraping.py", line 19, in <module>
    if contain_prod_data(data):
    File "h:python_scraping.py", line 14, in contain_prod_data
    prod_list = doc.xpath('//div[@class="catalog-items-list view-list"]')[0]
    IndexError: list index out of range

    I also noticed that when the cycle breaks this line is not executed:
    26. print("Finished saving pages: " +str(page_len))

    Thank you for any comments in advance.

Leave a Reply

Your email address will not be published. Required fields are marked *