Scraping Steam for Data using Python + BeautifulSoup

As promised in yesterday’s blog post about analyzing public Steam numbers, here are the juicy technical details behind scraping a website using Python, and a Python library called BeautifulSoup.

Python Logo

The Method

I chose to use Python because I’ve been using it for a little under two years to do number crunching, as well as building a few automation scripts for work. It’s very lightweight, very easy to read, and quite a mature language.

That said, you can probably do this in whatever you feel like, but my approach consisted of the following steps:

  1. Poke the Steam & Game Stats page and get the HTML page that is served up to the browser
  2. Parse the HTML code and pull out specific numbers that would be useful for analysis
  3. Open a specified CSV file, and add lines to the file with all of the relevant data
  4. Close file, standby for next script run

What Was Used

The script is a very small file (33 lines!), and uses the following:

And you can take a look at the Gist itself to see the full script, but I am going to use this post to explain some of the methodology behind the script, to help people who want to learn about writing in Python and scraping web pages!

Alright, shut up, explain your code.

Of course!

steampage = BeautifulSoup(urllib.urlopen('http://store.steampowered.com/stats/?l=english').read())

This gets the ball rolling for the scraper. We use urllib to open a connection to the Steam & Game Stats page, and then read it with the BeautifulSoup library. If you are unfamiliar, I know I was, BeautifulSoup is a very powerful Python library that makes it super easy to navigate, search, and modify the parsed code you receive from websites.

In short: read the code of a webpage using BeautifulSoup, and you get all kinds of methods to chop and screw it to your liking.

timestamp =time.time() currentTime =datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d %H:%M:%S')

I wanted to use a consistent timestamp when recording data into the CSVs because it would allow me to group results in a sane manner. It uses the current time (of the script running) and formats it into YYYY-MM-DD HH:MM:SS so that when imported into Google Sheets, it would preserve the actual date and time aspects.

top100CSV =open('SteamTop100byTime.csv','a')

You’ll see two open(…) lines in my script, and both of them point to a specific CSV. This is where I dumped all of my data. The second parameter (‘a’) made sure I was opening and adding to the CSV, rather than continuously overwriting.

for row in steampage('tr',{'class': 'playercountrow'}):  
  steamAppID = row.a.get('href').split("/")[4]   
  steamGameName = row.a.gettext().encode('utf-8') 
  currentConcurrent = row.findall('span')[0].gettext() 
  maxConcurrent = row.findall('span')[1].get_text()   
  top100CSV.write('{0},{1},"{2}","{3}","{4}"\n'.format(currentTime, steamAppID, steamGameName, currentConcurrent, maxConcurrent))

This simple looking for loop pulls out the Steam ID for the game, the English game name (as listed on the Top 100 list), the number of concurrent players (as of the script reading the page), and the peak concurrent seen throughout the day (I forget why I wanted this.) It also adds that information on a new line inside of the CSV.

In addition, this loop shows you the simplicity of the power behind BeautifulSoup. Let me break it down into smaller pieces because each one uses different BeautifulSoup methods.

for row in steampage('tr',{'class': 'playercountrow'}):

When I dug through the Steam & Game Stats source code, I realized that every game was listed inside of a table row with the class player_count_row. Upon seeing the pattern, I simply asked BeautifulSoup to iterate through every single block or table rows using that class, and as they are all uniform in their markup, can consistently pull out the information we need.

steamAppID = row.a.get('href').split("/")[4]

With BeautifulSoup, you can make direct references to markup (as seen above), and then grabbing parameters within the markup itself (like ‘href’.) I did this to grab the URL of each Steam game, break it apart based on where the forward slashes (‘/’) were, and pulled out the app ID that was nestled inside of the URL.

steamGameName = row.a.get_text().encode('utf-8')

Like the above markup parameter grabbing, get_text() is a very neat function in BeautifulSoup that allows you to grab the text for a link. The Steam & Game Stats page uses the game name itself as the link text, so it was a breeze to add to our collection of data.

currentConcurrent = row.findall('span')[0].gettext()

Nabbing the current and peak concurrent users is the same procedure, so I only need to explain it once. The find_all() function from BeautifulSoup finds the markup that is specified. It takes every single instance of that markup found and creates an array that can be referenced for easy modification or evaluation.

With the same methods and functions, I managed to easily find the current and peak concurrent users for Steam altogether.

Simple, right?

What else?

There’s nothing else, really! Those 33 lines were more than enough to collect lines upon lines of data inside of two CSV files.

There’s plenty of work that went into the analysis of that data, but that’s for another day.

Thanks for reading! As always, happy to answer questions or take feedback, so leave comments or yell at me on Twitter!