Python

Scraping Steam for Data using Python + BeautifulSoup

As promised in yesterday’s blog post about analyzing public Steam numbers, here are the juicy technical details behind scraping a website using Python, and a Python library called BeautifulSoup.

Python Logo

The Method

I chose to use Python because I’ve been using it for a little under two years to do number crunching, as well as building a few automation scripts for work. It’s very lightweight, very easy to read, and quite a mature language.

That said, you can probably do this in whatever you feel like, but my approach consisted of the following steps:

  1. Poke the Steam & Game Stats page and get the HTML page that is served up to the browser
  2. Parse the HTML code and pull out specific numbers that would be useful for analysis
  3. Open a specified CSV file, and add lines to the file with all of the relevant data
  4. Close file, standby for next script run

What Was Used

The script is a very small file (33 lines!), and uses the following:

And you can take a look at the Gist itself to see the full script, but I am going to use this post to explain some of the methodology behind the script, to help people who want to learn about writing in Python and scraping web pages!

Alright, shut up, explain your code.

Of course!

8
steampage = BeautifulSoup(urllib.urlopen('http://store.steampowered.com/stats/?l=english').read())

This gets the ball rolling for the scraper. We use urllib to open a connection to the Steam & Game Stats page, and then read it with the BeautifulSoup library. If you are unfamiliar, I know I was, BeautifulSoup is a very powerful Python library that makes it super easy to navigate, search, and modify the parsed code you receive from websites.

In short: read the code of a webpage using BeautifulSoup, and you get all kinds of methods to chop and screw it to your liking.

10
11
timestamp = time.time()
currentTime = datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d %H:%M:%S')

I wanted to use a consistent timestamp when recording data into the CSVs because it would allow me to group results in a sane manner. It uses the current time (of the script running) and formats it into YYYY-MM-DD HH:MM:SS so that when imported into Google Sheets, it would preserve the actual date and time aspects.

13
top100CSV = open('SteamTop100byTime.csv', 'a')

You’ll see two open(…) lines in my script, and both of them point to a specific CSV. This is where I dumped all of my data. The second parameter (‘a’) made sure I was opening and adding to the CSV, rather than continuously overwriting.

15
16
17
18
19
20
21
for row in steampage('tr', {'class': 'player_count_row'}):
    steamAppID = row.a.get('href').split("/")[4]
    steamGameName = row.a.get_text().encode('utf-8')
    currentConcurrent = row.find_all('span')[0].get_text()
    maxConcurrent = row.find_all('span')[1].get_text()
 
    top100CSV.write('{0},{1},"{2}","{3}","{4}"\n'.format(currentTime, steamAppID, steamGameName, currentConcurrent, maxConcurrent))

This simple looking for loop pulls out the Steam ID for the game, the English game name (as listed on the Top 100 list), the number of concurrent players (as of the script reading the page), and the peak concurrent seen throughout the day (I forget why I wanted this.) It also adds that information on a new line inside of the CSV.

In addition, this loop shows you the simplicity of the power behind BeautifulSoup. Let me break it down into smaller pieces because each one uses different BeautifulSoup methods.

15
for row in steampage('tr', {'class': 'player_count_row'}):

When I dug through the Steam & Game Stats source code, I realized that every game was listed inside of a table row with the class player_count_row. Upon seeing the pattern, I simply asked BeautifulSoup to iterate through every single block or table rows using that class, and as they are all uniform in their markup, can consistently pull out the information we need.

16
    steamAppID = row.a.get('href').split("/")[4]

With BeautifulSoup, you can make direct references to markup (as seen above), and then grabbing parameters within the markup itself (like ‘href’.) I did this to grab the URL of each Steam game, break it apart based on where the forward slashes (‘/’) were, and pulled out the app ID that was nestled inside of the URL.

17
    steamGameName = row.a.get_text().encode('utf-8')

Like the above markup parameter grabbing, get_text() is a very neat function in BeautifulSoup that allows you to grab the text for a link. The Steam & Game Stats page uses the game name itself as the link text, so it was a breeze to add to our collection of data.

18
    currentConcurrent = row.find_all('span')[0].get_text()

Nabbing the current and peak concurrent users is the same procedure, so I only need to explain it once. The find_all() function from BeautifulSoup finds the markup that is specified. It takes every single instance of that markup found and creates an array that can be referenced for easy modification or evaluation.

With the same methods and functions, I managed to easily find the current and peak concurrent users for Steam altogether.

Simple, right?

What else?

There’s nothing else, really! Those 33 lines were more than enough to collect lines upon lines of data inside of two CSV files.

There’s plenty of work that went into the analysis of that data, but that’s for another day.

Thanks for reading! As always, happy to answer questions or take feedback, so leave comments or yell at me on Twitter!

An Analysis of Activity on Steam

I have wanted to flex my analysis muscles for quite some time and thought I’d get some practice on publicly available data: the Steam & Game Stats page.

Steam Logo

For the unaware: Steam is one of the largest gaming platforms available for PCs, and they have a wonderful stats page that lists the current number of concurrent users for each game, as well as the current number of concurrent users in general. I’ve always wondered what actual activity levels were, ever since seeing this page, and so I decided to scrape that page for a week or two and see what sort of data I could get.

NOTE: If you’re interested in the technical details behind the scraping, I will provide that in a separate post in the near future.

The data I have managed to mine from this page is trivial at best, but I had a ton of fun learning how to build the scraper (uses Python, BeautifulSoup, and a handy dandy cronjob) as well as figuring out the required Google Sheets equations to put it all together.

Some Fun Facts About Steam Activity

On average, 19.95% of concurrent players on Steam are playing games.

Here is a visualization of what Steam activity levels have looked like from March 7, 2014 to March 19, 2014:

On average, 1/5th of the Steam “concurrent” user base is actually playing a game. However, I will concede that this is not dead-on accurate, as I only have numbers for top 100 games at any given moment, but is reliable for estimations because games in the top 90-100 levels account for 0.02% of the user base, meaning any additional users playing non-top-100 games will be relatively insignificant in their effect.

There was a big spike, on March 9, 2014 at 4:00pm EDT (16:00) when the number of Counter-Strike: Global Offensive players ballooned up to a staggering 111,893 players. Presumably for the start of a tournament. (Haven’t dug into this one too much.)

Dota 2 dominates games played on Steam, accounting for an average of 5.58% of the concurrent player base.

Not a real surprise, given the popularity of Dota 2, but the second most played game is Counter-Strike: Global Offensive, sitting at a distant 1.66% average, about three times smaller than Dota 2. This translates to an average of about 400,000 concurrent players, with the highest I’ve recorded at 673,018 concurrent players on March 15, 2014 at 10:00am EDT.

In fact, most of the highest numbers of concurrent players in Dota happens in the mornings from 9:00am to 11:00am.

However, might be indicative of the real struggle for MOBAs to fight against the titan amongst gods, League of Legends, which boasted an impressive peak of 7.5 million concurrent users in January 2014.

Hard to gain ground on such an entrenched competitor, but they’re definitely doing their best.

160 games have been a part of the Steam Top 100 between March 7, 2014 and March 19, 2014

While that sounds like a lot, we have to remember that the Steam catalogue currently sits at over 3,000 titles and growing, it’s pretty safe to say that breaking into the top 100 is no easy feat.

Further analysis that I could do as time goes on is to get a breakdown of the genres being represented in the Top 100 list, which would also provide a decent idea of what is and what is not popular on Steam. Not to mention that this is an extremely small sample size, it would be more worthwhile to get this data over a period of a year to make it really meaningful.

What’s Next?

Well, this is a big pile of data, and it’s growing by the hour. This is great, but what can I really do with this data?

For starters, the original goal was to figure out if there was a link between the digital marketing behaviours of publishers and the level of concurrent players on Steam, as well as growth or decline in player base from that activity (or lack thereof.) I will have to explore whether or not that is still possible to figure out, as there are a lot of marketing activities that are either harder to track down or even attribute towards the success of a game.

Secondly, I will have to step up the data storage game a bit to make it much more accessible. Currently, a Python script scrapes the Steam & Game stats page, adding a line to two separate CSV files with all the relevant data. I’d like to transition this into an actual database (probably MySQL) and maybe make it open to the public to poke at and do their own analysis.

Lastly, I’m really not sure. It was a fun side project in the first place, and I feel like it was a great learning experience and a fantastic way to brush up on my analytical skills.

Have any ideas or want access to the data? Give me a shout, I’m happy to share!

Week 1 of Coding: Ouch, right in the feels.

Ah, glorious Thursday. Not even close to the end of the week and I’m writing a wrap-up of my week so far. Want to know why?

My project is broken.

Yup, absolutely broken. Well, it works up to a certain point (and was really fun writing!) but there’s no logical way to finish it.

Recapping

This week, I was spending the majority of my days getting myself setup on Python and writing a script that accesses the Last.FM API to grab all of my scrobbles throughout the years, categorize them into genre, and present the data by year to visualize the change in my musical tastes over the years.

Everything went well until the “categorizing into genres” part, because… trying to pin a single genre to an artist is apparently very difficult. Last.FM doesn’t use “genres,” they have tags that are user applied and include a high level of variance. MusicBrainz, which Last.FM utilizes (I believe) also uses tags. Scraping Wikipedia and AllMusic resulted in gigantic piles of genre for each artist, so I’ve resorted to a more manual version: I tag them all as a specific genre by hand.

It hurts. Right in the wrists. (And the feels.)

However, this week has been a really fun experience and only makes me look forward to the next project. Before I talk about that, there are some important lessons I learned throughout the week.

1) Homebrew and pip are your best friends.

Writing basic scripts in Python? Yeah, that’s no sweat.
Want to write more complex scripts that might require external libraries? Yeah, have fun compiling and installing that stuff.

Well, okay, in actuality, it’s still relatively simple. But compared to typing ‘pip pymongo‘ into the terminal? It’s quite a bit more complex!

With homebrew and pip, I managed to get MongoDB onto my development machine, install the PyMongo driver, and install the unidecode library. In a matter of seconds.

Sa-weet!

2) Unicode can burn in hell.

I spent the better part of today and yesterday figuring out how to wrangle with unicode. A few of the artists from my scrobble list have Asian characters in their name on Last.FM, and Python (or MongoDB) automatically turns them into their unicode representations.

That’s all good and well, but turning them back (and using them in functions) is an absolute nightmare. Thank goodness for unidecode for (temporarily) solving that nightmare.

3) MongoDB is pretty awesome.

During the project, I was able to pull my scrobbles down from Last.FM but I wanted to insert them into a database.

I’m used to working with MySQL so I attempted to get that up and running. Well, after half an hour of yelling at my computer, I decided to take the lazy route and check out my alternatives that might be quicker. The suggested alternatives were SQLite or some sort of NoSQL solution. I figured it would also be a good opportunity to try out those fancy datastores I kept seeing on Hacker News and settled on MongoDB.

Got it up and running within minutes on default settings, and it’s been pretty smooth sailing so far. Inserting and retrieving data has been a breeze (my dataset is only 50,000 items) and I have enjoyed the experience.

I’m not skilled enough (yet) to really grasp the differences between the different types of datastores, and I make no attempt at doing so. I was just enamoured by the incredibly short amount of time it took for me to get up and running on Mongo.

4) I took breaks by learning Spanish.

I’ve been experimenting with the Pomodoro technique (25 minute sprints, 5 minute breaks) and it’s been a really good way of creating hard deadlines and stop-points for work.

However, I generally surf during breaks and get carried away for more than 5 minutes, so I wanted to do something that allowed for shorter bursts.

Enter Duolingo.

It turns out, doing one or two lessons on Duolingo were perfect – I would sit here shouting Spanish phrases and words at my computer and laughing all the while, and my alarm would go off and I would get right back into the work.

As a result, I’ve familiarized myself with basic Spanish words and phrases, and I am working my way through as much of the Spanish portion of Duolingo as I can. It’s a win-win situation, as far as I can tell!

Want to follow my Spanish-learning progress? Check out my profile on Duolingo.

Here comes week 2!

I haven’t decided exactly what project I’ll be working on next week, but I would really like to focus on starting to incorporate tests into my code. I’ve been ignoring them for now because I figured it would be beneficial to familiarize myself with syntax over anything else.

However, if I want to get any better at this, I’ll have to learn to write code that isn’t complete crap and I think tests would be a good place to start.

In the mean time, I’m going to try to categorize my music and complete my musical visualization. Thanks for reading my weekly brain dump!