Web Scraping - Football Match on Jockery Club

Posted by Chris IO on September 21, 2017

Recently I have been drawn to the technique of web scraping. It is a cool technique to collect first-hand data that is not readily available through an API or preporcessed. The good thing is, given your dataset is collected first-hand, your analysis may possibly be unique and provoke interesting findings to yourself and the others. The tradeoff is, you would have to spend additional time to collect and clean the data, as well as carefully design the archietecture to run the scraping codes. Overall, web scraping would enable more flexibility in data analysis, facilitate both preliminary anlysis and large-scale data collection.

As a toy problem, I have attempted to scrap the betting rates over the football matches on Hong Kong Jockery Club. I dont have the habit of watching football but just find that the data is quite straight-forward, and yet not available through any systematic channel like API.

I ended up doing what I exactly wanted, in less than 30 lines of code, much fewer than I would have expected. The procedure is quite intuitive and applicable to other cases upon some adjustment. Basically, I used Cron and Splinter.

Cron is a built-in utility in Mac which enables execution of scheduled tasks. The idea is to schedule the crawler to be run at every interval to collect the data through a browser. In practice, the scraping should be done on a server to avoid interruption. But getting the program to run on a local machine would be still a nice way to start with.

Second, Splinter is a python layer of Selenium. It allows code-based browsing and information extraction. It has a high-level API to write automated test and scraping over web pages. To anyone who has some expereience in Python this one is a no-brainer. Of course there are some other web browsing library in other languages, for example RSelenium/rvest in R and ExpressJS in Javascript.

For anyone who wants to install Splinter, you need to install a driver first please type :

$ brew install chromedriver/geckodriver 

chromdriver for chrome and geckodrive for firefox respectively.

$ pip install Splinter

Then you are good to go!

Since I have never used Cron, I decided to play around with this to understand how it works:

$ crontab -e 

to get into the crontab file, default with the Vim editor. You may specify your custom editor by:

$ export EDITOR=youreditor

For testing purpose, I create a file called 123.sh which just prints current time into a new file. Inside Crontab, I typed * * * * * /bin/sh /path/to/my/file/123.sh. * * * * * is the timer space each corresponds to minute, hour, day, month, and day in the week. Five-stars basically means execution every minute. So Cron would execute my 123.sh every minute, and it did, nice!

Once the script is inside the Crontab. The script would be run automatically every minute. In addition, u can specify MAILTO=@someemail in order to send a mail to your system mail at var/mail/users for error log.

To execute my python script, within the crontab, my script is: * * * * * /Library/Frameworks/Python.framework/Versions/3.6/bin/python3 /Users/pathtomyfile The first path links to the python which I want to invoke, and the second path points to my python script.

Lastly, type $ which python to check where your python locate, for me, it is the previous path, so you can replace it with your own path.

Once the Crontab is up and running, we can turn to the crawler itself. Before crawling, I have taken a look over the website first.

png

As you can see, the website already presents the data in a clean table, including match number, match date, team names and odds.

On top of these data, I decided to include the refresh time, and current date for indexing the data. Now, Let’s open an editor and put up a few lines:


from splinter import Browser
browser = Browser('chrome')
browser.visit('http://bet.hkjc.com/football/default.aspx')

These three lines would open a chrome browser and visit the stated website.

Now, I go straight to inspect the ids of the elements I want.

  • Right click on the browser, click “Inspect”.
  • Inspected each element on the html including their class, id, css and xpath.

I went to the table and find that the id of each row is formated as rmid110876 and rmid110877 and so on, so I decided to to use regular expression to extract their ids and loop through each of its id.


import re
pattern = '(rmid11[0-9]{4})'
ids = re.findall(pattern, browser.html)

This block shoud return a list of ids in the website. Unfortunately it returned None. After some inspection I found that the entire table is embedded on a frame, which actually links the website to another url. Maybe they don’t want other to crawl the betting odds I guess? Anyway I replaced the previous url with the new one.


browser.visit('http://bet.hkjc.com/football/index.aspx?lang=en')

Now the ids indeed contains a list of ids obtained from the website. Next, I need to crawl the text of each elment, format it nicely: Since the element contains a text attribute which is a string like this: 'THU 3 Orebro vs AFC Eskilstuna 22/09 00:00\n1.48 4.00 5.10'

so I can simply separate the information I want into groups.

matchtime, home, away = [], [] , []
odd_home, odd_draw, odd_away = [], [], []

for match in ids:
    entry = browser.find_by_id(match).text
    regex = re.search(r'[\d]\s(.+)\svs\s(\D+)\s(.+)\n(\S+)\s(\S+)\s(\S+)', entry)
    home.append(regex.group(1))
    away.append(regex.group(2))
    matchtime.append(regex.group(3))
    odd_home.append(regex.group(4))
    odd_draw.append(regex.group(5))
    odd_away.append(regex.group(6))
    
#get the latest refresh time 
refresh = [browser.find_by_id('sRefreshTime').value] * len(ids)

The code may look repetitive but its logic is simple. We can write the information into a csv file with a few lines:


import csv
import time

#so to append the current date on each row for sorting later
date = [time.strftime("%Y/%m/%d")] * len(ids)

#zip the eight columns into rows
rows = zip(matchtime,date,refresh,home,away,odd_home,odd_draw,odd_away)

with open('football.odds.csv','a') as f:
    writer = csv.writer(f)
    for row in rows:
        writer.writerow(row)
    f.close()

browser.quit()

There you go. Put the code inside a script and invoke it from Crontab as mentioned. My csv file looks like this:

To stop Crontab, you can contab file.txt to copy your content to a backup file, then erase the content by contab -r.

This is only an experiment, crawaling of larger scale would have consideration in latency, set-up of remote server and potential rejection of client due to frequent access.

For next step, I would probably test the code on a remote server or virtual machine, just to see if any new problems would arise.