Simple web scraping with Google Sheets

Our goal, as discussed, is to get data into a spreadsheet application which can be used to perform analysis. In the last topic, we looked at scraping data from PDFs with Tabula.

If a table of data has been published on a web page, sometimes you can scrape it directly into your spreadsheet application. Here we are going to look at an example with Google Sheets.

Take a look at this page, which shows current and historical tender notices for the state owned entity (SOE) CIPC. http://www.cipc.co.za/index.php/tenders-and-supplier-database-form/

It doesn't look like it, but the list of tenders on this page is published as a table. If we want to import this list - for example to count how many tenders were issued in different months of the year - we can do it directly from within Google Sheets. First of all, we open up a new spreadsheet.

Put your cursor in cell A1 (for more information about how to navigate spreadsheets, see our Data Pipeline module). Now type the following formula. =IMPORTHTML("http://www.cipc.co.za/index.php/tenders-and-supplier-database-form/", "table", 1)

Let's pause for a second to look at that formula.

The equals sign at the start tells Google Sheets that this cell is a formula, not a string of text and numbers (like a paragraph). A formula is an instruction to do something, for example =sum(A2:E2) tells Google Sheets to add the values in column 2, rows A to E together. =average(A2:E2) would look for the arithmetic mean of the same numbers.

In our formula, =importhtml tells Google Sheets to load the webpage inside the brackets ("http://www.cipc.co.za/index.php/tenders-and-supplier-database-form/"). It's important to notice the quote marks around the website URL.

Each element of the formula is separate by commas. The second element tells Google Sheets to load a table from the webpage, and the final element - the 1 - which table to load. For humans, reading the formula backwards makes more sense:

"Load the first table you find on http://www.cipc.co.za/index.php/tenders-and-supplier-database-form. It's in HTML"

Working with imported data

After you've entered the formula, your spreadsheet should look like this.

We can tidy this up by making the columns a little wider. Just put your cursor on the line between A and B at the top of the sheet, for example, and drag it manually to the right. You can also double click this line to automatically resize the column.

Now we have a table of all the latest tenders from CIPC, in a machine readable format. Because this table is reloaded from the original website every time we refresh our sheet, you might want to copy the data onto a fresh tab to store a copy for working on. Just press CTRL + A to select the whole table, and click the + icon in the bottom left to add a new tab to your spreadsheet. In Cell A1 on the new sheet, press CTRL + SHIFT + V. This is the command to paste values only, in other words, it takes the characters in each cell and not the formula.

Now your spreadsheet is ready for work.

Last updated