Simple web scraping with Google Sheets
Last updated
Last updated
Our goal, as discussed, is to get data into a spreadsheet application which can be used to perform analysis. In the last topic, we looked at scraping data from PDFs with Tabula.
If a table of data has been published on a web page, sometimes you can scrape it directly into your spreadsheet application. Here we are going to look at an example with Google Sheets.
Take a look at this page, which shows current and historical tender notices for the state owned entity (SOE) CIPC. http://www.cipc.co.za/index.php/tenders-and-supplier-database-form/
It doesn't look like it, but the list of tenders on this page is published as a table. If we want to import this list - for example to count how many tenders were issued in different months of the year - we can do it directly from within Google Sheets. First of all, we open up a new spreadsheet.
Put your cursor in cell A1 (for more information about how to navigate spreadsheets, see our Data Pipeline module). Now type the following formula. =IMPORTHTML("http://www.cipc.co.za/index.php/tenders-and-supplier-database-form/", "table", 1)
Let's pause for a second to look at that formula.
The equals sign at the start tells Google Sheets that this cell is a formula, not a string of text and numbers (like a paragraph). A formula is an instruction to do something, for example =sum(A2:E2) tells Google Sheets to add the values in column 2, rows A to E together. =average(A2:E2) would look for the arithmetic mean of the same numbers.
In our formula, =importhtml tells Google Sheets to load the webpage inside the brackets ("http://www.cipc.co.za/index.php/tenders-and-supplier-database-form/"). It's important to notice the quote marks around the website URL.
Each element of the formula is separate by commas. The second element tells Google Sheets to load a table from the webpage, and the final element - the 1 - which table to load. For humans, reading the formula backwards makes more sense:
"Load the first table you find on http://www.cipc.co.za/index.php/tenders-and-supplier-database-form. It's in HTML"
After you've entered the formula, your spreadsheet should look like this.
We can tidy this up by making the columns a little wider. Just put your cursor on the line between A and B at the top of the sheet, for example, and drag it manually to the right. You can also double click this line to automatically resize the column.
Now we have a table of all the latest tenders from CIPC, in a machine readable format. Because this table is reloaded from the original website every time we refresh our sheet, you might want to copy the data onto a fresh tab to store a copy for working on. Just press CTRL + A to select the whole table, and click the + icon in the bottom left to add a new tab to your spreadsheet. In Cell A1 on the new sheet, press CTRL + SHIFT + V. This is the command to paste values only, in other words, it takes the characters in each cell and not the formula.
Now your spreadsheet is ready for work.