LogoLogo
Procurement Data Crash Course
Procurement Data Crash Course
  • About this course
    • Course introduction
  • Module 1: How the public procurement process works
    • 1.1 Understanding the public procurement process
      • ❓Why the public procurement process exists
      • ⚖️What rules govern the public procurement process?
      • ⚙️RFQ or RFP? An introduction to the different types of tender
      • 📋The key stages of the procurement process
        • 📑Stage 1: Planning
        • 🚴‍♂️Stage 2: Initiation
        • ✔️Stage 3: Selection & award
        • 🤝Stage 4: Contract
        • 🏗️Stage 5: Implementation
      • 🛡️Why monitoring the procurement process is important
      • Test yourself: Understanding the public procurement process
    • 1.2 What does procurement data look like?
      • 💰Budgets & IRPs
      • 📃RFPs & RFQs
      • 🏆Awards
      • 📖Annual Reports
      • 🏛️The Auditor General's report
    • 1.3 Where is public procurement data published?
      • 🔍Where to find procurement data
      • 📚Maintaining your own library of procurement data
    • 1.4 Procurement oversight and monitoring for NPOs and media
      • ✋Procurement oversight guide for CSOs
      • 📺Procurement oversight guide for media
  • Module 2: Working with procurement data
    • 2.1 Whey we need machine readable data
      • Important data formats: CSVs, Excel and Google Sheets
    • 2.2 Turning websites and PDFs into machine readable data
      • Scraping data with Tabula
      • Simple web scraping with Google Sheets
      • Web scraping by inspecting network traffic
  • Useful resources and libraries
    • 3.1 Procurement data online resources
      • Importance reference resources
      • Online data repositories
  • Course testing & feedback
    • 🎓Extended course exam
    • 📝Surveys & feedback
    • ⏱️Quick course exam
  • MODULE4: Explore the OCPO procurement dashboard
    • 4.1 A walk through the OCPO COVID-19 reporting dashboard
      • Summary and Supplier page of the dashboard
      • Find supplier information from external sources
      • Navigating COVID19 Item Spend Page
      • Navigating the Transactions List Page
    • 4.2 Keep the Receipts Tool
      • Background and Introduction
      • Download data from Keep the Receipts
    • 4.3 Using KeeptheReceipts and Google Sheet for Procurement Data Analysis
      • Infrastructure Order Analysis
      • Mask Price Analysis
Powered by GitBook
On this page
  1. Module 2: Working with procurement data
  2. 2.2 Turning websites and PDFs into machine readable data

Simple web scraping with Google Sheets

PreviousScraping data with TabulaNextWeb scraping by inspecting network traffic

Last updated 2 years ago

Our goal, as discussed, is to get data into a spreadsheet application which can be used to perform analysis. In the last topic, we looked at scraping data from PDFs with Tabula.

If a table of data has been published on a web page, sometimes you can scrape it directly into your spreadsheet application. Here we are going to look at an example with Google Sheets.

Take a look at this page, which shows current and historical tender notices for the state owned entity (SOE) CIPC.

It doesn't look like it, but the list of tenders on this page is published as a table. If we want to import this list - for example to count how many tenders were issued in different months of the year - we can do it directly from within Google Sheets. First of all, we open up a new spreadsheet.

Put your cursor in cell A1 (for more information about how to navigate spreadsheets, see our Data Pipeline module). Now type the following formula. =IMPORTHTML("http://www.cipc.co.za/index.php/tenders-and-supplier-database-form/", "table", 1)

Let's pause for a second to look at that formula.

The equals sign at the start tells Google Sheets that this cell is a formula, not a string of text and numbers (like a paragraph). A formula is an instruction to do something, for example =sum(A2:E2) tells Google Sheets to add the values in column 2, rows A to E together. =average(A2:E2) would look for the arithmetic mean of the same numbers.

In our formula, =importhtml tells Google Sheets to load the webpage inside the brackets ("http://www.cipc.co.za/index.php/tenders-and-supplier-database-form/"). It's important to notice the quote marks around the website URL.

Each element of the formula is separate by commas. The second element tells Google Sheets to load a table from the webpage, and the final element - the 1 - which table to load. For humans, reading the formula backwards makes more sense:

"Load the first table you find on http://www.cipc.co.za/index.php/tenders-and-supplier-database-form. It's in HTML"

Working with imported data

After you've entered the formula, your spreadsheet should look like this.

We can tidy this up by making the columns a little wider. Just put your cursor on the line between A and B at the top of the sheet, for example, and drag it manually to the right. You can also double click this line to automatically resize the column.

Now we have a table of all the latest tenders from CIPC, in a machine readable format. Because this table is reloaded from the original website every time we refresh our sheet, you might want to copy the data onto a fresh tab to store a copy for working on. Just press CTRL + A to select the whole table, and click the + icon in the bottom left to add a new tab to your spreadsheet. In Cell A1 on the new sheet, press CTRL + SHIFT + V. This is the command to paste values only, in other words, it takes the characters in each cell and not the formula.

Now your spreadsheet is ready for work.

http://www.cipc.co.za/index.php/tenders-and-supplier-database-form/
Companies and Intellectual Property Commission
Google Sheets
CIPC Scraper
CIPC Data