LogoLogo
Procurement Data Crash Course
Procurement Data Crash Course
  • About this course
    • Course introduction
  • Module 1: How the public procurement process works
    • 1.1 Understanding the public procurement process
      • ❓Why the public procurement process exists
      • ⚖️What rules govern the public procurement process?
      • ⚙️RFQ or RFP? An introduction to the different types of tender
      • 📋The key stages of the procurement process
        • 📑Stage 1: Planning
        • 🚴‍♂️Stage 2: Initiation
        • ✔️Stage 3: Selection & award
        • 🤝Stage 4: Contract
        • 🏗️Stage 5: Implementation
      • 🛡️Why monitoring the procurement process is important
      • Test yourself: Understanding the public procurement process
    • 1.2 What does procurement data look like?
      • 💰Budgets & IRPs
      • 📃RFPs & RFQs
      • 🏆Awards
      • 📖Annual Reports
      • 🏛️The Auditor General's report
    • 1.3 Where is public procurement data published?
      • 🔍Where to find procurement data
      • 📚Maintaining your own library of procurement data
    • 1.4 Procurement oversight and monitoring for NPOs and media
      • ✋Procurement oversight guide for CSOs
      • 📺Procurement oversight guide for media
  • Module 2: Working with procurement data
    • 2.1 Whey we need machine readable data
      • Important data formats: CSVs, Excel and Google Sheets
    • 2.2 Turning websites and PDFs into machine readable data
      • Scraping data with Tabula
      • Simple web scraping with Google Sheets
      • Web scraping by inspecting network traffic
  • Useful resources and libraries
    • 3.1 Procurement data online resources
      • Importance reference resources
      • Online data repositories
  • Course testing & feedback
    • 🎓Extended course exam
    • 📝Surveys & feedback
    • ⏱️Quick course exam
  • MODULE4: Explore the OCPO procurement dashboard
    • 4.1 A walk through the OCPO COVID-19 reporting dashboard
      • Summary and Supplier page of the dashboard
      • Find supplier information from external sources
      • Navigating COVID19 Item Spend Page
      • Navigating the Transactions List Page
    • 4.2 Keep the Receipts Tool
      • Background and Introduction
      • Download data from Keep the Receipts
    • 4.3 Using KeeptheReceipts and Google Sheet for Procurement Data Analysis
      • Infrastructure Order Analysis
      • Mask Price Analysis
Powered by GitBook
On this page
  1. Module 2: Working with procurement data
  2. 2.2 Turning websites and PDFs into machine readable data

Scraping data with Tabula

Previous2.2 Turning websites and PDFs into machine readable dataNextSimple web scraping with Google Sheets

Last updated 2 years ago

Many tender notices or award notifications are published in the PDF format. PDFs are highly versatile filetypes, and can contain text and images. Unlike, say, Word documents, copy and pasting text and numbers from a PDF document rarely works. When a scanned document is shared as a PDF, copy and paste won't work at all.

Here is an example from the Department of Labour - .

This PDF contains a table, with the bid number, tender value and winning organisation. But how do we copy that into a spreadsheet when the document is actually just an image with no "text"?

Using Tabula and OCR

The answer is that we need an application which can apply Optical Character Recognition (OCR). OCR is a technique by which computers can "read" images and look for letters and numbers. In this case, we are going to use Tabula, an application which is specifically designed to extract tables from PDFs using OCR.

Once you've imported the file, click on Extract Data. Your screen should look like this.

Tabula is very easy to use. Simply use your mouse to drag a selection rectangle around the table you want to extract.

And then click Preview & Export Extracted Data. Your screen should look like this.

In this instance, you can see that the table hasn't extracted cleanly. The text is split over rows that don't appear in the original, making it hard to analyse. You can change the way Tabula performs its OCR by clicking on the Stream/Lattice buttons.

Changing the method to Lattice gives us exactly what we want, an error free table in the preview view. From here, we can click Export and save the extracted table as a CSV file.

This example use a very simple (although somewhat poor quality) PDF, but Tabula can extract large tables too. It can even extract tables that are split over multiple pages of a PDF.

You can download . You'll find installation instructions on the same site. Once you have installed it, Tabula doesn't run like a normal desktop application, it is accessed through your web browser (usually by browsing to )

Download the . Once you have Tabula running, open it using the Browse button in this view.

Tabula here
http://127.0.0.1:8080
PDF file we showed you above
you can download the original file here