# Scraping data with Tabula

Much data is published in the PDF format. PDFs are highly versatile filetypes, and can contain text and images. Unlike, say, Word documents, copy and pasting text and numbers from a PDF document rarely works. When a scanned document is shared as a PDF, copy and paste won't work at all.&#x20;

Here is a simple example from the Department of Labour - [you can download the original file here](https://www.labour.gov.za/Tenders/Awarded-Tenders/Lists/Awarded%20Tenders/Attachments/73/AWARDED%20BIDDER%202.pdf).

![](https://2315907434-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fi8JLXAtCzbzpRcnAIJsv%2Fuploads%2Fx7UY8ShYlN1DpmgVgunD%2Fspaces_uSSbOeqjFMxbM6oPQFVO_uploads_zkb78u8VwhW92H84Qi1L_image.webp?alt=media\&token=18cabbc9-03d1-4587-991f-f34794d6d118)

This PDF contains a table, with the bid number, tender value and winning organisation. But how do we copy that into a spreadsheet when the document is actually just an image with no "text"?

### Using Tabula&#x20;

The answer is that we need an application such as **Tabula**, which is specifically designed to extract tables from PDFs&#x20;

You can download [Tabula here](https://tabula.technology/). You'll find installation instructions on the same site. Once you have installed it, Tabula doesn't run like a normal desktop application, it is accessed through your web browser (usually by browsing to <http://127.0.0.1:8080>)

Download the [PDF file we showed you above](https://www.labour.gov.za/Tenders/Awarded-Tenders/Lists/Awarded%20Tenders/Attachments/73/AWARDED%20BIDDER%202.pdf). Once you have Tabula running, open it using the **Browse** button in this view.

<figure><img src="https://2315907434-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fi8JLXAtCzbzpRcnAIJsv%2Fuploads%2FhKMh3Zkx7olsRQPHQDIm%2Fspaces_uSSbOeqjFMxbM6oPQFVO_uploads_zRuTn9BvpAZmPs0N7kKw_image.webp?alt=media&#x26;token=3ab7b04c-1ef8-4995-8b5a-b78bed5d07c6" alt=""><figcaption></figcaption></figure>

Once you've imported the file, click on **Extract Data**. Your screen should look like this.

<figure><img src="https://2315907434-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fi8JLXAtCzbzpRcnAIJsv%2Fuploads%2F8f5mC6OrpEwfjWXuySpF%2Fspaces_uSSbOeqjFMxbM6oPQFVO_uploads_MsnZFOg0JXEH5tOOHOTw_image.webp?alt=media&#x26;token=0a6e7765-b54a-4c57-b806-e7c096bbf1a1" alt=""><figcaption></figcaption></figure>

Tabula is very easy to use. Simply use your mouse to drag a selection rectangle around the table you want to extract.&#x20;

<figure><img src="https://2315907434-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fi8JLXAtCzbzpRcnAIJsv%2Fuploads%2FdqwAzudCwVTANU2RPMzo%2Fspaces_uSSbOeqjFMxbM6oPQFVO_uploads_6s4DuGP7cLs5t84Y3AVm_image.webp?alt=media&#x26;token=433d67bb-40cf-4e56-a8c3-85f5631da614" alt=""><figcaption></figcaption></figure>

And then click **Preview & Export Extracted Data**. Your screen should look like this.&#x20;

<figure><img src="https://2315907434-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fi8JLXAtCzbzpRcnAIJsv%2Fuploads%2F1CHB0zimUP15Mju3ijjc%2Fspaces_uSSbOeqjFMxbM6oPQFVO_uploads_jGesuBDNVzk1J05fTwVj_image.webp?alt=media&#x26;token=1a63af5d-502d-4c63-95ac-0f6d5afee2d1" alt=""><figcaption></figcaption></figure>

In this instance, you can see that the table hasn't extracted cleanly. The text is split over rows that don't appear in the original, making it hard to analyse. You can change the way Tabula performs its data extracion by clicking on the **Stream/Lattice** buttons

.

<figure><img src="https://2315907434-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fi8JLXAtCzbzpRcnAIJsv%2Fuploads%2FS8yapwGaUk45MMEY9gtN%2Fspaces_uSSbOeqjFMxbM6oPQFVO_uploads_BJ89hUmTEe7TLe4XbXvQ_image.webp?alt=media&#x26;token=daf03c16-d04b-464e-bd9b-faa457f8477d" alt=""><figcaption></figcaption></figure>

This example use a very simple (although somewhat poor quality) PDF, but Tabula can extract large tables too. It can even extract tables that are split over multiple pages of a PDF.

The only thing it cannot do is **Optical Character Recognition (OCR)**. OCR is a technique by which computers can "read" images and look for letters and numbers. It may work with some scanned documents, but not all. If not, you will likely need to use paid for software, such as Adobe Acrobat or an application based on the open source [Tesseract](https://github.com/tesseract-ocr) libraries.

Try it with this [National Senior Certificate (Matric) report from South Africa's Department of Basic Education](https://www.education.gov.za/Portals/0/Documents/Reports/2021NSCReports/2022%20School%20Performance%20Report.pdf?ver=2023-01-20-114306-343). It's a 222 page document, but almost all the important data is in identical tables from page 13 on. In this case, you can extract an awful lot of data at once, but you may well still need to clean it a little once it has been imported into your spreadsheet app.
