Scraping data with Tabula

Previous2.1 Turning websites and PDFs into machine readable data Next2.2 An introduction to spreadsheet software

Last updated 2 years ago

Scraping data with Tabula

Much data is published in the PDF format. PDFs are highly versatile filetypes, and can contain text and images. Unlike, say, Word documents, copy and pasting text and numbers from a PDF document rarely works. When a scanned document is shared as a PDF, copy and paste won't work at all.

Here is a simple example from the Department of Labour - .

This PDF contains a table, with the bid number, tender value and winning organisation. But how do we copy that into a spreadsheet when the document is actually just an image with no "text"?

Using Tabula

The answer is that we need an application such as Tabula, which is specifically designed to extract tables from PDFs

Once you've imported the file, click on Extract Data. Your screen should look like this.

Tabula is very easy to use. Simply use your mouse to drag a selection rectangle around the table you want to extract.

And then click Preview & Export Extracted Data. Your screen should look like this.

In this instance, you can see that the table hasn't extracted cleanly. The text is split over rows that don't appear in the original, making it hard to analyse. You can change the way Tabula performs its data extracion by clicking on the Stream/Lattice buttons

This example use a very simple (although somewhat poor quality) PDF, but Tabula can extract large tables too. It can even extract tables that are split over multiple pages of a PDF.

Previous2.1 Turning websites and PDFs into machine readable data Next2.2 An introduction to spreadsheet software

Last updated 2 years ago