LogoLogo
The Fundamentals of Data-driven Storytelling
The Fundamentals of Data-driven Storytelling
  • About this course
    • Course Introduction
  • Module 1 - Find
    • 1.1 How to Find Data for Storytelling and journalism
      • Starting with a question
      • Open data portals and platforms
      • Other sources of data
    • 1.2 How to get better data from a Goolge Search
      • Searching for filetypes and formats
      • More on Advanced Search operators
      • Other common Google Search operators
    • 1.3 Sourcing your own data
      • Creating a Google Form for Research
      • Creating a questionnaire with TypeForm
      • Using quizzes and comments as a sources of data
  • Module 2 - Get
    • 2.1 Turning websites and PDFs into machine readable data
      • Scraping data with Tabula
    • 2.2 An introduction to spreadsheet software
      • Google Sheets, Microsoft Excel and Libre Office Calc.
      • Finding your way around a spreadsheet
      • Simple web scraping with Google Sheets
  • Module 3 - Verify
    • 3.1 Can I use this data in my work?
      • Initial steps for verification
      • What do these column headings mean?
  • Module 4 - Clean
    • 4.1 What to do with disorganised data?
      • Why is clean data important?
      • Keep your data organised
      • Cleaning data cheatsheet
  • Module 5 - Analyse
    • 5.1 What is the story within the data?
      • Spreadsheet rows, columns, cells and tabs
        • Spreadsheet formats, forumlas and essential shortcuts
          • Using the VLOOKUP Function
            • Combine Data From Multiple Spreadsheets
    • 5.2 How to turn numbers into stories
  • Module 6 - Visualise
    • 6.1 Ways we visualise data
    • 6.2 Why we visualize Data
    • 6.3 How to visualise data
  • Course Testing & Feedback
    • ⏱️Quick course exam
    • 🎓Extended course exam
    • 📝Survey and feedback
Powered by GitBook
On this page
  1. Module 2 - Get

2.1 Turning websites and PDFs into machine readable data

The second part of the data storytelling pipeline is "Get". By getting data, we mean taking the data you have found and turning it into a machine readable format that you can work with. If you found your dataset on an open data portal, this is easy - you should be able to download a CSV or XLS file from the source. In many cases, however, there are a few steps to go through in order to extract data from a website or document such as a PDF.

In this lesson we will explore how to turn websites into machine readable data. Machine readable data refers to data that is data in a format that can be processed by a computer. Machine-readable data must be structured data.

Data that is stored as letters and numbers in a digital format is known as machine readable data. Our goal for working with procurement data is to get information supplied into this format, preferably in a spreadsheet, so that we can perform analysis.

PreviousUsing quizzes and comments as a sources of dataNextScraping data with Tabula

Last updated 2 years ago