Bulk updates to an existing point collection

Long term maintenance to a point collection often involves the following tasks:

  1. Removing points that are no longer relevant;

  2. Adding points that are not already in the database; and

  3. Updating data about points already in the database (e.g., opening hours, services provided, correcting/improving a description).

To do this, you need to be able to compare your updated data to the data you have already uploaded to a profile collection in Wazimap NG. The easiest way to do this is to use unique identifiers.

Adding Unique Identifiers

An identifier is a value which uniquely and consistently identifies an object — in this case, a point in your point collection.

It is important to include some kind of identifier in your point data to facilitate updates to the data. Users down the line will rely on your identifiers being unique and consistent to be able to incorporate updates should they download a copy of your point data.

If your data does not have an official identifier that is consistent over time, and unique per point, it will become difficult to check for any duplicates or whether newly-provided points are already in your database (e.g., by name, address, and so on). Think about what your users will be able to provide you, and include that in what is shown to them as well.

You can use the following formula in Google Sheets to create a UUID (Copy this into the appropriate cell):

=CONCATENATE(DEC2HEX(RANDBETWEEN(0;4294967295);8);"-";DEC2HEX(RANDBETWEEN(0;42949);4);"-";DEC2HEX(RANDBETWEEN(0;42949);4);"-";DEC2HEX(RANDBETWEEN(0;42949);4);"-";DEC2HEX(RANDBETWEEN(0;4294967295);8);DEC2HEX(RANDBETWEEN(0;42949);4))

Make sure to copy and paste as text as well (not the formula), into a new column, and use the text version of it going forward. You don't want it to calculate a new random UUID for existing points.

Checking for Duplicates & Removing them

Google Sheets has a built-in function to check for and remove duplicates. In Google Sheets, select the data of interest, and click on Data > Data clean-up > Remove Duplicates (see Figure 20, below). In the box that appears (see Figure 21, below), be sure to select the option Data has header row if it indeed does.

If you are using Excel, see conditionally format duplicate values. If you are using Workbench, follow the steps below:

  1. Add a deduplication step.

  2. Select a column whose values probably ought to be unique.

  3. Look for rows where the duplicate number is greater than 1.

Merging Updates

If you have a consistent UUID (unique identifier), then:

  1. In Excel or Google Sheets, use VLOOKUP.

  2. In Workbench, Join Tabs might work.

If you have messy or dirty data with different capitalisation and potential spelling mistakes, try using OpenRefine with CSV reconciliation.

Last updated