Proposed by: Adi Eyal
Date: 2020-08-31
Status: Proposed
The data model that NG has been built around is the concept of a universe, i.e. the total number of people in a particular context. This universe can then be disaggregated by a number of attributes. For instance the country population is a universe that is divided into male and female, i.e.:
# male + # female = total population
Similarly:
# Left-handers in WC + # Right-handers in WC = total population of WC
The input file might look like this:
Geography
Preferred hand
Count
Western Cape
Left
30
Western Cape
Right
50
Eastern Cape
Left
80
Eastern Cape
Right
95
Table 1
That works well for census data where you are disaggregating a universe. It falls short when you would like to compare two unrelated datasets side-by-side. As an example:
Geography
Access to drinking water
Year
Count
Western Cape
Have access to drinking water
2016
30,000
Western Cape
Have access to drinking water
2017
35,000
Table 2
In this case, we cannot sum the two rows to get the universe, i.e. there aren’t 30,000 + 35,000 = 65,000 people with access to water in the Western Cape. We do however ever want to be able to compare these two figures.
The problem does not only apply to time-based data, e.g.
Geography
Age Group
Employment Status
Count
Western Cape
15-24
Employed
40,000
Western Cape
12-24
Unemployed
60,000
Western Cape
15-35
Employed
80,000
Western Cape
15-35
Unemployed
120,000
Table 3
In this case we have two universes which overlap but we still want to be able to compare them:
An underlying assumption of the dataset model is that there is only one count field in every file. We could change this assumption by allowing multiple counts, effectively pivoting our table.
Geography
Access to drinking water
Year - 2016
Year - 2017
Western Cape
Have access to drinking water
30,000
35,000
Table 4
This solution will require significant changes to the following components:
Data Import
Variable Creation (data is aggregated at this level)
API
Front-end data model
Front-end visualisations
We would also need to decide how the system would identify count columns. The current convention is to match by name. We could use a similar approach by requiring a standard prefix, e.g. Count: 2016. Alternatively, the administrator could identify these columns once the data has been uploaded.
An alternative approach would be to designate a pivot column, e.g. Year.
This is a robust approach that ensures compatibility between Count columns (compared with Solution 2 below). Every column available for the first Count column will be available to the second one.
A significant amount of effort is required to make this change.
It limits comparisons of datasets to those that were included in the initial upload. If data from a new year becomes available, it is not possible to include it.
Depending on implementation, may place an additional burden on the Data Administrator requiring a special naming of columns in the spreadsheet to be uploaded.
Using this approach we add the concept of a super indicator which ties together two or more indicators together. For instance, table 3 becomes two separate indicators:
Geography
Age Group
Employment Status
Count
Western Cape
15-24
Employed
40,000
Western Cape
12-24
Unemployed
60,000
Table 5
Geography
Age Group
Employment Status
Count
Western Cape
15-35
Employed
80,000
Western Cape
15-35
Unemployed
120,000
Table 6 These are then associated in the backend.
This solution would affect the following components:
A new database model would be required
The API will need to be changed
Front-end data model
Front-end visualisations
Overall fewer changes are necessary to implement this feature
Data Administrators are able to associate arbitrary indicators without pre-planning.
An almost identical workflow is used as the current approach.
It is possible to bind two, completely unrelated or incompatible indicators. For instance, the number of bankruptcies vs Child pregnancies.
Less extreme but equally problematic is two related datasets with different groups, e.g. 2016 Matric passes disaggregated by gender vs 2017 Matric passes without disaggregation. In this case, we will need to decide how this will be displayed on the frontend, especially in graph filters
In the case of two incompatible datasets, we need to decide whether filters are available for missing groups.
Whereas indicators can now be mapped using a choropleth. Superindicators cannot since a choropleth can only display one Count variable at a time. This can be addressed by another select box allowing the user to choose the Count variable of interest or mapping multi-count variables could be prevented entirely.
Another approach to addressing this issue is to recognise that the cause of this problem is the concept of a non-overlapping universe introduce by the Dataset model. This approach does not always make sense as in the examples above. To create an Indicator, a background process is fired that groups DatasetData objects appropriately. In cases like the ones described here, it might be easier to create the indicator directly and avoid datasets entirely.
When creating a new indicator, the admin is asked whether it should be created from a dataset (the current process) or whether it should be uploaded from a file. This latter approach would simply create the relevant IndicatorData directly. Since an indicator must link to a dataset, one can be created automatically. Creating new indicators from this dataset should not be allowed however.
This is by far the easiest approach to implement. Very little needs to change with the exception of the file upload mechanism when creating a new indicator. Once the IndicatorData objects are created, downstream users of this data remain unchanged.
It also allows flexibility for data administrators to shape data without too many constraints on the format.
Data re-use is limited. Datasets provide opportunities to create multiple indicators. The superindicator concept allows even more mixing and matching of indicators.
This problem was finally addressed by marking columns as aggregatable or not aggregateable. If a column is not aggregateable, different values in that column need to be shown separately. This can be implemented as a dropdown filter, e.g. you always need to choose a year. Alternatively, a grouped bar chart could be used, e.g. 2016 and 2017 bars.
The change also involved removing all aggregation from the backend and sending raw data to the frontend.