Dealing with countries — in data
Countries are one of the more common entities one comes across in data sets of all sizes, shapes and topic areas. Often found in a field or column* of its own, clearly titled “Country”. What could be simpler?
Many things, it turns out. While the field is just plain text for display, dealing with it is easy enough. But in order to use that field to color a map, associate two tables containing country information or ensure consistency within a data set or across data sets, you will want to identify the country in question. You will also want to make sure you’re counting each country once, and only once. And this is where things become hairy.
How do you spell USA?
us
US
usa
U.S.
USA
U. S.
U.S.A.
U. S. A.
US of A
U.S. of A
U. S. of A
United States
United States of America
These are 13 distinct — yet common and fully logical — ways of referring to the same country. In manual entry data sets (such as a user generated spreadsheet) you might easily find several of these within the same column, and different organizations will use different versions even if each happens to be internally consistent (which is not always the case). The US is by no means unique in this regard.
Illustrating the issue, here are differences between commonly used country labels for the same countries in data sets from the United Nations and World Bank respectively:
United Nations | World Bank
-----------------------------------------+--------------------------
Bahamas | Bahamas, The
Cape Verde | Cabo Verde
Congo | Congo, Rep.
Congo, Democratic Republic of | Congo, Dem. Rep.
Côte d'Ivoire | Cote d'Ivoire
Egypt | Egypt, Arab Rep.
Gambia | Gambia, The
Iran, Islamic Republic of | Iran, Islamic Rep.
Korea, Democratic People's Republic of | Korea, Dem. People’s Rep.
Korea, Republic of | Korea, Rep.
Kyrgyzstan | Kyrgyz Republic
Lao People's Democratic Republic | Lao PDR
Libyan Arab Jamahiriya | Libya
Micronesia, Federated States of | Micronesia, Fed. Sts.
Occupied Palestinian Territory | West Bank and Gaza
Saint Kitts and Nevis | St. Kitts and Nevis
Saint Lucia | St. Lucia
Saint Vincent and Grenadines | St. Vincent and the Grenadines
Slovakia | Slovak Republic
Tanzania, United Republic of | Tanzania
The former Yugoslav Republic of Macedonia| Macedonia, FYR
United States of America | United States
US Virgin Islands | Virgin Islands (U.S.)
Venezuela, Bolivarian Republic of | Venezuela, RB
Viet Nam | Vietnam
Yemen | Yemen, Rep.
Reconciliation of values such as these is needed before any useful analysis can be done on the data. Throw in potential typos and misspellings and you can’t even rely on a definitive list of alternatives for each country. You might need to result to fuzzy matching or edit distance calculations. And as you can imagine, multi-lingual data sets complicate the matter even further. Estados Unidos anyone?
A properly designed database schema will use a country code of some sort as a key rather than the countries’ full names. Most likely some flavor of ISO 3166-1, which defines numerical as well as 2 and 3 letter codes for countries:
Country | ISO 3166–1 | ISO 3166–1 | ISO 3166–1
| alpha-2 | alpha-3 | numeric
--------------------+------------+------------+-------------
Afghanistan | AF | AFG | 004
Albania | AL | ALB | 008
Algeria | DZ | DZA | 012
Andorra | AD | AND | 020
Angola | AO | AGO | 024
Antigua and Barbuda | AG | ATG | 028
Argentina | AR | ARG | 032
Armenia | AM | ARM | 051
Australia | AU | AUS | 036
Austria | AT | AUT | 040
Azerbaijan | AZ | AZE | 031
... | .. | ... | ...
Each has its own merits, but my favorite is the three letter code as it has the benefit of being the most legible for humans in addition to serving its purpose as a unique and consistent identifier for computer systems.
(Side note: The World Bank’s data site includes ISO 3166-1 alpha-3 codes in the data sets they offer for download, the UN’s data site does not, but assigns some other numerical country codes of its own)
When reconciling data from a manual entry data set you will want to use this kind of country codes. It is often useful to keep the original field or column and add a new one with the consolidation in form of a country code. You may still want to use the original field for presentation of the data, and at least as a way to verify the conversion to country codes. Rough implementations of this can be made in many computer systems — including spreadsheet software (hello VLOOKUP()). Many purpose-built data preparation and analytics tools provide functionality that simplifies this. Some even deal with the country use case specifically out of the box.
No matter how good the tool is, you will likely end up with at least a handful of unidentified values. As with many other entities of data, you are likely to find aggregate labels (“Total”, “All countries”, “Sum”), country-group references (“S-America”, “OECD countries”, “Developing”, “World”) and empty or ambiguous references (“N/A”, “-”, “All other countries”) in such data sets. What to do with these depends on the use case and purpose: anywhere from deleting them to leaving them as they are to separating them out or reconstructing hierarchies for display or analytical purposes.
Beware that such entries are common sources of errors such as double-counting (e.g. adding up populations for all countries as well as for all continents to find over 14 billion people in the world) so make sure you’re making a deliberate choice that meets your requirements.
With the reconciled field many software solutions will now be able to make more use of this data, e.g. to color countries by measures in the data set on a choropleth map. The standard, unique identifiers will also help associate and link the data sets together, and the cleanup will ensure there is no double counting or partial summing by country (results of e.g. counting “USA” and “United States” as two different entities).
So, how many countries are there in the world?
It depends on who you ask! The definition of “country” varies from one organization and use case to another. The United Nations currently has 193 members and 2 “observer states”: The Vatican and Palestine. The latter one draws attention to the fact that this is also a highly political issue. There are disputed areas and complicated situations like those of e.g. Kosovo, Taiwan (uh, I mean Republic of China) and Western-Sahara. There are also “countries” that are parts of other nations such as Scotland as a part of the United Kingdom, French Guiana as a part of France (but not of Europe) and Greenland as a part of Denmark. And what the heck is Antarctica when it comes to this kind of classification?
Such areas will sometimes be reported on individually and at other times as a part of the parent. In some cases you may not even be able to tell whether a number reported for one country is inclusive of such sub-divisions or not.
Illustrating this complexity is that despite the UN’s current 193+2 official state designations, there are still 249 countries, territories, or areas of geographical interest that are assigned official codes in ISO 3166-1, inclusion which in turn is determined by a presence in either of two publications:
- The terminology bulletin “Country Names”; or
- The “Country and Region Codes for Statistical Use”
…both published by — you guessed it — the UN!
As another example, there are currently 206 National Olympic Committees. These include various sub-divisions and complicated situations such as those discussed above (e.g. Puerto Rico (PUR), Chinese Taipei (TPE)and British Virgin Islands (IVB)). In the 2016 Olympics, two additional codes were used to refer to groups that aren’t really nationalities (Refugee Olympic Athletes (ROT)and Independent Olympic Athletes (IOA)). But for the purposes of covering the Olympics, this is obviously the list you want to work with.
Other organizations will have yet other classifications, but you can reasonably safely say that the number of countries in the world is somewhere between 193 and 249.
The question of how many countries there are in the world is explored in far more detail in this informative article from Political Geography Now.
Countries over time
As if the above wasn’t complicated enough, dealing with data sets that track countries over time opens up a whole new can of worms. A few years ago I worked on a visualization of the countries participating in the FIFA World Cup (that’s the kind of football played with the feet for you Americans). The visualization allowed a user to hover the mouse over a country code in the rankings for one competition to highlight the same country in the rankings for all other competitions as well. This seemingly simple task led to more reading of political history and philosophical debate than we had anticipated. Countries split, countries merged, countries ceased to exist.
- Do we highlight East-Germany and West-Germany on hovering of the united Germany? We decided yes.
- Do we still count Germany before World War II as the same country as the united Germany post the collapse of the Berlin-wall? We decided yes.
- Do we highlight all the smaller Balkan states when someone hovers on Yugoslavia? We decided yes.
- Do we highlight Yugoslavia when hovering the smaller nations? We decided no, creating what we deemed a justified inconsistency in user behavior.
- What about the former states of the Soviet Union? We decided on the same treatment as for Yugoslavia
- Do we highlight former colonies as a continuation of their colonizing countries after the got their independence? We decided no.
As you can see this isn’t simple, nor void of controversy.
Both the ISO-3166 and the International Olympic Committee have a range of obsolete codes referring to former countries, territories or even seemingly random groupings of people (such as the ZZX Olympic code that refers to teams of multiple nationalities).
You will find the same issues in various socio-economic data sets. Is it fair to add up the values for East- and West-Germany for a continuous data series?
- For population — most likely.
- For GDP — maybe.
- For GDP per capita (calculated from those two values)? Make sure you remind the audience that the drop in GDP per capita in the early nineties is a result of the reconciliation of the country, not a major recession.
And these aren’t just issues for long historical time series. How will your company track the UK sales numbers if Scotland would at some point vote for independence? How do you adjust numbers for Ukraine (and to a lesser extent Russia) given their forceful inclusion of Crimea? If you have data on birth places, what do you do with data on people that were born in Prussia, Rhodesia, Czechoslovakia or Burma? Someone born in Yugoslavia that now lives in Slovenia may have never left her home town, yet “moved” to a different country looking at the data.
And we haven’t even talked about national borders yet. Nor map projections!
Let’s leave that for another day.
- — -
* I use the word field to refer to a field in a database or a proper data model, but column to refer to a table column in a spreadsheet or other flat table structure.