Category Archives: Tools

Analyzing the Bührer dataset

What data of the available Bührer dataset actually made it on one of the maps? A mosaic plot, done with the vcd package from the open source statistical software R (https://www.r-project.org), gives a quick overview over the relevant factors.

Mosaic plot of the Bührer dataset
Mosaic plot of the Bührer dataset

The plot essentially shows areas proportional to the number of persons, ordered by the emigration status (left) and map # (top). For a given combination the successive blocks in the color red, black and grey denote Named, Married and Descendants persons respectively (see The methodology – preparing genealogical data for maps for explanations). These three categories make up roughly 4’500 persons of the original dataset, with the remainder not being shown. The small circles denote combinations that didn’t occur in the dataset.

A few observations:

  • Only a small fraction of persons in the dataset actually show up on map 1 and 2. This is comes as no surprise, given the large number of e.g. Swiss-based Bührers, “Assumed US” persons as known descendants of emigrants with no place information or “Undetermined” persons where location information could neither be determined nor inferred.
  • The number of Bührers emigrating for the generation prior 1880 (map 1) is significantly larger than the number of emigrating spouses from Switzerland, reflecting the fact that most married once overseas. A look at the category “Third country emigrated to US” indicates that a substantial part of the Bührers – at least for the first generation – preferred to marry other emigrants.
  • There’s very little Bührer emigration happening for the generations born after 1880 (map 2) – almost all Bührers in that period are America-born.

The plot has featured in a small presentation R User Meetup Mosaic plot Thomas Roth 20160803 (includes the R code) in a Zurich R User Group Meetup.

Software used for the family tree/GIS mapping project

The MacFamilyTree software from Synium (http://www.syniumsoftware.com/de/macfamilytree) was used to import, modify, consolidate and analyse the genealogical data. It was also used for the normalization, completion and geocoding of places. Except for MacFamilyTree all other mentioned software are open source.

Data was exported from MacFamilyTree’s underlying SQLLite database as SQL import script with the help of the SQLite Database Browser (http://sqlitebrowser.org) and subsequently imported into a PostgreSQL database (http://www.postgresql.org) with a PostGIS extension to add support for geographic objects. Unfortunately there seems to be no high quality GEDCOM-based parser/importer into SQL databases. Data handling and SQL scripts was done using pgAdmin3 (http://www.pgadmin.org).

The very flat data structure from MacFamilyTree was subsequently transformed into a more intuitive data model (“person”, “family”, “place”, “person_event” etc.) that served as a base for the extensive coded analysis and transformation logic in PostgreSQL’s procedural language PL/pgSQL.

All logic (and some data patching) were applied in roughly 40 sequential scripts per object. This repeatable processing proved to be a key success factor given the large number of methodological, coding and data errors encountered in the process that forced reprocessing.

Screenshot of the QGIS project for the emigration map
Screenshot of the QGIS project for the emigration map

All mapping and layout was done in QGIS (http://www.qgis.org), with key features for the project becoming available only in QGIS 2.2. Data came from either PostGIS layers in PostgreSQL or shapefiles from various sources. The original approach to create a raw map that would receive its finish in a vector-based editor was dumped in favour of end-to-end map production in QGIS. This reflects on one hand the growing maturity of QGIS on one side, but also the difficulties to process the incredible amount of paths in its vector-based output in other programs.