Visualizing spatial data with high volumes in an informative way can be tricky. One trick I have used quite a bit is that I have turned the points in QGIS very small and then used transparency to highlight frequency and color to show categories. The main problem with visualizing tiny tiny poins is that the visual outcome is dependent on the rendering order of the points. Main characteristics of this type of map is that the points are overlapping, but typically the rendering order is somewhat random. Or not exactly random, but dependent on how the data is stored in your file or database.
To highlight this problem, here are locations of data about animal collisions and visualized in the same way but just the rendering order has been manually defined in QGIS. So same data but just rendered in different ways. Looks very different:
One way to solve this problem is to create small multiples, i.e. a single map for each category. I have done this for example with GeoNames data earlier, but I knew that there were clear caveats also in this. Better way would be to bin the data into an equally sized grid and show which is the most common value in each grid cell.
Of course using binning can create a very simplified view of the data. Using just raw points answers better to a question “where” but binning enables the answer to the question “what”. So neither is a visualization method that suits all purposes. But as just rendering points and making them very small is easy and anyone with basic QGIS skills can do that, I thought writing a blog post about binning might be interesting.
Time to cook. Ingredients for this recipe include:
- Point data with categories
Like said, the idea is to group the points to a grid. Binning can of course be done with any polygons (e.g. binning pig farm points with municipality polygons), but a regular grid has a lot of advantages compared to something like administrative areas which are very different in shape and size. Doesn’t really matter if you make a regular square grid or a hexagonal grid for the bins, but I recommend hexagons if you have the choice. They are not only more pleasing to the eye, but they also have real advantages in spatial analytics especially if you are working on a single level and don’t have to worry about hierarchy.
But to start with, you can open QGIS and open the processing tool Create Grid. The tool allows you to create grids in the form of squares, hexagons or diamonds (sounds fancy, but it’s just square turned 45 degrees.). Next select a suitable extent, which you can easily do based on a layer extent or draw directly to the canvas. Finally select a suitable spatial resolution. This depends totally on the scale you are working on and the point frequency your data has. As a reference I have used grids of around 10×10 kilometers to 50×50 kilometers to cover Finland and for the U.S. my hexagons have been around 5 times bigger. But it is a compromise between readability and your machine performance and your data. Then create the grid.
It might also be possible to run the whole analysis in QGIS (at least the spatial join), but PostGIS is much more efficient with large datasets. After you have your grid, you should export your grid and points to PostGIS to their own tables. I usually use the Export to PostgreSQL tool inside QGIS but you can read more about different ways of doing this from my previous blog post.
The following SQL query is the core of this analysis. It is the one I used to create data for the maps you see later in this blog post. The data I’ve used here is GeoNames (table name geonames) and a hexgrid covering the U.S. (table name usa_gid):
CREATE TABLE naturalhexes AS SELECT DISTINCT ON (id) id, count, geom, naturalname FROM (WITH places_ct AS( SELECT geom, CASE WHEN field_2 ILIKE '%forest%' THEN 'forest' WHEN field_2 ILIKE '%rock%' THEN 'rock' WHEN field_2 ILIKE '%valley%' THEN 'valley' WHEN field_2 ILIKE '%lake%' THEN 'lake' WHEN field_2 ILIKE '%creek%' THEN 'creek' END naturalname FROM geonames WHERE field_2 ILIKE '%rock%' OR field_2 ILIKE '%valley%' OR field_2 ILIKE '%lake%' OR field_2 ILIKE '%creek%' ) SELECT count(*), p.naturalname, g.geom, g.id FROM usa_grid g JOIN places_ct p ON ST_Intersects(g.geom, p.geom) group by g.id, p.naturalname ORDER BY id, count DESC) foo
I am using a few PostgreSQL specialities here:
- WITH starts a Common Table Expression (CTE) which enables to write a statement that can be then used in a larger query. Our CTE returns the point locations (geom) of all features which contain some of our selected words. I use the case to
- To search the words I am using ILIKE, which compares strings just like normal LIKE, but is case insensitive.
- ST_Interescts joins the points to hexagons and groups them by the hexagon id and the name pattern. So the last part of the query returns a count, name pattern, hexagon id and the hexagon geometry.
- As a final step the DISTINCT ON in the beginning of the query keeps the “first” row of each group of duplicates. In this case duplicate hexagons. As the last part of the query sorts the hexagons based on the count of specific name patterns, DISTINCT ON keeps the hexagon with the highest value.
There are few tweaks which could be made to the query if it were to be optimized. Firstly the pattern matching is not very efficient in this way. Secondly this method doesnt take into account cases where there are two names which have an equal number of
As QGIS is a really strong visualization tool, having this interesting piece of data in your database allows you to make some informative visualizations out of the hexes. I used Categorized symbology and played around with the color palettes to get the most out of the data. As a general rule of thumb you shouldn’t use more than 5-6 classes in choropleth maps, but I stretched that a bit.
After publishing this map on Twitter, I learned a few interesting things. For example the word Bay probably comes from the French word bayou, which is a slow-moving creek or a swampy section of a river or a lake. It was striking how clear the spatial pattern was in this data. Places containing the word “Spring” are clearly located in the dry areas and “Creek” seem to be covering most of the country.
I also tried different types of pairs and compared e.g. the words high and low:
Which clearly had a correlation between the real elevation and “old” vs. “new” which had a difference between the east and west as I was hoping to see.
So this method is definitely not only limited to comparing names, but any type of points with categories. One way to take the analysis and visualization even further would be to create a bivariate map, where the intensity of the colors would indicate the number of names and colors would indicate the type.
Seems like with PostGIS 3.1 a new function ST_HexagonGrid for creating grids will be introduced which would make it possible to run the whole analysis with a single SQL query.