Visualising the Geographic Distribution of Charity Donors with Interactive Leaflet Maps

Python
EH
An anonymised display of voluntary work conducted for Emmanuel House Support Centre. Post 2/5 in the series.
Author

Daniel J Smith

Published

March 17, 2024

import pandas as pd
import numpy as np
import folium
from branca.colormap import LinearColormap
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

Data Imports

FakeIndividualConstituents.csv is a dataset I constructed consisting of the donation details of 100 fictional donors to a charity in Nottingham.

This post is meant to illustrate my method for investigating the distribution of donors to a real charity at which I volunteer - Emmanuel House in Nottingham.

Newsletter is a binary feature labelling if the donor is subscribed to the charity’s newsletter.

df = pd.read_csv('files/FakeIndividualConstituents.csv')
df.head()
Postcode NumberDonations TotalDonated AverageDonated Newsletter
0 NG9 3WF 4 61 15.25 1
1 NG9 4WP 1 23 23.00 0
2 NG9 3EL 1 30 30.00 0
3 NG1 9FH 5 75 15.00 1
4 NG5 6QZ 1 15 15.00 0
df[5:].sample(5)
Postcode NumberDonations TotalDonated AverageDonated Newsletter
64 NG8 3HT 9 153 17.00 1
98 NG3 1FF 3 56 18.67 1
12 NG1 3RD 1 15 15.00 0
54 NG5 3FD 3 48 16.00 1
9 NG9 4AX 1 17 17.00 0
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Postcode         100 non-null    object 
 1   NumberDonations  100 non-null    int64  
 2   TotalDonated     100 non-null    int64  
 3   AverageDonated   100 non-null    float64
 4   Newsletter       100 non-null    int64  
dtypes: float64(1), int64(3), object(1)
memory usage: 4.0+ KB
df.describe()
NumberDonations TotalDonated AverageDonated Newsletter
count 100.000000 100.000000 100.00000 100.000000
mean 4.320000 250.150000 46.51950 0.500000
std 5.454828 1657.135245 212.23961 0.502519
min 1.000000 15.000000 15.00000 0.000000
25% 1.000000 30.000000 15.00000 0.000000
50% 2.000000 45.000000 15.19000 0.500000
75% 5.000000 92.000000 16.92500 1.000000
max 37.000000 16618.000000 2077.25000 1.000000

Geocoding Postcodes with ONS data

For the details of this geocoding process, see my previous blog post:

Geocoding Postcodes in Python: pgeocode v ONS

post = pd.read_csv(r'C:\Users\Daniel\Downloads\open_postcode_geo.csv\open_postcode_geo.csv', header=None)
post = post[[0, 7, 8]]
post = post.rename({0: 'Postcode', 7: 'Latitude', 8: 'Longitude'}, axis=1)

df = df.merge(post, on='Postcode', how='left')

df.head()
Postcode NumberDonations TotalDonated AverageDonated Newsletter Latitude Longitude
0 NG9 3WF 4 61 15.25 1 52.930121 -1.198353
1 NG9 4WP 1 23 23.00 0 52.921587 -1.247504
2 NG9 3EL 1 30 30.00 0 52.938985 -1.239510
3 NG1 9FH 5 75 15.00 1 52.955008 -1.141045
4 NG5 6QZ 1 15 15.00 0 52.996670 -1.106307

Visualising Donor Data with Interactive Folium Maps

Folium is a Python library providing the functionality to visualise data in the form of interactive leaflet (.html) maps.

First Map

Producing a basic folium map with no extra features:

m = folium.Map(location=[52.9548, -1.1581], zoom_start=12)

for index, row in df.iterrows():  
    
    folium.CircleMarker(
        location=[row['Latitude'], row['Longitude']],
        radius=5,
        color='blue', 
        fill=True,
        fill_color='blue',
        fill_opacity=0.7,
    ).add_to(m)
    
m.save('FakeDonorMap1.html')
m
Make this Notebook Trusted to load map: File -> Trust Notebook

Adding Colour Labelling

We can colour each point by the value of TotalDonated using LinearColormap from branca.colormap:

m = folium.Map(location=[52.9548, -1.1581], zoom_start=12)

colors = ['green', 'yellow', 'orange', 'red', 'purple']
linear_colormap = LinearColormap(colors=colors,
                                 index=[0, 100, 250, 500, 1000],
                                 vmin=df['TotalDonated'].min(),
                                 vmax=df['TotalDonated'].quantile(0.99025))

for index, row in df.iterrows():    
    
    total_don = row['TotalDonated']
    color = linear_colormap(total_don)
    
    folium.CircleMarker(
        location=[row['Latitude'], row['Longitude']],
        radius=5,
        color=color, 
        fill=True,
        fill_color=color,
        fill_opacity=1,
    ).add_to(m)

linear_colormap.add_to(m)

m.save('FakeDonorMap2.html')
m
Make this Notebook Trusted to load map: File -> Trust Notebook

Adding Pop-Ups

Next, I added a pop-up to each data point showing the user the data of that donor by adding an argument popup to folium.CircleMarker:

m = folium.Map(location=[52.9548, -1.1581], zoom_start=12)

colors = ['green', 'yellow', 'orange', 'red', 'purple']
linear_colormap = LinearColormap(colors=colors,
                                 index=[0, 100, 250, 500, 1000],
                                 vmin=df['TotalDonated'].min(),
                                 vmax=df['TotalDonated'].quantile(0.99025))

for index, row in df.iterrows():    
    num_don = row['NumberDonations']
    total_don = row['TotalDonated']
    news = bool(row['Newsletter'])
    avg_don = row['AverageDonated']
    
    popup_text = f'<div style="width: 175px;">\
                  Total Donated: £{total_don:.2f}<br>\
                  Number of Donations: {num_don}<br>\
                  Average Donation: £{avg_don:.2f}<br>\
                  Subscribed to Newsletter: {news}\
                  </div>'
    
    color = linear_colormap(total_don)
    
    folium.CircleMarker(
        location=[row['Latitude'], row['Longitude']],
        radius=5,
        color=color, 
        fill=True,
        fill_color=color,
        fill_opacity=1,
        popup=popup_text
    ).add_to(m)

linear_colormap.add_to(m)

m.save('FakeDonorMap_noLayerControl.html')
m
Make this Notebook Trusted to load map: File -> Trust Notebook

Showing a screenshot taken with ShareX using Matplotlib:

plt.figure(figsize=(20,10))
img = mpimg.imread('FakeDonorMap_noLayerControl.jpg')
imgplot = plt.imshow(img)
plt.show()

Adding Layer Control

Finally, I added layer control to the map using folium.map.FeatureGroup and folium.LayerControl.

This results in buttons being added to the right of the UI, under the colour bar, allowing the user to show/hide the markers for donors in specific ranges of donation totals.

I used Microsoft Copilot to assist with this step.

m = folium.Map(location=[52.9548, -1.1581], zoom_start=12)

colors = ['green', 'yellow', 'orange', 'red', 'purple']
linear_colormap = LinearColormap(colors=colors,
                                 index=[0, 100, 250, 500, 1000],
                                 vmin=df['TotalDonated'].min(),
                                 vmax=df['TotalDonated'].quantile(0.99025))

# Create FeatureGroups
fgroups = [folium.map.FeatureGroup(name=f"Total Donated:  £{lower}{upper}") for lower, upper in zip([0, 100, 250, 500, 750, 1000], [100, 250, 500, 750, 1000, float('inf')])]

for index, row in df.iterrows():    
    num_don = row['NumberDonations']
    total_don = row['TotalDonated']
    news = bool(row['Newsletter'])
    avg_don = row['AverageDonated']
    
    popup_text = f'<div style="width: 175px;">\
                  Total Donated: £{total_don:.2f}<br>\
                  Number of Donations: {num_don}<br>\
                  Average Donation: £{avg_don:.2f}<br>\
                  Subscribed to Newsletter: {news}\
                  </div>'
    
    color = linear_colormap(total_don)
    
    marker = folium.CircleMarker(
        location=[row['Latitude'], row['Longitude']],
        radius=5,
        color=color, 
        fill=True,
        fill_color=color,
        fill_opacity=1,
        popup=popup_text
    )
    
    # Add the marker to the appropriate FeatureGroup
    for fgroup, (lower, upper) in zip(fgroups, zip([0, 100, 250, 500, 750, 1000], [100, 250, 500, 750, 1000, float('inf')])):
        if lower <= total_don < upper:
            fgroup.add_child(marker)
            break

# Add the FeatureGroups to the map
for fgroup in fgroups:
    m.add_child(fgroup)

linear_colormap.add_to(m)
m.add_child(folium.LayerControl())


m.save('FakeDonorMap_withLayerControl.html')
m
Make this Notebook Trusted to load map: File -> Trust Notebook

Showing a screenshot taken with ShareX using Matplotlib:

plt.figure(figsize=(20,10))
img = mpimg.imread('FakeDonorMap_withLayerControl.jpg')
imgplot = plt.imshow(img)
plt.show()

Constructing the Fake Data

I started with nottm_postcodes.csv, the csv file of 100 random Nottingham postcodes I used in my previous post:

Geocoding Postcodes in Python: pgeocode v ONS

df = pd.read_csv('files/nottm_postcodes.csv')
df.head()
Postcode
0 NG9 3WF
1 NG9 4WP
2 NG9 3EL
3 NG1 9FH
4 NG5 6QZ

I then defined NumberDonations using a standard normal distribution via np.random.randn, adding 1 using Python broadcasting to ensure each donor has donated at least one time.

I initially tried to produce the number of donations by uniformly generating a random integer between 1 and 40. However, this resulted in a mean number of donations of ~20 which was not at all representative of the real data.

Raising 5 to the power of the standard normal random variable generated by np.random.randn and adding 1 resulted in a NumberDonations column with a more realistic distribution.

num_donations = 1 + np.round(5**np.random.randn(100))
df['NumberDonations'] = pd.Series(num_donations).astype(int)
df['NumberDonations'].describe()
count    100.000000
mean       3.920000
std        5.104326
min        1.000000
25%        1.000000
50%        2.000000
75%        3.250000
max       34.000000
Name: NumberDonations, dtype: float64

I wanted the TotalDonated column to postively correlate with the NumberDonations column but not be entirely determined by it.

I settled on the approach of multiplying the NumberDonations column by 15 and adding to this 20 raised to the power of another standard normal random variable. Including the second normal random variable adds random noise to this TotalDonated column, meaning it isn’t entirely determined by NumberDonations.

The distribution of TotalDonated is not at all perfect, but suffices for the purposes of this post.

total_donated = np.round(np.abs(15*num_donations + 20**np.random.randn(100)))
df['TotalDonated'] = pd.Series(total_donated).astype(int)
df['TotalDonated'].describe()
count    100.000000
mean      69.340000
std       86.320196
min       15.000000
25%       30.000000
50%       33.000000
75%       61.500000
max      512.000000
Name: TotalDonated, dtype: float64
df.corr(numeric_only=True)
NumberDonations TotalDonated
NumberDonations 1.000000 0.916614
TotalDonated 0.916614 1.000000

The AverageDonated column is simply TotalDonated divided by NumberDonations

df['AverageDonated'] = np.round(df['TotalDonated']/df['NumberDonations'], decimals=2)
df['AverageDonated'].describe()
count    100.000000
mean      19.372000
std       16.543806
min       15.000000
25%       15.000000
50%       15.070000
75%       16.425000
max      164.500000
Name: AverageDonated, dtype: float64

My first approach to generating the binary feature Newsletter was:

newsletter = (np.random.rand(100) > 0.5).astype(int)
df['Newsletter'] = pd.Series(newsletter)
df['Newsletter'].describe()
count    100.00
mean       0.55
std        0.50
min        0.00
25%        0.00
50%        1.00
75%        1.00
max        1.00
Name: Newsletter, dtype: float64

However I wanted the binary Newsletter feature to positively correlate with TotalDonated but again not be entirely determined by it. The approach above results in an entirely randomly chosen Newsletter with 0 correlation to TotalDonated.

df.corr(numeric_only=True)
NumberDonations TotalDonated AverageDonated Newsletter
NumberDonations 1.000000 0.916614 -0.078951 0.112402
TotalDonated 0.916614 1.000000 0.297428 0.061388
AverageDonated -0.078951 0.297428 1.000000 -0.119316
Newsletter 0.112402 0.061388 -0.119316 1.000000

Upon the suggestion of Copilot I settled on the following:

def add_newsletter(df):
    # Create a score that is a combination of 'NumberDonations' and 'TotalDonated'
    score = df['NumberDonations'] + df['TotalDonated'] + df['NumberDonations'] * df['TotalDonated']

    # Add some random noise to 'Score'
    score += np.random.normal(0, 0.1, df.shape[0])

    # Create 'Newsletter' column by thresholding 'Score' such that the mean of 'Newsletter' is about 0.5
    threshold = np.percentile(score, 50)  # 50 percentile, i.e., median
    df['Newsletter'] = (score > threshold).astype(int)

    return df
df = add_newsletter(df)
df['Newsletter'].describe()
count    100.000000
mean       0.500000
std        0.502519
min        0.000000
25%        0.000000
50%        0.500000
75%        1.000000
max        1.000000
Name: Newsletter, dtype: float64
df.corr(numeric_only=True)
NumberDonations TotalDonated AverageDonated Newsletter
NumberDonations 1.000000 0.916614 -0.078951 0.472558
TotalDonated 0.916614 1.000000 0.297428 0.525571
AverageDonated -0.078951 0.297428 1.000000 0.186624
Newsletter 0.472558 0.525571 0.186624 1.000000

Thus we arrive at the synthetic dataset. Note that this is not exactly the same as the one used above due to the random nature of the generation process.

df.head()
Postcode NumberDonations TotalDonated AverageDonated Newsletter
0 NG9 3WF 21 399 19.00 1
1 NG9 4WP 1 16 16.00 0
2 NG9 3EL 16 241 15.06 1
3 NG1 9FH 2 30 15.00 0
4 NG5 6QZ 1 15 15.00 0
df.sample(5)
Postcode NumberDonations TotalDonated AverageDonated Newsletter
11 NG13 8XS 2 30 15.00 0
9 NG9 4AX 1 17 17.00 0
52 NG2 6QB 1 19 19.00 0
34 NG1 1LF 3 85 28.33 1
85 NG7 5QX 1 15 15.00 0
df.describe()
NumberDonations TotalDonated AverageDonated Newsletter
count 100.000000 100.000000 100.000000 100.000000
mean 3.920000 69.340000 19.372000 0.500000
std 5.104326 86.320196 16.543806 0.502519
min 1.000000 15.000000 15.000000 0.000000
25% 1.000000 30.000000 15.000000 0.000000
50% 2.000000 33.000000 15.070000 0.500000
75% 3.250000 61.500000 16.425000 1.000000
max 34.000000 512.000000 164.500000 1.000000

Remarks and Further Directions

The postcodes were randomly chosen from a certain latitude and longitude range encompassing Nottingham city. Thus there is no significance to the geographic distribution of the fake donors, while for the real data there were many interesting patterns to be investigated. For example, there is in fact a relationship between donor density and the social deprivation of areas of Nottingham not reflected in the synthetic data.

Similarly, the number of donations for the fake dataset was randomly generated and the total donated was generated from the number of donations plus some transformed gaussian noise. In practice there is a relationship between the amount given by donors and where in Nottingham they live.

I would be interested to investigate if I could find a way to make the fake dataset more accurately reflect the distribution of the true donor dataset without breaching any GDPR rules.

I have ideas of attaching statistics to the dataset, including population density, median income, IMD (Index of Multiple Deprivation) etc. I am currently invesigating this and it may be the topic of future blog posts.

Ultimately I want to train ML models on the donor dataset for the purposes of predictive modelling. I have tried some simple regression approaches, but want to conduct some more feature engineering on the dataset before prioritising this.