Motivation

Being part of a group well-known for applying U-Pb dating on detrital zircons to study sediment routing, my research methods are typically expected to focus on core description, statistical analysis, and luminescence dating. Initially, I assumed that U-Pb data wouldn’t be a major part of my work, as the samples are either from the Ganges or Brahmaputra rivers. However, when examining super sandy, thick sediment gravity flow deposits (most of which are turbidites), the lithofacies of these deposits surprisingly resemble features typical of the Brahmaputra.

This resemblance raises an interesting hypothesis: Could triggering events have occurred upstream in the Brahmaputra River, under favorable external conditions (such as climate-controlled low sea levels), to produce what we now observe as megaturbidites at the distal edge of the Bengal Fan, where our core samples were collected?

Developing a Python-Based KDE Plotting Tool

While there are great tools available for sedimentology data analysis, they are largely based on MATLAB or R. Although I have some familiarity with MATLAB, Python remains my tool of choice. To avoid switching programming environments, I decided to create my own Kernel Density Estimation (KDE) plotting functions in Python. I wanted these plots to meet publication standards, so I used visualization styles from my advisor Mike Blum’s 2018 paper as a reference. (See the paper here). The sample data I’m displaying here is also from that paper, but we’re in the process of collecting new data for future publications!

Setting Up the Data

To get started, you’ll need an Excel workbook for your data. If you plan to plot data from multiple sampling sites, arrange the age data from each site in a separate column, with each column named after the sampling site.

Building a Basic Plot

We’ll start with a very simple plot. Initially, I set the bandwidth hyperparameter to 0.01. This creates a trend that appears somewhat bumpy, so I added a smoothing function to improve the visual appeal. The goal is to use the script below to experiment with both the bandwidth and smoothing factors, finding an optimal balance. Decreasing the bandwidth provides more detail, but can result in a bumpy appearance, where smoothing can help make the plot more visually pleasing while retaining the key features of the data.

Below is a sample script to begin working with and the image is a preview of how the interface looks like. Feel free to adjust the parameters as needed:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
from scipy.ndimage import gaussian_filter1d
from ipywidgets import interact, widgets

def plot_kde(data_sample, bandwidth, sigma):
    # Generate KDE
    kde = gaussian_kde(data_sample, bw_method=bandwidth)
    x_values = np.linspace(0, 3500, 1000)
    y_values = kde(x_values)
    
    # Apply smoothing filter
    y_smooth = gaussian_filter1d(y_values, sigma=sigma)
    
    # Plotting
    plt.figure(figsize=(10, 1.8))

    plt.plot(x_values, y_smooth, color='black', linewidth=1)
    plt.xlabel("Age (Ma)")
    plt.ylabel("Normalized KDE")
    
    # Set x-axis limits and major ticks
    plt.xlim(0, 3500)
    plt.xticks(np.arange(0, 3501, 500))  # Major ticks every 500
    
    # Add minor ticks every 100
    plt.gca().xaxis.set_minor_locator(plt.MultipleLocator(100))
    
    # Customize minor tick appearance
    plt.tick_params(axis='x', which='minor', length=4, color='black')
    
    # Remove y-axis ticks
    plt.gca().yaxis.set_ticks([])
    plt.gca().yaxis.set_ticklabels([])

    plt.show()

# Create interactive sliders for bandwidth and sigma, and add data_sample as a fixed argument
def interactive_plot(data_sample):
    interact(
        lambda bandwidth, sigma: plot_kde(data_sample, bandwidth=bandwidth, sigma=sigma),
        bandwidth=widgets.FloatSlider(value=0.008, min=0.005, max=0.5, step=0.001, 
                                      description='Bandwidth', readout_format='.3f'),
        sigma=widgets.FloatSlider(value=3, min=0, max=10, step=0.5, description='Smooth')
    )

Stacked KDE Plots of Multiple Samples

Simple Stack

When plotting stacked KDEs of multiple samples, the script will automatically use the column names and sample sizes as labels for each plot. The stacking order follows the sequence of columns in the Excel file, meaning the first column will be plotted at the bottom.

The para_dict is a dictionary where each key corresponds to a sample name, and the values are: 1) the age data, 2) the bandwidth parameter, and 3) the smoothing factor. After experimenting with one sample to determine a comfortable combination of bandwidth and smoothing factor, I found that my samples could share the same values, so I set a default_bandwidth and default_smooth_factor based on this. For cases where samples may require different settings, I recommend updating the para_dict accordingly to fine-tune each sample’s parameters.

"""
    Import Data with defaulted bandwidth and smooth factor
"""
# Load the Excel file
excel_path = "Blum_2018_Bengal_Fan.xlsx"  # Replace with the actual path to your file and put it under the same directory of the script
data = pd.read_excel(excel_path)

# Default values for bandwidth and smooth factor
default_bandwidth = 0.008
default_smooth_factor = 4

# Initialize the param_dict
param_dict = {}

# Process each column to construct the dictionary with column name and valid data count
for column in data.columns:
    # Count the number of non-NaN values in the column
    valid_data_count = data[column].count()
    # Construct the key with column name and valid data count
    key = f"{column} (n={valid_data_count})"
    # Assign the column data to the dictionary, excluding NaN values, along with default parameters
    param_dict[key] = [data[column].dropna().values, default_bandwidth, default_smooth_factor]

print(param_dict.keys())

Define the visualization function

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
from scipy.ndimage import gaussian_filter1d

def plot_stacked_kdes(params, offset, color_blocks=None):
    # Create figure
    plt.figure(figsize=(12, 5))

    # Add color blocks if provided
    if color_blocks is not None:
        for start, end, color in color_blocks:
            plt.axvspan(start, end, color=color, alpha=0.6)  # Alpha for transparency

    # Loop through each dataset and plot
    for idx, (label, (data, bandwidth, smooth_factor)) in enumerate(params.items()):
        # Clean data by removing NaNs and infinities
        data = data[~np.isnan(data)]
        data = data[np.isfinite(data)]

        # Generate KDE
        kde = gaussian_kde(data, bw_method=bandwidth)
        x_values = np.linspace(0, 3500, 1000)  # x-axis range and resolution
        y_values = kde(x_values)

        # Apply smoothing filter
        y_smooth = gaussian_filter1d(y_values, sigma=smooth_factor)

        # Normalize each curve to have a maximum height of 1
        y_smooth = y_smooth / max(y_smooth)

        # Offset the y-values to separate curves
        y_offset = y_smooth + idx * offset

        # Plot the smoothed KDE with black color and line width of 1
        plt.plot(x_values, y_offset, color='black', linewidth=1)

        # Add text label aligned to the right end of the plot, slightly above the curve
        plt.text(3500, y_offset[-1] + 0.2 * offset, label, 
                 ha='right', va='center', fontsize=10)

    # Customize the x-axis
    plt.xlim(0, 3500)
    plt.xlabel("Age (Ma)", fontsize=12)
    
    # Add major ticks every 500 and minor ticks every 100 on the x-axis
    plt.xticks(np.arange(0, 3501, 500))  # Major ticks every 500
    plt.gca().xaxis.set_minor_locator(plt.MultipleLocator(100))

    # Thicken x-axis spine and customize x-axis tick appearance
    plt.gca().spines['bottom'].set_linewidth(1.5)  # Thicker x-axis spine
    plt.gca().tick_params(axis='x', width=1.5, labelsize=10)  # Thicker x-axis major ticks and labels
    plt.tick_params(axis='x', which='minor', length=4, color='black', width=1.5)  # Thicker minor ticks

    # Remove y-axis ticks and labels for a clean stacked look
    plt.gca().yaxis.set_ticks([])
    plt.gca().yaxis.set_ticklabels([])

    # Remove only the top and right spines, keep the bottom (x-axis) spine
    plt.gca().spines['top'].set_visible(False)
    plt.gca().spines['right'].set_visible(False)
    plt.gca().spines['left'].set_visible(False)

    plt.show()

Run the following command to generate a simple stack graph. offset defines the vertical distance between each KDE plot.

plot_stacked_kdes(params=param_dict, offset=1.5, color_blocks=None)

Stacked Plot with densely populated age groups noted in different colors

These are preliminary age groups in the Bengal Fan dataset, observed visually (and therefore approximate), to create a demo visualization. Accurate ages, supported by literature, will be included in future updates.

color_blocks = [
    (0, 300, (173/255, 217/255, 152/255)),   # Greenish block (A)
    (400, 600, (140/255, 200/255, 225/255)), # Light Blue block (B)
    (700, 900, (158/255, 185/255, 225/255)), # Blue block (C)
    (910, 1250, (245/255, 220/255, 200/255)), # Adjusted Light Orange block (D)
    (1300, 1700, (255/255, 205/255, 185/255)), # Adjusted Light Pink block (E)
    (1710, 2100, (245/255, 190/255, 150/255)), # Orange block (F)
    (2300, 2750, (235/255, 140/255, 135/255))  # Red block (G)
]

plot_stacked_kdes(params=param_dict, offset=1.5, color_blocks=color_blocks)

Here’s the link to the script you can run in Jupyter Notebook: KDE Plot Script on GitHub.