How to use statistical knowledge when analyzing geological data?

We often learn pure math and statistical concepts in class, encountering jargon like mean, median, mode, standard deviation, variance, and even more complex terms like PDF (Probability Density Function) and CDF (Cumulative Distribution Function). Understanding the statistics and the math behind these concepts, which means building a logic linkage between the math, the physical shape of the data distribution, and the statistical meaning behind it is already challenging.

When it comes to real life, dealing with our own geological data, it seems we are disconnected from what we learnt from the statistical class. Here I’m going to use my study case as an example to show how I proceed to tie my statistic knowledge and the information I want to read from my data together. The background is that numerous sandy deep-water deposits were recovered by IODP 354, a drilling program in the distal Bengal Fan. I’m curious to know about how thick are these events and how many are they. A histogram is a good way of visualization to inform me the facts I’m about to know. When I try to analyze my data, I don’t necessarily have all that statistical knowledge in the forefront of my mind. Instead, I’m more driven by the geological question I want to answer:

What are the most frequently occurring thicknesses?
What about the super thick ones?
Will they tail down at the end, or do they represent a peak on their own?

I first plotted a histogram under a linear scale, but there wasn’t much information to be gleaned from it. Due to my past experience, I knew that geological data often benefits from being plotted on a logarithmic scale, so I tried that as well. After a few attempts at adjusting the bin numbers, the histogram finally looked both informative and aesthetically pleasing.

Observations from the Data

As I began my observations (The final graph plotted with my data will be attached here once it is published. For now, the graph’s shape is similar to the thumbnail image when you clicked the post):

The distribution looked skewed to the left.
The most frequently occurring thicknesses seemed to be those between 10 cm and 20 cm.
However, I noticed that if I changed the bin number, the exact range of the most frequent group would slightly shift.
Additionally, I observed that the super thick events, those over 100 cm, were situated at the right tail of the distribution.

These are good observations, based on intuition, but to ensure that the information is consistently conveyed to others, we need to quantify our observations. After all, different people may have different definitions for what qualifies as “super thick,” a term not globally defined.

Introducing Statistical Concepts

This is where statistical knowledge becomes invaluable. The bin number is a parameter you can adjust to better visualize the data—it’s flexible and helps you see different aspects of the dataset. However, the descriptive statistics are fixed; they are inherent properties of the dataset.

Mode: The mode is the parameter used to describe the most frequently occurring value. It can be calculated as the value that appears most often in the dataset.
Outliers: The super thick sediment gravity flows (SGFs) at the tail are outliers. To define outliers formally, one common method is using mean + 2\(\sigma\) (standard deviations). In this case, for our data, this threshold is around 2 meters. Rather than calling these events “super thick,” we define them as outsized to make the term more formal.

Understanding the Empirical Rule

One thing to keep in mind is the empirical rule, also known as the 68-95-99.7 rule. This rule states that:

68% of the data falls within one standard deviation of the mean,
95% falls within two standard deviations,
99.7% falls within three standard deviations.

However, this rule is derived from a symmetric normal distribution and does not apply to skewed data, like the left-skewed distribution we’re dealing with here.

Quantifying the Observations

Now, with these quantifying ideas in mind, let’s revisit the graph. We still use descriptive terms like left-skewed lognormal distribution, but now we supplement these with numbers, like the mode and the threshold for outliers. What if we want to delve deeper, or just say we want to be a little more nerdier mathematically as a geologist:

Fitting a Distribution: How can we use a solid equation to fit this distribution?
Calculating Probabilities: What’s the probability of forming these outsized SGFs?

Fitting the Lognormal Distribution

We’ve already determined that the distribution pattern we’re dealing with is a lognormal distribution. The general equation for the PDF of a lognormal distribution is:

\[f(x) = \frac{1}{x\sigma\sqrt{2\pi}} \exp\left(-\frac{(\ln(x) - \mu)^2}{2\sigma^2}\right)\]

Where:

\(x\) is the thickness,
\(\mu\) is the mean of the log-transformed data,
\(\sigma\) is the standard deviation of the log-transformed data.

Customizing the Distribution Using Maximum Likelihood Estimation (MLE)

To apply the lognormal distribution equation to our data, we need to personalize it by estimating the parameters \(\mu\) and \(\sigma\) using the method of Maximum Likelihood Estimation (MLE). Here’s how you can execute MLE with logarithms base 10:

Log-transform the data (using log base 10):
- First, transform your thickness data by taking the logarithm base 10 of each data point, resulting in a new dataset \(\log_{10}(x)\).
Write the likelihood function:
- The likelihood function is the product of the PDF for each data point in your sample because this is how you mathematically combine independent observations to reflect the overall likelihood of the entire dataset. In essence, it aggregates the information from all the individual data points to provide a comprehensive measure of how likely the observed data is under the given distribution parameters.
- The likelihood function represents the probability of observing your data given specific values of the parameters \(\mu\) and \(\sigma\). For a lognormal distribution with base 10 logarithms, the likelihood function \(L(\mu, \sigma)\) based on the transformed data can be written as:
\[L(\mu, \sigma) = \prod_{i=1}^{n} \frac{1}{x_i \sigma \sqrt{2\pi}} \exp \left( -\frac{(\log_{10}(x_i) - \mu)^2}{2\sigma^2} \right)\]
Derive the log-likelihood function:
- To simplify the likelihood function, take the logarithm base 10 of the likelihood function, which converts the product into a sum. This gives you the log-likelihood function:
\[\log_{10}(L(\mu, \sigma)) = -\frac{n}{2} \log_{10}(2\pi) - n \log_{10}(\sigma) - \sum_{i=1}^{n} \log_{10}(x_i) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (\log_{10}(x_i) - \mu)^2\]
Differentiate the log-likelihood function:

The next step is to maximize the log-likelihood function with respect to \(\mu\) and \(\sigma\) to find the values that make the observed data most probable. The essence is to calculate the derivatives of the \(\log_{10}(L(\mu, \sigma))\) with respect to \(\mu\) and \(\sigma\), and find the maximum likelihood estimates for these parameters.

To find the maximum likelihood estimates for \(\mu\) and \(\sigma\), we differentiate the log-likelihood function with respect to each parameter.
- For \(\mu\): Differentiate the log-likelihood function with respect to \(\mu\):
\[\frac{\partial \log_{10}(L(\mu, \sigma))}{\partial \mu} = \frac{1}{\sigma^2} \sum_{i=1}^{n} (\log_{10}(x_i) - \mu)\]
- For \(\sigma\): Differentiate the log-likelihood function with respect to \(\sigma\):
\[\frac{\partial \log_{10}(L(\mu, \sigma))}{\partial \sigma} = -\frac{n}{\sigma \ln(10)} + \frac{1}{\sigma^3 \ln(10)} \sum_{i=1}^{n} (\log_{10}(x_i) - \mu)^2\]
Set the derivatives to zero to find the maximum:

To find the maximum of the log-likelihood function, set the partial derivatives with respect to \(\mu\) and \(\sigma\) equal to zero:
- For \(\mu\):
\[\frac{\partial \log_{10}(L(\mu, \sigma))}{\partial \mu} = 0 \implies \sum_{i=1}^{n} (\log_{10}(x_i) - \mu) = 0\]
This equation simplifies to:
\[\mu = \frac{1}{n} \sum_{i=1}^{n} \log_{10}(x_i)\]
After applying the dataset, each thickness data point corresponds to \(x_i\). \(\mu\) will be obtained, which is the mean of the log-transformed data.

The same workflow applies for obtaining \(\sigma\).
- For \(\sigma\):
\[\frac{\partial \log_{10}(L(\mu, \sigma))}{\partial \sigma} = 0 \implies -\frac{n}{\sigma \ln(10)} + \frac{1}{\sigma^3 \ln(10)} \sum_{i=1}^{n} (\log_{10}(x_i) - \mu)^2 = 0\]
This equation simplifies to:
\[\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (\log_{10}(x_i) - \mu)^2\]
\(\sigma\) can then be calculated after applying the dataset, which is the standard deviation of the log-transformed data.

So far, the values of \(\mu\) and \(\sigma\) that maximize the log-likelihood function are your MLE estimates. These estimates will be used to define your personalized lognormal distribution, and to construct the probability density function (PDF) and cumulative distribution function (CDF) specific to your data, enabling further analysis and interpretation.

In high school, teachers just teach you the equation for calculating \(\mu\) and \(\sigma\), and let you memorize them for exams. Now packages in Python like pandas can give you all the descriptive stats with just one line of code:
```
 df.describe()
```
But if you take the time to work through these calculations yourself, it really helps you understand why we’re doing them and what they actually mean. Plus, there’s a sense of accomplishment when your calculated results align with what you learned from your high school teacher. We are standing on the shoulders of giants.

Calculating the Cumulative Distribution Function (CDF)

Once we have the equation for the PDF, we can calculate the Cumulative Distribution Function (CDF), which represents the accumulating probability, by integrating the PDF:

\[F(x) = \int_{-\infty}^{x} f(t) \, dt\] \[F(x) = P(X \leq x) = \Phi\left(\frac{\log_{10}(x) - \mu}{\sigma}\right)\]

This CDF will give us the cumulative probability up to a certain thickness.

Determining the Probability of Outsized SGFs

To calculate the probability of forming outsized SGFs with thicknesses between 2m and 10m, we follow these steps:

Calculate the CDF for 2m and 10m: Find the CDF values for these thicknesses using the personalized lognormal distribution.
Subtract the CDF values: Subtract the CDF value at 2m from the CDF value at 10m to get the probability of a thickness falling within this range.

Conclusion

In this blog, we’ve explored how statistical tools can be applied to geological data, making those initially confusing statistical terms our allies. With these quantified observations and a clear understanding of their statistical meaning, we can confidently move forward with our geological interpretations, using data to back up our insights.