Tutorial

Statistical functions

Distimate can approximate common statistical functions from a histogram.

Distimate supports the following functions:

  • mean() (ineffective and imprecise, for sanity checks only)
  • Probability density function (PDF) - PDF
  • Cumulative distribution function (CDF) - CDF
  • Quantile (percentile) function - Quantile

Note

In many contexts, distribution functions approximated by Distimate should be called empirical distribution functions. They usually aggregate an empirical measure of a random sample.

For example, many libraries implement an empirical cumulative distribution function (ECDF). Distimate calls that function CDF for brevity.

Each of the above functions can be either plotted as an object with .x and .y attributes, or it can be called to approximate a function value at arbitrary point.

import distimate

edges = [0, 10, 50, 100]

cdf = distimate.CDF(edges, [4, 3, 1, 0, 2])
print(cdf.x)
print(cdf.y)
[  0  10  50 100]
[0.4 0.7 0.8 0.8]

The functions accept a number or a NumPy array-like.

print(cdf(0))
print(cdf(5))
print(cdf(107))
print(cdf([0, 5, 107]))
0.4
0.55
nan
[0.4  0.55  nan]

Functions are approximated from histograms:

  • The first bucket is represented by the first edge.
  • We assume that samples are uniformly distributed in inner buckets.
  • Outliers in the last bucket cannot be approximated.

The implementation intelligently handles various corner cases, for example ambiguous quantiles.

Distributions

All approximations from histograms require histogram edges and values. The Distribution class is a wrapper that holds both. It provides methods for updating or combining distributions:

dist1 = distimate.Distribution(edges)
dist1.add(7)
print(dist1.to_histogram())

dist2 = distimate.Distribution(edges)
dist2.update([0, 1, 1])
print(dist2.to_histogram())

print("----------------")
print((dist1 + dist2).to_histogram())
[0. 1. 0. 0. 0.]
[1. 2. 0. 0. 0.]
----------------
[1. 3. 0. 0. 0.]
  • The first histogram bucket counts items less than or equal to the left-most edge.
  • The inner buckets count items between two edges. Intervals are left-open, the inner buckets count items greater than their left edge and less than or equal to their right edge.
  • The last bucket counts items greater than the right-most edge.

Note

The bucketing implemented in Distimate works best with non-negative metrics.

  • The left-most edge should be zero in most cases.
  • The right-most edge should be set to highest expected value.

With this setup, the first bucket counts zeros and the last bucket counts outliers.

If Distimate is used with negative metrics, it can return wrong approximation for values bellow the left-most edge.

It is common to define histogram edges once and reuse them between distributions. The DistributionType class can remember the histogram edges. It can be used as a factory for creating distributions:

dist_type = distimate.DistributionType([0, 10, 50, 100])
print(dist_type.edges)

dist = dist_type.from_samples([0, 7, 10, 107])
print(dist.edges, dist.values)
[  0  10  50 100]
[  0  10  50 100] [1. 2. 0. 0. 1.]

Pandas integration

Consider that you load pandas.DataFrame with histogram values:

import pandas as pd

columns = ["color", "size", "hist0", "hist1", "hist2", "hist3", "hist4"]
data = [
    (  "red", "M", 0, 1, 0, 0, 0),
    ( "blue", "L", 1, 2, 0, 0, 0),
    ( "blue", "M", 3, 2, 1, 0, 1),
]
df = pd.DataFrame(data, columns=columns)
print(df)
  color size  hist0  hist1  hist2  hist3  hist4
0   red    M      0      1      0      0      0
1  blue    L      1      2      0      0      0
2  blue    M      3      2      1      0      1

The histogram data can be converted to pandas.Series with Distribution instances:

hist_columns = df.columns[2:]
dists = pd.Series.dist.from_histogram(edges, df[hist_columns])
print(dists)
0    <Distribution: weight=1, mean=5.00>
1    <Distribution: weight=3, mean=3.33>
2     <Distribution: weight=7, mean=nan>
dtype: object

We can replace histograms in the original DataFrame by the distributions:

df["qty"] = dists
df.drop(columns=hist_columns, inplace=True)
print(df)
  color size                                  qty
0   red    M  <Distribution: weight=1, mean=5.00>
1  blue    L  <Distribution: weight=3, mean=3.33>
2  blue    M   <Distribution: weight=7, mean=nan>

The advantage of the new column is that we can use it with the dist accessor to compute statistical functions for all DataFrame rows using a simple expression.

median = df["qty"].dist.quantile(0.5)
print(median)
0    5.0
1    2.5
2    2.5
Name: qty_q50, dtype: float64

See DistributionAccessor for all methods available via the dist accessor.

Series of Distribution instances can be aggregated:

agg = df.groupby("color")["qty"].sum()
print(agg)
color
blue    <Distribution: weight=10, mean=nan>
red     <Distribution: weight=1, mean=5.00>
Name: qty, dtype: object