Class EmpiricalDistribution

java.lang.Object
org.sm.smtools.math.statistics.EmpiricalDistribution

public final class EmpiricalDistribution
extends java.lang.Object
The EmpiricalDistribution class offers a means to calculate the empirical cumulative distribution (CDF) and probability density (PDF) functions, including percentiles. After the distribution is created, a user will typically call the analyse() method to estimate the various statistical quantities.

The distribution can only contain Integer.MAX_VALUE samples.

Note that a valid I18NL10N database must be available!

Note that this class cannot be subclassed!

Version:
26/06/2018
Author:
Sven Maerivoet
  • Constructor Summary

    Constructors
    Constructor Description
    EmpiricalDistribution()
    Constructs an empty EmpiricalDistribution object.
    EmpiricalDistribution​(double[] x)
    Constructs an EmpiricalDistribution object for a given array of values.
    EmpiricalDistribution​(double[] x, double[] histogramBinRightEdges)
    Constructs an EmpiricalDistribution object for a given array of values and user-specified histogram bin right edges.
    EmpiricalDistribution​(double[] x, int nrOfHistogramBins)
    Constructs an EmpiricalDistribution object for a given array of values and a user-specified number of histogram bins.
  • Method Summary

    Modifier and Type Method Description
    void analyse()
    Estimates the empirical distribution and analyses it various statistical quantities.
    double calculateKDEPDFBandwidth​(MathTools.EKernelType kernelType)
    Calculates the bandwidth for kernel density estimation (KDE) based on Silverman's Rule-of-Thumb.
    void clear()
    Clears the empirical distribution.
    void estimateKDEPDF​(MathTools.EKernelType kernelType, double bandwidth, int nrOfSupportPoints, double minSupport, double maxSupport)
    Estimates the probability distribution function (PDF) using a specified kernel function.
    double getCDF​(double x)
    Returns the value of the cumulative distribution function (CDF) evaluated at x.
    static double getChiSquare​(double alpha, int degreesOfFreedom)
    Returns the chi-square value corresponding to a specified alpha level and number of degrees of freedom.
    double[] getData()
    Retrieves the raw data for this empirical distribution.
    double getExpectedValue()
    Returns the expected value for the first moment (population mean), which in this case is approximated by the sample mean.
    FunctionLookupTable getFullKDEPDF()
    Returns the previously complete calculated kernel density estimation (KDE) of the probability distribution function (PDF).
    double getHistogramBinCentre​(int histogramBin)
    Returns the centre of a specified histogram bin.
    double[] getHistogramBinCentres()
    Returns the centres of all the histogram bins.
    double getHistogramBinCount​(int histogramBin)
    Returns the count associated with a specified histogram bin.
    double[] getHistogramBinCounts()
    Returns the counts for all the histogram bins.
    double[] getHistogramBinFrequencies()
    Returns the frequencies for all the histogram bins.
    double getHistogramBinFrequency​(int histogramBin)
    Returns the frequency associated with a specified histogram bin.
    double getHistogramBinWidth()
    Returns the width of a histogram bin.
    double getInterquartileRange()
    Returns the interquartile range (IQR) (i.e., the difference between the 75th and the 25th percentiles).
    static java.lang.String getInterquartileRangeDescription()
    Returns a descriptive label of the interquartile range (IQR).
    double getJarqueBeraTestStatistic()
    Calculates the Jarque-Bera test statistic.
    double getKDEPDF​(double x)
    Returns the value of the probability density function (PDF) evaluated at x (based on kernel density estimation, KDE).
    FunctionLookupTable getKDEPDFModes()
    Returns all modes (i.e., local maxima) for the calculated kernel density estimation (KDE) of the probability density function (PDF).
    double getKDEXMaximum()
    Returns the maximum of the values for a kernel density estimation (KDE) of the probability distribution function (PDF).
    double getKDEXMinimum()
    Returns the minimum of the values for a kernel density estimation (KDE) of the probability distribution function (PDF).
    double getKDEXRange()
    Returns the range of the values for a kernel density estimation (KDE) of the probability distribution function (PDF).
    double getKurtosis()
    Returns the sample kurtosis (using an unbiased estimator).
    static java.lang.String getKurtosisDescription()
    Returns a descriptive label of the kurtosis.
    java.lang.String getKurtosisInterpretation()
    Returns a qualitative description of the kurtosis based on its test statistic.
    double getKurtosisZStatistic()
    Returns a two-tailed test statistic Z of kurtosis (different from zero) with a 5% significance level.
    double getMean()
    This is the sample mean, which in this case is an alias for the expected value.
    static java.lang.String getMeanDescription()
    Returns a descriptive label of the mean (expected value).
    double getMedian()
    Returns the median (i.e., the 50th percentile).
    static java.lang.String getMedianDescription()
    Returns a descriptive label of the median.
    int getN()
    Returns the sample size.
    int getNrOfHistogramBins()
    Returns the number of histogram bins used for estimating the probability density function (PDF).
    boolean[] getOutliers()
    Returns the outliers which are defined as having z-scores greater than 3.
    double getPDF​(double x)
    Returns the value of the probability density function (PDF) evaluated at x (based on a histogram).
    double getPercentile​(double percentile)
    Returns the given percentile.
    double getPercentile​(int percentile)
    Returns the given percentile.
    static java.lang.String getPercentileDescription()
    Returns a descriptive label of a percentile.
    double[] getPercentiles()
    Returns all the percentiles for the range [0,100].
    double getSkewness()
    Returns the sample skewness (using an unbiased estimator).
    double getSkewnessConfidenceBounds()
    Returns the symmetrical skewness' confidence bounds for a 95% confidence interval, defined as twice the standard error of skewness (SES).
    static java.lang.String getSkewnessDescription()
    Returns a descriptive label of the skewness.
    java.lang.String getSkewnessInterpretation()
    Returns a qualitative description of the skewness based on its test statistic.
    double getSkewnessZStatistic()
    Returns a two-tailed test statistic Z of skewness (different from zero) with a 5% significance level.
    double[] getSortedData()
    Retrieves the raw data for this empirical distribution.
    double getStandardDeviation()
    Returns the standard deviation (i.e., the positive square root of the variance).
    static java.lang.String getStandardDeviationDescription()
    Returns a descriptive label of the standard deviation.
    double getTrimmedMean​(double percentageToTrim)
    This is the trimmed (or truncated) mean, which corresponds to the mean calculated after symmetrically discarding a certain percentage of data points at the high and low end (without interpolation).
    double getVariance()
    Returns the sample variance (using an unbiased estimator of the population variance).
    static java.lang.String getVarianceDescription()
    Returns a descriptive label of the variance.
    double getXMaximum()
    Returns the maximum of the input values.
    double getXMinimum()
    Returns the minimum of the input values.
    double getXRange()
    Returns the range of the input values.
    double[] getZScores()
    Returns the calculated z-scores, defined as:
    boolean isJarqueBeraTestAccepted​(double alpha)
    Compares the Jarque-Bera test statistic with the chi-square distribution with 2 degrees of freedom for a given alpha level.
    void recalculatePDF()
    Recalculates the probability density function (PDF).
    void recalculatePDF​(int nrOfHistogramBins)
    Recalculates the probability density function (PDF) using a user-specified number of histogram bins.
    void setData​(double[] x)
    Sets the source data for the empirical distribution.
    void setData​(double[] x, int nrOfHistogramBins)
    Sets the source data for the empirical distribution, as well as a user-specified number of histogram bins.

    Methods inherited from class java.lang.Object

    clone, equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • EmpiricalDistribution

      public EmpiricalDistribution()
      Constructs an empty EmpiricalDistribution object.
    • EmpiricalDistribution

      public EmpiricalDistribution​(double[] x)
      Constructs an EmpiricalDistribution object for a given array of values.

      The Freedman-Diaconis rule is applied for finding the optimal histogram bin width, and consequently the optimal number of histogram bins:

      bin width = 2 * IQR / n^1/3

      Parameters:
      x - the array of values to estimate the empirical distribution for
    • EmpiricalDistribution

      public EmpiricalDistribution​(double[] x, int nrOfHistogramBins)
      Constructs an EmpiricalDistribution object for a given array of values and a user-specified number of histogram bins.
      Parameters:
      x - the array of values to estimate the empirical distribution for
      nrOfHistogramBins - the user-specified number of histogram bins
    • EmpiricalDistribution

      public EmpiricalDistribution​(double[] x, double[] histogramBinRightEdges)
      Constructs an EmpiricalDistribution object for a given array of values and user-specified histogram bin right edges.
      Parameters:
      x - the array of values to estimate the empirical distribution for
      histogramBinRightEdges - the array of values containing the histogram bin right edges
  • Method Details

    • getData

      public double[] getData()
      Retrieves the raw data for this empirical distribution.
      Returns:
      the raw data for this empirical distribution
      See Also:
      getSortedData()
    • getSortedData

      public double[] getSortedData()
      Retrieves the raw data for this empirical distribution.
      Returns:
      the raw data for this empirical distribution
      See Also:
      getData()
    • setData

      public void setData​(double[] x)
      Sets the source data for the empirical distribution.

      The Freedman-Diaconis rule is applied for finding the optimal histogram bin width, and consequently the optimal number of histogram bins:

      bin width = 2 * IQR / n^1/3

      Parameters:
      x - the array of values to estimate the empirical distribution for
    • setData

      public void setData​(double[] x, int nrOfHistogramBins)
      Sets the source data for the empirical distribution, as well as a user-specified number of histogram bins.
      Parameters:
      x - the array of values to estimate the empirical distribution for
      nrOfHistogramBins - the user-specified number of histogram bins
    • clear

      public void clear()
      Clears the empirical distribution.
    • analyse

      public void analyse()
      Estimates the empirical distribution and analyses it various statistical quantities.
    • getCDF

      public double getCDF​(double x)
      Returns the value of the cumulative distribution function (CDF) evaluated at x.
      Parameters:
      x - the value to evaluate the cumulative distribution function at
      Returns:
      the value of the cumulative distribution function evaluated at x
    • getPercentile

      public double getPercentile​(int percentile)
      Returns the given percentile.
      Parameters:
      percentile - the requested percentile (in the interval [0,100])
      Returns:
      the requested percentile value
    • getPercentile

      public double getPercentile​(double percentile)
      Returns the given percentile.
      Parameters:
      percentile - the requested percentile (in the interval [0.0,100.0])
      Returns:
      the requested percentile value
    • getPercentiles

      public double[] getPercentiles()
      Returns all the percentiles for the range [0,100].
      Returns:
      an array containing all the percentiles in the range [0,100]
    • getXMinimum

      public double getXMinimum()
      Returns the minimum of the input values.
      Returns:
      the minimum of the input values
    • getXMaximum

      public double getXMaximum()
      Returns the maximum of the input values.
      Returns:
      the maximum of the input values
    • getXRange

      public double getXRange()
      Returns the range of the input values.
      Returns:
      the range of the input values
    • getKDEXMinimum

      public double getKDEXMinimum()
      Returns the minimum of the values for a kernel density estimation (KDE) of the probability distribution function (PDF).
      Returns:
      the minimum of the values for a kernel density estimation (KDE) of the probability distribution function (PDF)
    • getKDEXMaximum

      public double getKDEXMaximum()
      Returns the maximum of the values for a kernel density estimation (KDE) of the probability distribution function (PDF).
      Returns:
      the maximum of the values for a kernel density estimation (KDE) of the probability distribution function (PDF)
    • getKDEXRange

      public double getKDEXRange()
      Returns the range of the values for a kernel density estimation (KDE) of the probability distribution function (PDF).
      Returns:
      the range of the values for a kernel density estimation (KDE) of the probability distribution function (PDF)
    • getMedian

      public double getMedian()
      Returns the median (i.e., the 50th percentile).
      Returns:
      the median
    • getInterquartileRange

      public double getInterquartileRange()
      Returns the interquartile range (IQR) (i.e., the difference between the 75th and the 25th percentiles).
      Returns:
      the interquartile range (IQR)
    • recalculatePDF

      public void recalculatePDF()
      Recalculates the probability density function (PDF).

      The Freedman-Diaconis rule is applied for finding the optimal histogram bin width, and consequently the optimal number of histogram bins:

      bin width = 2 * IQR / n^1/3

    • recalculatePDF

      public void recalculatePDF​(int nrOfHistogramBins)
      Recalculates the probability density function (PDF) using a user-specified number of histogram bins.
      Parameters:
      nrOfHistogramBins - the user-specified number of histogram bins
    • calculateKDEPDFBandwidth

      public double calculateKDEPDFBandwidth​(MathTools.EKernelType kernelType)
      Calculates the bandwidth for kernel density estimation (KDE) based on Silverman's Rule-of-Thumb.
      Parameters:
      kernelType - the type of kernel function to use in the calculation
      Returns:
      an estimation of the bandwidth
    • estimateKDEPDF

      public void estimateKDEPDF​(MathTools.EKernelType kernelType, double bandwidth, int nrOfSupportPoints, double minSupport, double maxSupport)
      Estimates the probability distribution function (PDF) using a specified kernel function.
      Parameters:
      kernelType - the type of kernel function to use
      bandwidth - the bandwidth of the kernel function
      nrOfSupportPoints - the number of (X,Y) values to use for the smoothened 1D function
      minSupport - the minimum value for the support
      maxSupport - the maximum value for the support
    • getNrOfHistogramBins

      public int getNrOfHistogramBins()
      Returns the number of histogram bins used for estimating the probability density function (PDF).
      Returns:
      the number of histogram bins used for estimating the probability density function (PDF)
    • getHistogramBinCount

      public double getHistogramBinCount​(int histogramBin)
      Returns the count associated with a specified histogram bin.
      Parameters:
      histogramBin - the histogram bin to lookup the count for
      Returns:
      the count associated with the specified histogram bin
    • getHistogramBinCounts

      public double[] getHistogramBinCounts()
      Returns the counts for all the histogram bins.
      Returns:
      an array containing the counts for all the histogram bins
    • getHistogramBinFrequency

      public double getHistogramBinFrequency​(int histogramBin)
      Returns the frequency associated with a specified histogram bin.
      Parameters:
      histogramBin - the histogram bin to lookup the frequency for
      Returns:
      the frequency associated with the specified histogram bin
    • getHistogramBinFrequencies

      public double[] getHistogramBinFrequencies()
      Returns the frequencies for all the histogram bins.
      Returns:
      an array containing the frequencies for all the histogram bins
    • getHistogramBinCentre

      public double getHistogramBinCentre​(int histogramBin)
      Returns the centre of a specified histogram bin.
      Parameters:
      histogramBin - the histogram bin to lookup the centre for
      Returns:
      the centre of the specified histogram bin
    • getHistogramBinCentres

      public double[] getHistogramBinCentres()
      Returns the centres of all the histogram bins.
      Returns:
      an array containing the centres of all the histogram bins
    • getHistogramBinWidth

      public double getHistogramBinWidth()
      Returns the width of a histogram bin.
      Returns:
      the width of a histogram bin
    • getPDF

      public double getPDF​(double x)
      Returns the value of the probability density function (PDF) evaluated at x (based on a histogram).
      Parameters:
      x - the value to evaluate the probability density function at
      Returns:
      the value of the probability density function evaluated at x
    • getKDEPDF

      public double getKDEPDF​(double x)
      Returns the value of the probability density function (PDF) evaluated at x (based on kernel density estimation, KDE).
      Parameters:
      x - the value to evaluate the probability density function at
      Returns:
      the value of the probability density function evaluated at x
    • getFullKDEPDF

      public FunctionLookupTable getFullKDEPDF()
      Returns the previously complete calculated kernel density estimation (KDE) of the probability distribution function (PDF).
      Returns:
      the lookup table with X and Y values for the KDE PDF
    • getN

      public int getN()
      Returns the sample size.
      Returns:
      the sample size
    • getExpectedValue

      public double getExpectedValue()
      Returns the expected value for the first moment (population mean), which in this case is approximated by the sample mean.
      Returns:
      the expected value for the first moment
      See Also:
      getMean()
    • getMean

      public double getMean()
      This is the sample mean, which in this case is an alias for the expected value.
      Returns:
      the sample mean
      See Also:
      getExpectedValue(), getTrimmedMean(double)
    • getTrimmedMean

      public double getTrimmedMean​(double percentageToTrim)
      This is the trimmed (or truncated) mean, which corresponds to the mean calculated after symmetrically discarding a certain percentage of data points at the high and low end (without interpolation).
      Parameters:
      percentageToTrim - the percentage to trim (left and right combined)
      Returns:
      the trimmed mean
      See Also:
      getMean()
    • getKDEPDFModes

      public FunctionLookupTable getKDEPDFModes()
      Returns all modes (i.e., local maxima) for the calculated kernel density estimation (KDE) of the probability density function (PDF).
      Returns:
      all modes of the specified PDF
    • getVariance

      public double getVariance()
      Returns the sample variance (using an unbiased estimator of the population variance).
      Returns:
      the sample variance (using an unbiased estimator of the population variance)
    • getStandardDeviation

      public double getStandardDeviation()
      Returns the standard deviation (i.e., the positive square root of the variance).
      Returns:
      the standard deviation
    • getSkewness

      public double getSkewness()
      Returns the sample skewness (using an unbiased estimator).

      Skewness implies:

      • Positive skew: longer right tail, density mass constrained to the left.
      • Negative skew: longer left tail, density mass constrained to the right.

      Note that the amount of skewness is determined as follows:

      • -0.5 ≤ skewness ≤ +0.5: approximately symmetric distribution.
      • -1 ≤ skewness < -0.5, or +0.5 < skewness ≤ +1: moderately skewed distribution.
      • skewness < -1, or skewness > +1: highly skewed distribution.
      Returns:
      the sample skewness (using an unbiased estimator)
    • getSkewnessConfidenceBounds

      public double getSkewnessConfidenceBounds()
      Returns the symmetrical skewness' confidence bounds for a 95% confidence interval, defined as twice the standard error of skewness (SES).
      Returns:
      the symmetrical skewness' confidence bounds for a 95% confidence interval
    • getSkewnessZStatistic

      public double getSkewnessZStatistic()
      Returns a two-tailed test statistic Z of skewness (different from zero) with a 5% significance level.
      • Z > +2: population is very likely positively skewed.
      • Z < -2: population is very likely negatively skewed.
      • -2 ≤ Z ≤ +2: inconclusive (might be symmetric, might be skewed).

      The larger Z, the higher the probability.

      Returns:
      the skewness Z-statistic
    • getKurtosis

      public double getKurtosis()
      Returns the sample kurtosis (using an unbiased estimator).

      The value returned is the excess kurtosis, such that it is zero for a normal distribution:

      • Mesokurtic: has zero excess (e.g., normal distribution).
      • Leptokurtic: has positive excess, higher and sharper central peak, with longer and fatter tails (i.e., more extreme values).
      • Platykurtic: has negative excess, lower and broader central peak, with shorter and thinner tails (i.e., less extreme values).

      As the kurtosis increases, more probability mass is transferred from the distribution's shoulders to the centre and tails.

      Returns:
      the sample kurtosis (using an unbiased estimator)
    • getKurtosisZStatistic

      public double getKurtosisZStatistic()
      Returns a two-tailed test statistic Z of kurtosis (different from zero) with a 5% significance level.
      • Z > +2: population has very likely positive kurtosis (leptokurtic).
      • Z < -2: population has very likely negative kurtosis (platykurtic).
      • -2 ≤ Z ≤ +2: inconclusive (might be negative, zero, or positive kurtosis).

      The larger Z, the higher the probability.

      Returns:
      the kurtosis Z-statistic
    • getJarqueBeraTestStatistic

      public double getJarqueBeraTestStatistic()
      Calculates the Jarque-Bera test statistic.

      This tests the goodness-of-fit of whether the distribution's skewness and kurtosis match that of the normal distribution.

      The test result should be compared to the values of the chi-square distribution with 2 degrees of freedom.

      Returns:
      the Jarque-Bera test statistic
      See Also:
      isJarqueBeraTestAccepted(double), getChiSquare(double,int)
    • isJarqueBeraTestAccepted

      public boolean isJarqueBeraTestAccepted​(double alpha)
      Compares the Jarque-Bera test statistic with the chi-square distribution with 2 degrees of freedom for a given alpha level.

      Alpha levels can be 0.995, 0.99, 0.975, 0.95, 0.90, 0.10, 0.05, 0.025, 0.01, or 0.005.

      Parameters:
      alpha - the alpha level
      Returns:
      true if the test is accepted, false if it is rejected
      See Also:
      getJarqueBeraTestStatistic(), getChiSquare(double,int)
    • getChiSquare

      public static double getChiSquare​(double alpha, int degreesOfFreedom)
      Returns the chi-square value corresponding to a specified alpha level and number of degrees of freedom.

      Alpha levels can be 0.995, 0.99, 0.975, 0.95, 0.90, 0.10, 0.05, 0.025, 0.01, or 0.005.

      The number of degrees of freedom is clipped between 1 and 100.

      Parameters:
      alpha - the alpha level
      degreesOfFreedom - the number of degrees of freedom
      Returns:
      the chi-square value corresponding to the specified alpha level and number of degrees of freedom
      See Also:
      getJarqueBeraTestStatistic(), isJarqueBeraTestAccepted(double)
    • getZScores

      public double[] getZScores()
      Returns the calculated z-scores, defined as:

      (value - mean) / standard deviation

      Returns:
      the z-scores
      See Also:
      getOutliers()
    • getOutliers

      public boolean[] getOutliers()
      Returns the outliers which are defined as having z-scores greater than 3.
      Returns:
      the outliers
      See Also:
      getZScores()
    • getMeanDescription

      public static java.lang.String getMeanDescription()
      Returns a descriptive label of the mean (expected value).
      Returns:
      a descriptive label of the mean
    • getStandardDeviationDescription

      public static java.lang.String getStandardDeviationDescription()
      Returns a descriptive label of the standard deviation.
      Returns:
      a descriptive label of the standard deviation
    • getVarianceDescription

      public static java.lang.String getVarianceDescription()
      Returns a descriptive label of the variance.
      Returns:
      a descriptive label of the variance
    • getMedianDescription

      public static java.lang.String getMedianDescription()
      Returns a descriptive label of the median.
      Returns:
      a descriptive label of the median
    • getInterquartileRangeDescription

      public static java.lang.String getInterquartileRangeDescription()
      Returns a descriptive label of the interquartile range (IQR).
      Returns:
      a descriptive label of the interquartile range (IQR)
    • getPercentileDescription

      public static java.lang.String getPercentileDescription()
      Returns a descriptive label of a percentile.
      Returns:
      a descriptive label of a percentile
    • getSkewnessDescription

      public static java.lang.String getSkewnessDescription()
      Returns a descriptive label of the skewness.
      Returns:
      a descriptive label of the skewness
    • getKurtosisDescription

      public static java.lang.String getKurtosisDescription()
      Returns a descriptive label of the kurtosis.
      Returns:
      a descriptive label of the kurtosis
    • getSkewnessInterpretation

      public java.lang.String getSkewnessInterpretation()
      Returns a qualitative description of the skewness based on its test statistic.
      Returns:
      a qualitative description of the skewness
    • getKurtosisInterpretation

      public java.lang.String getKurtosisInterpretation()
      Returns a qualitative description of the kurtosis based on its test statistic.
      Returns:
      a qualitative description of the kurtosis