主要内容

Nonparametric and Empirical Probability Distributions

Overview

In some situations, you cannot accurately describe a data sample using a parametric distribution. Instead, the probability density function (pdf) or cumulative distribution function (cdf) must be estimated from the data. Statistics and Machine Learning Toolbox™ provides several options for estimating the pdf or cdf from sample data.

Kernel Distribution

Akernel distribution产生非参数概率密度估计,其适应数据,而不是选择具有特定参数形式的密度并估计参数。该分布由内核密度估计器,一个平滑功能,该平滑功能确定用于生成PDF的曲线的形状,以及控制所得密度曲线的平滑度的带宽值。

与直方图类似,内核分发构建使用示例数据表示概率分布的功能。但是与直方图不同,该直方图将值放入离散间位中,内核分布总和为每个数据值的组件平滑功能和产生平滑,连续概率曲线。以下绘图显示了从相同的样本数据产生的直方图和内核分布的视觉比较。

A histogram represents the probability distribution by establishing bins and placing each data value in the appropriate bin. Because of this bin count approach, the histogram produces a discrete probability density function. This might be unsuitable for certain applications, such as generating random numbers from a fitted distribution.

Alternatively, the kernel distribution builds the probability density function (pdf) by creating an individual probability density curve for each data value, then summing the smooth curves. This approach creates one smooth, continuous probability density function for the data set.

有关内核发行版的更多一般信息,请参阅Kernel Distribution。有关如何使用内核分发的信息,请参阅Using肯尼斯分布Objectsksdity

经验累积分布函数

An empirical cumulative distribution function (ecdf.) estimates the cdf of a random variable by assigning equal probability to each observation in a sample. Because of this approach, the ecdf is a discrete cumulative distribution function that creates an exact match between the ecdf and the distribution of the sample data.

The following plot shows a visual comparison of the ecdf of 20 random numbers generated from a standard normal distribution, and the theoretical cdf of a standard normal distribution. The circles indicate the value of the ecdf calculated at each sample data point. The dashed line that passes through each circle visually represents the ecdf, although the ecdf is not a continuous function. The solid line shows the theoretical cdf of the standard normal distribution from which the random numbers in the sample data were drawn.

ECDF形状类似于理论CDF,尽管它不是完全匹配。相反,ECDF与样本数据完全匹配。ECDF是一个离散功能,并且不顺畅,特别是在数据可能稀疏的尾部。你可以平滑分发Pareto tails, using theparetotailsfunction.

For more information and additional syntax options, seeecdf.。To construct a continuous function based on cdf values computed from sample data, seePiecewise Linear Distribution

Piecewise Linear Distribution

Apiecewise linear distribution通过计算每个单独点处的CDF值,然后线性地连接这些值以形成连续曲线来估计样本数据的整体CDF。

The following plot shows the cdf for a piecewise linear distribution based on a sample of hospital patients’ weight measurements. The circles represent each individual data point (weight measurement). The black line that passes through each data point represents the piecewise linear distribution cdf for the sample data.

分段线性分布线性地连接每个样本数据点计算的CDF值以形成连续曲线。相比之下,一个经验累积分布函数constructed using theecdf.功能生成一个离散的CDF。例如,从ECDF生成的随机数只能包括x原始示例数据中包含的值。从分段线性分配生成的随机数可以包括任何xth上限与下限之间的价值e sample data.

Because the piecewise linear distribution cdf is constructed from the values contained in the sample data, the resulting curve is often not smooth, especially in the tails where data might be sparse. You can smooth the distribution withPareto tails, using theparetotailsfunction.

有关如何使用分段线性分布的信息,请参阅使用分段Objects.

Pareto Tails

帕累托尾巴使用分段方法来通过平滑分布的尾部来改善非参数CDF的拟合。你可以适应一个kernel distribution,经验CDF., or a user-defined estimator to the middle data values, then fit广义帕累托分布curves to the tails. This technique is especially useful when the sample data is sparse in the tails.

The following plot shows the empirical cdf (ecdf) of a data sample containing 20 random numbers. The solid line represents the ecdf, and the dashed line represents the empirical cdf with Pareto tails fit to the lower and upper 10 percent of the data. The circles denote the boundaries for the lower and upper 10 percent of the data.

将帕累托尾部置于样本数据的较低和较高百分之上,使CDF在尾部下滑,其中数据稀疏。有关使用Pareto Tails的更多信息,请参阅paretotails

Triangular Distribution

ATriangular Distribution当有限的示例数据可用时,提供概率分布的简单表示。该连续分布通过下限,峰值位置和上限进行参数化。这些点线性地连接以估计样本数据的PDF。您可以使用数据的平均值,中位数或模式作为峰值位置。

以下绘图显示了从0到5的10个整数的随机样本的三角分布PDF。下限是样本数据中最小的整数,上限是最大的整数。此图的峰值在样本数据中处于模式或最常见的值。

仿真和项目管理等业务应用程序有时使用三角分布在存在有限的示例数据时创建模型。有关更多信息,请参阅Triangular Distribution

See Also

||

相关话题