In
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
and
data analysis
Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enco ...
the
application software CumFreq is a tool for
cumulative frequency analysis of a
single variable and for
probability distribution fitting.
Originally the method was developed for the analysis of
hydrological measurements of spatially varying magnitudes (e.g.
hydraulic conductivity of the soil) and of magnitudes varying in time (e.g. rainfall,
river discharge
In hydrology, discharge is the volumetric flow rate of water that is transported through a given cross-sectional area. It includes any suspended solids (e.g. sediment), dissolved chemicals (e.g. CaCO3(aq)), or biologic material (e.g. diatoms) in ad ...
) to find their
return periods. However, it can be used for many other types of phenomena, including those that contain
negative values.
Software features

CumFreq uses the
plotting position approach to estimate the ''cumulative frequency'' of each of the observed magnitudes in a data series of the variable.
[''Frequency and Regression Analysis''. Chapter 6 in: H.P.Ritzema (ed., 1994), ''Drainage Principles and Applications'', Publ. 16, pp. 175–224, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands. . Free download as PDF from ]
ILRI website
or from
/ref>
The computer program allows determination of the best fitting probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
. Alternatively it provides the user with the option to select the probability distribution to be fitted. The following probability distributions are included: normal, lognormal, logistic, loglogistic, exponential, Cauchy, Fréchet, Gumbel, Pareto, Weibull, Generalized extreme value distribution, Laplace distribution
In probability theory and statistics, the Laplace distribution is a continuous probability distribution named after Pierre-Simon Laplace. It is also sometimes called the double exponential distribution, because it can be thought of as two exponen ...
, Burr distribution (Dagum mirrored), Dagum distribution (Burr mirrored), Gompertz distribution, Student distribution and other.
Another characteristic of CumFreq is that it provides the option to use two different probability distributions, one for the lower data range, and one for the higher. The ranges are separated by a break-point. The use of such composite (discontinuous) probability distributions can be useful when the data of the phenomenon studied were obtained under different conditions.
During the input phase, the user can select the number of intervals needed to determine the histogram
A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to " bin" (or "bucket") the range of values—that is, divide the ent ...
. He may also define a threshold to obtain a truncated distribution.
The output section provides a calculator to facilitate interpolation
In the mathematical field of numerical analysis, interpolation is a type of estimation, a method of constructing (finding) new data points based on the range of a discrete set of known data points.
In engineering and science, one often has a n ...
and extrapolation.
Further it gives the option to see the Q–Q plot in terms of calculated and observed cumulative frequencies.
ILRI[''Drainage research in farmers' fields: analysis of data'', 2002. Contribution to the project "Liquid Gold" of the International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands]
/ref> provides examples of application to magnitudes like crop yield
In agriculture, the yield is a measurement of the amount of a crop grown, or product such as wool, meat or milk produced, per unit area of land. The seed ratio is another way of calculating yields.
Innovations, such as the use of fertilizer, the c ...
, watertable depth, soil salinity
Soil salinity is the salt content in the soil; the process of increasing the salt content is known as salinization. Salts occur naturally within soils and water. Salination can be caused by natural processes such as mineral weathering or by the ...
, hydraulic conductivity, rainfall, and river discharge
In hydrology, discharge is the volumetric flow rate of water that is transported through a given cross-sectional area. It includes any suspended solids (e.g. sediment), dissolved chemicals (e.g. CaCO3(aq)), or biologic material (e.g. diatoms) in ad ...
.
Generalizing distributions
The program can produce generalizations of the normal, logistic, and other distributions by transforming the data using an exponent
Exponentiation is a mathematical operation, written as , involving two numbers, the '' base'' and the ''exponent'' or ''power'' , and pronounced as " (raised) to the (power of) ". When is a positive integer, exponentiation corresponds to re ...
that is optimized to obtain the best fit
Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints. Curve fitting can involve either interpolation, where an exact fit to the data is ...
.
This feature is not common in other distribution-fitting software which normally include only a logarithmic transformation of data obtaining distributions like the lognormal and loglogistic.
Generalization of symmetrical distributions (like the normal and the logistic) makes them applicable to data obeying a distribution that is skewed to the right (using an exponent <1) as well as to data obeying a distribution that is skewed to the left (using an exponent >1). This enhances the versatility of symmetrical distributions.
Inverting distributions
Skew distributions can be mirrored by distribution inversion (see survival function, or complementary distribution function) to change the skewness from positive to negative and vice versa. This amplifies the number of applicable distributions and increases the chance of finding a better fit. CumFreq makes use of that opportunity.
Shifting distributions
When negative data are present that are not supported by a probability distribution, the model performs a distribution shift to the positive side while, after fitting, the distribution is shifted back.
Confidence belts
The software employs the binomial distribution
In probability theory and statistics, the binomial distribution with parameters ''n'' and ''p'' is the discrete probability distribution of the number of successes in a sequence of ''n'' independent experiments, each asking a yes–no quest ...
to determine the confidence belt
In frequentist statistics, a confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated ''confidence level''; the 95% confidence level is most common, but other levels, such as ...
of the corresponding cumulative distribution function
In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x.
Ev ...
.[
The prediction of the return period, which is of interest in time series, is also accompanied by a confidence belt. The construction of confidence belts is not found in most other software.
The figure to the right shows the variation that may occur when obtaining samples of a variate that follows a certain probability distribution. The data were provided by Benson.][Benson, M.A. 1960. Characteristics of frequency curves based on a theoretical 1000 year record. In: T.Dalrymple (ed.), Flood frequency analysis. U.S. Geological Survey Water Supply paper 1543−A, pp. 51–71]
The confidence belt
In frequentist statistics, a confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated ''confidence level''; the 95% confidence level is most common, but other levels, such as ...
around an experimental cumulative frequency or return period curve gives an impression of the region in which the true distribution may be found.
Also, it clarifies that the experimentally found best fitting probability distribution may deviate from the true distribution.
Goodness of fit
Cumfreq produces a list of distributions ranked by goodness of fit.
Histogram and density function
From the cumulative distribution function
In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x.
Ev ...
(CDF) one can derive a histogram
A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to " bin" (or "bucket") the range of values—that is, divide the ent ...
and the probability density function (PDF).
Calculator
The software offers the option to use a probability distribution calculator. The cumulative frequency and the return period are give as a function of data value as input. In addition, the confidence intervals are shown. Reversely, the value is presented upon giving the cumulative frequency or the return period.
See also
* Distribution fitting
References
{{DEFAULTSORT:Cumfreq
Statistical software
Regression and curve fitting software
Freeware