Percentage ratio of the standard deviation. Average linear deviation

The most perfect characteristic of variation is the standard deviation, which is called the standard (or standard deviation). Standard deviation() is equal to the square root of the mean square of the deviations of the individual values ​​of the feature from the arithmetic mean:

The standard deviation is simple:

Weighted standard deviation is used for grouped data:

Between the mean square and standard linear deviations under normal distribution conditions, the following relationship takes place: ~ 1.25.

The standard deviation, being the main absolute measure of variation, is used to determine the values ​​of the ordinates of the normal distribution curve, in calculations related to organizing sample observation and establishing the accuracy of sample characteristics, as well as when assessing the boundaries of variation of a feature in a homogeneous population.

Dispersion, its types, standard deviation.

Variance of a random variable- a measure of the spread of a given random variable, i.e., its deviation from the mathematical expectation. In statistics, the designation or is often used. The square root of the variance is called the standard deviation, standard deviation, or standard deviation.

Total variance (σ 2) measures the variation of a trait in the aggregate under the influence of all factors that caused this variation. At the same time, thanks to the grouping method, it is possible to isolate and measure the variation due to the grouping trait and the variation arising under the influence of unaccounted factors.

Intergroup variance (σ 2 mg) characterizes systematic variation, i.e., differences in the value of the trait under study that arise under the influence of a trait - a factor underlying the grouping.

Standard deviation(synonyms: standard deviation, standard deviation, square deviation; similar terms: standard deviation, standard spread) - in probability theory and statistics, the most common indicator of dispersion of the values ​​of a random variable relative to its mathematical expectation. With limited arrays of samples of values, instead of the mathematical expectation, the arithmetic mean of the population of samples is used.

The standard deviation is measured in units of measurement of the random variable itself and is used to calculate the standard error of the arithmetic mean, when constructing confidence intervals, when statistically testing hypotheses, when measuring the linear relationship between random variables. Defined as the square root of the variance of the random variable.


Standard deviation:

Standard deviation(estimate of the standard deviation of a random variable x relative to its mathematical expectation based on an unbiased estimate of its variance):

where is the variance; - i th element of the sample; - sample size; - arithmetic mean of the sample:

It should be noted that both estimates are biased. In the general case, it is impossible to construct an unbiased estimate. However, the estimate based on the estimate of the unbiased variance is consistent.

Essence, scope and procedure for determining the mode and median.

In addition to power-law averages in statistics, for the relative characteristics of the magnitude of the varying feature and the internal structure of the distribution series, structural averages are used, which are mainly represented fashion and median.

Fashion- this is the most common variant of the row. Fashion is used, for example, in determining the size of clothes, shoes, which are in the greatest demand among customers. The mode for the discrete series is the one with the highest frequency. When calculating the mode for the interval variation series, it is necessary to first determine the modal interval (by the maximum frequency), and then - the value of the modal value of the feature according to the formula:

- - fashion value

- - the lower bound of the modal interval

- - the value of the interval

- - the frequency of the modal interval

- is the frequency of the interval preceding the modal

- is the frequency of the interval following the modal

Median - this is the value of the trait, which underlies the ranked series and divides this series into two equal parts.

To determine the median in a discrete series in the presence of frequencies, first calculate the half-sum of frequencies, and then determine what value of the variant falls on it. (If the sorted series contains an odd number of features, then the median number is calculated by the formula:

M e = (n (number of features in the aggregate) + 1) / 2,

in the case of an even number of features, the median will be equal to the average of the two features in the middle of the row).

When calculating medians for an interval variation series, first determine the median interval within which the median is located, and then the median value using the formula:

- - the required median

- - the lower bound of the interval that contains the median

- - the value of the interval

- - the sum of frequencies or the number of members of the series

The sum of the accumulated frequencies of the intervals preceding the median

- - frequency of the median interval

Example... Find fashion and median.

Solution:
In this example, the modal interval is within the 25-30 age group, since this interval has the highest frequency (1054).

Let's calculate the magnitude of the mode:

This means that the modal age of the students is 27 years.

We calculate the median... The median interval is in the age group 25-30 years, since within this interval there is a variant that divides the population into two equal parts (Σf i / 2 = 3462/2 = 1731). Next, we substitute the necessary numerical data into the formula and get the median value:

This means that one half of the students are under 27.4 years old, and the other over 27.4 years old.

In addition to the mode and the median, indicators such as quartiles dividing the ranked series into 4 equal parts can be used, deciles- 10 parts and percentiles - per 100 parts.

The concept of selective observation and its scope.

Selective observation applies when the application of continuous surveillance physically impossible due to a large amount of data, or economically impractical... Physical impossibility occurs, for example, when studying passenger flows, market prices, family budgets. Economic inexpediency occurs when assessing the quality of goods associated with their destruction, for example, tasting, testing bricks for strength, etc.

The statistical units selected for observation constitute a sample population or sample, and their entire array - the general population (HS). In this case, the number of units in the sample denotes n, and in the whole HS - N... Attitude n / N called the relative size or fraction of the sample.

The quality of the results of sample observation depends on the representativeness of the sample, that is, on how representative it is in the HS. To ensure the representativeness of the sample, it is necessary to observe random selection of units, which assumes that the inclusion of a HS unit in the sample cannot be influenced by any factor other than case.

Exists 4 ways to randomly select to the sample:

  1. Actually random selection or "lotto method", when statistical quantities are assigned serial numbers, recorded on certain items (for example, barrels), which are then mixed in a container (for example, in a bag) and selected at random. In practice, this method is carried out using a random number generator or mathematical tables of random numbers.
  2. Mechanical selection, according to which each ( N / n) -th value of the general population. For example, if it contains 100,000 values, and you want to select 1,000, then every 100,000 / 1000 = 100th value will be included in the sample. Moreover, if they are not ranked, then the first one is chosen at random from the first hundred, and the numbers of the others will be one hundred more. For example, if unit # 19 turned out to be the first, then the next should be # 119, then # 219, then # 319, and so on. If the units of the general population are ranked, then # 50 is selected first, then # 150, then # 250, and so on.
  3. The selection of values ​​from a heterogeneous data set is carried out stratified(stratified) way, when the general population is pre-divided into homogeneous groups, to which random or mechanical selection is applied.
  4. A special way of sampling is serial selection, in which not individual quantities are randomly or mechanically selected, but their series (sequences from some number to some in a row), within which continuous observation is carried out.

The quality of sample observations also depends on sample type: repeated or nonrepeatable.

At re-selection The statistical quantities that got into the sample or their series after use are returned to the general population, having a chance to get into a new sample. Moreover, all values ​​of the general population have the same probability of being included in the sample.

Non-repeat selection means that the statistical values ​​included in the sample or their series after use are not returned to the general population, and therefore, for the rest of the values ​​of the latter, the probability of falling into the next sample increases.

Repeated sampling gives more accurate results, therefore it is used more often. But there are situations when it cannot be applied (study of passenger flows, consumer demand, etc.) and then re-selection is carried out.

Marginal sampling error of observation, average sampling error, the order of their calculation.

Let us consider in detail the above methods of forming the sample population and the errors that arise in this case. representativeness .
Actually random the sample is based on a random selection of units from the general population without any systematic elements. Technically, proper random selection is carried out by drawing lots (for example, lottery draws) or according to a table of random numbers.

Actually random selection "in its pure form" is rarely used in the practice of selective observation, but it is the initial one among other types of selection, it implements the basic principles of selective observation. Let us consider some questions of the theory of the sampling method and the error formula for a simple random sample.

Sample observation error is the difference between the value of the parameter in the general population, and its value calculated from the results of sample observation. For an average quantitative characteristic, the sampling error is determined

The indicator is called the marginal sampling error.
The sample mean is a random variable that can take on different values ​​depending on which units were included in the sample. Therefore, sampling errors are also random values ​​and can take on different values. Therefore, the average of the possible errors is determined - mean sampling error which depends on:

Sample size: the larger the number, the lower the value of the average error;

The degree of change of the trait under study: the smaller the variance of the trait, and, consequently, the variance, the smaller the average sampling error.

At random re-selection the mean error is calculated:
.
In practice, the general variance is not known exactly, but in probability theory proved that
.
Since the value for sufficiently large n is close to 1, we can assume that. Then the mean sampling error can be calculated:
.
But in cases of a small sample (for n<30) коэффициент необходимо учитывать, и среднюю ошибку малой выборки рассчитывать по формуле
.

At random non-replicate sample the given formulas are corrected by the value. Then the mean non-repetitive sampling error is:
and .
Because is always less, then the multiplier () is always less than 1. This means that the average error in non-repetitive selection is always less than in repeated selection.
Mechanical sampling it is used when the general population is ordered in some way (for example, alphabetical lists of voters, telephone numbers, numbers of houses, apartments). The selection of units is carried out at a certain interval, which is equal to the reciprocal of the percentage of the sample. So, with a 2% sample, each 50 unit = 1 / 0.02 is selected, with 5% each 1 / 0.05 = 20 unit of the general population.

The reference point is selected in different ways: randomly, from the middle of the interval, with a change in the reference point. The main thing here is to avoid systematic error. For example, with a 5% sample, if the first unit is 13, then the next 33, 53, 73, etc.

In terms of accuracy, mechanical selection is close to random sampling itself. Therefore, to determine the average error of mechanical sampling, the formulas of proper random selection are used.

At typical selection the surveyed population is preliminarily divided into homogeneous groups of the same type. For example, when surveying enterprises, these can be industries, sub-sectors; when studying the population, these can be regions, social or age groups. Then, an independent selection is made from each group, either mechanically or in a purely random way.

Typical sampling gives more accurate results than other methods. Typification of the general population ensures that each typological group is represented in the sample, which makes it possible to exclude the influence of intergroup variance on the mean sampling error. Consequently, when finding the error of a typical sample according to the rule for adding variances (), it is necessary to take into account only the average of the group variances. Then the average sampling error:
on re-selection
,
with no re-selection
,
where is the average of the within-group variances in the sample.

Serial (or nested) selection applies in the case when the general population is divided into series or groups before the start of the sample survey. These series can be packaging of finished products, student groups, brigades. The series for the survey are selected mechanically or in a purely random way, and within the series, a continuous survey of units is carried out. Therefore, the mean sampling error depends only on the intergroup (inter-series) variance, which is calculated by the formula:

where r is the number of selected series;
- average of the i-th series.

The average serial sampling error is calculated:

upon re-selection:
,
with a non-repeat selection:
,
where R is the total number of series.

Combined selection is a combination of the considered selection methods.

The average sampling error for any selection method depends mainly on the absolute size of the sample and, to a lesser extent, on the percentage of the sample. Suppose that 225 observations are carried out in the first case from a general population of 4500 units and in the second from 225000 units. The variances in both cases are equal to 25. Then, in the first case, with 5% sampling, the sampling error will be:

In the second case, with 0.1% selection, it will be equal to:


In this way, with a decrease in the percentage of the sample by 50 times, the sampling error increased insignificantly, since the size of the sample did not change.
Suppose the sample size is increased to 625 observations. In this case, the sampling error is:

An increase in the sample by a factor of 2.8 with the same size of the general population reduces the size of the sampling error by more than 1.6 times.

Methods and ways of forming the sample.

In statistics, various methods of forming sample sets are used, which are determined by the objectives of the research and depends on the specifics of the object of study.

The main condition for conducting a sample survey is to prevent the occurrence of systematic errors arising from the violation of the principle of equal opportunities for each unit of the general population to be included in the sample. Prevention of systematic errors is achieved as a result of the use of scientifically based methods of forming a sample population.

There are the following ways to select units from the general population:

1) individual selection - individual units are selected in the sample;

2) group selection - qualitatively homogeneous groups or series of the studied units fall into the sample;

3) combined selection is a combination of individual and group selection.
Selection methods are determined by the rules for the formation of the sample population.

The sample can be:

  • proper accidental consists in the fact that the sample population is formed as a result of a random (unintentional) selection of individual units from the general population. In this case, the number of units selected for the sample population is usually determined on the basis of the accepted proportion of the sample. The proportion of the sample is the ratio of the number of units in the sample n to the number of units in the general population N, i.e.
  • mechanical consists in the fact that the selection of units in the sample population is made from the general population, divided into equal intervals (groups). Moreover, the size of the interval in the general population is equal to the reciprocal of the proportion of the sample. So, with a 2% sample, every 50th unit (1: 0.02) is selected, with a 5% sample, every 20th unit (1: 0.05), etc. Thus, in accordance with the accepted share of selection, the general population is, as it were, mechanically divided into groups of equal size. Only one unit is selected from each group.
  • typical - in which the general population is first divided into homogeneous typical groups. Then, from each typical group, by proper random or mechanical sampling, an individual selection of units is made into the sample population. An important feature of the typical sample is that it gives more accurate results in comparison with other methods of selecting units in the sample;
  • serial- in which the general population is divided into groups of the same size - series. Series are selected for the sample. Within the series, continuous observation of the units included in the series is carried out;
  • combined- the sample can be two-stage. In this case, the general population is first divided into groups. Then the groups are selected, and within the latter, individual units are selected.

In statistics, the following methods of selecting units in a sample population are distinguished:

  • single stage sampling - each selected unit is immediately examined according to a given criterion (proper random and serial sampling);
  • multistage sampling - selection is made from the general population of individual groups, and individual units are selected from the groups (typical sampling with a mechanical method of selecting units in the sample population).

In addition, a distinction is made between:

  • re-selection- according to the returned ball scheme. Moreover, each unit or series that got into the sample returns to the general population and therefore has a chance to get into the sample again;
  • non-repeat selection- according to the scheme of an unreturned ball. It has more accurate results with the same sample size.

Determination of the required sample size (using the Student's table).

One of the scientific principles in sampling theory is to ensure a sufficient number of sampled units. Theoretically, the need to comply with this principle is presented in the proofs of the limit theorems of probability theory, which make it possible to establish what volume of units should be selected from the general population in order for it to be sufficient and to ensure the representativeness of the sample.

A decrease in the standard error of the sample, and, consequently, an increase in the accuracy of the estimate is always associated with an increase in the sample size, therefore, already at the stage of organizing a sample observation, it is necessary to decide the question of what the size of the sample population should be in order to ensure the required accuracy of the observation results. The calculation of the required sample size is built using formulas derived from the formulas for the marginal sampling errors (A), corresponding to a particular type and method of selection. So, for a random repeated sample size (n) we have:

The essence of this formula is that with a random repeated selection of the required size, the sample size is directly proportional to the square of the confidence coefficient (t2) and the variance of the variation feature (? 2) and is inversely proportional to the square of the marginal sampling error (? 2). In particular, with a doubling of the marginal error, the required sample size can be reduced by a factor of four. Of the three parameters, two (t and?) Are set by the researcher.

In this case, the researcher proceeding for the purpose of the objectives of the sample survey, it should solve the question: in what quantitative combination is it better to include these parameters to ensure the optimal option? In one case, he may be more satisfied with the reliability of the results obtained (t) than with a measure of accuracy (?), In the other - the other way around. It is more difficult to resolve the issue regarding the value of the marginal sampling error, since the researcher does not have this indicator at the design stage of sample observation, therefore, in practice, it is customary to set the marginal sampling error, as a rule, within up to 10% of the expected average level of the feature. Establishing an assumed average can be approached in different ways: using data from similar previous surveys, or using data from a sampling frame and making a small trial sample.

When designing a sample observation, it is most difficult to establish the third parameter in formula (5.2) - the variance of the sample population. In this case, it is necessary to use all the information available to the researcher, obtained in previous similar and pilot surveys.

The question of defining the required sample size is complicated if the sample survey involves the study of several characteristics of sampling units. In this case, the average levels of each of the characteristics and their variation are usually different, and therefore, it is possible to decide the question of which of the characteristics variance to give preference to, only taking into account the purpose and objectives of the survey.

When designing a sample observation, a predetermined value of the allowable sampling error is assumed in accordance with the tasks of a particular study and the probability of conclusions based on the results of the observation.

In general, the formula for the marginal error of the sample mean makes it possible to determine:

The amount of possible deviations of the indicators of the general population from the indicators of the sample population;

The required size of the sample, providing the required accuracy, at which the limits of possible error do not exceed a certain specified value;

The probability that the error in the sample will have a specified limit.

Student's t distribution in probability theory, it is a one-parameter family of absolutely continuous distributions.

Series of dynamics (interval, moment), closing of rows of dynamics.

Rows of dynamics- these are the values ​​of statistical indicators, which are presented in a certain chronological order.

Each time series contains two components:

1) indicators of time periods (years, quarters, months, days or dates);

2) indicators characterizing the object under study for time periods or for the corresponding dates, which are called the levels of the series.

Series levels are expressed both absolute and average or relative values. Depending on the nature of the indicators, dynamic series of absolute, relative and average values ​​are built. Series of dynamics from relative and average values ​​are built on the basis of derived series of absolute values. Distinguish between interval and moment series of dynamics.

Dynamic interval series contains the values ​​of indicators for specific periods of time. In the interval series, the levels can be summed up, obtaining the volume of the phenomenon over a longer period, or the so-called accumulated totals.

Dynamic torque series reflects the values ​​of indicators at a certain point in time (date of time). In the moment series, the researcher can only be interested in the difference of the phenomena, reflecting the change in the level of the series between certain dates, since the sum of the levels here has no real content. Accumulated totals are not calculated here.

The most important condition for the correct construction of time series is the comparability of the levels of the series belonging to different periods. The levels should be presented in homogeneous quantities, and the different parts of the phenomenon should be equally comprehensive.

In order to to avoid distortion of the real dynamics, in the statistical study preliminary calculations are carried out (closing the series of dynamics), which precede the statistical analysis of the time series. Closing the series of dynamics is understood as the unification of two or more series in one row, the levels of which are calculated according to different methodology or do not correspond to territorial boundaries, etc. The convergence of the series of dynamics may also imply bringing the absolute levels of the series of dynamics to a common base, which eliminates the incomparability of the levels of the series of dynamics.

The concept of comparability of the series of dynamics, coefficients, growth and growth rates.

Rows of dynamics- a series of statistical indicators characterizing the development of natural and social phenomena in time. Statistical compilations published by the Goskomstat of Russia contain a large number of series of dynamics in tabular form. The series of dynamics make it possible to reveal the patterns of development of the phenomena under study.

The series of dynamics contain two types of indicators. Time indicators(years, quarters, months, etc.) or points in time (at the beginning of the year, at the beginning of each month, etc.). Row level indicators... Indicators of the levels of the series of dynamics can be expressed in absolute values ​​(production of a product in tons or rubles), relative values ​​(share of the urban population in%) and average values ​​(average wages of workers in the industry by years, etc.). In tabular form, a dynamic row contains two columns or two rows.

The correct construction of the series of dynamics presupposes the fulfillment of a number of requirements:

  1. all indicators of a number of dynamics must be scientifically grounded, reliable;
  2. indicators of a number of dynamics should be comparable in time, i.e. must be calculated for the same periods of time or for the same dates;
  3. indicators of a number of dynamics should be comparable across the territory;
  4. indicators of a number of dynamics should be comparable in content, i.e. calculated according to a unified methodology, in the same way;
  5. indicators of a number of dynamics should be comparable across the range of considered farms. All indicators of a number of dynamics should be given in the same units of measurement.

Statistical indicators can characterize either the results of the process under study over a period of time, or the state of the phenomenon under study at a certain point in time, i.e. indicators can be interval (periodic) and momentary. Accordingly, the initial series of dynamics can be either interval or momentary. The momentary series of dynamics, in turn, can be with equal and unequal time intervals.

The original series of dynamics can be transformed into a series of average values ​​and a series of relative values ​​(chain and basic). Such series of dynamics are called derived series of dynamics.

The methodology for calculating the average level in the series of dynamics is different, due to the type of series of dynamics. Using examples, we will consider the types of series of dynamics and formulas for calculating the average level.

Absolute gains (Δy) show how many units the next level of the series has changed in comparison with the previous one (column 3. - absolute chain increments) or in comparison with the initial level (column 4. - basic absolute increments). Calculation formulas can be written as follows:

With a decrease in the absolute values ​​of the series, there will be, respectively, "decrease", "decrease".

The indices of absolute growth indicate that, for example, in 1998 the production of product "A" increased by 4 thousand tons as compared to 1997, and by 34 thousand tons as compared to 1994; for the rest of the years see table. 11.5 g 3 and 4.

Growth rate shows how many times the level of the series has changed in comparison with the previous one (column 5 - chain growth or decline coefficients) or compared to the initial level (column 6 - basic growth or decline coefficients). Calculation formulas can be written as follows:

Rates of growth show how many percent is the next level of the series in comparison with the previous one (column 7 - chain growth rates) or in comparison with the initial level (column 8 - basic growth rates). Calculation formulas can be written as follows:

So, for example, in 1997 the volume of production of product "A" in comparison with 1996 amounted to 105.5% (

Growth rate show how many percent the level of the reporting period has increased in comparison with the previous one (column 9 - chain growth rates) or in comparison with the initial level (column 10 - basic growth rates). Calculation formulas can be written as follows:

T pr = T p - 100% or T pr = absolute increase / level of the previous period * 100%

So, for example, in 1996, compared to 1995, product "A" was produced by 3.8% (103.8% - 100%) or (8: 210) x100%, and compared to 1994 - by 9% (109% - 100%).

If the absolute levels in a row decrease, then the rate will be less than 100% and, accordingly, there will be a rate of decline (growth rate with a minus sign).

Absolute value of 1% gain(column 11) shows how many units must be produced in a given period in order for the level of the previous period to increase by 1%. In our example, in 1995 it was necessary to produce 2.0 thousand tons, and in 1998 - 2.3 thousand tons, i.e. much bigger.

There are two ways to determine the magnitude of the absolute value of a 1% increase:

Divide the level of the previous period by 100;

Divide the chain absolute increments by the corresponding chain growth rates.

Absolute value of 1% gain =

In dynamics, especially over a long period, a joint analysis of the growth rates with the content of each percentage of increase or decrease is important.

Note that the considered method of analyzing the series of dynamics is applicable both for the series of dynamics, the levels of which are expressed in absolute values ​​(t, thousand rubles, the number of employees, etc.), and for the series of dynamics, the levels of which are expressed by relative indicators (% of scrap ,% ash content of coal, etc.) or average values ​​(average yield in centners / ha, average wages, etc.).

Along with the analytical indicators considered, calculated for each year in comparison with the previous or initial level, when analyzing the series of dynamics, it is necessary to calculate the average analytical indicators for the period: the average level of the series, the average annual absolute increase (decrease) and the average annual growth rate and growth rate.

Methods for calculating the average level of a series of dynamics were discussed above. In the interval series of dynamics we are considering, the average level of the series is calculated using the simple arithmetic mean formula:

Average annual production of a product for 1994-1998 amounted to 218.4 thousand tons.

The average annual absolute growth is also calculated using the simple arithmetic mean formula:

Annual absolute increments varied over the years from 4 to 12 thousand tons (see column 3), and the average annual increase in production for the period 1995 - 1998. amounted to 8.5 thousand tons.

Methods for calculating the average growth rate and average growth rate require more detailed consideration. Let us consider them using the example of the annual indicators of the series level shown in the table.

The average level of a number of dynamics.

A series of dynamics (or time series) are the numerical values ​​of a particular statistic at successive moments or periods of time (i.e., arranged in chronological order).

The numerical values ​​of one or another statistical indicator that makes up a series of dynamics are called levels of and usually denoted by the letter y... First member of the series y 1 called initial or baseline and the last one y n - the final... Moments or periods of time to which the levels refer are denoted through t.

The series of dynamics, as a rule, are presented in the form of a table or graph, and the time scale is plotted along the abscissa axis t, and on the ordinate - the scale of the levels of the series y.

Average indicators of a number of dynamics

Each row of dynamics can be viewed as a kind of aggregate n time-varying indicators that can be summarized as averages. Such generalized (average) indicators are especially necessary when comparing changes in a particular indicator in different periods, in different countries, etc.

A generalized characteristic of a number of dynamics can be primarily middle level of the row... The method of calculating the average level depends on whether it is a moment series or an interval (period) series.

When interval of the series, its average level is determined by the formula of a simple arithmetic mean from the levels of the series, i.e.

=
If there is moment row containing n levels ( y1, y2,…, yn) with equal intervals between dates (points in time), then such a series can be easily converted into a series of averages. In this case, the indicator (level) at the beginning of each period is simultaneously the indicator at the end of the previous period. Then the average value of the indicator for each period (interval between dates) can be calculated as a half-sum of values at at the beginning and end of the period, i.e. how . The number of such averages will be. As mentioned earlier, for the series of averages, the average level is calculated from the arithmetic mean.

Therefore, we can write:
.
After converting the numerator, we get:
,

where Y1 and Yn- the first and last levels of the row; Yi- intermediate levels.

This average is known in statistics as average chronological for moment series. It got this name from the word "cronos" (time, lat.), Since it is calculated from indicators that change over time.

In the case of unequal of intervals between dates, the chronological average for a moment series can be calculated as the arithmetic average of the average values ​​of the levels for each pair of moments, weighted by the distance (time intervals) between the dates, i.e.
.
In this case it is assumed that in the intervals between the dates the levels took on different values, and we are of the two known ( yi and yi + 1) we determine the averages, from which we then calculate the overall average for the entire analyzed period.
If it is assumed that each value yi remains unchanged until next (i + 1)- th moment, i.e. the exact date of the change in levels is known, then the calculation can be carried out according to the formula of the arithmetic weighted average:
,

where is the time during which the level remained unchanged.

In addition to the average level in the series of dynamics, other average indicators are calculated - the average change in the levels of the series (by basic and chain methods), the average rate of change.

Baseline mean absolute change is the quotient of the last basic absolute change divided by the number of changes. That is

Chain mean absolute change levels of a series is the quotient of dividing the sum of all chain absolute changes by the number of changes, that is

The sign of the average absolute changes is also used to judge the nature of the change in the phenomenon on average: growth, decline or stability.

From the rule of control of basic and chain absolute changes it follows that the basic and chain mean change must be equal.

Along with the average absolute change, the relative average is also calculated using basic and chain methods.

Baseline mean relative change determined by the formula:

Chain mean relative change determined by the formula:

Naturally, the baseline and chain mean relative changes should be the same and by comparing them with the criterion value 1, a conclusion is drawn about the nature of the change in the phenomenon on average: growth, decline or stability.
By subtracting 1 from the baseline or chain average of the relative change, the corresponding average rate of change, by the sign of which it is also possible to judge the nature of the change in the studied phenomenon, reflected by the given series of dynamics.

Seasonal fluctuations and seasonality indices.

Seasonal fluctuations are steady intra-annual fluctuations.

The main principle of managing to obtain the maximum effect is to maximize income and minimize costs. By studying seasonal fluctuations, the problem of the maximum equation is solved at each level of the year.

When studying seasonal fluctuations, two interrelated tasks are solved:

1. Revealing the specifics of the development of the phenomenon in the intra-annual dynamics;

2. Measurement of seasonal fluctuations with the construction of a seasonal wave model;

To measure seasonal fluctuations, turkeys are usually reckoned with seasonality. In general, they are determined by the ratio of the initial equations of a number of dynamics to theoretical equations, which serve as a basis for comparison.

Since random deviations are superimposed on seasonal fluctuations, seasonality indices are averaged to eliminate them.

In this case, for each period of the annual cycle, generalized indicators are determined in the form of average seasonal indices:

Average indices of seasonal fluctuations are free from the influence of random deviations of the main development trend.

Depending on the nature of the trend, the formula for the average seasonality index can take the following forms:

1.For the series of intra-annual dynamics with a pronounced main development trend:

2. For the series of intra-annual dynamics in which there is no increasing or decreasing trend, or is insignificant:

Where is the general average;

Methods for analyzing the main trend.

The development of phenomena in time is influenced by factors of different nature and strength of influence. Some of them are random in nature, others have an almost constant impact and form a certain development trend in the ranks of the dynamics.

An important task of statistics is to identify the dynamics of the trend in the series, freed from the action of various random factors. For this purpose, the series of dynamics are processed by the methods of consolidating intervals, moving average and analytical alignment, etc.

Interval coarsening method based on the enlargement of time periods to which the levels of a number of dynamics belong, i.e. is the replacement of data related to small time periods with data from larger periods. It is especially effective when the initial levels of the series are for short periods of time. For example, rows of indicators related to daily events are replaced by rows related to weekly, monthly, etc. This will allow you to more clearly show "Axis of development of the phenomenon"... The average, calculated over the enlarged intervals, makes it possible to identify the direction and nature (acceleration or deceleration of growth) of the main development trend.

Moving average method is similar to the previous one, but in this case the actual levels are replaced by average levels calculated for successively moving (sliding) enlarged intervals covering m series levels.

for instance if you accept m = 3, then first the average of the first three levels of the series is calculated, then from the same number of levels, but starting from the second in succession, then starting from the third, etc. Thus, the average, as it were, "slides" along a number of dynamics, moving by one period. Calculated from m terms moving averages refer to the middle (center) of each interval.

This method only eliminates random fluctuations. If the series has a seasonal wave, then it will remain after smoothing by the moving average method.

Analytical alignment. In order to eliminate random fluctuations and identify a trend, the alignment of the series levels by analytical formulas (or analytical alignment) is applied. Its essence consists in replacing empirical (actual) levels with theoretical ones, which are calculated according to a certain equation adopted as a mathematical trend model, where theoretical levels are considered as a function of time:. In this case, each actual level is considered as the sum of two components:, where is the systematic component and expressed by a certain equation, and is a random variable causing fluctuations around the trend.

The analytical alignment task boils down to the following:

1. Determination, based on actual data, of the type of hypothetical function that can most adequately reflect the development trend of the indicator under study.

2. Finding the parameters of the specified function (equation) from empirical data

3. Calculation according to the found equation of theoretical (aligned) levels.

The choice of a particular function is carried out, as a rule, on the basis of a graphical representation of empirical data.

Regression equations are used as models, the parameters of which are calculated using the least squares method

Below are the most commonly used regression equations for leveling time series, indicating which development trends are most suitable for reflecting.

To find the parameters of the above equations, there are special algorithms and computer programs. In particular, to find the parameters of the equation of a straight line, the following algorithm can be used:

If the periods or moments of time are numbered so that St = 0, then the above algorithms will be significantly simplified and turn into

The aligned levels on the chart will be located on one straight line passing at the closest distance to the actual levels of this time series. The sum of squares of deviations is a reflection of the influence of random factors.

With its help, we calculate the average (standard) error of the equation:

Here n is the number of observations, and m is the number of parameters in the equation (we have two of them - b 1 and b 0).

The main tendency (trend) shows how systematic factors affect the levels of a number of dynamics, and the fluctuations of levels around the trend () serves as a measure of the impact of residual factors.

To assess the quality of the time series model used, it is also used Fisher's F test... It is the ratio of two variances, namely the ratio of the variance caused by the regression, i.e. the studied factor, to the variance caused by random causes, i.e. residual dispersion:

In expanded form, the formula for this criterion can be represented as follows:

where n is the number of observations, i.e. number of levels in a row,

m is the number of parameters in the equation, y is the actual level of the series,

Aligned row level - middle row level.

More successful than others, a model may not always be satisfactory enough. It can be recognized as such only if the criterion F for it crosses the known critical boundary. This boundary is established using the F distribution tables.

Essence and classification of indices.

An index in statistics is understood as a relative indicator that characterizes the change in the magnitude of a phenomenon in time, space, or in comparison with any standard.

The main element of the index ratio is the indexed value. The indexed value is understood as the value of the attribute of the statistical population, the change in which is the object of study.

There are three main tasks with indexes:

1) assessment of changes in a complex phenomenon;

2) determination of the influence of individual factors on the change in a complex phenomenon;

3) comparison of the magnitude of some phenomenon with the magnitude of the past period, the magnitude for another territory, as well as with the standards, plans, forecasts.

Indices are classified according to 3 criteria:

2) according to the degree of coverage of the elements of the population;

3) according to the methods of calculating general indices.

By content of indexed values, the indices are divided into indices of quantitative (volumetric) indicators and indices of qualitative indicators. Indices of quantitative indicators - indices of the physical volume of industrial production, physical volume of sales, headcount, etc. Indices of qualitative indicators - indices of prices, production costs, labor productivity, average wages, etc.

According to the degree of coverage of the units of the population, the indices are divided into two classes: individual and general. To characterize them, we introduce the following conventions adopted in the practice of using the index method:

q- the amount (volume) of any product in natural expression ; R- unit price; z- unit cost of production; t- the time spent on the production of a unit of production (labor intensity) ; w- production of products in value terms per unit of time; v- production of products in kind per unit of time; T- the total time spent or the number of employees.

In order to distinguish to which period or object the indexed values ​​belong, it is customary to put subscripts at the bottom right after the corresponding symbol. So, for example, in dynamics indices, as a rule, for the compared (current, reporting) periods, the subscript 1 is used and for the periods with which the comparison is made,

Individual indexes serve to characterize changes in individual elements of a complex phenomenon (for example, change in the volume of output of one type of product). They represent the relative values ​​of dynamics, fulfillment of obligations, comparison of indexed values.

The individual index of the physical volume of production is determined

From an analytical point of view, the given individual dynamics indices are similar to the growth rates (rates) and characterize the change in the indexed value in the current period compared to the baseline, i.e., show how many times it has increased (decreased) or how many percent is it growth (decrease). Index values ​​are expressed in coefficients or percentages.

General (summary) index reflects the change in all elements of a complex phenomenon.

Aggregate index is the main form of the index. It is called aggregate because its numerator and denominator are a set of "aggregate"

Average indices, their definition.

In addition to aggregate indices, statistics use their other form - weighted average indices. They are resorted to when the available information does not allow the total aggregate index to be calculated. So, if there are no data on prices, but there is information about the cost of products in the current period and individual price indices for each product are known, then the general price index cannot be determined as an aggregate, but it is possible to calculate it as the average of individual ones. In the same way, if the quantities of individual types of products produced are not known, but individual indices and the cost of production for the base period are known, then the general index of the physical volume of production can be determined as a weighted average.

Average index - it index calculated as the average of the individual indices. The aggregate index is the main form of the general index, so the average index should be the same as the aggregate index. When calculating averages, two forms of averages are used: arithmetic and harmonic.

The arithmetic mean index is identical to the aggregate index if the weights of the individual indices are the terms of the denominator of the aggregate index. Only in this case the value of the index, calculated according to the formula of the arithmetic mean, will be equal to the aggregate index.

Expectation and variance

Let us measure a random variable N times, for example, we measure the wind speed ten times and want to find the average value. How is the mean related to the distribution function?

We will roll the dice a large number of times. The number of points that will drop out on the die with each roll is a random variable and can take any natural values ​​from 1 to 6. The arithmetic mean of the dropped points calculated for all dice rolls is also a random value, but for large N it tends to a very specific number - the mathematical expectation M x... In this case M x = 3,5.

How did this value come about? Let in N trials once dropped 1 point, once - 2 points, and so on. Then For N→ ∞ the number of outcomes in which one point was drawn, Similarly, Hence

Model 4.5. Dice

Suppose now that we know the distribution law of a random variable x, that is, we know that the random variable x can take values x 1 , x 2 , ..., x k with probabilities p 1 , p 2 , ..., p k.

Expected value M x random variable x equals:

Answer. 2,8.

The mathematical expectation is not always a reasonable estimate of some random variable. So, to estimate the average wage, it is more reasonable to use the concept of the median, that is, such a value that the number of people receiving a salary less than the median and a higher one coincide.

Median random variable is called the number x 1/2 such that p (x < x 1/2) = 1/2.

In other words, the probability p 1 that the random variable x will be less x 1/2, and the probability p 2 the fact that the random variable x will be greater x 1/2 are the same and equal to 1/2. The median is not determined unambiguously for all distributions.

Let's go back to a random variable x, which can take on the values x 1 , x 2 , ..., x k with probabilities p 1 , p 2 , ..., p k.

Dispersion random variable x is the mean value of the square of the deviation of a random variable from its mathematical expectation:

Example 2

Under the conditions of the previous example, calculate the variance and standard deviation of a random variable x.

Answer. 0,16, 0,4.

Model 4.6. Target shooting

Example 3

Find the probability distribution of the number of points dropped on a die from the first roll, median, expectation, variance, and standard deviation.

Falling out of any face is equally probable, so the distribution will look like this:

Root-mean-square deviation It is seen that the deviation of the value from the mean is very large.

Mathematical expectation properties:

  • The mathematical expectation of the sum of independent random variables is equal to the sum of their mathematical expectations:

Example 4

Find the mathematical expectation of the sum and product of the points rolled on two dice.

In example 3, we found that for one cube M (x) = 3.5. So, for two cubes

Dispersion properties:

  • The variance of the sum of independent random variables is equal to the sum of variances:

D x + y = D x + D y.

Let for N dice rolled y points. Then

This result is true not only for dice rolls. In many cases, it determines the accuracy of measuring the mathematical expectation empirically. It can be seen that with an increase in the number of measurements N the spread of values ​​around the mean, that is, the standard deviation, decreases proportionally

The variance of a random variable is related to the mathematical expectation of the square of this random variable by the following relationship:

Let's find the mathematical expectations of both sides of this equality. By definition,

The mathematical expectation of the right-hand side of the equality by the property of mathematical expectations is equal to

Standard deviation

Standard deviation is equal to the square root of the variance:
When determining the standard deviation with a sufficiently large volume of the studied population (n> 30), the following formulas are used:

Similar information.


The square root of the variance is called the standard deviation from the mean, which is calculated as follows:

An elementary algebraic transformation of the standard deviation formula brings it to the following form:

This formula is often more convenient in the practice of calculations.

The mean square deviation, like the mean linear deviation, shows how much, on average, the specific values ​​of the feature deviate from their mean. The standard deviation is always greater than the linear standard deviation. There is such a ratio between them:

Knowing this ratio, it is possible to determine the unknown by the known indicators, for example, but (I calculate a and vice versa. The standard deviation measures the absolute size of the variability of the attribute and is expressed in the same units of measurement as the values ​​of the attribute (rubles, tons, years, etc.). It is the absolute measure of variation.

For alternative signs, for example, the presence or absence of higher education, insurance, variance and standard deviation formulas are:

Let us show the calculation of the standard deviation according to the data of a discrete series, characterizing the distribution of students of one of the faculties of the university by age (Table 6.2).

Table 6.2.

The results of auxiliary calculations are given in columns 2-5 of table. 6.2.

The average age of a student, years, is determined by the formula of the arithmetic weighted average (column 2):

The squares of the deviation of the student's individual age from the average are contained in columns 3-4, and the products of the squares of the deviations by the corresponding frequencies are in column 5.

The variance of the age of students, years, is found by the formula (6.2):

Then o = l / 3.43 1.85 * ode, i.e. each specific value of the student's age deviates from the average by 1.85 years.

The coefficient of variation

In terms of its absolute value, the standard deviation depends not only on the degree of variation of the trait, but also on the absolute levels of variants and the mean. Therefore, it is impossible to directly compare the standard deviations of the variational series with different mean levels. To be able to make such a comparison, you need to find the specific weight of the average deviation (linear or quadratic) in the arithmetic mean, expressed as a percentage, i.e. calculate relative indices of variation.

Linear coefficient of variation calculated by the formula

The coefficient of variation determined by the following formula:

In the coefficients of variation, not only the incompatibility associated with different units of measurement of the studied attribute is eliminated, but also the incompatibility arising from differences in the value of the arithmetic means. In addition, the indicators of variation characterize the homogeneity of the population. The population is considered homogeneous if the coefficient of variation does not exceed 33%.

According to the table. 6.2 and the calculation results obtained above, we determine the coefficient of variation,%, according to the formula (6.3):

If the coefficient of variation exceeds 33%, then this indicates the heterogeneity of the studied population. The value obtained in our case indicates that the aggregate of students is homogeneous in terms of age. Thus, an important function of the generalized indicators of variation is the assessment of the reliability of the mean. The less c1, a2 and V, the more homogeneous the set of phenomena obtained and the more reliable the obtained average. According to the "three sigma rule" considered by mathematical statistics, in normally distributed or close to them series, deviations from the arithmetic mean, not exceeding ± 3st, occur in 997 cases out of 1000. Thus, knowing X and a, you can get a general initial idea of ​​the variation series. If, for example, the average wage of an employee in the firm was 25,000 rubles, and a is equal to 100 rubles, then with a probability close to reliability, it can be argued that the wages of employees of the firm fluctuate within (25,000 ± 3 x 100 ) i.e. from 24,700 to 25,300 rubles.

To calculate the geometric mean simple, the formula is used:

Geometric weighted

To determine the geometric weighted average, the formula is applied:

The average diameters of wheels, pipes, the average sides of the squares are determined using the mean square.

RMS values ​​are used to calculate some indicators, for example, the coefficient of variation, which characterizes the rhythm of production. Here, the standard deviation from the planned output for a certain period is determined using the following formula:

These values ​​accurately characterize the change in economic indicators in comparison with their base value, taken in its average value.

Quadratic simple

The root mean square simple is calculated by the formula:

Weighted square

The weighted mean square is:

22. Absolute indicators of variation include:

range of variation

mean linear deviation

variance

standard deviation

Swing variation (r)

Swipe variation is the difference between the maximum and minimum values ​​of the characteristic

It shows the limits within which the value of the trait changes in the studied population.

Work experience of five applicants in previous work is: 2,3,4,7 and 9 years. Solution: range of variation = 9 - 2 = 7 years.

For a generalized characteristic of differences in the values ​​of the attribute, the average indicators of variation are calculated based on taking into account the deviations from the arithmetic mean. The difference is taken as the deviation from the mean.

At the same time, in order to avoid the sum of deviations of the options of a feature from the mean (zero property of the mean), it is necessary either not to take into account the signs of deviation, that is, to take this sum modulo, or to square the deviations to zero.

Average linear and standard deviation

Average linear deviation- this is the arithmetic mean of the absolute deviations of individual values ​​of the attribute from the mean.

The average linear deviation is simple:

Work experience of five applicants in previous work is: 2,3,4,7 and 9 years.

In our example: years;

Answer: 2.4 years.

Weighted mean linear deviation applies to grouped data:

The average linear deviation, due to its convention, is used in practice relatively rarely (in particular, to characterize the fulfillment of contractual obligations for uniformity of delivery; in the analysis of product quality, taking into account the technological features of production).

Standard deviation

The most perfect characteristic of variation is the standard deviation, which is called the standard (or standard deviation). Standard deviation() is equal to the square root of the mean square of the deviations of individual values ​​of the attribute from the arithmetic mean:

The standard deviation is simple:

Weighted standard deviation is used for grouped data:

Between the mean square and standard linear deviations under normal distribution conditions, the following relationship takes place: ~ 1.25.

The standard deviation, being the main absolute measure of variation, is used to determine the values ​​of the ordinates of the normal distribution curve, in calculations related to organizing sample observation and establishing the accuracy of sample characteristics, as well as when assessing the boundaries of variation of a feature in a homogeneous population.

X i - random (current) values;

mean value of random variables over a sample, calculated by the formula:

So, variance is the mean square of the deviations ... That is, first, the average value is calculated, then the difference between each baseline and mean, squared , is added and then divided by the number of values ​​in the given population.

The difference between the individual value and the mean reflects the measure of the deviation. It is squared so that all deviations become exclusively positive numbers and to avoid mutual destruction of positive and negative deviations when they are summed up. Then, with the squares of the deviations, we simply calculate the arithmetic mean.

The answer to the magic word "variance" lies in just these three words: mean - square - deviations.

Mean square deviation (RMS)

Taking the square root of the variance, we obtain the so-called “ root-mean-square deviation ". There are names "Standard deviation" or "sigma" (from the name of the Greek letter σ .). The formula for the standard deviation is:

So, variance is sigma squared, or is standard deviation squared.

The standard deviation, obviously, also characterizes the measure of data scattering, but now (in contrast to the variance) it can be compared with the original data, since they have the same units of measurement (this is evident from the calculation formula). The range of variation is the difference between the extreme values. Standard deviation, as a measure of uncertainty, is also involved in many statistical calculations. With its help, the degree of accuracy of various estimates and forecasts is established. If the variation is very large, then the standard deviation will also turn out to be large, therefore, the forecast will be inaccurate, which will be expressed, for example, in very wide confidence intervals.

Therefore, in the methods of statistical data processing in the appraisals of real estate objects, depending on the required accuracy of the task, the rule of two or three sigma is used.

To compare the two sigma rule and the three sigma rule, we use the Laplace formula:

F - F,

where Ф (x) is the Laplace function;



Minimum value

β = maximum value

s = sigma value (standard deviation)

a = mean

In this case, a particular form of the Laplace formula is used when the boundaries α and β of the values ​​of the random variable X are equally spaced from the distribution center a = M (X) by some value d: a = a-d, b = a + d. Or (1) Formula (1) determines the probability of a given deviation d of a random variable X with a normal distribution law from its mathematical expectation M (X) = a. If in formula (1) we take sequentially d = 2s and d = 3s, then we get: (2), (3).

The Two Sigma Rule

Almost reliably (with a confidence level of 0.954) it can be argued that all values ​​of a random variable X with a normal distribution law deviate from its mathematical expectation M (X) = a by an amount not greater than 2s (two standard deviations). Confidence probability (Pd) is the probability of events that are conventionally taken as reliable (their probability is close to 1).

Let us illustrate the two sigma rule geometrically. In fig. 6 shows a Gaussian curve with a distribution center a. The area bounded by the entire curve and the Ox axis is 1 (100%), and the area of ​​the curved trapezoid between the abscissas a – 2s and a + 2s, according to the two sigma rule, is 0.954 (95.4% of the total area). The area of ​​the shaded areas is 1-0.954 = 0.046 ("5% of the total area). These areas are called the critical region of the values ​​of the random variable. The values ​​of a random variable falling into the critical region are unlikely and in practice are conventionally taken as impossible.

The probability of conditionally impossible values ​​is called the level of significance of the random variable. The level of significance is related to the confidence level by the formula:

where q is the level of significance, expressed as a percentage.

The Three Sigma Rule

When solving issues requiring greater reliability, when the confidence probability (Pd) is taken equal to 0.997 (more precisely - 0.9973), instead of the two sigma rule, according to formula (3), the rule is used three sigma.



According to the three sigma rule with a confidence level of 0.9973, the critical area will be the range of values ​​of the feature outside the interval (a-3s, a + 3s). The significance level is 0.27%.

In other words, the probability that the absolute value of the deviation will exceed three times the standard deviation is very small, namely 0.0027 = 1-0.9973. This means that in only 0.27% of cases, this can happen. Such events, proceeding from the principle of impossibility of unlikely events, can be considered practically impossible. Those. the sample is highly accurate.

This is the essence of the Three Sigma Rule:

If a random variable is normally distributed, then the absolute value of its deviation from the mathematical expectation does not exceed three times the standard deviation (RMSD).

In practice, the three sigma rule is applied as follows: if the distribution of the studied random variable is unknown, but the condition specified in the above rule is met, that is, there is reason to assume that the studied quantity is normally distributed; otherwise, it is not normally distributed.

The level of significance is taken depending on the permissible degree of risk and the task at hand. For real estate valuation, a less precise sample is usually adopted, following the two sigma rule.