logo white

  • Mathematicians
  • Math Lessons
  • Square Roots
  • Math Calculators
  • Summary Statistics – Explanation and Examples

JUMP TO TOPIC

What are Summary Statistics

Summary statistics definition, summary statistics examples, practice problems, summary statistics – explanation and examples.

Summary statistics are numbers or words that describe a data set or data sets simply.

This includes measures of centrality, dispersion, and correlation as well as descriptions of the overall shape of the data set.

Summary statistics are used in all branches of math and science that employ statistics. These include probability, economics, biology, psychology, and astronomy.

Before moving on with this section, make sure to review measures of central tendency and standard deviation .

This section covers:

  • What are Summary Statistics?

How to Interpret Summary Statistics

Summary statistics are numbers or words that describe a data set as succinctly as possible.

These include measures of central tendency such as mean, median, and mode. They also include measures of dispersion such as range and standard deviation. Summary statistics for multivariate data sets may also include measures of correlation such as the correlation coefficient.

Descriptions of the overall data shape such as “normally distributed” or “skewed right” are also part of summary statistics.

Summary statistics give a small “snapshot” of a data set that is more approachable than large quantities of data and more easily generalized than random data points. Like the summary of a story, they analyze and describe even large data sets in just a few numbers and words.

It is best to interpret individual components of summary statistics in light of the other components.

In general, a larger range and larger standard deviation indicate a wider dispersion. A wider range with a smaller standard deviation indicates outliers.

Similarly, when it comes to measures of central tendency, a mean that is higher than the median indicates a skew to the right. Likewise, a mean that is less than the median indicates a skew to the left. If they are about the same, the data set is likely normally distributed.

Summary statistics are measures of central tendency, dispersion, and correlation combined with descriptions of shape that provide a simple overview of a data set or data sets.

These measures can include, mean, median, mode, standard deviation, range, and correlation coefficient.

One example of an important use for summary statistics is a census. In the United States, there are over $320$ million people. This means that a census includes a lot of data points. Since a census also usually includes information such as age, family size, address, occupation, etc., these are multivariate data points!

But, civil servants and politicians need to make decisions based on census results. The easiest way to do that is to provide decision makers with summary statistics of census results. These snapshots are easier to understand than a collection of $320$ million+ data points.

Common Examples

This section covers common examples of problems involving summary statistics and their step-by-step solutions.

A data set has a mean of $200$, a median of $50$, a mode of $40$, and a range of $1500$. What do the summary statistics say about this data set?

The summary statistics for this data set indicate a strong skew to the right. This means that there is one or more upper outliers.

How do they show this?

Outlier have a strong effect on the mean of a data set but very little effect on the median. This means that upper outliers will increase the average while the median stays in place. In fact, it is the main reason for a discrepancy in the median and mean of a data set.

Clearly, there is a large difference between $50$ and $200$, especially in light of the fact that the mode is $40$. This means that half of the data points are more than $50$ and half are less with $40$ being the most commonly occurring term. It certainly does not fit with that to say that a typical term is $200$.

Likewise, the wide range indicates large values are possible.

Additional summary statistics that would paint a fuller picture are the highest and lowest values along with the standard deviation.

Find the summary statistics for the following data set.

$(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 6, 6, 7, 8, 9, 11, 13, 17, 25, 33)$

Common summary statistics include mean, median, mode, range, and standard deviation.

In this case, the mean is equal to:

$\frac{1(6)+2(3)+3(2)+4(2)+5+6(2)+7+8+9+11+13+17+25+33}{24} = \frac{166}{24} = \frac{83}{12}$.

This is about equal to $6.9167$.

The median in this case is equal to the average of the twelfth and thirteenth numbers. These are both four, however, so four is the median.

Since one appears more often than any other number, it is the mode.

These are the measures of central tendency. On the other hand, the common measures of dispersion are range and standard deviation.

The range is just equal to the largest number minus the smallest number. This is equal to $33-1 = 32$.

Standard deviation, however, is difficult to calculate. It is equal to:

$\sqrt{\frac{\sum_{i=1}^k (n_i – \mu)^2}{k}}$.

These calculations take a while. For larger data sets, it is often easier to use a standard deviation calculator.

Whether calculating by hand or with technology, however, the standard deviation is about $8.086.$

The total summary, then, is:

Mean: $6.9167$

Median: $4$

Range: $32$

Standard Deviation: $8.086$.

The summary statistics may also note that there are $24$ elements in the data set, with the largest value being $1$ and the smallest value being $33$.

Consider the following data set:

$(85, 86, 88, 88, 90, 91, 94, 94, 96, 97, 98, 98, 98, 99, 99, 99, 99, 100, 100, 100, 100, 100, 100, 101, 101, 101, 102, 102, 102, 103, 103, 104, 104, 105, 106, 106, 108, 109, 110, 110, 110, 113, 115)$.

What are the summary statistics for this data set? What do these statistics say about the data set?

This data set has 43 data points. The highest value is $115$, while the lowest value is $85$. This means that the range is $115-85=30$.

The median of this data set is going to be the twenty-second term, which is $100$.

Likewise, the mode of the data set is $100$ because it appears more than any other value.

The mean of this data set is equal to:

$\frac{4314}{43}$. This is about equal to $100.3$.

Plugging the standard deviation into a standard deviation calculator reveals that it is approximately $6.9$.

Therefore, the summary statistics on this data set are:

Mean: $100.3$

Median: $100$

Mode: $100$

Range: $30$

Standard Deviation: $6.9$

Number of Terms: $43$

Highest Value: $115$

Lowest Value: $85$.

Based on these statistics, the data is probably normally distributed because all of the measures of central tendency are almost exactly equal.

A shipping company weighs a sample of packages before they are sent out. They get the following results.

$(0.1, 0.1, 0.3, 0.5, 0.8, 0.9, 1.1, 1.2, 1.4, 1.5, 1.5, 1.5, 1.6, 1.7, 1.7, 1.8, 1.9, 2.1, 2.9, 3.3, 4.0, 5.3, 5.5, 6.8, 9.2, 21.8)$.

What are the summary statistics for the data? What do they say about the data in context?

The summary statistics for this data set are:

Number of Terms: $26$

Mode (most common value): $1.5$

Median (average of the thirteenth and fourteenth terms): $1.65$

Mean (sum of the terms divided by $26$): About $3.096$

Highest Value: $21.8$

Lowest Value: $0.1$

Range (difference of highest and lowest values): $21.7$

Standard Deviation (average variance from mean): $4.397$

In this data set, the median and mode are approximately the same, but the mean is a bit higher. It is not, however, a full standard deviation higher. This means that the data is slightly skewed to the right, but not too much. This is likely due to the presence of some outliers.

In context, this means that there are a few heavier packages that the company sends, but, for the most part, the packages weigh around $1.65$ pounds.

  • A data set has standard deviation of $1$, a mean of $0$, a median of $4$ and a mode of $3.5$. What can be said about the data set?
  • Another data set is approximately normally distributed. It has a median of $16$ and a standard deviation of $3$. In what range do the median and mode likely fall?
  • Describe what the summary statistics would look like for a U-shaped data set.
  • Find the summary statistics for the following data set: $(-5, -4, -4, -3, -3, -3, -2, -2, -2, -1, -1, -1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 5)$.
  • A charity receives donations at an event. The donation amounts in dollars are: $(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 11, 12, 15, 15, 20, 20, 20, 20, 40, 40, 45, 50, 50, 50, 100, 200)$. Find the summary statistics for the donations and interpret them in context.
  • Since the difference between the median and the mean is greater than the standard deviation and the mode is close the mean, the data is likely skewed to the left.
  • Since this data is normally distributed, the median and mode are likely within $3$ units in either direction of the mean. That is, they are likely in the range of $13$ to $19$.
  • In such a data set, the mean and median would be about the same. The standard deviation would be large relative to the range. The mode would likely be very high or very low (or both).
  • Number of terms: $28$. Mean is about $-0.3214$, median is 0, and mode is $0$ an $1$. The range between the highest and lowest values of $5$ and $-5$ is $10$, and the standard deviation is about $2.405$. The data is approximately normally distributed.
  • There were $37$ donations averaging $21.08$ dollars. The most common donation was $5$ dollars, and the median donation was $10$. The range of donations was from $1$ to $200$, which means the range was $199$. In this case, the standard deviation was about $36.31$, which means that there was a lot of variance in the donation amount. The large difference between the mean and median donation indicates an outlier to the right, namely the $200$ dollar donation.

All mathematical illustrations/objects created with Geogreba.

Summarising Data

Home » Data Analytics Articles » Statistics – a brief guide » Summarising Data

When you want to measure something in the natural world you usually have to take several measurements. This is because things are variable, so you need several results to get an idea of the situation. Once you have these measurements you need to summarize them in some way because sets of raw numbers are not easily interpreted by most people.

There are four key areas to consider when summarizing a set of numbers:

  • Centrality – the middle value or average.
  • Dispersion – how spread out the values are from the average.
  • Replication – how many values there are in the sample.
  • Shape – the data distribution, which relates to how “evenly” the values are spread either side of the average.

You need to present the first three summary statistics in order to summarize a set of numbers adequately. There are different measures of centrality and dispersion – the measures you select are based on the the last item, shape (or data distribution).

An average is a measure of the middle point of a set of values. This central tendency (centrality) is an important measure and is usually what you are comparing when looking at differences between samples for example.

There are three main kinds of average:

  • Mean – the arithmetic mean, the sum of the values divided by the replication.
  • Median – the middle value when all the numbers are ranked in order.
  • Mode – the most frequent value(s) in a sample.

Of these three, the mean and the median are most commonly used in statistical analysis. The most appropriate average depends on the shape of the data sample.

The arithmetic mean is calculated by adding together the values in the sample. The sum is then divided by the number of items in the sample (the replication).

what is a summary in statistics

The formula is shown above. The ∑ symbol represents “sum of”. The n represents the replication. The final mean is indicated using an overbar. This shows that the mean is your estimate of the true mean. This is because you usually measure only some of the items in a “population”; this is called a sample. If you measured everything then you would be able to calculate the true mean, which would be indicated by giving it a µ symbol.

The mean should only be used when the shape of the sample is appropriate. When the data are normally distributed the mean is a good summary of the average. If the data are not normally distributed the mean is not a good summary and you should use the median instead.

The median is the middle value, taken when you arrange your numbers in order (rank). This measure of the average does not depend on the shape of the data. The “formula” for working out the median depends on the ranks of the values, you want a value whose rank is the (n/2)+0.5th like so:

what is a summary in statistics

If you have an odd number of values in your sample the median is simply the middle value like so:

The median is 7 in this case.

When you have an even number of values the middle will fall between two items:

What you do is use a value mid-way between the two items in the middle. In this case mid-way between 4 and 7, which gives 5.5.

The median is a good general choice for an average because it is not dependent on the shape of the data. When the data are normally distributed the mean and the median are coincident (or very close).

The Mode is the most frequent value in a sample. It is calculated by working out how many there are of each value in your sample. The one with the highest frequency is the mode. It is possible to get tied frequencies, in which case you report both values. The sample is then said to be bimodal. You might get more than two modal values!

The mode is not commonly used in statistical analysis. It tends to be used most often when you have a lot of values, and where you have integer values (although it can be calculated for any sample).

The mode is not dependent on the shape of your sample. Generally speaking you would expect your mode and median to be close, regardless of the sample distribution. If the sample is normally distributed the mode will usually also be close to the mean.

The dispersion of a sample refers to how spread out the values are around the average. If the values are close to the average, then your sample has low dispersion. If the values are widely scattered about the average your sample has high dispersion.

what is a summary in statistics

The example figure shows samples that are normally distributed, that is, they are symmetrical around the average (mean). As far as dispersion goes, the principle is the same regardless of the shape of the data. However, different measures of dispersion will be more appropriate for different data distribution.

There are various measures of dispersion, such as:

Standard deviation

  • Standard Error
  • Confidence Interval
  • Inter-Quartile Range

The choice of measurement depends largely on the shape of the data and what you want to focus on. In general, with normally distributed data you use the standard deviation. If the data are not normally distributed, you use the inter-quartile range.

The standard deviation is used when the data are normally distributed. You can think of it as a sort of “average deviation” from the mean. The general formula for calculating standard deviation looks like the following:

what is a summary in statistics

To work out standard deviation follow these steps:

  • Subtract the mean from each value in the sample.
  • Square the results from step 1 (this removes negative values).
  • Add together the squared differences from step 2.
  • Divide the summed squared differences from step 3 by n-1, which is the number of items in the sample (replication) minus one.
  • Take the square root of the result from step 4.

The final result is called s, the standard deviation. In most cases you will have taken a sample of values from a larger “population”, so your value of s is your estimate of standard deviation (the sample standard deviation). This is also why you used n-1 as the divisor in the formula. If you measured the entire population you can use n as the divisor. You would then have σ, which is the “true” standard deviation (called the population standard deviation).

In effect the -1 is a compensation factor. As n gets larger and therefore closer to the entire population, subtracting 1 has a smaller and smaller effect on the result. In most statistical analyses you will use sample standard deviation (and so n-1).

Inter-Quartile range

The inter-quartile range (IQR) is a useful measure of the dispersion of data that are not normally distributed (see shape). You start by working out the median; this effectively splits the data into two chunks, with an equal number of values in each part. For each half you can now work out the value that is half-way between the median and the “end” (the maximum or minimum). This gives you values for the two inter-quartiles. The difference between them is the IQR, which you usually express as a single value.

The IQR essentially “knocks off” the most extreme portions of the data sample, leaving you with a core 50% of your original data. A small IQR denotes a small dispersion and a large IQR a large dispersion.

As a by-product of working out the IQR you’ll usually end up with five values:

  • Minimum – the 0th quartile (or 0% quantile).
  • Lower quartile – the 1st quartile (or 25% quantile).
  • Median – the 2nd quartile (or 50% quantile).
  • Upper quartile – the 3rd quartile (or 75% quantile).
  • Maximum – the 4th quartile (or 100% quantile).

These 5 values split the data sample into four parts, which is why they are called quartiles. You can calculate the quartiles from the ranks of the data values like so:

  • Rank the values in ascending order. Use the mean rank for tied values.
  • The median corresponds to the item that has rank 0.5n + 0.5 (where n = replication).
  • The lower quartile corresponds to the item that has rank 0.25n + 0.75.
  • The upper quartile corresponds to the item that has rank 0.75n + 0.25.

If you are using Excel you can compute the quartiles using the QUARTILE function.

The range is simply the difference between the maximum and the minimum values. It is quite a crude measure and not very useful. The inter-quartile range is much more useful, and makes use of the maximum and minimum values in the calculation.

Replication

This is the simplest of the summary statistics but it is still important. The replication is simply how many items there are in your sample (that is, the number of observations).

The value n, the replication, is used in calculating other summary statistics, such as standard deviation and IQR, but it is also helpful in its own right. You should look at the dispersion and replication together. A certain value for dispersion might be considered “high” if n is small but quite “low” if n is very large.

The shape of the data affects the type of summary statistics that best summarize them. The “shape” refers to how the data values are distributed across the range of values in the sample. Generally you expect there to be a “cluster” of values around the average. It is important to know if the values are more or less symmetrically arranged around the average, or if there are more values to one side than the other.

There are two main ways to explore the shape (distribution) of a sample of data values:

  • Graphically – using frequency histograms or tally plots draws a picture of the sample shape.
  • Shape statistics – such as skewness and kurtosis. These give values to how central the average is and how clustered around the average the data are.

The ultimate goal is to determine what kind of distribution your data forms. If you have normal distribution you have a wide range of options when it comes to data summary and subsequent analysis.

Types of data distribution

There are many “shapes” of data, commonly encountered ones are:

  • Normal (also called Gaussian)

In general, your aim is to work out if you have normal distribution or not. If you do have normal distribution you can use mean and standard deviation for summary. If you do not have normal distribution you need to use median and IQR instead.

The normal distribution (also called Gaussian) has well-explored characteristics and such data are usually described as parametric. If data are not parametric they can be described as skewed or non-parametric.

Drawing the distribution

There are two main ways to visualize the shape of your data:

Tally plots

In both cases the idea is to make a frequency plot. The data values are split into frequency classes, usually called bins. You then determine how many data items are in each bin. There is little difference between a tally plot and a histogram, they show the same information but are constructed is slightly different ways.

A tally plot is a kind of frequency graph that you can sketch in a notebook. This makes it a very useful tool for times when you haven’t got a computer to hand.

To draw a tally plot follow these steps:

  • Determine the size classes (bins), you want around 7 bins.
  • Draw a vertical line (axis) and write the values for the bins to the left.
  • For each datum, determine which size class it fits into and add a tally mark to the right of the axis, opposite the appropriate bin.

You will now be able to assess the shape of the data sample you’ve got.

what is a summary in statistics

The tally plot in the preceding figure shows a normal (parametric) distribution. You can see that the shape is more or less symmetrical around the middle. So here the mean and standard deviation would be good summary values to represent the data. The original dataset was:

The first bin, labelled 18, contains values up to 18. There are two in the dataset (17, and 16). The next bin is 21 and therefore contains items that are >18 but not greater than 21 (there are three: 21, 19 and 21).

The following dataset is not normally distributed:

These data produce a tally plot like so:

Note that the same bins were used for the second dataset. The range for both samples was 16-36. The data in the second sample are clearly not normally distributed. The tallest size class is not in the middle and there is a long “tail” towards the higher values. For these data the median and inter-quartile range would be appropriate summary statistics.

A histogram is like a bar chart. The bars represent the frequency of values in the data sample that correspond to various size classes (bins). Generally the bars are drawn without gaps between them to highlight the fact that the x-axis represents a continuous variable. There is little difference between a tally plot and a histogram but the latter can be produced easily using a computer (you can sketch one in a notebook too).

To make a histogram you follow the same general procedure as for a tally plot but with subtle differences:

  • Determine the size classes.
  • Work out the frequency for each size class.
  • Draw a bar chart using the size classes as the x-axis and the frequencies on the y-axis.

You can draw a histogram by hand or use your spreadsheet. The following histograms were drawn using the same data as for the tally plots in the preceding section. The first histogram shows normally distributed data.

what is a summary in statistics

The next histogram shows a non-parametric distribution.

In both these examples the bars are shown with a small gap, more properly the bars should be touching. The x-axis shows the size classes as a range under each bar. You can also show the maximum value for each size class. Ideally your histogram should have the labels at the divisions between size classes like so:

what is a summary in statistics

Note that this histogram uses slightly different size classes to the earlier ones.

Shape statistics

Visualizing the shape of your data samples is usually your main goal. However, it is possible to characterize the shape of a data distribution using shape statistics. There are two, which are used in conjunction with each other:

  • Skewness – a measure of how central the average is in the distribution.
  • Kurtosis – a measure of how pointy the distribution is ( think of it as how clustered the values are around the middle).

If you are producing a numerical data summary these two values are useful statistics.

The skewness of a sample is a measure of how central the average is in relation to the overall spread of values. The formula to calculate skewness uses the number of items in the sample (the replication, n) and the standard deviation, s.

what is a summary in statistics

In practice you’ll use a computer to calculate skewness; Excel has a SKEW function that will compute it for you.

A positive value indicates that the average is skewed to the left, that is, there is a long “tail” of more positive values. A negative value indicates the opposite. The larger the value the more skewed the sample is.

The kurtosis of a sample is a measure of how pointed the distribution is (see drawing the distribution). It is also a way to think about how clustered the values are around the middle. The formula to calculate kurtosis uses the number of items in the sample (the replication, n) and the standard deviation, s.

what is a summary in statistics

In practice you’ll use a computer to calculate kurtosis; Excel has a KURT function that will compute it for you.

A positive result indicates a pointed distribution, which will probably also have a low dispersion. A negative result indicates a flat distribution, which will probably have high dispersion. The higher the value the more extreme the pointedness or flatness of the distribution.

You should always summarize a sample of data values to make them more easily understood (by you and others). At the very least you need to show:

  • Middle value – centrality, that is, an average.
  • Dispersion – how spread out the data are around the average.

Replication – how large the sample is.

The shape of the data (its distribution) is also important because the shape determines which summary statistics are most appropriate to describe the sample. Your data may be normally distributed (i.e. with a symmetrical, bell-shaped curve) and so parametric, or they may be skewed and therefore non-parametric.

You can explore and describe the shape of data using graphs:

  • Tally plots – a simple frequency plot.
  • Histograms – a frequency plot like a bar chart.

You can also use shape statistics:

  • Skewness – how central the average is.
  • Kurtosis – how pointed the distribution is.

The shape of the data also leads you towards the most appropriate ways of analyzing the data, that is, which statistical tests you can use.

My Publications

I have written several books on ecology and data analysis

what is a summary in statistics

Register your interest for our Training Courses

We run training courses in data management, visualisation and analysis using Excel and R: The Statistical Programming Environment. Courses will be held at one of our training centres in London. Alternatively we can come to you and provide the training at your workplace. Training Courses are also available via an online platform.

DS101 Making Sense of Data DS102 Basic Hypothesis Testing using Excel DS103 Data Visualisation using Excel DR102 Beginning R DR201 Data Mining using R DR202 Data Visualisation using R DR212 Data Visualisation with ggplot2 DR103 Ecological Data Analysis using R

Get In Touch Now

for any information regarding our training courses, publications or help with a data project

What Is Summary Statistics: Definition and Examples 

img

Introduction to Summary Statistics  

What are Summary statistics? A statistics summary gives information about the data in a sample. It can help understand the values better. It may include the total number of values, minimum value, and maximum value, along with the mean value and the standard deviation corresponding to a data collection. With this, you can understand the trends, outliers, and distribution of values in a data set. This is especially useful when dealing with large amounts of data as it can help in analyzing the data better. This information can be utilized to steer the rest of the analysis and derive more information about a data set. These are values that are calculated based on the sample data and do not go beyond the data on hand.   

What are Summary Statistics?  

By definition, the summary statistics sum up the features of a data sample. They describe the values and provide related measurements. These work as a basis for understanding the values recorded during a study.   

Descriptive statistics can show where the mean of a set of values lies. It can also help to understand if the data is skewed. Descriptive or summary statistics include:  

  • Description of the sample size (usually denoted by N)   
  • Description of the center of the data or values (Mean value)  
  • Description of how the values are spread  
  • Plotted graphs and charts that help understand the distribution of values.   

A Few Examples of Summary Statistics   

What is the meaning of summary statistics? It can be better understood with the help of the following illustrations:   

  • Calculation of mean value: Assume a data set with 5 numbers – 20, 30, 40, 50, and 60. The sum of all these numbers is 200. 200 divided by 5 would give the mean value, which is 40.  
  • Calculation of the Grade Point Average : Many universities consider this score to evaluate students’ performance during the duration of their degree programs. The university records how much a student scores in various courses. More often than not, a course is accompanied by a certain number of credits which is also a numerical value. Letter grades A, B, and C assigned to students correspond to point values such as 4.0, 3.0, and 2.0, respectively. The sum of all these point values a student earns (during a semester, term, or year) is added up and divided by the total number of corresponding credits. The resultant value is the Grade Point Average or the GPA (for that semester, term, or year). Thus, the GPA pulls together several data points created across grades, courses, and examinations and then calculates the average. This average value helps ” summarize ” a student’s mean academic performance. The final number shows the typical high score corresponding to a student. While this numeric value helps track the student’s progress, it is also useful for comparing the student’s performance with respect to the designated program or varsity standards. It is important to note that the GPA is a straightforward calculation that is based on the data collected. It does not predict the performance in the future or draw any conclusion. Usually, the summary statistics are presented in the form of a chart or a graph.   
  • As-Is Report in a Pie Chart: If a 500-member audience in a particular theater were to be asked if they liked a play (yes) or disliked it (no), their responses could be captured in a data set. Also, the summary of their replies, that is, the total number of ‘yes’ and ‘no’ responses, could be represented in a pie chart. This would be another example of summary statistics as it is an as-is report of the findings of the study and does not draw upon any other conclusions.  

Every summary statistics example quoted above focuses on one of the important aspects- the mean, the variability, or the data distribution.  

Categories Of Summary Statistics  

The summary or descriptive statistics can be drilled down into different types, measures, or features. With a focus on averages, the description or summary can be focused on any of the three main categories:  1). the measure of the average value; 2) the frequency of each value; or 3) the spread of the values.   

Summary Statistics: Measures of location  

Also referred to as central tendency, this summary shows or describes a data set’s center or average. This is measured by the calculated values of the mean, median, and mode.   

  Mean : This is the most common method of calculating the average value. Usually represented by ‘M.’ The mean can be found by adding the values of the responses and then dividing this sum by the total number of responses (denoted by N). Consider this – a person wants to find out the number of hours they work in one week per day. The data set would include entries of the hours clocked every day of that week – 8, 10, 7, 9, 8, 6, and 4. 52 would be the sum of all these entries, and the total number of responses would be 7. 52 divided by 7 would give the value of M, which is 7.4. 

Median : This is defined as the exact central value in the data set. By arranging the values from the lowest to the highest, we get 8 as the median, with 3 values to its left and 3 values to its right.

Mode : This represents the most frequent value in a data set. A given data set may have many modes, including 0 (zero). The mode can be found by arranging the values in a data set in ascending order and then looking for values that are repeated. In the example of work hours per week, by arranging the values from the lowest 4 to the highest 10, we can see that the value 8 is repeating. Thus 8 is the mode.

, What Is Summary Statistics: Definition and Examples 

Summary Statistics: Measures of Spread  

The measure of spread is also referred to as Dispersion, Variability, or Frequency Distribution. This measure helps us understand how the responses are spread out. The three aspects of spread are range, SD (Standard Deviation), and Variance. Let us examine each of these to understand what the summary statistics meaning is:  

Range : This can be used to understand how far the highest and lowest values lie in a data set. This can be found by the subtraction of these two values (i.e., highest – lowest). Considering the earlier example of working hours, the highest entry was 10 and the lowest 4. The range would be 10-4=6.  

Standard Deviation : This is an indication of the average variability of the data set. This shows how far each value lies from the value of M, the mean. The higher the value, the more variability. There are several steps to arrive at the SD:  

  • Tabulate the values along with the mean value  
  • Subtract the mean from each score and find individual values of the deviation  
  • Square each of the resultant values  
  • Find the sum of all the values squared in Step 3  
  • Divide the sum found in Step 4 by (N-1), where N represents the total number of responses  
  • Find the square root of the resultant value from Step 5  

Summary Statistics: Graphs and Charts  

Values of a data set and related observations can be represented graphically in tons of ways. Common graphs and chart types include histograms, Bar charts, Box plots, Frequency Distribution Tables, Scatter Plots, and Pie charts. Each of these comes with its own benefits and can be chosen based on how well it represents the data and how easily a person can understand the meaning of summary statistics via the representation.   

Applications Of Summary Statistics  

The applications are far and wide and include an assortment of fields and professions – from academics, finance and investments, or even government organizations. Economic interests may lie in data pertaining to consumer spending, inflation, changes in the GDP, and more.    

Analysts involved in the Finance domain could be interested in companies and industries, market information with a focus on volumes and prices, consumer sentiment regarding a product or service, and many more variables.   

Conclusion  

Due to its focus on the collected data, descriptive and summary statistics may seem limited at first glance. However, they aid an analyst in quantifying the data set on hand and help chalk out its basic characteristics. Plus, post-data collection involves no uncertainties; these work well for cleaning up large amounts of data. Along with organized and simplified data, the descriptions or summary statistics thus obtained set the stage for further data analysis.  

According to the US Bureau of Labor Statistics , the scope for the Data Science field and related jobs will continue to look up in the coming decade (2021 to 2031). With a 36% job outlook, it is considered a field with much faster growth than many others.   

With more organizations making data-driven decisions, the prospect of a role related to statistics and data analytics never seemed brighter than it is now. According to a glassdoor.com report, 2022, a Data Analyst can expect a salary of INR6 lakhs per annum, and for a Data Scientist, this can go upto INR 11 lakhs per annum. Equip yourself with the necessary skills to take on an organization’s data analytics role; explore UNext Jigsaw’s highly recommended Integrated Program in Business Analytics . It comes with a blend of key management skills and real-world scenarios related to Data Science.  

 width=

Fill in the details to know more

facebook

PEOPLE ALSO READ

staffing pyramid, Understanding the Staffing Pyramid!

Related Articles

what is a summary in statistics

Understanding the Staffing Pyramid!

May 15, 2023

 width=

From The Eyes Of Emerging Technologies: IPL Through The Ages

April 29, 2023

img

Understanding HR Terminologies!

April 24, 2023

HR, How Does HR Work in an Organization?

How Does HR Work in an Organization?

Measurment Maturity Model, A Brief Overview: Measurement Maturity Model!

A Brief Overview: Measurement Maturity Model!

April 20, 2023

HR Analytics, HR Analytics: Use Cases and Examples

HR Analytics: Use Cases and Examples

, What Are SOC and NOC In Cyber Security? What’s the Difference?

What Are SOC and NOC In Cyber Security? What’s the Difference?

February 27, 2023

what is a summary in statistics

Fundamentals of Confidence Interval in Statistics!

February 26, 2023

what is a summary in statistics

A Brief Introduction to Cyber Security Analytics

, Cyber Safe Behaviour In Banking Systems

Cyber Safe Behaviour In Banking Systems

February 17, 2023

img

Everything Best Of Analytics for 2023: 7 Must Read Articles!

December 26, 2022

, Best of 2022: 5 Most Popular Cybersecurity Blogs Of The Year

Best of 2022: 5 Most Popular Cybersecurity Blogs Of The Year

December 22, 2022

, 10 Reasons Why Business Analytics Is Important In Digital Age

10 Reasons Why Business Analytics Is Important In Digital Age

February 28, 2023

what is a summary in statistics

Bivariate Analysis: Beginners Guide | UNext

November 18, 2022

, Everything You Need to Know About Hypothesis Tests: Chi-Square

Everything You Need to Know About Hypothesis Tests: Chi-Square

November 17, 2022

, Everything You Need to Know About Hypothesis Tests: Chi-Square, ANOVA

Everything You Need to Know About Hypothesis Tests: Chi-Square, ANOVA

November 15, 2022

, How To Use the Pivot Table in Excel ?

How To Use the Pivot Table in Excel ?

May 12, 2023

Cost in pricing, Role of Cost in Pricing of the Product!

Role of Cost in Pricing of the Product!

April 18, 2023

data visualization in excel, What Is Data Visualization in Excel?

What Is Data Visualization in Excel?

April 14, 2023

tables in SQL

What Are Databases and Tables in SQL?

March 24, 2023

, It’s Raining Opportunities In Cloud Computing! 

It’s Raining Opportunities In Cloud Computing! 

March 23, 2023

Product Management, Product Management – With Great Power Comes Great Responsibility!

Product Management – With Great Power Comes Great Responsibility!

share

Are you ready to build your own career?

arrow

Query? Ask Us

what is a summary in statistics

Enter Your Details ×

Summary statistics

Summary statistics helps us summarize statistical information.

Let's consider an example to understand this better.

A school conducted a blood donation camp.

The blood groups of 30 students were recorded as follows.

summary statistics

We can represent this data in a tabular form.

summary statistics table

This table is known as a frequency distribution table.

You can observe that all the collected data is organized under two columns.

This makes it easy for us to understand the given information.

Thus, summary statistics condenses the data to a simpler form so that it is easy for us to observe its features at a glance.

We will learn more about summary statistics as we scroll down. Try your hand at solving some interactive questions at the end.

Lesson Plan

Summary statistics.

Let us first understand the meaning of summary statistics.

Definition of Summary Statistics:  Summary statistics is a part of descriptive statistics that summarizes and provides the gist of the information about the sample data.

Summary statistics deals with summarizing statistical information.

This indicates that we can efficiently use summary statistics to quickly get the gist of the information.

Statistics generally deals with the presentation of information quantitatively or visually.

"Summary statistics" is a part of descriptive statistics.

Descriptive statistics deals with the collection, organization, summaries, and presentation of data.

What Is a Summary Statistics Table?

Big data related to population, economy, stock prices, and unemployment needs to be summarized systematically to interpret it correctly.

It is usually done using a summary statistics table.

The summary table is a visual representation that summarizes statistical information about the data in a tabular form.

Summary Statistics Table

Here are a few summary statistics about a certain country:

  • The population of the country now stands at 1,351,800.
  • 60% of people describe their health as very good or excellent.
  • 20,800 have immigrated into the country while 21,500 people emigrated out of the country.
  • The per capita gross annual pay now stands at $21,000.
  • There were 105,023 recorded crimes.
  • Unemployment is at 2.8%.

How Do you Explain Summary Statistics?

Summary statistics is a part of descriptive statistics that summarizes and provides the gist of information about the sample data.

Statisticians commonly try to describe and characterize the observations by finding:

  • a measure of location, or central tendency, such as the arithmetic mean
  • a measure of statistical dispersion like the standard mean absolute deviation
  • a measure of the shape of the distribution like skewness
  • if more than one variable is measured, a measure of statistical dependence such as a correlation coefficient

How Do you Analyze Summary Statistics?

In a class, the collection of scores obtained by 30 students is the description of data collected.

To find the mean of the data, we will need to find the average marks of 30 students.

If the average marks obtained by 30 students is 75 out of 100, then we can derive a conclusion or give judgment about the performance of the students on the basis of this result.

important notes to remember

1. Summary statistics helps us get the gist of the information instantly.

2. Statisticians describe the observations using the following measures.

  • Measure of location, or central tendency: arithmetic mean
  • Measure of statistical dispersion: standard mean absolute deviation
  • Measure of the shape of the distribution: skewness
  • Measure of statistical dependence: correlation coefficient

Measures of Location

The arithmetic mean, median, mode, and interquartile mean are the common measures of location or central tendency.

Measures of Spread

Standard deviation, range, variance, absolute deviation, interquartile range, distance standard deviation, etc. are the common measures of spread/dispersion.

The coefficient of variation (CV) is a statistical measure of the relative spread of data points around the mean.

Graphs / charts

Some of the graphs and charts frequently used in the statistical representation of the data are given below.

  • Scatter plot
  • Frequency distribution graph

Solved Examples on Summary Statistics

The mean monthly salary of 10 workers of a group is $1445

One more worker whose monthly salary is $1500 has joined the group.

Find the mean monthly salary of 11 workers of the group.

Here, \(n=10, \bar{x}=1445\)

Using the formula,

\begin{align} \bar{x}&=\dfrac{\sum x_i}{n} \\\therefore \sum x_i&= \bar{x} \times n\\\sum x_i&= 1445 \times 10\\&=14450\\\text {10 workers salary} & = $14450\\\text {11 workers salary} & = 14450+1500\\&= $15950\\\text {Average salary} &=\dfrac{15950}{11} \\&= 1450\end{align}

The pie chart shows the favorite subjects of students in a class.

Using the information given in the pie chart, determine the percentage of students who chose English.

pie chart for subject preferences

We know that \(144^\circ +36^\circ+72^\circ+108^\circ= 360^\circ\)

The percentage of students who chose English \[\begin{align}&=\dfrac{72}{360}\times 100\\&=20\end{align}\]

On World Environment Day, 100 schools decided to plant 100 tree saplings in their gardens.

children planting saplings

The following data shows the number of plants that survived in each school after one month.

Using this data, can you find the number of schools that were able to retain 50% of the plants or more?

We need to represent this large amount of data in such a way that a reader can understand it easily.

To include all the observations in groups, we will create various groups of equal intervals.

These intervals are called  class intervals .

summary statistics for number of plants

From this table, it is clear that 50% or more plants survived in (8 + 18 + 10 + 23 + 12) schools.

Challenge your math skills

Which of the following data sets has the second-largest arithmetic mean?  

A = {First five whole numbers}  B = {First five natural numbers} C = {First five even numbers} D = {First five odd numbers}

A scientist is studying reaction times. She believes that 5% of scores have an error as they are farthest from the mean. The mean reaction time is 7 and the standard deviation is 0.5. Which highest and lowest reaction times should be eliminated? 

I nteractive Questions on Summary Statistics

Here are a few activities for you to practice. Select/Type your answer and click the "Check Answer" button to see the result.

Let's Summarize

This mini-lesson targeted the fascinating concept of summary statistics. The math journey around the summary statistics starts with what a student already knows, and goes on to creatively crafting a fresh concept in the young minds. Done in a way that not only it is relatable and easy to grasp, but also will stay with them forever. Here lies the magic with Cuemath.

About Cuemath

At  Cuemath , our team of math experts is dedicated to making learning fun for our favorite readers, the students!

Through an interactive and engaging learning-teaching-learning approach, the teachers explore all angles of a topic.

Be it problems, online classes, doubt sessions, or any other form of relation, it’s the logical thinking and smart learning approach that we, at Cuemath, believe in.

FAQs on Summary Statistics

What is included in summary statistics.

Summary statistics summarize and provide information about the collected data. It characterizes the values in your data set. It tells us where the average lies and whether the data is skewed.

What is the most common summary statistic?

The mean and the median are most commonly used in statistical analysis. 

What is a summary statistic table?

The summary table is a visual representation that summarizes statistical information about data in a tabular form.

What does the five-number summary tell you?

A five-number summary is useful in descriptive analyses or during the initial interpretation of a large data set. It consists of five values: the maximum and minimum values, the lower and upper quartiles, and the median.

What is the purpose of the summary table?

Summary tables are a visual representation of the data making it easier to understand.

What is a summary in math?

Summary in math is a quick and simple description of the data.

How do you describe statistics?

Summary statistics help us to condense the data in a simpler form so that it is easy for us to observe and describe its features at a glance.

What are the types of statistics?

Types of statistics are:

  • Descriptive statistics
  • Inferential statistics
  • Live one on one classroom and doubt clearing
  • Practice worksheets in and after class for conceptual clarity
  • Personalized curriculum to keep up with school

Summary Statistics

The information that gives a quick and simple description of the data.

Can include mean, median, mode, minimum value, maximum value, range, standard deviation, etc.

Introduction to Data Science

Chapter 12 summary statistics.

We start by describing a simple yet powerful data analysis technique: constructing data summaries. Although the approach does not require mathematical models or probability, the motivation for summaries we construct will later help us understand both these topics.

Numerical data is often summarized with the average value. For example, the quality of a high school is sometimes summarized with one number: the average score on a standardized test. Occasionally, a second number is reported: the standard deviation . For example, you might read a report stating that scores were 680 plus or minus 50, (50 is the standard deviation). The report has summarized an entire vector of scores with just two numbers. Is this appropriate? Is there any important piece of information that we are missing by only looking at this summary rather than the entire list? Here we answer these questions and motivate several useful summary statistics, including the average and standard deviation.

12.1 Variable types

We will be working with two types of variables: categorical and numeric. Each can be divided into two other groups: categorical can be ordinal or not, whereas numerical variables can be discrete or continuous.

When each entry in a vector comes from one of a small number of groups, we refer to the data as categorical data . Two simple examples are sex (male or female) and US regions (Northeast, South, North Central, West). Some categorical data can be ordered even if they are not numbers, such as spiciness (mild, medium, hot). In statistics textbooks, ordered categorical data are referred to as ordinal data.

Examples of numerical data are population sizes, murder rates, and heights. Some numerical data can be treated as ordered categorical. We can further divide numerical data into continuous and discrete. Continuous variables are those that can take any value, such as heights, if measured with enough precision. For example, a pair of twins may be 68.12 and 68.11 inches, respectively. Counts, such as number of gun murders per year, are discrete because they have to be round numbers.

Keep in mind that discrete numeric data can be considered ordinal. Although this is technically true, we usually reserve the term ordinal data for variables belonging to a small number of different groups, with each group having many members. In contrast, when we have many groups with few cases in each group, we typically refer to them as discrete numerical variables. So, for example, the number of packs of cigarettes a person smokes a day, rounded to the closest pack, would be considered ordinal, while the actual number of cigarettes would be considered a numerical variable. But, indeed, there are examples that can be considered both numerical and ordinal.

12.2 Distributions

The most basic statistical summary of a list of objects or numbers is its distribution . The simplest way to think of a distribution is as a compact description of a list with many entries. This concept should not be new for readers of this book. For example, with categorical data, the distribution simply describes the proportion of each unique category. Here is an example with US state regions:

When the data is numerical, the task of constructing a summary based on the distribution is more challenging. We introduce an artificial, yet illustrative, motivating problem that will help us introduce the concepts needed to understand distributions.

12.2.1 Case study: describing student heights

Pretend that we have to describe the heights of our classmates to ET, an extraterrestrial that has never seen humans. As a first step, we need to collect data. To do this, we ask students to report their heights in inches. We ask them to provide sex information because we know there are two different distributions by sex. We collect the data and save it in the heights data frame:

One way to convey the heights to ET is to simply send him this list of 1050 heights. But there are much more effective ways to convey this information, and understanding the concept of a distribution will help. To simplify the explanation, we first focus on male heights. We examine the female height data in Section 12.7.1 .

It turns out that, in some cases, the average and the standard deviation are pretty much all we need to understand the data. We will learn data visualization techniques that will help us determine when this two number summary is appropriate. These same techniques will serve as an alternative for when two numbers are not enough.

12.2.2 Empirical cumulative distribution functions

Numerical data that are not categorical also have distributions. In general, when data is not categorical, reporting the frequency of each entry is not an effective summary since most entries are unique. In our case study, while several students reported a height of 68 inches, only one student reported a height of 68.503937007874 inches and only one student reported a height 68.8976377952756 inches. We assume that they converted from 174 and 175 centimeters, respectively.

Statistics textbooks teach us that a more useful way to define a distribution for numeric data is to define a function that reports the proportion of the data entries \(x\) that are below \(a\) , for all possible values of \(a\) . This function is called the empirical cumulative distribution function (eCDF) and often denoted with \(F\) :

\[ F(a) = \mbox{Proportion of data points that are less than or equal to }a\]

Here is a plot of \(F\) for the male height data:

what is a summary in statistics

Similar to what the frequency table does for categorical data, the eCDF defines the distribution for numerical data. From the plot, we can see that 16% of the values are below 65, since \(F(66)=\) 0.164, or that 84% of the values are below 72, since \(F(72)=\) 0.841, and so on. In fact, we can report the proportion of values between any two heights, say \(a\) and \(b\) , by computing \(F(b) - F(a)\) . This means that if we send this plot above to ET, he will have all the information needed to reconstruct the entire list. Paraphrasing the expression “a picture is worth a thousand words”, in this case, a picture is as informative as 812 numbers.

Note: the reason we add the word empirical is because, as we will see in 13.10.1 , the cumulative distribution function (CDF) can be defined mathematically, meaning without any data.

12.2.3 Histograms

Although the eCDF concept is widely discussed in statistics textbooks, the summary plot is actually not very popular in practice. The main reason is that it does not easily convey characteristics of interest such as: at what value is the distribution centered? Is the distribution symmetric? What ranges contain 95% of the values? Histograms are much preferred because they greatly facilitate answering such questions. Histograms sacrifice just a bit of information to produce summaries that are much easier to interpret.

The simplest way to make a histogram is to divide the span of our data into non-overlapping bins of the same size. Then, for each bin, we count the number of values that fall in that interval. The histogram plots these counts as bars with the base of the bar defined by the intervals. Here is the histogram for the height data splitting the range of values into one inch intervals: \((49.5, 50.5],(50.5, 51.5],(51.5,52.5],(52.5,53.5],...,(82.5,83.5]\)

what is a summary in statistics

As you can see in the figure above, a histogram is similar to a barplot, but it differs in that the x-axis is numerical, not categorical.

If we send this plot to ET, he will immediately learn some important properties about our data. First, the range of the data is from 50 to 84 with the majority (more than 95%) between 63 and 75 inches. Second, the heights are close to symmetric around 69 inches. Also, by adding up counts, ET could obtain a very good approximation of the proportion of the data in any interval. Therefore, the histogram above is not only easy to interpret, but also provides almost all the information contained in the raw list of 812 heights with about 30 bin counts.

What information do we lose? Note that all values in each interval are treated the same when computing bin heights. So, for example, the histogram does not distinguish between 64, 64.1, and 64.2 inches. Given that these differences are almost unnoticeable to the eye, the practical implications are negligible and we were able to summarize the data to just 23 numbers.

12.2.4 Smoothed density

Smooth density plots are similar to histograms, but the data is not divided into bins. Here is what a smooth density plot looks like for our heights data:

what is a summary in statistics

In this plot, we no longer have sharp edges at the interval boundaries and many of the local peaks have been removed. Also, the scale of the y-axis changed from counts to density .

To understand the smooth densities, we have to understand estimates , a topic we don’t cover until later. However, we provide a heuristic explanation to help you understand the basics.

The main new concept you must understand is that we assume that our list of observed values is a subset of a much larger list of unobserved values. In the case of heights, you can imagine that our list of 812 male students comes from a hypothetical list containing all the heights of all the male students in all the world measured very precisely. Let’s say there are 1,000,000 of these measurements. This list of values has a distribution, like any list of values, and this larger distribution is really what we want to report to ET since it is much more general. Unfortunately, we don’t get to see it.

However, we make an assumption that helps us perhaps approximate it. If we had 1,000,000 values, measured very precisely, we could make a histogram with very, very small bins. The assumption is that if we show this, the height of consecutive bins will be similar. This is what we mean by smooth: we don’t have big jumps in the heights of consecutive bins. Below we have a hypothetical histogram with bins of size 1:

what is a summary in statistics

The smaller we make the bins, the smoother the histogram gets. Here are the histograms with bin width of 1, 0.5, and 0.1:

what is a summary in statistics

The smooth density is basically the curve that goes through the top of the histogram bars when the bins are very, very small. To make the curve not depend on the hypothetical size of the hypothetical list, we compute the curve on frequencies rather than counts:

what is a summary in statistics

Now, back to reality. We don’t have millions of measurements. Instead, we have 812 and we can’t make a histogram with very small bins.

We therefore make a histogram, using bin sizes appropriate for our data and computing frequencies rather than counts, and we draw a smooth curve that goes through the tops of the histogram bars. The following plots demonstrate the steps that lead to a smooth density:

what is a summary in statistics

However, remember that smooth is a relative term. We can actually control the smoothness of the curve that defines the smooth density through an option in the function that computes the smooth density curve. Here are two examples using different degrees of smoothness on the same histogram:

what is a summary in statistics

We need to make this choice with care as the resulting summary can change our interpretation of the data. We should select a degree of smoothness that we can defend as being representative of the underlying data. In the case of height, we really do have reason to believe that the proportion of people with similar heights should be the same. For example, the proportion that is 72 inches should be more similar to the proportion that is 71 than to the proportion that is 78 or 65. This implies that the curve should be pretty smooth; that is, the curve should look more like the example on the right than on the left.

While the histogram is an assumption-free summary, the smoothed density is based on some assumptions.

Note that interpreting the y-axis of a smooth density plot is not straightforward. It is scaled so that the area under the density curve adds up to 1. If you imagine we form a bin with a base 1 unit in length, the y-axis value tells us the proportion of values in that bin. However, this is only true for bins of size 1. For other size intervals, the best way to determine the proportion of data in that interval is by computing the proportion of the total area contained in that interval. For example, here are the proportion of values between 65 and 68:

what is a summary in statistics

The proportion of this area is about 0.3, meaning that about 30% of male heights are between 65 and 68 inches.

By understanding this, we are ready to use the smooth density as a summary. For this dataset, we would feel quite comfortable with the smoothness assumption, and therefore with sharing this aesthetically pleasing figure with ET, which he could use to understand our male heights data:

what is a summary in statistics

12.3 Exercises

1. In the murders dataset, the region is a categorical variable and the following is its distribution:

what is a summary in statistics

To the closest 5%, what proportion of the states are in the North Central region?

2. Which of the following is true:

  • The graph above is a histogram.
  • The graph above shows only four numbers with a bar plot.
  • Categories are not numbers, so it does not make sense to graph the distribution.
  • The colors, not the height of the bars, describe the distribution.

3. The plot below shows the eCDF for male heights:

what is a summary in statistics

Based on the plot, what percentage of males are shorter than 75 inches?

4. To the closest inch, what height m has the property that 1/2 of the male students are taller than m and 1/2 are shorter?

5. Here is an eCDF of the murder rates across states:

what is a summary in statistics

Knowing that there are 51 states (counting DC) and based on this plot, how many states have murder rates larger than 10 per 100,000 people?

6. Based on the eCDF above, which of the following statements are true:

  • About half the states have murder rates above 7 per 100,000 and the other half below.
  • Most states have murder rates below 2 per 100,000.
  • All the states have murder rates above 2 per 100,000.
  • With the exception of 4 states, the murder rates are below 5 per 100,000.

7. Below is a histogram of male heights in our heights dataset:

what is a summary in statistics

Based on this plot, how many males are between 63.5 and 65.5?

8. About what percentage are shorter than 60 inches?

9. Based on the density plot below, about what proportion of US states have populations larger than 10 million?

what is a summary in statistics

10. Below are three density plots. Is it possible that they are from the same dataset?

what is a summary in statistics

Which of the following statements is true:

  • It is impossible that they are from the same dataset.
  • They are from the same dataset, but the plots are different due to code errors.
  • They are the same dataset, but the first and second plot undersmooth and the third oversmooths.
  • They are the same dataset, but the first is not in the log scale, the second undersmooths, and the third oversmooths.

12.4 The normal distribution

Histograms and density plots provide excellent summaries of a distribution. But can we summarize even further? We often see the average and standard deviation used as summary statistics: a two-number summary! To understand what these summaries are and why they are so widely used, we need to understand the normal distribution.

The normal distribution, also known as the bell curve and as the Gaussian distribution, is one of the most famous mathematical concepts in history. A reason for this is that approximately normal distributions occur in many situations, including gambling winnings, heights, weights, blood pressure, standardized test scores, and experimental measurement errors. There are explanations for this, but we describe these later. Here we focus on how the normal distribution helps us summarize data.

Rather than using data, the normal distribution is defined with a mathematical formula. For any interval \((a,b)\) , the proportion of values in that interval can be computed using this formula:

\[\mbox{Pr}(a < x \leq b) = \int_a^b \frac{1}{\sqrt{2\pi}s} e^{-\frac{1}{2}\left( \frac{x-m}{s} \right)^2} \, dx\]

You don’t need to memorize or understand the details of the formula. But note that it is completely defined by just two parameters: \(m\) and \(s\) . The rest of the symbols in the formula represent the interval ends, \(a\) and \(b\) , and known mathematical constants \(\pi\) and \(e\) . These two parameters, \(m\) and \(s\) , are referred to as the average (also called the mean ) and the standard deviation (SD) of the distribution, respectively.

The distribution is symmetric, centered at the average, and most values (about 95%) are within 2 SDs from the average. Here is what the normal distribution looks like when the average is 0 and the SD is 1:

what is a summary in statistics

The fact that the distribution is defined by just two parameters implies that if a dataset is approximated by a normal distribution, all the information needed to describe the distribution can be encoded in just two numbers: the average and the standard deviation. We now define these values for an arbitrary list of numbers.

For a list of numbers contained in a vector x , the average is defined as:

and the SD is defined as:

which can be interpreted as the average distance between values and their average.

Let’s compute the values for the height for males which we will store in the object \(x\) :

The pre-built functions mean and sd can be used here:

Advacned note: for reasons explained in statistics textbooks, sd divides by length(x)-1 rather than length(x) . So sd(x) and sqrt(sum((x-mu)^2) / length(x)) are when length(x) is large, these two are practically equal.

Here is a plot of the smooth density and the normal distribution with mean = 69.3 and SD = 3.6 plotted as a black line with our student height smooth density in blue:

what is a summary in statistics

The normal distribution does appear to be quite a good approximation here. We now will see how well this approximation works at predicting the proportion of values within intervals.

12.4.1 Standard units

For data that is approximately normally distributed, it is convenient to think in terms of standard units . The standard unit of a value tells us how many standard deviations away from the average it is. Specifically, for a value x from a vector X , we define the value of x in standard units as z = (x - m)/s with m and s the average and standard deviation of X , respectively. Why is this convenient?

First look back at the formula for the normal distribution and note that what is being exponentiated is \(-z^2/2\) with \(z\) equivalent to \(x\) in standard units. Because the maximum of \(e^{-z^2/2}\) is when \(z=0\) , this explains why the maximum of the distribution occurs at the average. It also explains the symmetry since \(- z^2/2\) is symmetric around 0. Second, note that if we convert the normally distributed data to standard units, we can quickly know if, for example, a person is about average ( \(z=0\) ), one of the largest ( \(z \approx 2\) ), one of the smallest ( \(z \approx -2\) ), or an extremely rare occurrence ( \(z > 3\) or \(z < -3\) ). Remember that it does not matter what the original units are, these rules apply to any data that is approximately normal.

In R, we can obtain standard units using the function scale :

Now to see how many men are within 2 SDs from the average, we simply type:

The proportion is about 95%, which is what the normal distribution predicts! To further confirm that, in fact, the approximation is a good one, we can use quantile-quantile plots.

12.4.2 Quantile-quantile plots

A systematic way to assess how well the normal distribution fits the data is to check if the observed and predicted proportions match. In general, this is the approach of the quantile-quantile plot (QQ-plot).

First let’s define the theoretical quantiles for the normal distribution. In statistics books we use the symbol \(\Phi(x)\) to define the function that gives us the proportion of a standard normal distributed data that are smaller than \(x\) . So, for example, \(\Phi(-1.96) = 0.025\) and \(\Phi(1.96) = 0.975\) . In R, we can evaluate \(\Phi\) using the pnorm function:

The inverse function \(\Phi^{-1}(x)\) gives us the theoretical quantiles for the normal distribution. So, for example, \(\Phi^{-1}(0.975) = 1.96\) . In R, we can evaluate the inverse of \(\Phi\) using the qnorm function.

Note that these calculations are for the standard normal distribution by default (mean = 0, standard deviation = 1), but we can also define these for any normal distribution. We can do this using the mean and sd arguments in the pnorm and qnorm function. For example, we can use qnorm to determine quantiles of a distribution with a specific average and standard deviation

For the normal distribution, all the calculations related to quantiles are done without data, thus the name theoretical quantiles . But quantiles can be defined for any distribution, including an empirical one. So if we have data in a vector \(x\) , we can define the quantile associated with any proportion \(p\) as the \(q\) for which the proportion of values below \(q\) is \(p\) . Using R code, we can define q as the value for which mean(x <= q) = p . Notice that not all \(p\) have a \(q\) for which the proportion is exactly \(p\) . There are several ways of defining the best \(q\) as discussed in the help for the quantile function.

To give a quick example, for the male heights data, we have that:

So about 50% are shorter or equal to 69 inches. This implies that if \(p=0.50\) then \(q=69.5\) .

The idea of a QQ-plot is that if your data is well approximated by normal distribution then the quantiles of your data should be similar to the quantiles of a normal distribution. To construct a QQ-plot, we do the following:

  • Define a vector of \(m\) proportions \(p_1, p_2, \dots, p_m\) .
  • Define a vector of quantiles \(q_1, \dots, q_m\) for your data for the proportions \(p_1, \dots, p_m\) . We refer to these as the sample quantiles .
  • Define a vector of theoretical quantiles for the proportions \(p_1, \dots, p_m\) for a normal distribution with the same average and standard deviation as the data.
  • Plot the sample quantiles versus the theoretical quantiles.

Let’s construct a QQ-plot using R code. Start by defining the vector of proportions.

To obtain the quantiles from the data, we can use the quantile function like this:

To obtain the theoretical normal distribution quantiles with the corresponding average and SD, we use the qnorm function:

To see if they match or not, we plot them against each other and draw the identity line:

what is a summary in statistics

Notice that this code becomes much cleaner if we use standard units:

The above code is included to help describe QQ-plots. However, in practice it is easier to use ggplot2 code:

While for the illustration above we used 20 quantiles, the default from the geom_qq function is to use as many quantiles as data points.

Note that although here we used qqplots to compare an observed distribution to the mathamatically defeinde normal distribution, QQ-plots can be used to compare any two distributions.

12.5 Percentiles

Before we move on, let’s define some terms that are commonly used in exploratory data analysis.

Percentiles are special cases of quantiles that are commonly used. The percentiles are the quantiles you obtain when setting the \(p\) at \(0.01, 0.02, ..., 0.99\) . We call, for example, the case of \(p=0.25\) the 25th percentile, which gives us a number for which 25% of the data is below. The most famous percentile is the 50th, also known as the median .

For the normal distribution the median and average are the same, but this is generally not the case.

Another special case that receives a name are the quartiles , which are obtained when setting \(p=0.25,0.50\) , and \(0.75\) .

12.6 Boxplots

To introduce boxplots we will use a dataset of US murders by state. Suppose we want to summarize the murder rate distribution. Using the techniques we have learned, we can quickly see that the normal approximation does not apply here:

what is a summary in statistics

In this case, the histogram above or a smooth density plot would serve as a relatively succinct summary.

Now suppose those used to receiving just two numbers as summaries ask us for a more compact numerical summary.

The boxplot provides a five-number summary composed of the range along with the quartiles (the 25th, 50th, and 75th percentiles). The boxplot often ignore outliers when computing the range and instead plot these as independent points. We provide a detailed explanation of outliers later. Finally, he suggested we plot these numbers as a “box” with “whiskers” like this:

what is a summary in statistics

with the box defined by the 25% and 75% percentile and the whiskers showing the range. The distance between these two is called the interquartile range. The two points are considered outliers by the default R function we used. The median is shown with a horizontal line. Today, we call these boxplots .

From just this simple plot, we know that the median is about 2.5, that the distribution is not symmetric, and that the range is 0 to 5 for the great majority of states with two exceptions.

12.7 Stratification

In data analysis we often divide observations into groups based on the values of one or more variables associated with those observations. For example in the next section we divide the height values into groups based on a sex variable: females and males. We call this procedure stratification and refer to the resulting groups as strata .

Stratification is common in data visualization because we are often interested in how the distribution of variables differs across different subgroups. We will see several examples throughout this part of the book.

12.7.1 Case study: describing student heights (continued)

Using the histogram, density plots, and QQ-plots, we have become convinced that the male height data is well approximated with a normal distribution. In this case, we report back to ET a very succinct summary: male heights follow a normal distribution with an average of 69.3 inches and a SD of 3.6 inches. With this information, ET will have a good idea of what to expect when he meets our male students. However, to provide a complete picture we need to also provide a summary of the female heights.

We learned that boxplots are useful when we want to quickly compare two or more distributions. Here are the heights for men and women:

what is a summary in statistics

The plot immediately reveals that males are, on average, taller than females. The standard deviations appear to be similar. But does the normal approximation also work for the female height data collected by the survey? We expect that they will follow a normal distribution, just like males. However, exploratory plots reveal that the approximation is not as useful:

what is a summary in statistics

We see something we did not see for the males: the density plot has a second bump . Also, the QQ-plot shows that the highest points tend to be taller than expected by the normal distribution. Finally, we also see five points in the QQ-plot that suggest shorter than expected heights for a normal distribution. When reporting back to ET, we might need to provide a histogram rather than just the average and standard deviation for the female heights.

We have noticed what we didn’t expect to see. If we look at other female height distributions, we do find that they are well approximated with a normal distribution. So why are our female students different? Is our class a requirement for the female basketball team? Are small proportions of females claiming to be taller than they are? Another, perhaps more likely, explanation is that in the form students used to enter their heights, FEMALE was the default sex and some males entered their heights, but forgot to change the sex variable. In any case, data visualization has helped discover a potential flaw in our data.

Regarding the five smallest values, note that these values are:

Because these are reported heights, a possibility is that the student meant to enter 5'1" , 5'2" , 5'3" or 5'5" .

12.8 Exercises

1. Define variables containing the heights of males and females like this:

How many measurements do we have for each?

2. Suppose we can’t make a plot and want to compare the distributions side by side. We can’t just list all the numbers. Instead, we will look at the percentiles. Create a five row table showing female_percentiles and male_percentiles with the 10th, 30th, 50th, 70th, & 90th percentiles for each sex. Then create a data frame with these two as columns.

3. Study the following boxplots showing population sizes by country:

what is a summary in statistics

Which continent has the country with the biggest population size?

4. What continent has the largest median population size?

5. What is median population size for Africa to the nearest million?

6. What proportion of countries in Europe have populations below 14 million?

7. If we use a log transformation, which continent shown above has the largest interquartile range?

8. Load the height data set and create a vector x with just the male heights:

What proportion of the data is between 69 and 72 inches (taller than 69, but shorter or equal to 72)? Hint: use a logical operator and mean .

9. Suppose all you know about the data is the average and the standard deviation. Use the normal approximation to estimate the proportion you just calculated. Hint: start by computing the average and standard deviation. Then use the pnorm function to predict the proportions.

10. Notice that the approximation calculated in question nine is very close to the exact calculation in the first question. Now perform the same task for more extreme values. Compare the exact calculation and the normal approximation for the interval (79,81]. How many times bigger is the actual proportion than the approximation?

11. Approximate the distribution of adult men in the world as normally distributed with an average of 69 inches and a standard deviation of 3 inches. Using this approximation, estimate the proportion of adult men that are 7 feet tall or taller, referred to as seven footers . Hint: use the pnorm function.

12. There are about 1 billion men between the ages of 18 and 40 in the world. Use your answer to the previous question to estimate how many of these men (18-40 year olds) are seven feet tall or taller in the world?

13. There are about 10 National Basketball Association (NBA) players that are 7 feet tall or higher. Using the answer to the previous two questions, what proportion of the world’s 18-to-40-year-old seven footers are in the NBA?

14. Repeat the calculations performed in the previous question for Lebron James’ height: 6 feet 8 inches. There are about 150 players that are at least that tall.

15. In answering the previous questions, we found that it is not at all rare for a seven footer to become an NBA player. What would be a fair critique of our calculations:

  • Practice and talent are what make a great basketball player, not height.
  • The normal approximation is not appropriate for heights.
  • As seen in question 10, the normal approximation tends to underestimate the extreme values. It’s possible that there are more seven footers than we predicted.
  • As seen in question 10, the normal approximation tends to overestimate the extreme values. It’s possible that there are fewer seven footers than we predicted.

12.9 Robust summaries

12.9.1 outliers.

We previously described how boxplots show outliers , but we did not provide a precise definition. Here we discuss outliers, approaches that can help detect them, and summaries that take into account their presence.

Outliers are very common in real-world data anlysis. Data recording can be complex and it is common to observe data points generated in error. For example, an old monitoring device may read out nonsensical measurements before completely failing. Human error is also a source of outliers, in particular when data entry is done manually. An individual, for instance, may mistakenly enter their height in centimeters instead of inches or put the decimal in the wrong place.

How do we distinguish an outlier from measurements that were too big or too small simply due to expected variability? This is not always an easy question to answer, but we try to provide some guidance. Let’s begin with a simple case.

Suppose a colleague is charged with collecting demography data for a group of males. The data report height in feet and are stored in the object:

Our colleague uses the fact that heights are usually well approximated by a normal distribution and summarizes the data with average and standard deviation

and writes a report on the interesting fact that this group of males is much taller than usual. The average height is over six feet tall! Using your data analysis skills, however, you notice something else that is unexpected: the standard deviation is over 7 feet. Adding and subtracting two standard deviations, you note that 95% of this population will have heights between -9.489, 21.697 feet, which does not make sense. A quick plot reveals the problem:

what is a summary in statistics

There appears to be at least one value that is nonsensical, since we know that a height of 180 feet is impossible. The boxplot detects this point as an outlier.

12.9.2 Median

When we have an outlier like this, the average can become very large. Mathematically, we can make the average as large as we want by simply changing one number: with 500 data points, we can increase the average by any amount \(\Delta\) by adding \(\Delta \times\) 500 to a single number. The median, defined as the value for which half the values are smaller and the other half are bigger, is robust to such outliers. No matter how large we make the largest point, the median remains the same.

With this data the median is:

which is about 5 feet and 9 inches.

The median is what boxplots display as a horizontal line.

12.9.3 The inter quartile range (IQR)

The box in boxplots is defined by the first and third quartile. These are meant to provide an idea of the variability in the data: 50% of the data is within this range. The difference between the 3rd and 1st quartile (or 75th and 25th percentiles) is referred to as the inter quartile range (IQR). As is the case with the median, this quantity will be robust to outliers as large values do not affect it. We can do some math to see that for normally distributed data, the IQR / 1.349 approximates the standard deviation of the data had an outlier not been present. We can see that this works well in our example since we get a standard deviation estimate of:

which is about 3 inches.

12.9.4 Tukey’s definition of an outlier

In R, points falling outside the whiskers of the boxplot are referred to as outliers . This definition of outlier was introduced by John Tukey. The top whisker ends at the 75th percentile plus 1.5 \(\times\) IQR. Similarly the bottom whisker ends at the 25th percentile minus 1.5 \(\times\) IQR. If we define the first and third quartiles as \(Q_1\) and \(Q_3\) , respectively, then an outlier is anything outside the range:

\[[Q_1 - 1.5 \times (Q_3 - Q1), Q_3 + 1.5 \times (Q_3 - Q1)].\]

When the data is normally distributed, the standard units of these values are:

Using the pnorm function, we see that 99.3% of the data falls in this interval.

Keep in mind that this is not such an extreme event: if we have 1,000 data points that are normally distributed, we expect to see about 7 outside of this range. But these would not be outliers since we expect to see them under the typical variation.

If we want an outlier to be rarer, we can increase the 1.5 to a larger number. Tukey also used 3 and called these far out outliers. With a normal distribution, 100% of the data falls in this interval. This translates into about 2 in a million chance of being outside the range. In the geom_boxplot function, this can be controlled by the outlier.size argument, which defaults to 1.5.

The 180 inches measurement is well beyond the range of the height data:

If we take this value out, we can see that the data is in fact normally distributed as expected:

what is a summary in statistics

12.9.5 Median absolute deviation

Another way to robustly estimate the standard deviation in the presence of outliers is to use the median absolute deviation (MAD). To compute the MAD, we first compute the median, and then for each value we compute the distance between that value and the median. The MAD is defined as the median of these distances. For technical reasons not discussed here, this quantity needs to be multiplied by 1.4826 to assure it approximates the actual standard deviation. The mad function already incorporates this correction. For the height data, we get a MAD of:

12.10 Exercises

We are going to use the HistData package. If it is not installed you can install it like this:

Load the height data set and create a vector x with just the male heights used in Galton’s data on the heights of parents and their children from his historic research on heredity.

1. Compute the average and median of these data.

2. Compute the median and median absolute deviation of these data.

3. Now suppose Galton made a mistake when entering the first value and forgot to use the decimal point. You can imitate this error by typing:

How many inches does the average grow after this mistake?

4. How many inches does the SD grow after this mistake?

5. How many inches does the median grow after this mistake?

6. How many inches does the MAD grow after this mistake?

7. How could you use exploratory data analysis to detect that an error was made?

  • Since it is only one value out of many, we will not be able to detect this.
  • We would see an obvious shift in the distribution.
  • A boxplot, histogram, or qq-plot would reveal a clear outlier.
  • A scatterplot would show high levels of measurement error.

8. How much can the average accidentally grow with mistakes like this? Write a function called error_avg that takes a value k and returns the average of the vector x after the first entry changed to k . Show the results for k=10000 and k=-10000 .

12.10.1 Case study: self-reported student heights

The heights we have been looking at are not the original heights reported by students. The original reported heights are also included in the dslabs package and can be loaded like this:

Height is a character vector so we create a new column with the numeric version:

Note that we get a warning about NAs. This is because some of the self reported heights were not numbers. We can see why we get these:

Some students self-reported their heights using feet and inches rather than just inches. Others used centimeters and others were just trolling. For now we will remove these entries:

If we compute the average and standard deviation, we notice that we obtain strange results. The average and standard deviation are different from the median and MAD:

This suggests that we have outliers, which is confirmed by creating a boxplot:

what is a summary in statistics

We can see some rather extreme values. To see what these values are, we can quickly look at the largest values using the arrange function:

The first seven entries look like strange errors. However, the next few look like they were entered as centimeters instead of inches. Since 184 cm is equivalent to six feet tall, we suspect that 184 was actually meant to be 72 inches.

We can review all the nonsensical answers by looking at the data considered to be far out by Tukey:

Examining these heights carefully, we see two common mistakes: entries in centimeters, which turn out to be too large, and entries of the form x.y with x and y representing feet and inches, respectively, which turn out to be too small. Some of the even smaller values, such as 1.6, could be entries in meters.

An Introduction to Data Analysis

5 summary statistics.

badge summary statistics

A summary statistic is a single number that represents one aspect of a possibly much more complex chunk of data. This single number might, for example, indicate the maximum or minimum value of a vector of one billion observations. The large data set (one billion observations) is reduced to a single number which represents one aspect of that data. Summary statistics are, as a general (but violable) rule, many-to-one surjections. They compress complex information into a simpler, compressed representation.

Summary statistics are useful for understanding the data at hand, for communication about a data set, but also for subsequent statistical analyses. As we will see later on, many statistical tests look at a summary statistic \(x\) , which is a single value derived from data set \(D\) , and compare \(x\) to an expectation of what \(x\) should be like if the process that generated \(D\) really had a particular property. For the moment, however, we use summary statistics only to get comfortable with data: understanding it better and gaining competence to manipulate it.

Section 5.1 first uses the Bio-Logic Jazz-Metal data set to look at a very intuitive class of summary statistics for categorical data, namely counts and proportions. Section 5.2 introduces summary statistics for simple, one-dimensional vectors with numeric information. Section 5.3 looks at measures of the relation between two numerical vectors, namely covariance and correlation . These last two sections use the avocado data set .

The learning goals for this chapter are:

  • become able to compute counts and frequencies for categorical data
  • mean, mode, median
  • variance, standard deviation, quantiles
  • bootstrapped CI of the mean
  • Bravais-Pearson correlation

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Descriptive Statistics | Definitions, Types, Examples

Published on July 9, 2020 by Pritha Bhandari . Revised on June 21, 2023.

Descriptive statistics summarize and organize characteristics of a data set. A data set is a collection of responses or observations from a sample or entire population.

In quantitative research , after collecting data, the first step of statistical analysis is to describe characteristics of the responses, such as the average of one variable (e.g., age), or the relation between two variables (e.g., age and creativity).

The next step is inferential statistics , which help you decide whether your data confirms or refutes your hypothesis and whether it is generalizable to a larger population.

Table of contents

Types of descriptive statistics, frequency distribution, measures of central tendency, measures of variability, univariate descriptive statistics, bivariate descriptive statistics, other interesting articles, frequently asked questions about descriptive statistics.

There are 3 main types of descriptive statistics:

  • The distribution concerns the frequency of each value.
  • The central tendency concerns the averages of the values.
  • The variability or dispersion concerns how spread out the values are.

Types of descriptive statistics

You can apply these to assess only one variable at a time, in univariate analysis, or to compare two or more, in bivariate and multivariate analysis.

  • Go to a library
  • Watch a movie at a theater
  • Visit a national park

Prevent plagiarism. Run a free check.

A data set is made up of a distribution of values, or scores. In tables or graphs, you can summarize the frequency of every possible value of a variable in numbers or percentages. This is called a frequency distribution .

  • Simple frequency distribution table
  • Grouped frequency distribution table

From this table, you can see that more women than men or people with another gender identity took part in the study. In a grouped frequency distribution, you can group numerical response values and add up the number of responses for each group. You can also convert each of these numbers to percentages.

Measures of central tendency estimate the center, or average, of a data set. The mean, median and mode are 3 ways of finding the average.

Here we will demonstrate how to calculate the mean, median, and mode using the first 6 responses of our survey.

The mean , or M , is the most commonly used method for finding the average.

To find the mean, simply add up all response values and divide the sum by the total number of responses. The total number of responses or observations is called N .

The median is the value that’s exactly in the middle of a data set.

To find the median, order each response value from the smallest to the biggest. Then , the median is the number in the middle. If there are two numbers in the middle, find their mean.

The mode is the simply the most popular or most frequent response value. A data set can have no mode, one mode, or more than one mode.

To find the mode, order your data set from lowest to highest and find the response that occurs most frequently.

Measures of variability give you a sense of how spread out the response values are. The range, standard deviation and variance each reflect different aspects of spread.

The range gives you an idea of how far apart the most extreme response scores are. To find the range , simply subtract the lowest value from the highest value.

Standard deviation

The standard deviation ( s or SD ) is the average amount of variability in your dataset. It tells you, on average, how far each score lies from the mean. The larger the standard deviation, the more variable the data set is.

There are six steps for finding the standard deviation:

  • List each score and find their mean.
  • Subtract the mean from each score to get the deviation from the mean.
  • Square each of these deviations.
  • Add up all of the squared deviations.
  • Divide the sum of the squared deviations by N – 1.
  • Find the square root of the number you found.

Step 5: 421.5/5 = 84.3

Step 6: √84.3 = 9.18

The variance is the average of squared deviations from the mean. Variance reflects the degree of spread in the data set. The more spread the data, the larger the variance is in relation to the mean.

To find the variance, simply square the standard deviation. The symbol for variance is s 2 .

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

what is a summary in statistics

Univariate descriptive statistics focus on only one variable at a time. It’s important to examine data from each variable separately using multiple measures of distribution, central tendency and spread. Programs like SPSS and Excel can be used to easily calculate these.

If you were to only consider the mean as a measure of central tendency, your impression of the “middle” of the data set can be skewed by outliers, unlike the median or mode.

Likewise, while the range is sensitive to outliers , you should also consider the standard deviation and variance to get easily comparable measures of spread.

If you’ve collected data on more than one variable, you can use bivariate or multivariate descriptive statistics to explore whether there are relationships between them.

In bivariate analysis, you simultaneously study the frequency and variability of two variables to see if they vary together. You can also compare the central tendency of the two variables before performing further statistical tests .

Multivariate analysis is the same as bivariate analysis but with more than two variables.

Contingency table

In a contingency table, each cell represents the intersection of two variables. Usually, an independent variable (e.g., gender) appears along the vertical axis and a dependent one appears along the horizontal axis (e.g., activities). You read “across” the table to see how the independent and dependent variables relate to each other.

Interpreting a contingency table is easier when the raw data is converted to percentages. Percentages make each row comparable to the other by making it seem as if each group had only 100 observations or participants. When creating a percentage-based contingency table, you add the N for each independent variable on the end.

From this table, it is more clear that similar proportions of children and adults go to the library over 17 times a year. Additionally, children most commonly went to the library between 5 and 8 times, while for adults, this number was between 13 and 16.

Scatter plots

A scatter plot is a chart that shows you the relationship between two or three variables . It’s a visual representation of the strength of a relationship.

In a scatter plot, you plot one variable along the x-axis and another one along the y-axis. Each data point is represented by a point in the chart.

From your scatter plot, you see that as the number of movies seen at movie theaters increases, the number of visits to the library decreases. Based on your visual assessment of a possible linear relationship, you perform further tests of correlation and regression.

Descriptive statistics: Scatter plot

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Statistical power
  • Pearson correlation
  • Degrees of freedom
  • Statistical significance

Methodology

  • Cluster sampling
  • Stratified sampling
  • Focus group
  • Systematic review
  • Ethnography
  • Double-Barreled Question

Research bias

  • Implicit bias
  • Publication bias
  • Cognitive bias
  • Placebo effect
  • Pygmalion effect
  • Hindsight bias
  • Overconfidence bias

Descriptive statistics summarize the characteristics of a data set. Inferential statistics allow you to test a hypothesis or assess whether your data is generalizable to the broader population.

The 3 main types of descriptive statistics concern the frequency distribution, central tendency, and variability of a dataset.

  • Distribution refers to the frequencies of different responses.
  • Measures of central tendency give you the average for each response.
  • Measures of variability show you the spread or dispersion of your dataset.
  • Univariate statistics summarize only one variable  at a time.
  • Bivariate statistics compare two variables .
  • Multivariate statistics compare more than two variables .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2023, June 21). Descriptive Statistics | Definitions, Types, Examples. Scribbr. Retrieved February 19, 2024, from https://www.scribbr.com/statistics/descriptive-statistics/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, central tendency | understanding the mean, median & mode, variability | calculating range, iqr, variance, standard deviation, inferential statistics | an easy introduction & examples, what is your plagiarism score.

  • Math Article

Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. In other words, it is a mathematical discipline to collect, summarize data. Also, we can say that statistics is a branch of applied mathematics. However, there are two important and basic ideas involved in statistics; they are  uncertainty and variation.  The uncertainty and variation in different fields can be determined only through statistical analysis. These uncertainties are basically determined by the probability that plays an important role in statistics. 

What is Statistics?

Statistics is simply defined as the study and manipulation of data. As we have already discussed in the introduction that statistics deals with the analysis and computation of numerical data. Let us see more definitions of statistics given by different authors here.

According to Merriam-Webster dictionary , statistics is defined as “classified facts representing the conditions of a people in a state – especially the facts that can be stated in numbers or any other tabular or classified arrangement”.

According to statistician Sir Arthur Lyon Bowley, statistics is defined as “Numerical statements of facts in any department of inquiry placed in relation to each other”.

Statistics – Download PDF

Statistics examples.

Some of the real-life examples of statistics are:

  • To find the mean of the marks obtained by each student in the class whose strength is 50. The average value here is the statistics of the marks obtained.
  • Suppose you need to find how many members are employed in a city. Since the city is populated with 15 lakh people, hence we will take a survey here for 1000 people (sample). Based on that, we will create the data, which is the statistic.

Basics of Statistics

The basics of statistics include the measure of central tendency and  the measure of dispersion. The central tendencies are  mean, median and mode  and dispersions comprise variance and standard deviation. 

Mean is the average of the observations. Median is the central value when observations are arranged in order. The mode determines the most frequent observations in a data set.

Variation is the measure of spread out of the collection of data. Standard deviation is the measure of the dispersion of data from the mean. The square of standard deviation is equal to the variance.

Mathematical Statistics

Mathematical statistics is the application of Mathematics to Statistics, which was initially conceived as the science of the state — the collection and analysis of facts about a country: its economy, and, military, population, and so forth.

Mathematical techniques used for different analytics include mathematical analysis, linear algebra, stochastic analysis, differential equation and measure-theoretic probability theory.

Types of Statistics

Basically, there are two types of statistics.

Descriptive Statistics

Inferential Statistics

In the case of descriptive statistics, the data or collection of data is described in summary. But in the case of inferential stats, it is used to explain the descriptive one. Both these types have been used on large scale.

The data is summarised and explained in descriptive statistics. The summarization is done from a population sample utilising several factors such as mean and standard deviation. Descriptive statistics is a way of organising, representing, and explaining a set of data using charts, graphs, and summary measures. Histograms, pie charts, bars, and scatter plots are common ways to summarise data and present it in tables or graphs. Descriptive statistics are just that: descriptive. They don’t need to be normalised beyond the data they collect.

We attempt to interpret the meaning of descriptive statistics using inferential statistics. We utilise inferential statistics to convey the meaning of the collected data after it has been collected, evaluated, and summarised. The probability principle is used in inferential statistics to determine if patterns found in a study sample may be extrapolated to the wider population from which the sample was drawn. Inferential statistics are used to test hypotheses and study correlations between variables, and they can also be used to predict population sizes. Inferential statistics are used to derive conclusions and inferences from samples, i.e. to create accurate generalisations.

Statistics Formulas

The formulas that are commonly used in statistical analysis are given in the table below.

Summary Statistics

In Statistics, summary statistics are a part of descriptive statistics (Which is one of the types of statistics), which gives the list of information about sample data. We know that statistics deals with the presentation of data visually and quantitatively. Thus, summary statistics deals with summarizing the statistical information. Summary statistics generally deal with condensing the data in a simpler form, so that the observer can understand the information at a glance.  Generally, statisticians try to describe the observations by finding:

  • The measure of central tendency or mean of the locations, such as arithmetic mean.
  • The measure of distribution shapes like skewness or kurtosis.
  • The measure of dispersion such as the standard mean absolute deviation.
  • The measure of statistical dependence such as correlation coefficient.

Summary Statistics Table

The summary statistics table is the visual representation of summarized statistical information about the data in tabular form.

For example, the blood group of 20 students in the class are O, A, B, AB, B, B, AB, O, A, B, B, AB, AB, O, O, B, A, AB, B, A.

Thus, the summary statistics table shows that 4 students in the class have O blood group, 4 students have A blood group, 7 students in the class have B blood group and 5 students in the class have AB blood group.  The summary statistics table is generally used to represent the big data related to population, unemployment, and the economy to be summarized systematically to interpret the accurate result.

Scope of Statistics

Statistics is used in many sectors such as psychology, geology, sociology, weather forecasting, probability and much more. The goal of statistics is to gain understanding from the data, it focuses on applications, and hence, it is distinctively considered as a mathematical science.

Methods in Statistics

The methods involve collecting, summarizing, analyzing, and interpreting variable numerical data. Here some of the methods are provided below.

  • Data collection
  • Data summarization
  • Statistical analysis

What is Data in Statistics?

Data is a collection of facts, such as numbers, words, measurements, observations etc.

Types of Data

  • Example- She can run fast, He is thin.
  • Example- An Octopus is an Eight legged creature.

Types of quantitative data

  • Discrete data- has a particular fixed value. It can be counted
  • Continuous data- is not fixed but has a range of data. It can be measured.

Representation of Data

There are different ways to represent data such as through graphs, charts or tables. The general representation of statistical data are:

  • Frequency Distribution

Measures of Central Tendency

In Mathematics, statistics are used to describe the central tendencies of the grouped and ungrouped data. The three measures of central tendency are:

All three measures of central tendency are used to find the central value of the set of data.

Measures of Dispersion

In statistics, the dispersion measures help interpret data variability, i.e. to understand how homogenous or heterogeneous the data is. In simple words, it indicates how squeezed or scattered the variable is. However, there are two types of dispersion measures, absolute and relative. They are tabulated as below:

Skewness in Statistics

Skewness, in statistics, is a measure of the asymmetry in a probability distribution. It measures the deviation of the curve of the normal distribution for a given set of data. 

The value of skewed distribution could be positive or negative or zero. Usually, the bell curve of normal distribution has zero skewness.

ANOVA Statistics

ANOVA Stands for Analysis of Variance. It is a collection of statistical models, used to measure the mean difference for the given set of data.

Degrees of freedom

In statistical analysis, the degree of freedom is used for the values that are free to change. The independent data or information that can be moved while estimating a parameter is the degree of freedom of information. 

Applications of Statistics

Statistics have huge applications across various fields in Mathematics as well as in real life. Some of the applications of statistics are given below:

  • Applied statistics, theoretical statistics and mathematical statistics
  • Machine learning and data mining
  • Statistics in society
  • Statistical computing
  • Statistics applied to the mathematics of the arts

Video Lesson

Grade 11 statistics.

what is a summary in statistics

Statistics Related Articles

Hope this detailed discussion and formulas on statistics will help you to solve problems quickly and efficiently. Learn more Maths concepts at BYJU’S with the help of interactive videos.

Frequently Asked Questions on Statistics

What exactly is statistics.

Statistics is a branch that deals with the study of the collection, analysis, interpretation, organisation, and presentation of data. Mathematically, statistics is defined as the set of equations, which are used to analyse things.

What are the two types of statistics?

The two different types of statistics used for analyzing the data are:

  • Descriptive Statistics: It summarizes the data from the sample using indexes
  • Inferential Statistics: It concludes from the data which are subjected to the random variation

What is Summary Statistics?

How is statistics applicable in maths.

Statistics is a part of Applied Mathematics that uses probability theory to generalize the collected sample data. It helps to characterize the likelihood where the generalizations of data are accurate. This is known as statistical inference.

What is the purpose of statistics?

What is the importance of statistics in real life.

Quiz Image

Put your understanding of this concept to test by answering a few MCQs. Click ‘Start Quiz’ to begin!

Select the correct answer and click on the “Finish” button Check your score and answers at the end of the quiz

Visit BYJU’S for all Maths related queries and study materials

Your result is as below

Request OTP on Voice Call

Leave a Comment Cancel reply

Your Mobile number and Email id will not be published. Required fields are marked *

Post My Comment

what is a summary in statistics

helpful indeed

The content is fabulous,n very helpful thanks to the byjus.

what is a summary in statistics

  • Share Share

Register with BYJU'S & Download Free PDFs

Register with byju's & watch live videos.

close

Book cover

Epidemiology and Biostatistics pp 49–52 Cite as

Summary Measures in Statistics

  • Bryan Kestenbaum MD, MS 2  
  • First Online: 13 October 2018

3555 Accesses

Summary measures provide compact descriptions of one or more study variables. Summary measures include statistical properties, such as the mean and median of a distribution, and graphical presentations, such as histograms and box plots. For normally distributed data, 95% of the observations reside within two standard deviations of the mean value. The joint distribution of two continuous variables can be described graphically using scatter plots, or quantified using correlation, which indicates the tendency of larger values of one variable to match up with larger values of a second variable.

  • Summary Measure
  • Coronary Calcium Score
  • Higher Gait Speed
  • Serum Phosphate Levels
  • Approximate Standard Deviation

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Author information

Authors and affiliations.

Division of Nephrology, Department of Medicine, University of Washington, Seattle, WA, USA

Bryan Kestenbaum MD, MS

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Cite this chapter.

Kestenbaum, B. (2019). Summary Measures in Statistics. In: Epidemiology and Biostatistics. Springer, Cham. https://doi.org/10.1007/978-3-319-97433-0_12

Download citation

DOI : https://doi.org/10.1007/978-3-319-97433-0_12

Published : 13 October 2018

Publisher Name : Springer, Cham

Print ISBN : 978-3-319-97432-3

Online ISBN : 978-3-319-97433-0

eBook Packages : Medicine Medicine (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

1.1: Basic Definitions and Concepts

  • Last updated
  • Save as PDF
  • Page ID 567

Learning Objectives

  • To learn the basic definitions used in statistics and some of its key concepts.

We begin with a simple example. There are millions of passenger automobiles in the United States. What is their average value? It is obviously impractical to attempt to solve this problem directly by assessing the value of every single car in the country, add up all those values, then divide by the number of values, one for each car. In practice the best we can do would be to estimate the average value. A natural way to do so would be to randomly select some of the cars, say \(200\) of them, ascertain the value of each of those cars, and find the average of those \(200\) values. The set of all those millions of vehicles is called the population of interest, and the number attached to each one, its value, is a measurement . The average value is a parameter : a number that describes a characteristic of the population, in this case monetary worth. The set of \(200\) cars selected from the population is called a sample , and the \(200\) numbers, the monetary values of the cars we selected, are the sample data . The average of the data is called a statistic : a number calculated from the sample data. This example illustrates the meaning of the following definitions.

Definitions: populations and samples

A population is any specific collection of objects of interest. A sample is any subset or subcollection of the population, including the case that the sample consists of the whole population, in which case it is termed a census.

Definitions: measurements and Sample Data

A measurement is a number or attribute computed for each member of a population or of a sample. The measurements of sample elements are collectively called the sample data .

Definition: parameters

A parameter is a number that summarizes some aspect of the population as a whole. A statistic is a number computed from the sample data.

Continuing with our example, if the average value of the cars in our sample was \($8,357\), then it seems reasonable to conclude that the average value of all cars is about \($8,357\). In reasoning this way we have drawn an inference about the population based on information obtained from the sample . In general, statistics is a study of data: describing properties of the data, which is called descriptive statistics , and drawing conclusions about a population of interest from information extracted from a sample, which is called inferential statistics . Computing the single number \($8,357\) to summarize the data was an operation of descriptive statistics; using it to make a statement about the population was an operation of inferential statistics.

Definition: Statistics

Statistics is a collection of methods for collecting, displaying, analyzing, and drawing conclusions from data.

Definition: Descriptive statistics

Descriptive statistics is the branch of statistics that involves organizing, displaying, and describing data.

Definition: Inferential statistics

Inferential statistics is the branch of statistics that involves drawing conclusions about a population based on information contained in a sample taken from that population.

Definition: Qualitative data

Qualitative data are measurements for which there is no natural numerical scale, but which consist of attributes, labels, or other non-numerical characteristics.

Definition: Quantitative data

Quantitative data are numerical measurements that arise from a natural numerical scale.

Qualitative data can generate numerical sample statistics. In the automobile example, for instance, we might be interested in the proportion of all cars that are less than six years old. In our same sample of \(200\) cars we could note for each car whether it is less than six years old or not, which is a qualitative measurement. If \(172\) cars in the sample are less than six years old, which is \(0.86\) or \(86\% \), then we would estimate the parameter of interest, the population proportion, to be about the same as the sample statistic, the sample proportion, that is, about \(0.86\).

The relationship between a population of interest and a sample drawn from that population is perhaps the most important concept in statistics, since everything else rests on it. This relationship is illustrated graphically in Figure \(\PageIndex{1}\). The circles in the large box represent elements of the population. In the figure there was room for only a small number of them but in actual situations, like our automobile example, they could very well number in the millions. The solid black circles represent the elements of the population that are selected at random and that together form the sample. For each element of the sample there is a measurement of interest, denoted by a lower case \(x\) (which we have indexed as \(x_1 , \ldots, x_n\) to tell them apart); these measurements collectively form the sample data set. From the data we may calculate various statistics. To anticipate the notation that will be used later, we might compute the sample mean \(\bar{x}\) and the sample proportion \(\hat{p}\), and take them as approximations to the population mean \(\mu\) (this is the lower case Greek letter mu, the traditional symbol for this parameter) and the population proportion \(p\), respectively. The other symbols in the figure stand for other parameters and statistics that we will encounter.

061a9224135424ca4644d6a0bb5f5913.jpg

Key Takeaway

  • Statistics is a study of data: describing properties of data (descriptive statistics) and drawing conclusions about a population based on information in a sample (inferential statistics).
  • The distinction between a population together with its parameters and a sample together with its statistics is a fundamental concept in inferential statistics.
  • Information in a sample is used to make inferences about the population from which the sample was drawn.

Lecture Notes: Introduction to Data Science

21 exploratory data analysis: summary statistics.

Let’s continue our discussion of Exploratory Data Analysis. In the previous section we saw ways of visualizing attributes (variables) using plots to start understanding properties of how data is distributed, an essential and preliminary step in data analysis. In this section, we start discussing statistical, or numerical, summaries of data to quantify properties that we observed using visual summaries and representations.

Remember that one purpose of EDA is to spot problems in data (as part of data wrangling) and understand variable properties like:

  • central trends (mean)
  • spread (variance)
  • suggest possible modeling strategies (e.g., probability distributions)

We also want to use EDA to understand relationship between pairs of variables, e.g. their correlation or covariance.

One last note on EDA. John W. Tukey was an exceptional scientist/mathematician, who had profound impact on statistics and Computer Science. A lot of what we cover in EDA is based on his groundbreaking work. I highly recommend you read more about him: https://www.stat.berkeley.edu/~brill/Papers/life.pdf .

Part of our goal is to understand how variables are distributed in a given dataset. Note, again, that we are not using distributed in a formal mathematical (or probabilistic) sense. All statements we are making here are based on data at hand, so we could refer to this as the empirical distribution of data. Here, empirical is used in the sense that this is data resulting from an experiment.

Let’s use a dataset on diamond characteristics as an example.

(Here’s some help interpreting these variables: https://en.wikipedia.org/wiki/Diamond_(gemstone)#Gemological_characteristics ).

Let’s start using some notation to make talking about this a bit more efficient. We assume that we have data across \(n\) entitites (or observational units) for \(p\) attributes. In this dataset \(n=53940\) and \(p=10\) . However, let’s consider a single attribute, and denote the data for that attribute (or variable) as \(x_1, x_2, \ldots, x_n\) .

Ok, so what’s the first question we want to ask about how data is distributed? Since we want to understand how data is distributed across a range , we should first define the range.

We use notation \(x_{(1)}\) and \(x_{(n)}\) to denote the minimum and maximum statistics. In general, we use notation \(x_{(q)}\) for the rank statistics, e.g., the \(q\) th largest value in the data.

21.2 Central Tendency

Now that we know the range over which data is distributed, we can figure out a first summary of data is distributed across this range. Let’s start with the center of the data: the median is a statistic defined such that half of the data has a smaller value. We can use notation \(x_{(n/2)}\) (a rank statistic) to represent the median. Note that we can use an algorithm based on the quicksort partition scheme to compute the median in linear time (on average).

21.2.1 Derivation of the mean as central tendency statistic

Of course, the best known statistic for central tendency is the mean , or average of the data: \(\overline{x} = \frac{1}{n} \sum_{i=1}^n x_i\) . It turns out that in this case, we can be a bit more formal about “center” means in this case. Let’s say that the center of a dataset is a point in the range of the data that is close to the data. To say that something is close we need a measure of distance .

So for two points \(x_1\) and \(x_2\) what should we use for distance? We could base it on \((x_1 - x_2)\) but that’s not enough since its sign depends on the order in which we write it. Using the absolute value solves that problem \(|x_1 - x_2|\) since now the sign doesn’t matter, but this has some issues that we will see later. So, next best thing we can do is use the square of the difference. So, in this case, the distance between data point \(x_1\) and \(x_2\) is \((x_1 - x_2)^2\) . Here is a fun question: what’s the largest distance between two points in our dataset?

So, to define the center , let’s build a criterion based on this distance by adding this distance across all points in our dataset:

\[ RSS(\mu) = \frac{1}{2} \sum_{i=1}^n (x_i - \mu)^2 \]

Here RSS means residual sum of squares , and we \(\mu\) to stand for candidate values of center . We can plot RSS for different values of \(\mu\) :

Now, what should our “center” estimate be? We want a value that is close to the data based on RSS! So we need to find the value in the range that minimizes RSS. From calculus, we know that a necessary condition for the minimizer \(\hat{\mu}\) of RSS is that the derivative of RSS is zero at that point. So, the strategy to minimize RSS is to compute its derivative, and find the value of \(\mu\) where it equals zero.

So, let’s find the derivative of RSS:

\[ \begin{eqnarray} \frac{\partial}{\partial \mu} \frac{1}{2} \sum_{i=1}^n (x_i - \mu)^2 & = & \frac{1}{2} \sum_{i=1}^n \frac{\partial}{\partial \mu} (x_i - \mu)^2 \; \textrm{(sum rule)}\\ {} & = & \frac{1}{2} \sum_{i=1}^n 2(x_i - \mu) \times \frac{\partial}{\partial \mu} (x_i - \mu) \; \textrm{(power rule and chain rule)}\\ {} & = & \frac{1}{2} \sum_{i=1}^n 2(x_i - \mu) \times (-1) \; \textrm{(sum rule and power rule)}\\ {} & = & \frac{1}{2} 2 \sum_{i=1}^n (\mu - x_i) \textrm{(rearranging)}\\ {} & = & \sum_{i=1}^n \mu - \sum_{i=1}^n x_i \\ {} & = & n\mu - \sum_{i=1}^n x_i \end{eqnarray} \]

Next, we set that equal to zero and find the value of \(\mu\) that solves that equation:

\[ \begin{eqnarray} \frac{\partial}{\partial \mu} & = & 0 & \Rightarrow \\ n\mu - \sum_{i=1}^n x_i & = & 0 & \Rightarrow \\ n\mu & = & \sum_{i=1}^n x_i & \Rightarrow \\ \mu & = & \frac{1}{n} \sum_{i=1}^n x_i & {} \end{eqnarray} \]

That’s the average we know and love! So the fact you should remember:

The mean is the value that minimizes RSS for a vector of attribute values

It equals the value where the derivative of RSS is 0:

It is the value that minimizes RSS:

And it serves as an estimate of central tendency of the dataset:

Note that in this dataset the mean and median are not exactly equal, but are very close:

One last note, there is a similar argument to define the median as a measure of center . In this case, instead of using RSS we use a different criterion: the sum of absolute deviations

\[ SAD(m) = \sum_{i=1}^n |x_i - m|. \]

The median is the minimizer of this criterion.

21.3 Spread

Now that we have a measure of center, we can now discuss how data is spread around that center.

21.3.1 Variance

For the mean, we have a convenient way of describing this: the average distance (using squared difference) from the mean. We call this the variance of the data:

\[ \mathrm{var}(x) = \frac{1}{n} \sum_{i=1}^n (x_i - \overline{x})^2 \]

You will also see it with a slightly different constant in the front for technical reasons that we may discuss later on:

\[ \mathrm{var}(x) = \frac{1}{n-1} \sum_{i=1}^n (x_i - \overline{x})^2 \]

Variance is a commonly used statistic for spread but it has the disadvantage that its units are not easy to conceptualize (e.g., squared diamond depth). A spread statistic that is in the same units as the data is the standard deviation , which is just the squared root of variance:

\[ \mathrm{sd}(x) = \sqrt{\frac{1}{n}\sum_{i=1}^n (x_i - \overline{x})^2} \]

We can also use standard deviations as an interpretable unit of how far a given data point is from the mean:

As a rough guide, we can use “standard deviations away from the mean” as a measure of spread as follows:

We will see later how these rough approximations are derived from a mathematical assumption about how data is distributed beyond the data we have at hand.

21.3.2 Spread estimates using rank statistics

Just like we saw how the median is a rank statistic used to describe central tendency, we can also use rank statistics to describe spread. For this we use two more rank statistics: the first and third quartiles , \(x_{(n/4)}\) and \(x_{(3n/4)}\) respectively:

Note, the five order statistics we have seen so far: minimum, maximum, median and first and third quartiles are so frequently used that this is exactly what R uses by default as a summary of a numeric vector of data (along with the mean):

This five-number summary are also all of the statistics used to construct a boxplot to summarize data distribution. In particular, the inter-quartile range , which is defined as the difference between the third and first quartile: \(\mathrm{IQR}(x) = x_{(3n/4)} - x_{(1n/4)}\) gives a measure of spread. The interpretation here is that half the data is within the IQR around the median.

21.4 Outliers

We can use estimates of spread to identify outlier values in a dataset. Given an estimate of spread based on the techniques we’ve just seen, we can identify values that are unusually far away from the center of the distribution.

One often cited rule of thumb is based on using standard deviation estimates. We can identify outliers as the set

\[ \mathrm{outliers_{sd}}(x) = \{x_j \, | \, |x_j| > \overline{x} + k \times \mathrm{sd}(x) \} \] where \(\overline{x}\) is the sample mean of the data and \(\mathrm{sd}(x)\) it’s standard deviation. Multiplier \(k\) determines if we are identifying (in Tukey’s nomenclature) outliers or points that are far out . Here is an example usage:

While this method works relatively well in practice, it presents a fundamental problem. Severe outliers can significantly affect spread estimates based on standard deviation. Specifically, spread estimates will be inflated in the presence of severe outliers. To circumvent this problem, we use rank-based estimates of spread to identify outliers as:

\[ \mathrm{outliers_{IQR}}(x) = \{x_j \, | \, x_j < x_{(1/4)} - k \times \mathrm{IQR}(x) \; \mathrm{ or } \; x_j > x_{(3/4)} + k \times \mathrm{IQR}(x)\} \] This is usually referred to as the Tukey outlier rule , with multiplier \(k\) serving the same role as before. We use the IQR here because it is less susceptible to be inflated by severe outliers in the dataset. It also works better for skewed data than the method based on standard deviation.

Here we demonstrate its use again:

One last thought. Although there are formal ways of defining this precisely, the five-number summary can be used to understand if data is skewed. How? Consider the differences between the first and third quartiles to the median:

If one of these differences is larger than the other, then that indicates that this dataset might be skewed, that is, that the range of data on one side of the median is longer (or shorter) than the range of data on the other side of the median. Do you think our diamond depth dataset is skewed?

21.6 Covariance and correlation

The scatter plot is a visual way of observing relationships between pairs of variables. Like descriptions of distributions of single variables, we would like to construct statistics that summarize the relationship between two variables quantitatively. To do this we will extend our notion of spread (or variation of data around the mean) to the notion of co-variation : do pairs of variables vary around the mean in the same way.

Consider now data for two variables over the same \(n\) entities: \((x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)\) . For example, for each diamond, we have carat and price as two variables:

We want to capture the relationship: does \(x_i\) vary in the same direction and scale away from its mean as \(y_i\) ? This leads to covariance

\[ cov(x,y) = \frac{1}{n} \sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y}) \]

Think of what would the covariance for \(x\) and \(y\) be if \(x_i\) varies in the opposite direction as \(y_i\) ?

Just like variance, we have an issue with units and interpretation for covariance, so we introduce correlation (formally, Pearson’s correlation coefficient) to summarize this relationship in a unit-less way:

\[ cor(x,y) = \frac{cov(x,y)}{sd(x) sd(y)} \]

As before, we can also use rank statistics to define a measure of how two variables are associated. One of these, Spearman correlation is commonly used. It is defined as the Pearson correlation coefficient of the ranks (rather than actual values) of pairs of variables.

21.7 Postscript: Finding Maxima/Minima using Derivatives

The values at which a function attains its maximum value are called maxima ( maximum if unique) of the function. Similarly, the values at which a function attains its minimum value are called minima ( minimum if unique) of the function.

In a smoothly changing function maxima or minima are found where the function flattens (slope becomes \(0\) ). The first derivative of the function tells us where the slope is \(0\) . This is the first derivate test .

The derivate of the slope (the second derivative of the original function) can be useful to know if the value we found from first derivate test is a maxima or minima. When a function’s slope is zero at \(x\) , and the second derivative at \(x\) is:

This is called the second derivate test .

21.7.1 Steps to find Maxima/Minima of function \(f(x)\)

  • Find the value(s) at which \(f'(x)=0\) (First derivative test).
  • Find the value of the second derivative for each of the x’s found in step 1 (Second derivative test).
  • If the value of the second derivative at \(x\) is: - less than 0, it is a local maximum - greater than 0, it is a local minimum - equal to 0, then the test fails (no minima or maxima)

21.7.2 Notes on Finding Derivatives

The derivative of the sum of two functions is the sum of the derivatives of the two functions:

Similarly, the derivative of the difference of two functions is the difference of the derivatives of the two functions.

If we have a function f(x) of the form \(f(x)=x^{n}\) for any integer n, \[\begin{eqnarray*} \frac{d}{dx}(f(x)) = \frac{d}{dx}(x^{n}) = nx^{n-1} \end{eqnarray*}\]

If we have two functions of the form \(f(x)\) and \(g(x)\) , the chain rule can be stated as follows: \[\begin{eqnarray*} \frac{d}{dx}(f(g(x)) = f^{'}(g(x)) g^{'}(x) \end{eqnarray*}\]

Differentiate \(y=(3x+1)^{2}\) with respect to x.\ Applying the above equation, we have the following: \[\begin{eqnarray*} \frac{d}{dx}((3x+1)^{2}) = 2(3x+1)^{2-1} \frac{d}{dx}((3x+1)) = 2(3x+1)(3) = 6(3x+1) \end{eqnarray*}\]

Product Rule

If we have two functions f(x) and g(x),

Quotient Rule

If we have two functions f(x) and g(x) ( \(g(x)\neq 0\) ),

21.7.3 Resources:

A useful calculus cheat sheet: http://tutorial.math.lamar.edu/pdf/Calculus_Cheat_Sheet_Derivatives.pdf

Discussion on finding maxima/minima: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0ahUKEwi32ZGPvbbPAhUCdj4KHcdyDZAQFggnMAI&url=http%3A%2F%2Fwww.math.psu.edu%2Ftseng%2Fclass%2FMath140A%2FNotes-First_and_Second_Derivative_Tests.doc&usg=AFQjCNEUih6RsfXq933pFwmoPk0yOvc1Mg&sig2=zyxh1-zWe7TY7zYwnhpH8g&cad=rja

This website may not work correctly because your browser is out of date. Please update your browser .

Summary statistics

transaction_statistic_logger_visualization_1%20%281%29.png

Summary statistics provide a quick summary of data and are particularly useful for comparing one project to another, or before and after.

There are two main types of summary statistics used in evaluation: measures of central tendency and measures of dispersion.

Measures of central tendency provide different versions of the average, including the mean, the median and the mode.

Measures of dispersion provide information about how much variation there is in the data, including the range and the standard deviation.

This tool provides step by step instructions accompanied by screenshots on how to calculate the mean and standard deviation with Excel.

Source: (Siegle)

This website offers a self paced online module which examines alcohol and marijuana use case studies to produce descriptive statistics including the mean and standard deviation

  • Descriptive statistics lessons - Khan Academy

This page is a Stub (a minimal version of a page). You can help expand it. Contact Us  to recommend resources or volunteer to expand the description.

'Summary statistics' is referenced in:

Framework/guide.

  • Rainbow Framework :  Analyse data

Back to top

© 2022 BetterEvaluation. All right reserved.

Statology

Statistics Made Easy

How to Calculate Summary Statistics in R Using dplyr

You can use the following syntax to calculate summary statistics for all numeric variables in a data frame in R using functions from the dplyr package:

The summarise() function comes from the dplyr package and is used to calculate summary statistics for variables.

The pivot_longer() function comes from the tidyr package and is used to format the output to make it easier to read.

This particular syntax calculates the following summary statistics for each numeric variable in a data frame:

  • Minimum value
  • Median value
  • Standard deviation
  • 25th percentile
  • 75th percentile
  • Maximum value

The following example shows how to use this function in practice.

Example: Calculate Summary Statistics in R Using dplyr

Suppose we have the following data frame in R that contains information about various basketball players:

We can use the following syntax to calculate summary statistics for each numeric variable in the data frame:

  From the output we can see:

  • The minimum value in the points column is 12 .
  • The median value in the points column is 21.5 .
  • The mean value in the points column is 22.8 .

Note : In this example, we utilized the dplyr across() function. You can find the complete documentation for this function here .

Additional Resources

The following tutorials explain how to perform other common functions using dplyr:

How to Summarise Data But Keep All Columns Using dplyr How to Summarise Multiple Columns Using dplyr How to Calculate Standard Deviation Using dplyr

' src=

Published by Zach

Leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

  • Share full article

Advertisement

Supported by

A Legal Showdown on the Border Between the U.S. and Texas: What to Know

A court in Austin heard oral arguments in the federal government’s bid to block Texas from imposing a wide-ranging new immigration law.

Officers in Border Patrol uniforms talk to several people standing near a large border wall.

By J. David Goodman

Reporting from Austin

The Biden administration is suing the State of Texas over a new state law that would empower state and local police officers to arrest migrants who cross from Mexico without authorization.

On Thursday, a federal court in Austin heard three hours of arguments over whether to halt the implementation of the law, which is set to go into effect on March 5.

The case has far-reaching implications for the future of immigration law and border enforcement and has been closely watched across the country. It comes amid fierce political fighting between the parties — and within them — over how to handle illegal immigration and follows the impeachment by House Republicans of the secretary of homeland security , and the failure of a bipartisan Senate deal to bolster security at the border.

Texas has argued that its law is necessary to deter migrants from crossing illegally, as has happened in record numbers over the past year. The Biden administration argues that the law conflicts with federal law and violates the U.S. Constitution, which gives the federal government authority over immigration matters.

The judge hearing the case, David A. Ezra of the Western District of Texas, was appointed to the bench by President Ronald Reagan. He had frequent questions, particularly when the lawyer representing the Texas attorney general was speaking, and appeared skeptical of the law.

“Let’s say for the purpose of argument that I agree with you,” Judge Ezra told the state’s lawyer, Ryan Walters. California might then want to pass its own immigration and deportation law, he said. Maybe then Maine would follow, he added, and then other states.

“That turns us from the United States of America into a confederation of states,” Judge Ezra said. “What a nightmare.”

What does the Texas law say?

The law passed by the Texas Legislature, known as Senate Bill 4 , makes it a crime to cross into Texas from a foreign country anywhere other than a legal port of entry, usually the international bridges from Mexico.

Under the law, known as S.B. 4, any migrant seen by the police wading across the Rio Grande could be arrested and charged in state court with a misdemeanor on the first offense. A second offense would be a felony. After being arrested, migrants could be ordered during the court process to return to Mexico or face prosecution if they don’t agree to go.

Texas lawmakers said they had designed S.B. 4 to closely follow federal law, which already bars illegal entry. The new law effectively allows state law enforcement officers all over Texas to conduct what until now has been the U.S. Border Patrol’s work.

It allows for migrants to be prosecuted for the new offense up to two years after they cross into Texas.

How does it challenge federal immigration authority?

Lawyers for the Biden administration argue that the Texas law conflicts with numerous federal laws passed by Congress that provide for a process for handling immigration proceedings and deportations.

The administration says the law interferes with the federal government’s foreign diplomacy role, pointing to complaints already lodged against Texas’ border actions by the government of Mexico. The Mexican authorities said they “rejected” any legislation that would allow the state or local authorities to send migrants, most of whom are not Mexican, back over the border to Mexico.

The fight over the law is likely to end up before the U.S. Supreme Court, legal experts have said . If so, it will give the 6-to-3 conservative majority a chance to revisit a 2012 case stemming from Arizona’s attempt to take on immigration enforcement responsibilities. That case, Arizona v United States, was narrowly decided in favor of the power of the federal government to set immigration policy.

Immigrant organizations, civil rights advocates and some Texas Democrats have criticized the law because it could make it more difficult for migrants being persecuted in their home countries to seek asylum, and it does not protect legitimate asylum seekers from prosecution in state courts.

Critics have also said that the law could lead to racial profiling because it allows law enforcement officers even far from the border to arrest anyone they suspect of having entered illegally in the previous two years. The result, they warn, could lead to improper traffic stops and arrests of anyone who looks Hispanic.

Wait, didn’t the Supreme Court already rule against Texas?

Not in this case.

Texas and the Biden administration have been battling for months over immigration enforcement on several legal fronts.

One case involves the placement by Texas of a 1,000-foot barrier of buoys in the middle of the Rio Grande, which Gov. Greg Abbott said would deter crossings. The federal government sued, arguing that the barrier violated a federal law over navigable rivers. In December, a federal appeals court sided with the Biden administration, ordering Texas to remove the barrier from the middle of the river while the case moved forward.

A second case involves Border Patrol agents’ cutting or removing of concertina wire — installed by the Texas authorities on the banks of the Rio Grande — in cases where agents need to assist migrants in the river or detain people who have crossed the border. The Texas attorney general, Ken Paxton, filed a lawsuit claiming that Border Patrol agents who removed the wire were destroying state property.

It was a fight over an injunction in that case that reached the Supreme Court on an emergency application. The justices, without giving their reasons, sided with the Biden administration , allowing border agents to cut or remove the wire when they need to while further arguments are heard in the case at the lower court level.

Why the stakes are higher now

Unlike the other cases, the battle over S.B. 4 involves a direct challenge by Texas to what courts and legal experts have said has been the federal government’s unique role: arresting, detaining and possibly deporting migrants at the nation’s borders.

“This will be a momentous decision,” said Fatma E. Marouf, a law professor and director of the Immigrant Rights Clinic at the Texas A&M University School of Law. “If they uphold this law, it will be a whole new world. It’s hard to imagine what Texas couldn’t do, if this were allowed.”

The federal government is seeking an injunction to prevent the law from going into effect next month.

“S.B. 4 is clearly invalid under settled precedent,” said Brian Boynton, who presented the Justice Department’s case.

“There is nothing in S.B. 4 that affords people the rights they have under federal law,” he said, later adding that the law would interfere with foreign affairs and the actions of the Department of Homeland Security.

Lawyers for Texas argued that the new law would not conflict with existing federal law. “This is complementary legislation,” said Mr. Walters, a lawyer for the state.

But Judge Ezra expressed concern that the law did not allow a judge to pause a prosecution for illegally entering Texas in the case of someone applying for asylum, calling that provision of the Texas law “troublesome” and “very problematic.”

“It just slaps the federal immigration law in the face,” he said.

Texas argued that the record number of migrant arrivals at the Texas border constituted an “invasion” that Texas had the power to defend itself against under Article I, Section 10 of the U.S. Constitution, which prohibits states from engaging in war on their own “unless actually invaded.”

The state has cited the same constitutional provision in the other pending cases between Texas and the federal government. But legal experts said the argument was a novel one.

And Judge Ezra appeared unconvinced on Thursday, as he had been when the same argument was presented last year in the buoy barrier case, which he decided in favor of the federal government .

“I do not see any evidence that Texas is at war,” he said on Thursday.

Before adjourning, the judge turned to Mr. Walters, the Texas lawyer, and said that he would work quickly to issue his decision so that if the state wanted to appeal before March 5, “you can.” He then turned to the federal government’s lawyers and added: “Either of you.”

J. David Goodman is the Houston bureau chief for The Times, reporting on Texas and Oklahoma. More about J. David Goodman

  • International edition
  • Australia edition
  • Europe edition

Ukrainian soldiers fire artillery towards Avdiivka after Russian troops finally claimed control of the city in Donetsk oblast

Russia-Ukraine war at a glance: what we know on day 726

Russia pushes for further advances after fall of Avdiivka; Denmark will give Ukraine all its artillery, says PM – ‘we don’t have to use it at the moment’

  • See all our Ukraine war coverage

Russian troops launched multiple attacks to the west of just-captured Avdiivka in a bid for more gains, a Ukrainian army spokesperson said on Sunday. Kyiv also announced it had opened a war crimes investigation after two separate reports of Russian troops shooting captured Ukrainian soldiers emerged. Russia has said some Ukrainian troops remain at the Avdiivka coking coal plant .

Facing manpower and ammunition shortages, Ukraine was forced to withdraw from Avdiivka in the eastern Donetsk region, handing Moscow its first major territorial gain since May last year. “The enemy is trying to actively develop its offensive,” said Dmytro Lykhoviy, a spokesperson for the Ukrainian army commander leading Kyiv’s troops in the area. Ukraine’s general staff reported 14 failed Russian attacks on the village of Lastochkyne, around 2km (one mile) to the west of Avdiivka’s northern edge. “But our considerable forces are entrenched there,” Lykhoviy said.

Lykhoviy also reported failed Russian attacks near the villages of Robotyne and Verbove in the southern Zaporizhzhia region – one of the areas where Ukraine managed to regain ground during last year’s counteroffensive. He said it would be “very difficult” for Russia to break through there , given heavy Ukrainian defensive lines and natural conditions of the terrain. “The situation in the Zaporizhzhia sector is stable … No positions have been lost. The enemy was kicked in the teeth and retreated.”

Denmark has decided to donate all its artillery to Ukraine , the Danish prime minister, Mette Frederiksen, told the 60th Munich Security Conference on Saturday, pointing out that other European countries are also holding munitions they do not immediately need . “If you ask Ukrainians, they are asking us for ammunition now, artillery now,” he said. “From the Danish side, we decided to donate our entire artillery .”

Frederiksen continued: “ There is still ammunition in European stocks . This is not a question of only production because we have the weapons, we have ammunition, we have air defence that we don’t have to use ourselves at the moment that we should deliver to Ukraine … Russia does not want peace with us. They are destabilising the western world from many different angles – in the Arctic region, the Balkans and Africa – with disinformation, cyber-attacks, hybrid war, and obviously in Ukraine.”

“Please, do not ask Ukraine when the war will end. Ask yourself: why is Putin still able to continue it?” Ukraine’s president, Volodymyr Zelenskiy, said as he addressed delegates at the Munich Security Conference on Saturday. Zelenskiy shared a video of the speech online and also wrote: “We can get our land back. And Putin can lose.”

China’s foreign minister, Wang Yi, has told his Ukrainian counterpart that Beijing does not sell lethal weapons to Russia for its war against Ukraine , a Chinese foreign ministry statement said on Sunday. China says it is a neutral party in the Ukraine conflict but has been criticised for refusing to condemn Moscow for its offensive.

Ukraine’s foreign minister, Dmytro Kuleba, said he had discussed the prospects for peace in Kyiv’s war against Russia with his Chinese counterpart . Kuleba said he had discussed Ukraine’s plans to hold a global peace summit, which Switzerland has agreed to help stage. The two men, he said, “agreed on the need to maintain Ukraine-China contacts at all levels and continue our dialogue”.

The UK’s shadow foreign secretary, David Lammy, has said he would support further sanctions against Russia and added he would “plug the gaps” of existing measures. In seperate comments made at the Munich conference on Sunday, Lammy said: “Russia will continue to be a threat for Europe for months, years, perhaps a generation more.”

Events on the battlefield in Ukraine are a matter of “life and death” for Russia that could determine its fate, Vladimir Putin said in remarks aired on Sunday. The Russian president has repeatedly framed the almost two-year conflict as a battle for Russia’s survival in a bid to rally patriotic sentiment.

More than 100 Kremlin documents obtained by a European intelligence service and reviewed by the Washington Post reportedly show that Russia ran a disinformation campaign to undermine Volodymyr Zelenskiy . The US publication said Kremlin instructions had “resulted in thousands of social media posts and hundreds of fabricated articles” that “tried to exploit what were then rumoured tensions” between Zelenskiy and his top army commander, Valerii Zaluzhnyi.

German politician Ricarda Lang pushed back at the idea of a deal with Russia , in response to US Republican senator JD Vance’s comments, that included his belief that Putin is not “an existential threat to Europe”, on a panel at the Munich Security Conference on Sunday. Lang said: “Putin has shown over and over again – and he just showed this with the murder of Navalny on Friday – that he has no interest in peace at the moment.”

Poland’s Radek Sikorski stressed Poland’s support for Ukraine at the third day of the Munich conference, but acknowledged that Warsaw and Kyiv had two problems linked to grain and trucking . Responding to Sikorski on stage, Olha Stefanishyna, Ukraine’s deputy prime minister, said: “We have to solve it. There are legitimate messages on both sides. I think that the major contribution in resolving these issues has been done by Ukraine, because we secured the Black Sea.”

The EU’s foreign policy chief, Josep Borrell, said that “the most important security commitment for Ukraine was membership” of the EU , in comments made at the Munich conference on Sunday.

The Estonian prime minister, Kaja Kallas, dismissed a warrant issued by Russia for her arrest, saying it was just an attempt to intimidate her amid speculation she could get a top EU post. “It’s Russia’s playbook. It’s nothing surprising and we are not afraid,” she told Reuters on Sunday on the sidelines of the Munich conference. When asked by Reuters whether she was interested in any future European role, she said: “We are not there yet.”

Ukrainians who sought sanctuary in the UK after the Russian invasion will be permitted to extend their visas for an extra 18 months , the Home Office has announced.

  • Russia-Ukraine war at a glance

Most viewed

  • CBSSports.com
  • Fanatics Sportsbook
  • CBS Sports Home
  • Champions League
  • Motor Sports
  • High School
  • Horse Racing 

fantasybaseball-180x100.png

Fantasy Baseball

180x100-college-pickem-tile.png

College Pick'em

Fantasy football, football pick'em, fantasy basketball, fantasy hockey, franchise games, 24/7 sports news network.

cbs-sports-hq-watch-dropdown.jpg

  • CBS Sports Golazo Network
  • PGA Tour on CBS
  • College Basketball on CBS
  • UEFA Champions League
  • UEFA Europa League
  • Italian Serie A
  • Watch CBS Sports Network
  • TV Shows & Listings

The Early Edge

201120-early-edge-logo-square.jpg

A Daily SportsLine Betting Podcast

pick-six.png

NFL Playoff Time!

  • Podcasts Home
  • Eye On College Basketball
  • The First Cut Golf
  • NFL Pick Six
  • Cover 3 College Football
  • Fantasy Football Today
  • Morning Kombat
  • My Teams Organize / See All Teams Help Account Settings Log Out

2024 NBA All-Star Game score, highlights: East sets record by hitting 200-point mark in defense-less win

Damian lillard was named mvp after putting up 39 points to lead the east to victory.

nba-all-star-game-east-trophy-2024.png

The 2024 NBA All-Star Game was a record-setting affair, as the Eastern Conference took down the Western Conference, 211-186, Sunday night in Indianapolis. In the process, the East became the first team to ever reach the 200-point mark in the All-Star Game, surpassing the previous record of 196, set by the West in a 2016 win.

In fitting fashion, Pacers hometown star Tyrese Haliburton got the East going with five 3-pointers in a short span during the first quarter, and was the one who broke the 200-point barrier with a 3-pointer in the closing minutes. 

After utilizing a captains' draft format in recent years, the NBA decided to return to the classic East vs. West matchup this year. In addition, the Elam Ending was scrapped in favor of a traditional 48-minute game. Neither change did anything to affect the intensity of the game, as it was once again a total shootout with basically no defense .

Karl-Anthony Towns of the Timberwolves went off in the fourth quarter to finish with 50 points, though that wasn't enough for the West to get the win. On the other side, Bucks star Damian Lillard -- who was named the game's MVP -- led the way with 39 points and hit multiple shots from halfcourt. 

Jaylen Brown added 36 of his own and Haliburton put up 32 points, seven rebounds and six assists. Unfortunately for Haliburton and his hometown fans, that wasn't enough for MVP. Those in the building had their star's back and showered Lillard with boos during the trophy ceremony. 

With another All-Star Game in the books, here's what we learned from Sunday's showcase:

Records are made to be broken

Any hope that the players might take this year's game more seriously was dashed right away. Both teams hit the 20-point mark within five minutes and never looked back. The East reaching the 200-point mark was the headline record, but it was far from the only one. Here's a rundown of some of the history that was made in Indianapolis:

  • The two teams combined for 194 points in the first half, setting a new record for the highest-scoring half in All-Star history. After the break, they promptly broke their own record by combining for 204 points in the second half. 
  • With Damian Lillard (11) and Tyrese Haliburton (10) leading the way, the East drained a stunning 42 3-pointers during the game, setting a new All-Star record. For comparison, the most 3s ever made in a normal NBA game is 29, set by the Bucks in 2020. 
  • Lillard became the first player to go back-to-back in the 3-Point Contest in over a decade on Saturday , then followed that up with the All-Star Game MVP award on Sunday. He is now the first player ever to lift both trophies in the same weekend. 

Dame was gunning for MVP

This was Lillard's eighth All-Star appearance, but his first ever start. He wasn't going to let that opportunity go to waste. "My first start, I'm going to be on the floor a lot... why not try go get an MVP?" Lillard said after the game. 

It took Lillard over six minutes to get his first basket, but once he did, he never stopped shooting. Lillard put on a show, as he launched shots from all over the gym, including multiple half-courters that caught nothing but net. In the end, he finished with 39 points and six assists on 14-of-26 from the field. He took more shots than anyone on the East team, and was second only to Karl-Anthony Towns in attempts, but when you make that many of them it's hard to argue with the volume. 

The fans in Indianapolis were not pleased with Lillard, but he knew what he was getting himself into by going for the trophy with Haliburton as his starting backcourt mate. "I expect it," Lillard said of the boos. "We're in his hometown, we're in his building. He had a great game. But it's an honor. I've been here quite a few times, to have this type of accomplishment is special." 

The format changes didn't work

During his press conference on Saturday , NBA commissioner Adam Silver said, "We returned to the East versus West format and the 48-minute game format because we thought what we were doing was not working. I'd say people uniformly were critical of last year's All-Star Game and felt it was not a competitive game."

He added that the league worked with the players on how to improve the experience and level of competitiveness, and concluded by saying, "I think we're going to see a good game tomorrow night."

Unfortunately, he was wrong. Once again, the game was a dud, made notable only by how many points the two teams were able to rack up while not playing any defense. The problem for the NBA is the players don't care to try any harder, and until that changes, nothing else matters. 

It's hard to totally fault the players when a championship and hundreds of millions of dollars worth of contracts are going to be on the line over the next few months. No one wants to get hurt during an exhibition game and potentially miss out on those opportunities. The result, though, is an event that no longer seems to be worth anyone's time. 

Final: East 211 -- West 186

The Eastern Conference never let up and cruised to a 25-point win in an all-out shootout. Their 211 points were a new All-Star Game record, as were the combined 297 points. Damian Lillard, who was named MVP, Tyrese Haliburton and Jaylen Brown all put up over 30 points to lead the way for the East, while Karl-Anthony Towns went off in the fourth quarter to get to 50 points for the West

Silver's comments did not age well

Adam Silver was hoping for and predicted a more competitive All-Star Game in 2024. That was not the case Sunday night.

Karl Anthony Towns dropped 50 points, with 31 of them coming in the fourth quarter. Towns ended the game with 35 field-goal attempts. LeBron James, Kevin Durant, Luka Doncic and Nikola Jokic had 40 field-goal attempts combined.

Lillard in elite company

Damian Lillard is the only player to win All-Star Game MVP and the 3-point contest in the same year. He's also one of just five players who have won both trophies. The list:

  • Damian Lillard: 2024 ASG MVP, 2023 & 2024 3-point contests
  • Stephen Curry: 2022 ASG MVP, 2015 & 2022 3-point contests
  • Kyrie Irving: 2014 ASG MVP, 2013 3-point contest
  • Glenn Rice: 1997 ASG MVP, 1995 3-point contest
  • Larry Bird: 1982 ASG MVP, 1986, 1987, 1988 3-point contests

Lillard wins MVP, gets booed

Damian Lillard has had an up-and-down season in Milwaukee, but he was on top of his game this weekend. After winning the 3-Point Contest on Saturday, Lillard went off for 39 points to lead the East to a victory and was named All-Star Game MVP on Sunday. He is the first player in NBA history to lift both trophies in the same weekend. 

The fans in Indianapolis were not pleased with the selection, and showered Lillard with boos. Not only have the Pacers and Bucks had serious beef this year, but Lillard stole the honor from Tyrese Haliburton, who had a strong game himself

Third quarter: East 160 -- West 136

The Eastern Conference is well on its way to scoring 200 points for the first time ever in the All-Star Game. They're just 40 points away, and should get there with ease given how little defense is being played this evening. The previous high is 196, set by the Western Conference in 2016

LeBron has jokes

LeBron is playing in his record-setting 20th All-Star Game, and is by far the oldest player in the game at 39 years old. While sitting on the bench he cracked some jokes at his own expense about playing with Wilt Chamberlain and Bob Cousy in his first All-Star Game back in 1968. 

Doncic gets stuffed by the rim

Luka Doncic has never been known as a high-flyer, and even in the All-Star Game he had some trouble at the rim. Out alone on the fastbreak, he tried to set himself up with an off-the-backboard alley-oop but got stuffed by the rim. That's going on Shaqtin-a-fool

Lillard from halfcourt

Damian Lillard reminded everyone that he truly has unlimited range by pulling up from halfcourt and draining his ninth 3-pointer of the game; no one else has even made nine shots tonight. Lillard is up to 33 points and barring a late run by someone else, he likely has the MVP locked up. 

Joker fakes everyone out

One of the best moments of the first half came late in the second quarter when Nikola Jokic picked off a pass and took off the other way on the fastbreak. He started laughing as he approached the rim, and faked like he was going to throw down a big slam before laying the ball in

Halftime: East 104 -- West 89

The two teams only combined 93 points in the second quarter, but that was still enough to set an All-Star Game record for most points in a half with 193. Just for comparison, the final score 30 years ago in the 1994 event was 127-114 in favor of the East. 

Damian Lillard has taken over the game for the East and leads all scorers with 22 points. Tyrese Haliburton has turned into a passer after the first few minutes and is still stuck on 18. Steph Curry and Kevin Durant each have 12 points to lead the West. 

Lillard heating up

Last night, Damian Lillard became the first player in over a decade to go back-to-back in the 3-Point Contest, and he's carried that rhythm over into tonight. He's hit five 3s and suddenly has a game-high 17 points to lead the East

LeBron's still got it

So much for LeBron taking it easy tonight. He just got loose on the fastbreak again and threw down the alley-oop off the bounce pass feed from Paul George. Even after all these years, it's still fun to watch LeBron put on a show

First quarter: East 53 -- West 47

If you were hoping for more effort on defense this evening, you are out of luck. There were 100 combined points in the first quarter, and the East has already made 13 3-pointers. Tyrese Haliburton leads all scorers with 15 points, while six different players on the West have put up at least six points. 

The good news is that the game is still close, and if that continues for the rest of the night the energy and effort should improve

LeBron throws it down

LeBron said during his press conference this afternoon that he's not going to play a ton of minutes because of a sore ankle. His ankle clearly isn't feeling too bad, though, as he stormed down the lane for a thunderous slam on the fastbreak. 

Haliburton putting on a show

Tyrese Haliburton made his first All-Star start tonight, and he's putting on a show in front of his hometown fans. The Pacers star was on fire in the opening few minutes, and has already poured in five 3-pointers. His game-high 15 points have the East out to an early lead at the first timeout. 

Tatum for three from the corner gets us started

Giannis Antetokounmpo pulled down a rebound, pushed the ball in transition and whipped a pass out to Jayson Tatum in the corner. The Celtics star obliged with a 3-pointer that caught nothing but net, and we're underway in Indianapolis

Tip-off time, finally

More than 40 minutes after tonight's scheduled start time, the players are finally on the court and the ball is ready to be tipped. This is simply too long of a delay, especially after Adam Silver said yesterday during his press conference that they were going to speed up the ceremonies in order to facilitate a more competitive game

Player introductions starting

Reggie Miller, Oscar Robertson and Larry Bird have taken center stage with an opening message about the importance of basketball in Indiana. Now, it's time for the long-awaited player introductions. Thankfully, there's slightly less production than previous years, so it shouldn't take too long

Closer look at LeBron's shoes

With a Deion Sanders twist.

Two Celtics stars meet up

Larry Bird has been front and center this weekend with the All-Star Game in his home state. Prior to Sunday's main event, he met up with current Celtics legend Jayson Tatum to share a few words. If everything goes to plan for Tatum and the Celtics this summer, he'll join Bird as an NBA champion

Recent NBA All-Star Game MVP history

  • 2023: Jayson Tatum, Boston Celtics
  • 2022: Stephen Curry, Golden State Warriors
  • 2021: Giannis Antetokounmpo, Milwaukee Bucks
  • 2020: Kawhi Leonard, Los Angeles Clippers
  • 2019: Kevin Durant, Golden State Warriors
  • 2018: LeBron James, Cleveland Cavaliers
  • 2017: Anthony Davis, New Orleans Pelicans

jayson-tatum-all-star-getty-images.jpg

Closer look at NBA All-Star Game jerseys

The NBA All-Star Game jerseys this year draw inspiration from Indiana's basketball heritage. Here's a closer look before the stars take the court.

LeBron comments on Lakers future

LeBron James is playing in his 20th All-Star Game this weekend. He skipped out on Saturday night's festivities to watch his son, Bronny, play for USC. LeBron made it to Indy on Sunday and he addressed the media. James can be a free agent after the season and his future was the source of the conversation when he met with reporters on Sunday evening.

"I am a Laker and I am happy and have been very happy being a Laker the last six years and hopefully it stays that way," James  said . 

LeBron James opens up on Lakers future, says Warriors trade rumors 'didn't go far at all'

Haliburton is MVP favorite

Tyrese Haliburton of the hometown Pacers has become the All-Star Game MVP favorite less than an hour before tip-off. Haliburton, as of 7:10 p.m. ET, has the shortest odds of any player at +325 on Caesars Sportsbook. Giannis Antetokounmpo and Stephen Curry are among the other favorites. LeBron James had better odds earlier in the day but has dropped to a longer shot at +2000, per Caesars.

  • Tyrese Haliburton: +325
  • Giannis Antetokounmpo: +700
  • Stephen Curry: +750
  • Shai Gilgeous-Alexander: +750
  • Jayson Tatum: +900
  • Luka Doncic: +1000

Larry Bird wants to see some effort

NBA legend and Indiana native Larry Bird has been on hand at this weekend's festivities in Indianapolis. Larry Legend had some thoughts about the game, too. Specifically he wants to see some effort out there.

"The one thing I would really like to see is they play hard tonight in this All-Star Game," Bird said. "I think it's very important when you have the best players in the world together, you've gotta compete, and you've gotta play hard, and you've gotta show the fans how good they really are."

Read more here: 

Larry Bird calls on NBA stars to 'play hard' in 2024 All-Star Game: 'You've gotta compete'

Welcome to the 2024 NBA All-Star Game

Tip-off is just about an hour away in Indianapolis, and the West enters the 73rd annual NBA All-Star Game as slight favorites over their Eastern Conference counterparts. The 2024 game is about a return to traditional All-Star Games. There was no All-Star draft this year. There is no Elam Ending. Just 48 minutes of basketball with the best players in the league on the same court.

nba-all-star-getty.png

CBS Sports HQ Newsletter

We bring sports news that matters to your inbox, to help you stay informed and get a winning edge., thanks for signing up, keep an eye on your inbox., there was an error processing your subscription..

Image thumbnail

Highlights: 2024 NBA All-Star Game

Image thumbnail

NBA Eastern Conference Win Totals: Milwaukee Bucks

Image thumbnail

NBA Eastern Conference Win Totals: New York Knicks

Image thumbnail

NBA Eastern Conference Win Totals: Cleveland Cavaliers

Image thumbnail

NBA Eastern Conference Win Totals: Boston Celtics

Image thumbnail

NBA 2nd Half Lookahead: Bold Prediction

Image thumbnail

NBA 2nd Half Lookahead: Pick To Win NBA Finals

Image thumbnail

NBA 2nd Half Lookahead: Team That Will Fall Off

Image thumbnail

NBA 2nd Half Lookahead: Team To Watch Out For

Image thumbnail

NBA 2nd Half Lookahead: Biggest Storyline To Follow

Image thumbnail

Top Storyline To Follow In 2nd Half Of Season

Image thumbnail

How To Fix NBA All-Star Game

Image thumbnail

Overall Grade For NBA All-Star Weekend

Image thumbnail

Examining Lack Of Effort In NBA All-Star Game

Image thumbnail

Breaking News: Struggling Nets Fire HC Jacque Vaughn

Image thumbnail

Eastern Conference Wins All-Star Game, Lillard Named MVP

Image thumbnail

Breaking News: Nets Fire Head Coach Jacque Vaughn

Image thumbnail

2024 All-Star Game MVP Damian Lillard Sounds Off

Image thumbnail

211 points! East cruises past West in highest scoring All-Star game ever

Image thumbnail

NBA All-Star Game: Damian Lillard Named MVP

IMAGES

  1. PPT

    what is a summary in statistics

  2. Summary statistics

    what is a summary in statistics

  3. PPT

    what is a summary in statistics

  4. asdoc : Creating high quality tables of summary statistics in Stata

    what is a summary in statistics

  5. Summary Statistics

    what is a summary in statistics

  6. Summary Statistics

    what is a summary in statistics

VIDEO

  1. 14- Summary of descriptive statistics

  2. Inferential Statistics Before Midterm Summary

  3. Descriptive Statistics

  4. What is Statistics?

  5. Statistics . class 9th & 10th .summary of the chapter . measures of Central tendency

  6. Calculating Summary Statistics in a PMF using calculators Casio MS series

COMMENTS

  1. Summary statistics

    In descriptive statistics, summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible. Statisticians commonly try to describe the observations in a measure of location, or central tendency, such as the arithmetic mean

  2. Summary Statistics

    Summary statistics are numbers or words that describe a data set or data sets simply. This includes measures of centrality, dispersion, and correlation as well as descriptions of the overall shape of the data set. Summary statistics are used in all branches of math and science that employ statistics.

  3. Statistics: A Brief Guide

    You need to present the first three summary statistics in order to summarize a set of numbers adequately. There are different measures of centrality and dispersion - the measures you select are based on the the last item, shape (or data distribution). Averages An average is a measure of the middle point of a set of values.

  4. What Is Summary Statistics: Definition and Examples

    A statistics summary gives information about the data in a sample. It can help understand the values better. It may include the total number of values, minimum value, and maximum value, along with the mean value and the standard deviation corresponding to a data collection.

  5. Summary Statistics

    Summary statistics is a part of descriptive statistics that summarizes and provides the gist of information about the sample data. Statisticians commonly try to describe and characterize the observations by finding: a measure of location, or central tendency, such as the arithmetic mean

  6. 5 Number Summary: Definition, Finding & Using

    The 5 number summary is an exploratory data analysis tool that provides insight into the distribution of values for one variable. Collectively, this set of statistics describes where data values occur, their central tendency, variability, and the general shape of their distribution.

  7. Summary Statistics Definition (Illustrated Mathematics Dictionary)

    Definition of Summary Statistics more ... The information that gives a quick and simple description of the data. Can include mean, median, mode, minimum value, maximum value, range, standard deviation, etc. Showing the Results of a Survey

  8. Exploring one-variable quantitative data: Summary statistics

    AP®︎/College Statistics 14 units · 137 skills. Unit 1 Exploring categorical data. Unit 2 Exploring one-variable quantitative data: Displaying and describing. Unit 3 Exploring one-variable quantitative data: Summary statistics. Unit 4 Exploring one-variable quantitative data: Percentiles, z-scores, and the normal distribution.

  9. Reading and interpreting summary statistics

    Just a simple method call df.describe () gives you the summary statistics for the numeric columns (I'll touch upon categorical columns towards the end). So how do you read this summary statistics? You can, in fact, extract 3 kinds of information from this table: Statistical distribution of variables Anomalies in the data Other points of interest

  10. Chapter 12 Summary Statistics

    The most basic statistical summary of a list of objects or numbers is its distribution. The simplest way to think of a distribution is as a compact description of a list with many entries. This concept should not be new for readers of this book.

  11. 5 Summary statistics

    5. Summary statistics. A summary statistic is a single number that represents one aspect of a possibly much more complex chunk of data. This single number might, for example, indicate the maximum or minimum value of a vector of one billion observations. The large data set (one billion observations) is reduced to a single number which represents ...

  12. Find a Five-Number Summary in Statistics: Easy Steps

    The five number summary includes 5 items: The minimum. Q1 (the first quartile, or the 25% mark).; The median.; Q3 (the third quartile, or the 75% mark).; The maximum. The five number summary gives you a rough idea about what your data set looks like. for example, you'll have your lowest value (the minimum) and the highest value (the maximum).

  13. Descriptive Statistics

    Descriptive statistics summarize and organize characteristics of a data set. A data set is a collection of responses or observations from a sample or entire population. In quantitative research, after collecting data, ...

  14. Using Histograms to Understand Your Data

    These statistics use a single number to quantify a characteristic of the sample. For example, a measure of central tendency is a single value that represents the center point or typical value of a dataset, such as the mean. A measure of variability is another type of summary statistic that describes how spread out the values are in your dataset.

  15. Summary statistics

    This video works through several step by step examples of calculating descriptive statistics. It shows how to calculate the mean, median, mode, mid-range, ra...

  16. Statistics Definitions, Types, Formulas & Applications

    According to Merriam-Webster dictionary, statistics is defined as "classified facts representing the conditions of a people in a state - especially the facts that can be stated in numbers or any other tabular or classified arrangement".

  17. Summary Measures in Statistics

    Summary Measures in Statistics Bryan Kestenbaum MD, MS Chapter First Online: 13 October 2018 3546 Accesses Abstract Summary measures provide compact descriptions of one or more study variables.

  18. 1.1: Basic Definitions and Concepts

    In general, statistics is a study of data: describing properties of the data, which is called descriptive statistics, and drawing conclusions about a population of interest from information extracted from a sample, which is called inferential statistics. Computing the single number $8, 357 $ 8, 357 to summarize the data was an operation of ...

  19. 21 Exploratory Data Analysis: Summary Statistics

    21.2 Central Tendency. Now that we know the range over which data is distributed, we can figure out a first summary of data is distributed across this range. Let's start with the center of the data: the median is a statistic defined such that half of the data has a smaller value. We can use notation \(x_{(n/2)}\) (a rank statistic) to represent the median.

  20. Difference between Descriptive and Inferential Statistics

    Inferential statistics takes data from a sample and makes inferences about the larger population from which the sample was drawn. Because the goal of inferential statistics is to draw conclusions from a sample and generalize them to a population, we need to have confidence that our sample accurately reflects the population.

  21. Summary statistics

    Summary statistics provide a quick summary of data and are particularly useful for comparing one project to another, or before and after. There are two main types of summary statistics used in evaluation: measures of central tendency and measures of dispersion.

  22. How to Get Summary Statistics in Excel (7 Easy Methods)

    Steps: Let's go through the following steps to do the task. First, we will calculate the Sum of the Math Score. To do so, we will type the following formula in cell C16. =SUM (C5:C12) After that, press ENTER. As a result, you can see the Sum of the Math Score in cell C16.

  23. How to Calculate Summary Statistics in R Using dplyr

    The summarise () function comes from the dplyr package and is used to calculate summary statistics for variables. The pivot_longer () function comes from the tidyr package and is used to format the output to make it easier to read. This particular syntax calculates the following summary statistics for each numeric variable in a data frame:

  24. A Legal Showdown on the Border Between the U.S. and Texas: What to Know

    Texas lawmakers said they had designed S.B. 4 to closely follow federal law, which already bars illegal entry. The new law effectively allows state law enforcement officers all over Texas to ...

  25. Russia-Ukraine war at a glance: what we know on day 726

    Russia pushes for further advances after fall of Avdiivka; Denmark will give Ukraine all its artillery, says PM - 'we don't have to use it at the moment' Russian troops launched multiple ...

  26. PDF 2024 Conference on Education Data and Statistics-Engagement Day Data

    and Statistics - Engagement Day. • Kindly use the below template to share your consolidated summary in ENGLISH. • This summary will be used for at least two purposes: (1) to prepare a report for the Rapporteur to report to the conference; (2) to prepare a detailed meeting report after the conference. 1.

  27. 2024 NBA All-Star Game score, highlights: East sets record by hitting

    Lillard in elite company. Damian Lillard is the only player to win All-Star Game MVP and the 3-point contest in the same year. He's also one of just five players who have won both trophies.