test 2 stats class

Measure of Center Value at the senter or middle of a data set.
Arithmetic mean/ Mean A set of data is the measure of center found by adding the data values and dividing the total by the number of data values
Sample size Number of data values
Median The measure of center that is the middle value when the originaldata values are arranged in order of increasing (or decreasing) magnitude
Mode the value that occurs with the greatest frequency
Bimodal When 2 data values occur with the same greatest frequency, each one is a mode
Multimodal When more than 2 data values occur with the same greatest frequency, each is a mode
No Mode When no data value is repeated
Midrange The measure of center that is the value mid-way between the maximum and minimum values in the original data set
Skewed If it is not symmetric and extends more to one side than the other
Range The differencde between the maximum data values and the minimum data values. Range=Max-min data values
Standard Deviation Measure of variation of values about the mean
Variance Measure of variatioan equal to the square of the standard deviation
Range Rule of Thumb Based on the principle that for any data sets, the vast majority of sample data values lie within 2 standard deviations of the mean
Empirical Rule Data sets that are approxiamently bell shaped these properties apply. 68% fall with in 1 standard deviation. 95% fall within 2 standard deviations. 99.7% fall within 3 standard deviations
Coefficient of Variation(or CV) A set of nonnegative sample or population data, expressed as a percent, describes the standard deviation relative to the mean
Z Score Number of standard deviations that a given value x is above or below the mean. Round to 2 decimal places (9.08)
Percentiles Measures of location, which divide a set of data into 100 groups with about 1% of the values in each group
Quatiles Measures of location, which divide a set of data into 4 groups with about 25% of the values in each group
Coefficient of Variation(or CV) A set of nonnegative sample or population data, expressed as a percent, describes the standard deviation relative to the mean
Z Score Number of standard deviations that a given value x is above or below the mean. Round to 2 decimal places (9.08)
Coefficient of Variation(or CV) A set of nonnegative sample or population data, expressed as a percent, describes the standard deviation relative to the mean
Z Score Number of standard deviations that a given value x is above or below the mean. Round to 2 decimal places (9.08)
Percentiles Measures of location, which divide a set of data into 100 groups with about 1% of the values in each group
Boxplot Graph of a data set that consist of a line extending from the min to max value, and a box with lines drawn at Q1, the median, and Q3
Quatiles Measures of location, which divide a set of data into 4 groups with about 25% of the values in each group
5-number Summary Consist of the minimum value, the first quartile, the median(Q2), Q3, and the maximum value
Boxplot Graph of a data set that consist of a line extending from the min to max value, and a box with lines drawn at Q1, the median, and Q3
Event Any collection of results or outcomes of a procedure
Simple Event Outcome or an event that cannot be further broken down into simpler components
Sample Space Consist of all possible simple events
Complement Event A, consist of all outcomes in which event A does not occur
Actual Odds Against A and B are integers having no common factors. A:B
Actual odds in favor B:A
Payoff Odds The ratio of the net profit to the amount bet
Compound Event Any event combining 2 or more simple events
Disjoint If they cannot occur at the same time
Independant Occurance of one event does not affect the other event
Dependant One event affects the other
Random Variable Variable that has a single numerical value, determined by chance, for each outcome of a Drocedure
Probability Distribution Description that gives the probability for each value of the random variable
Discrete Random Variable Finite number of values or a countable number of values
Continuous Random Variable Infinitely many values, and those values can be associated with measurements on a continuous scale witout gaps or interruptions
Expected Value Represents the mean of the outcomes

Introduction to Statistics and Data

Word Definition
Statistics A way of reasoning using calculations made from data that help understand the world.
Context Tells WHO was measured, WHAT was measured, WHERE the data was collected, WHEN and WHY the study was performed, and HOW the data were collected.
Data Systematically recorded information with its context.
Data Table An arrangement of data where rows represent a case, and columns represent a variable.
Case What the data is about.
Population What is to be represented by the sample. The whole.
Sample The few cases we can examine to understand the population.
Variable Holds information about the same characteristic for many cases.
Units Quantity of amount adopted as a standard of measurement.
(Ex. religion, favorite music, favorite food, etc.)
Quantitative Data Measured numerical values. Always have units.
(Ex. distance, weight, time, age, etc.)
Two Types of Quantitative Data Discreet and Continuous.
Discreet Data Counted quantitative data. Measurable.
(Ex. minutes, distance, etc.)
Continuous Data Quantitative data that can take on any value in an interval. Measurable.
(Ex. shoe size, etc.)

Displaying and Describing Categorical Data

Word Definition
Frequency Table Record totals and category names.
Distribution of categorical variables.
Pile Count number of data values in each category of interest.
Relative Frequency Table Counts are expressed as percentages.
Distribution of categorical variables.
Area Principle Each data value should be represented by the same amount of area.
Equal width and equal spacing.
Bar Chart Show counts for each category.
Distribution of categorical variables.
Bars don't touch (Area Principle).
Vertical (Y) Axis of Bar Chart Frequency or relative frequency (100%).
Horizontal (X) Axis of Bar Chart Categories.
Relative Frequency Bar Chart Show percent for each category.
Pie Chart Slice into pieces proportional to the fraction of the whole.
In percents, whole circle adds up to 100%.
Contingency (Two-Way) Table Two categorical variables shown individually distributed along each variable.
Contingent on the value of the variable.
Cells give counts NOT totals.
Marginal Distribution Distribution of one variable.
From totals on margins of contingency tables.
Total per category / The whole.
Conditional Distribution Distributions of one variable for just the individuals who satisfy some condition on another variable (restricts).
Value of cell / Row or Column total.
Independent Variable Segmented bar chart shows IDENTICAL values.
Means two variables have NO ASSOCIATION.
Dependent Variable Segmented bar chart shows DIFFERENT values.
Means two variables ARE ASSOCIATED.
Segmented Bar Chart Conditional distribution of categorical variable within each category of another variable.
By percents.
Each bar treated as a whole (like a pie chart).
Simpson's Paradox When averages are taken across different groups, they can appear to contradict the overall averages.

Displaying and Summarizing Quantitative Data

Word Definition
Distribution Slices up all possible values of the variable into equal-width bins.
Gives counts of values falling into each bin.
Histogram Show distribution of a quantitative variable.
Each bar represents frequency (counts) of values falling into each bin.
Bars can touch.
MUST HAVE FREQUENCY TABLE.
Relative Frequency Histogram Each bar represents frequency of values falling into each bin in PERCENTS.
Bins in Histograms Minimum of FIVE BINS
Bin Bar or class
Horizontal (X) Axis on Histograms Variable.
Vertical (Y) Axis on Histograms Frequency (counts or percents).
Stem and Leaf Display Sketches distribution of quantitative data.
Useful for small sets of data.
MUST HAVE KEY (with units).
Dotplot Graphs a dot for each case against a single axis.
Shape Modes (single, multiple, etc.); Skew and Symmetry; Outliers and Gaps.
Modes Unimodal (one mode); Bimodal (two modes); Multimodal (multiple modes); uniform (no modes).
Right Skew Mean > Median (mean is on right).
Tail (or longest whisker) is on right.
Left Skew Mean < Median (mean is on left).
Tail (or longest whisker) is on left.
Symmetry Mean = Median.
Outlier All outliers are influential; not all influential datum are outliers.
Spread Range, IQR (Inner Quartile Range), and Standard Deviation (Variance).
Range Maximum – Minimum
Not resistant to outliers.
Inner Quartile Range (IQR) Q3 – Q1
More resistant to outliers (than range).
Middle 50%.
Standard Deviation Square root of variance.
Mean Average
Five Number Summary
(Smallest to Largest)
Minimum, Quartile 1, Median, Quartile 3, Maximum
"n" Sample size.
Minimum Smallest datum in the set of data.
Maximum Largest datum in the set of data.
Quartile 1 (Q1) "Median of lower half".
Marks 25th percentile.
Quartile 3 (Q3) "Median of upper half".
Marks 75th percentile.
Median (Quartile 2, Q2, or Med.) Middle datum (but not necessarily an actual piece of data).
Numerical Summary (10 factors) Mean, Standard Deviation, Range, Inner Quartile Range (IQR), "n", and Five Number Summary (Minimum, Q1, Median, Q3, and Maximum).
Most reliable measure of CENTER when shape is SYMMETRICAL Mean.
Most reliable measure of CENTER when shape is SKEWED Median.
Most reliable measure of SPREAD when shape is SYMMETRICAL Standard Deviation.
Most reliable measure of SPREAD when shape is SKEWED Inner Quartile Range (IQR).
Percentile The ith percentile is the number that falls above i% of the data.

Understanding and Comparing Distributions

Word Definition
Boxplot Displays Five Number Summary.
Central box with whiskers that extend to the non-outlying data values.
Useful for comparing groups and displaying outliers.
Lower Fence (Outlier) Q1 – 1.5(IQR)
Any datum below is lower outlier.
Upper Fence (Outlier) Q3 + 1.5(IQR)
Any datum below is upper outlier.
Far Lower Outlier Any datum below Q1 – 3(IQR)
Far Upper Outlier Any datum above Q3 + 3(IQR)
When COMPARING DISTRIBUTIONS consider their: Shape, center, outliers, and spread.
When COMPARING BOXPLOTS consider their: Shape, median, IQR, variation, outliers.
Timeplot Displays data that change over time.
Successive values are usually connected with lines to show trends more clearly.
Smooth curves can be added to show long-term patterns and trends.

Common Stat 151 Terms to know

What is statistics a way of reasoning and a collecting of tools and methods that help us comprehend the world OR particular calculations made from data
What is Data values (numbers, record names, labels) along with their context
Who the individual cases about whom or which the data is from
Respondents Individuals who answer a survey
Subjects / Participants People on whom we experiment on
Experimental Units Animals, Plants, inanimate subjects
Variables Characteristics recorded about each individual
What determines the variable that has been measured, some have units Ex. Mass, Time, Distance
Categorical or Qualitative Variables Quality of something, uses words Ex. sex, race, ethnicity
Quantitative variables Comes with a unit of measurement Ex. Income (\$), Height (inches), weight (pound)
Ordinal variables Rank or data in terms of degree
Indentifier Variables categorical variables with one invidual in each category Ex SIN, FedEx tracking number, ISBN
Why why the data is useful
trial sequence of events we want to investigate
component building blocks of a simulation
Sample Surveys

Correlation and Regression

Purpose of Correlation and Regression Make inferences based on sample data that come in pairs. Determine if there is a linear relationship b/w the two quantitative variables & describe it with an equation that can be used for predictions. Two dependent populations (quantitative data).
Correlation Correlation coefficient measures the strength of the linear relationship b/w two quantitative variables. Variables must be continuous/discrete. Use scatter plot. X & Y are linearly related if the scatter of points can be approximated by a straight line.
Correlation Coefficient r measures the strength of the linear relationship b/w the paired x & y values in a sample. Represents the linear correlation coefficient for a sample. Rho represents the linear correlation coefficient for a population.
Correlation Sxy:covariance of x & y. Sx:standard deviation of x. Sy:standard deviation of y.
Interpreting the Linear Correlation Coefficient r Between -1 & 1. If r close to 0, no linear correlation b/w x & y. If r close to -1 or 1, strong linear correlation. Negative value indicates negative or inverse relationship. Positive value indicates positive relationship. r measures strength & direction.
Factors That Affect the Size of r Nonlinear relationship: linear correlation only measures degree of linear relationship, so if Xs and Ys are nonlinearly related, r may be 0 even though the 2 variables are nonlinearly related. Restricted range: restrictions on range of X/Y will reduce r.
Factors That Affect the Size of r Extreme Scores: a single extreme score may produce evidence of correlation when none exists. Combining groups: there may be no correlation w/n either group, but combining them can give the illusion of a linear correlation. Can also change its direction.
Correlation Testing hypotheses about rho. A single r can be tested to determine if the corresponding rho is different from a hypothesized value. df=n-2. CORRELATION DOES NOT PROVE CAUSATION. measures how well the best-fitting straight line actually fits.
Assumptions For each value of X there is a normally dist. subpop. of Y values. For each value of Y there is a normally dist. subpop. of X values. Joint dist. of X & Y is a normal dist. The variance of Xs/Ys is same at each value of X/Y (homoscedasticity).
R-Squared Coefficient of determination. The proportion of the variation in y that is explained by the linear relationship b/w x & y. SSR/SST. Measures closeness of fit of the sample regression equation to the observed values of Y.
Regression Used to find the best-fitting straight line that relates the scores. Objective is to predict the value of one variable (the outcome) based on the value of another variable. Use scatter plot. Best fitting line minimizes y-yhat (actual-predicted).
Regression SStotal:variation in obs. values of response variable. SSregression:variation in obs. values of response variable explained by regression. SSerror:variation in obs. values of response variable not explained by regression. SSR:1df SSE:n-2df SST:n-1df
Least Squares Criterion The best fitting straight line is the one that minimizes the sum of the squared deviation b/w the actual y values & the predicted values. Minimize SSE.
Beta The population parameter for b, the slope of the line.
How Can You Tell A Regression Question From A Correlation Question? Intent: Prediction=regression, Strength of relationship=correlation

vocab

model an equation or formula that simplifies and represents reality
linear model a linear model is an equation of a line, need to know the variables along with their W's and their units
predicted value the value of yhat for a given x-value in the data, a predicted value is found by substituting the x-value in the regression equation
residuals the differences between data values and the corresponding values predicted by the regression model – or more generally values predicted by any model
least squares specifies the unique line that minimizes the variance of the residuals or equivalently the sum of the squared residuals
regression to the mean because the correlation is always less than 1.0 in magnitude each predicted yhat tends to be fewer standard deviations from its mean than its corresponding x was from its means which is called regression to the mean
regression line / line of best fit yhat – bnot +bonex
slope bone gives a value in "y-units per x-unit"
intercept bnot gives a starting value in y-units, found from bnot = ybar – bonexbar

data analysis

useful graphs scatterplot
can get a sense for the nature of the relationship
what to look for in a graph relationship between two variables where one variable causes changes to another
location where most of the data lies
spread variability of the data, how far apart or close together it is
shape symetric, skewed etc
nature of relationship existent/ non-existent
strong/ weak
increasing/ decreasing
linear/ non-linear
outliers in scatterplots represent some unexplainable anomalies in data
could reveal possible systematic structure worthy of investigation
casual relationship relationship between two variables where one variable causes changes to another
explanatory variable explains or causes the change
on x-axis
response variable is changed
on y-axis
useful numbers correlation and regression
formula for the correlation coefficient r= 1/(n-1) ?-?((xi-x ?)/sx?)((yi-y ?)/sy)
xi or yi axis values of corresponding letter
xbar or ybar mean of axis values of corresponding letter
sx or sy standard deviation of axis values of corresponding latter
properties of r close to 1 = strong positive linear relatoinship
close to -1 = strong negative linear relationship
close to 0 = weak or non-existent linear relationsip
cautions about the use of r only useful for describing linear relationships
sensitive to outliers
regression models general linear relationships between variables

focus negative = decrease

what regression modelling does describes behaviour of response variable (the variable of interest) in terms of a collection predictors (related variables ie. explanatory variable(s))
a linear framework is used to look at? the relationship between the response and the regressors
formula: Y = ? + ?x
Where ? is the intercept and ? is the slope
ideal model for linear framework in terms of responses and regressors one unique response to one given regressor
real world model for linear framework in terms of responses and regressors must approximate
statistical model relates response to physical model predictions
allows for better predictions and quantification of uncertainty concerning the response
to make decisions
what does regression analysis do? finds the best relationship between responses and regressors for a particular class of models
experimenter controls predictors, why? may be important for making inferences about the effect of predictors on response
course assumption predictors are controlled in an experiment or at least accurately measured
define a good statistical model fit, predictive performance, parsimony interpretability
qualitative description of model response = signal + noise
Y = ? + ?x + o
o = noise
define signal a small number of unknown parameters
variation in response explained in terms of predictors
it is the systematic part of the model
define noise residual variation unexplained in the systematic part of the model
can be described in terms of unknown parameters
what does a good statistical model do to possibly large and complex data reduces it to a small number of parameters
a model will fit well if the systematic part of the model describes much of the variation in the response (low noise)
large number of parameters may be required to do this
define parsimony: smaller number of parameters = grater reduction of data, more useful for making a decision
there is a cycle between what? tentative model formulation, estimation of parameters and model criticism
a good model will manage balance between goodness of fit and complexity
provide reduction useful data
model response variable in terms of a single predictor yn = values of the response variable

Chapter 4 Statistics

What is a bivariate data? When values from two variables are collected from each individual in the population or sample, the set of data values are called bivariate data. (For example: height and weight collected from each GPC student)
What is the response variable in a bivariate set of data? The response variable is the dependent variable in the bivariate data set. It can be explained (at least in part) by the explanatory variable.
What is the explanatory or predictor variable in a bivariate set of data? The explanatory variable is the independent variable in the bivariate data set. Its value can be used to predict, though not perfectly, values of the dependent variable (the response variable).
What is a lurking variable? A lurking variable is a variable that is related to either the response variable or the predictor variable or both, but is not considered as part of the study. A lurking variable can lead to incorrect or misleading results in a study.
What is a scatter diagram? A scatter diagram (or scatter plot) is a graph of a set of points, which shows the relationship between two quantitative variables measured on the same individual with one point for each individual. The points are not connected.
Which variable is placed on which axis in a scatter diagram? The explanatory variable is plotted on the horizontal axis and the response variable is plotted on the vertical axis.
Why are scatter diagrams useful? A scatter diagram can be used to indicate a relationship or lack of a relationship between the explanatory variable and the response variable. For example: Do the dots line up approximately along a straight line?
What is meant by a positive linear association between two variables? Two variables that are linearly related are said to be positively associated if whenever the value of the predictor variable increases, the value of the response variable also increases. (The dots are approximately on a line and it go up left to right.)
What is meant by a negative linear association between two variables? Two variables that are linearly related are said to be negatively associated if whenever the value of the predictor variable increases, the value of the response variable decreases. (The dots are approximately on a line and it go down left to right.)
What is the correlation coefficient in a bivariate study? The linear correlation coefficient is a measure of the strength and direction of a linear relationship between two quantitative variables. For a sample, it is often represented by the letter r.
What are the possible values of the correlation coefficient and what do these values indicate? The correlation coefficient can take values between -1 and 1, inclusive.
What do the values of the correlation coefficient, r, indicate? An r close to 1 shows a strong positive linear relationship between the variables. An r close to -1 indicates a strong negative linear relationship between the variables. An r close to 0 indicates that the variables are not linearly related.
What is least-square criterion? It is the smallest sum of the squares of the differences between the predicted y data and the observed y data values. It indicates the smallest sum of the squares of the residuals.
What are the linear equations found using the least-square criteria called? Linear equations found using the least square criteria are called linear regression equations.
What is the residual in a linear relationship? Residual = observed y ¬– predicted y (Note the order.) The predicted y is found using the linear relationship.
How can you interpret the slope in a linear regression equation? The slope can be interpreted as the average rate of change of the response variable, y, with respect to the explanatory variable, x. Thus, when x increases by one unit, y will change by the amount of the slope.
How can you interpret the y-intercept in a linear regression equation? The y-intercept can be interpreted as the predicted value of the response variable when the predictor variable is zero.
Under what conditions does the interpretation of the y- intercept in a regression equation make sense? This makes sense only if the value of 0 for the explanatory variable makes sense and there is an observed value of the explanatory variable near 0. (Never make predictions too far from observed values.)
What does the coefficient of determination measure? The coefficient of determination, R2, measures the percentage of the total variation in the response variable, y, which is explained by the least-squares regression associated with x.
What values can the coefficient of determination take on? The coefficient of determination can take values between 0 and 1, inclusive.
What do the possible values of the coefficient of determination, R2, indicate about the linear regression equation? R2 = 1 means that 100% of the variation in the response variable is predicted by the regression line. (The regression line fits the data exactly.) The closer R2 is to zero the worse the regression line predicts the relationship between x and y.
For linear regression equations (but not all regression equations) how might you find the coefficient of determination, R2? For a linear regression model, we just need to square the correlation coefficient, r, to find the value of coefficient of determination, R2.