## Final Exam

Test of Statistical Significance Helps you decide whether an observed relationship b/w an IV and a DV exists in the population or whether it could have happened by chance.
Null Hypothesis (H0) States there is no relationship b/w the IV (X) and the DV (Y) in a population. Any relationship that may exist is by random sampling error or chance.
Research/Alternative Hypothesis (H1) State there is a relationship between the two variables (X and Y). Not by chance.
Type I Error Occurs when the researcher concludes there IS a relationship when, in fact, there is not one.
Type II Error Occurs when the researcher infers that there IS NOT a relationship when, in fact, there is one.
Correlation Analysis Produces a 'measure of association'; Known as Pearson's r; Gauges the direction and strength of a relationship b/w two interval-level variables.
Regression Analysis Produces a 'statistic' or 'regression coefficient'; Estimates the size of the effect of the IV on the DV.
Regression Line y = a + b(x)
Multicollinearity Occurs when the IV's are related to each other so strongly that it becomes difficult to estimate the partial effect of each IV on the DV.

## Multiple Regression Special Topics

Statistic used to measure the proportion of variance accounted for by the nominal variable R squared
Statistic used to represent the relationship between the coded aspect of the nominal variable and Y ryi
Statistic used to represent the relationship between the coded aspect of the nominal variable and the reference group Pri and sri
Partial regression coefficients for dummy coded variables compares each group to the reference, or left out, group
Semi-partial correlations between y and dummy coded variable gives the importance, in terms of y variance, of the chosen group vs reference group distinction when controlling for covariates
Partial correlation between y and dummy coded variable point-biserial correlation between outcome variable (y) and the dichotomy formed by the chosen group vs reference group adjusted for any covariates
R squared SSR / SST
Hypotheses of multiple regression H0: R squared y.12 = 0
H1: R squared y.12 > 0
H0: Beta1 = Beta2 = . . . = 0
H1: not all betas are zero
Unstandardized regression coefficient SE(b)
Standardized regression coefficient Beta
Three steps of interpreting interactions 1. Does an interaction effect exist?
2. What is the strength of the effect?
3. What is the nature of the relationship?
Bilinear moderated relationship Slope between y and x1 changes as a linear function of x2
interpretation of the partial regression coefficient for the interaction term in a multiple regression model change in slope between y and x1 for a one-unit increase in x2
What type of relationship does a product term interaction represent? x1 . x2
Does interaction term exist? Look at partial regression coefficient for the interaction term.
T = b/SE(b)
What is the strength of the effect? Check r square change when interaction term added or can square the semi-partial correlation
What is the nature of the effect? Plot the relationship between y and x1 for fixed values of x2 and interpret with 1-2 sentences
Is interaction term symmetric Yes – doesn't matter whether we designate x1 or x2 as the moderator variable. The inference is the same
VIF Variance Inflation Factor – represents how much the variances of the individual regression coefficients are inflated relative to when the IVs are not linearly related.
At what level is VIF a problem greater than 10
What type of variable selection or screening strategies are most useful when evaluating a set of potential predictor variables and interaction among the variables? Hierarchical selection strategies
Can interaction terms be interpreted w/o the main effect included in the model? No
What are the benefits of centering when an interaction term is present? Reduces intercorrelation, yielding more stable estimates of regression parameters, which reduces size of SE(b) and makes for a more powerful test of the interaction effect
Problems encountered when IVs are highly correlated 1. adding or deleting an IV changes reg coef
2. estimated se of reg coef become large, affecting stability of population estimates
3. individual reg coef may not be significant even though a statistical relationship exists b/w DV and a set of IVs
Centering a linear transformation used to help mitigate the effects of high correlation between multiplicative interaction terms and their constituent parts
Problems involved with detecting interaction effects in the multiple regression model 1. Measurement Error
2. Mis-specification of functional form of interaction
3. Levels of measurement
4. Statistical power
Formal diagnostics of multicollinarity VIF and Tolerance
Dealing with multicollinearity in multiple regession models 1. Centering works for interactions
2. Drop one or more IVs (but can cause problems with model specification)
3. Add cases to break patterns
4. Create a construct, or index, variable
Assumptions of Repeated Measures ANOVA 1. Normally distributed, continuous DV
2. Same number of repeated measures per subject
3. Same of fixed timing of measurements across subjects
4. Sphericity
Sphericity refers to the equality of covariances and is tested by evaluating the variances of differences between all possible pairs of scores
Benefits of Generalized Estimating Equations (GEEs) for repeated measures 1. Allows for discrete or continuous DV
2. Allows subjects to have varying number s and timing of the repeated measurements
3. Allows a variety of covariance structures to be considered
Why is Multiple Regression better than ANCOVA when you add covariates? In ANCOVA you have to manually check for treatment x covariate interaction and if so, then you cannot use.
Ordinal vs. disordinal interactions Ordinal – lines never cross
Disordinal – lines cross each other
When interaction term is present, how does it change the interpretation of the betas? When you have a higher order term (interaction), lower order terms (main effects) are evaluated at 0 (instead of means).
Non-essential collinearity refers to correlation b/w interaction term and its main effects (can be corrected w/centering)
Benefits of hierarchical variable entry 1. Causal priority including temporal order can be acknowledged.
2. control for confounding and test complex relationships b/w variables (mediation and moderation)
Types of automatic variable entry Forward
Backward
Step-wise
Limitations of automatic variable entry 1. Order of variable entry doesn't necessarily affect importance of variable
2. assumes a "single" best subset of variables
3. Theoretically important variables can be removed b/c of correlation w/other variables
Assumptions of GEE 1. Clusters are independent
2. ID (subject) correlated

## Multiple Regresstion – P264B

Regression partitions y into what two components? SST = SSE + SSR
SS Total (total variation in y) = SS Regression + SS Error
What is the interpretation of the slope (b1) coefficient? For a 1-unit increase in x, y increases by (b) units.
Unstandardized regression coefficient Slope (b)
What are standardized regression coefficients useful?

It is useful to standardize the regression coefficients for multiple variables (change b to B).
Standardized regression coefficient Beta
Coefficient of determination

R2 = SSR/SST gives the proportion of variance in y accounted for by x.
Only pertains to linear association
Why is the sum of the squared residuals, or errors, a minimum in OLS regression?

Using a least squares procedure guarantees that b0 and b1 will produce estimates of y .
What is the standard error of estimate (SEE)?

SEE is standard deviation of the regression; Average distance of any point to the regression line
Also called root mean square error (b/c is
How is SEE calculated? square root of MSE)
SEE = S y/x = SSE/n-p
When correlation between x & y=1, SEE=0 (all points are on the line).
How does SEE differ from the standard error of the slope (regression coefficient)? SE of slope is beta (regression coefficient)
SE(b) measures slope
SEE measures scatter about the regression line
What is the impact of non-constant error variance on the MSE? Non-constant error variance should increase the MSE
MSE = SSE/df
What is one impact of an inflated (large) MSE?

Reduces predictive power
Reduces coefficient of determination (R2)
What departures from OLS regression can be studied using residual plots?

1. Non-linear regression function
2. Non-constant variance (heteroscedascticity)
3. Correlated error terms (not independent)
4. Distribution of error terms not normal
5. Omitted an important IV from the model
6. Outliers
Departures from OLS regression ACRONYM Homoscedasticity
Independence
Linearity
Outliers
Omitted variable
Normal distrubtion
How is a residual (error term) calculated?

observed value – expected value based on regression equation; y – y hat
How is a standardized residual calculated?

Z = ei / SEE
(SEE = square root MSE, MSE = SSE/df)
What are some of the drawbacks of using ZRESID to identify “unusual” cases?

Z doesn’t account for unusual cases of x, they “mask” their effects by increasing SSE (&MSE)
What three dimensions are used to characterize atypical or unusual observations?

Leverage, Discrepancy, and Influence
Leverage represents how unusually the case is in terms of its x value (extreme in predictor set)
Discrepancy how unusual a value of y is for a given value of x (conditioned on x)
Influence how much of an impact each individual observation has on the global regression analysis (DFFITS) and on estimates of the regression coefficients (DFBETA)
Conceptually, what is an externally studentized residual (SDRESID)?

It calculates the residual for a point based on that point not being included in the MSE to determine how “unusual” this case is compared to the rest of the data set
Also known as jackknifed residual, studentized deleted residual
Durbin-Watson test used to check for serial correlations;
Plot residuals against time; there must be no relationship among the residuals for time
D=2 means no serial correlation(Range is 0 to 4)
Is Durbin-Watson useful for all types of designs having non-constant error variance?

No, only those where collection is spread over time or if time of collection is a factor
A plot of ZRESID against the IV is useful for studying which types of departures? Discrepancy
What can a plot of the residuals against a variable not included in the regression equation tell us? If we omitted a key variable (model specification)
It runs as a covarite and tells us what part of the error is associated with that variable. Including ut would therefore reduce MSE.
How might one diagnose problems with non-constant error variance? Use residual plots
Will the value of the standardized residual be large for all types of outlying observations? No, the standardized residual would NOT be large for leveraged outliers.
Name a residual diagnostic that can be used to detect outlying x values.

Leverage (Hii)
What is discrepancy and how is it measured? Discrepancy is how far y is from predicted value of y for a given value of x
It is measured by comparing ZRESID and Studentized deleted residual
What are the two components of influence and what residual diagnostics are used to reflect those two components?

Influence is how much the point moves the line.
DFFIT measures influence on y (whole regression equation)
DFBETA (x) measures influence on the slope (DFBETA for constant is less important)
DFBETA (x) measures influence on the slope (DFBETA for constant is less important); global
DFFIT measures influence on y (whole regression equation); specific
What measure tells us how much the group of independent variables together estimate y?

Multiple R2
What are the limitations of R2 when used to compare between different studies?

It does not separate variables to determine the individual contribution of each variable, controlling for the others in the model
What measure tells us about the contribution of a single IV to estimating y when other variables are included in the regression equation? Semi-partial correlation (must square to explain variance)
How are these descriptive measures interpreted? Controlling for other variables, x1 accounts for n% of the variation in y.
Why are the regression coefficients in a multiple regression equation called “partial”?

Because they account for “part” of the variation accounted for by the full model
Interpret a regression coefficient in a multiple regression model?

Controlling for other variables, for every one-unit increase in x1, there is a n-unit increase in y.
What hypotheses are tested in the ANOVA summary table of a multiple regression model? H0: R2y.123…p = 0
H1: R2y.123…p > 0
H0: B1= B2= B3= . . . Bp= 0
H1: not all betas are equal
If the F-statistic is significant, will all of the individual regression coefficients be significant?

Not necessarily, depends on the beta for each variable
What test determines significance of individual regression coefficients?

T-test
What are effects of collinearity on regression?

1. Affects estimates of partial regression coefficients
2. Affects size of SE(b)
3. Makes interpretations more complex b/c estimate of effect depends upon variables included
4. When extreme, there is no unique solution to the regression problem.
What is the term for extreme cases of inter-correlation among the IVs? Multi-colinearity
What factors determine the size of the standard error of a regression coefficient? 1.Specification issues – IV omitted, model doesn’t explain enough variance
2. Restricted range of x – not providing enough variation in x to show full range of y
3. Inter-correlation – high correlation > low tolerance > small denominator > big SE
semi partial correlation

Increase in R2 when x1 is added to an equation containing x2 or the percentage of variance in Y uniquely accounted for by x1 because all other variables have been statistically controlled.
How is the semi-partial correlation interpreted? Controlling for other variables, x1 accounts for n% of variation in y.
partial correlation

Correlation between y and x1 when linear effects of other variables are removed from x1 and y.
How is the partial correlation interpreted? When the variation of other variables is removed, x1 accounts for n% of variation in y.

## Formula Quiz 1

individual an object described by a set of data
variable a characteristic of an individual
quantitative variable a variable that takes on a numerical value that can be measured
quantitative data values of quantitative variables
categorical/qualitative variable a variable that places an individual into a category
distribution indicates what values a variable takes on and the frequency at which it takes these values
graphs of qual. pie chart, bar chart
bar chart qual.
pie chart qual.
graphs of quan. dotplot, histogram, stemplot
dotplot quan.
historgram quan.
stemplot quan.
outlier an individual observation that falls outside the overaell pattern of the graph
relative frequency histogram has the same shape as a histogram with the exception that the vertical axis measures relative frequencies instead of frequencies
key features of a histogram the center (mean, median), the spread (range), the shape
shapes of a graph. symmetric, skewed left, skewed right
measures of center sample mean, mode, median
sample mean arithmetic average or arkithmetic mean
mode element or elements that occur most often
median "the middle number"/average of two middle numbers
median position formula (n+1)/2 (n=number of numbers in the data set)
mean=median when… distribution is perfectly symmetric
when it is skewed right the mean is dragged to the right
when it is skewed left the median is dragged to the left
measures of spread range, iqr, five number summary, the variance and the sample standard deviation
range largest #- smallest #
iqr IQR= Q3-Q1
five number summary min. q1 med q3 max
variance sum of (xi-the mean)squared divded by n-1
standard deviation (equation) square root of variance…sum of (xi-average)square/n-1
standard deviation (definition) the st. dev. is a set of numbers that emasures how numbers are spread out from the mean
xi-xbar a deviation of xi from the mean
the sum of all the deviations of the mean always equals 0
st. dev. is …. to outliers nonresistant (is affected by)
n-1 is.. degrees of freedom
a datapoint is an outlier if… it lies mroe than one a half iqr ranges before q1 or above q3
boxplot is a graph which displays five num summary of a set of data
modified boxplot a graph that displays the fiver numeb summary of a data set (tests for outliers)
side-by-side boxplots can be used to compare the distributions of to data sets
within one standard deviation of the mean 68% of the data will fall
two sample standard deviations from the mean about 95% of the data will fall
three devations fromt he mean about 99.7% of the data wqill fall
z-score meansures how far these points lie from the mean (using standard devations as the unit)
equation for z-score x-xbar/s
sample mean of a z-score is 0
the sample dev of a zcore is 1
cumulative frequency is the nunber of observations less than or equal to a given number
cumulative relative frequency cumulative frequency divded by the toal number ofobservations
empirical distribution function is a graph of the cumulative relative frquency vs. the raw data in the sample
a density curve a curve that always lies on or above the horizontal axis and has area exactly of 1 underneath
median of a density curve is the point that divides the area under the curve in half
mean of a density curve the point at which the curve would balance if it was made of a solid material
the standard normal distribution is..(mean/st. dev.) a normal distribution with mean 0 and standard deviation 1
conversation formula is used to convert normal distribiton values to standard normal distribution values
conversation formula (actual form) z= (x-mu)/s
what does a z-score measure the number of standard deviations between anobservation x and the mean mu of the data set
normal quantile plot graphs raw data (horizontal) versus their z-score (y-axis)
a data set is approximately normal when its quantile plot is approximately linear
independent variable x is the explanatory varaible
dependent variable y response variable
directions of scatterplots positive association, negative association or neither
scatterplots are analyzed according to: direction, form, strength of relationship, and outliers
correlation coefficient measures the direction and strength of the linear relationship between two quantitative variables
formula for r r= one over n minus 1 times the sum of the (xi-x) divided by sx and (yi-y) divided by sy
the correclation coefficient r is always a number between -1 and 1
if r is positive then x and y have a posistive association
if r=1 then x and y have a perfect positive correlation
if r is negatrive then x and y have a negative association
if r=-1 then x and y have a perfect negative correlation
least squares regression lineis the equation.. of the line that makes the sum of the squares fof the residuals as small as possible
equation for the LSQR yhat=bnaught+b1x
bnaught is.. y intercept
b1 is… the slope of the line
equation for b1 r(sy/sx)
equation for bnaught ybar-b1xbar
ybar the mean of the y coordinates
x bar is the mean of the x coordinates
the difference between y and yhat is called an error or a residual
residual is the observed value of y mins the predicted value of y (y-yhat)
the point xbar, ybar… is a point on every regression line
rsquared is called the coefficient of determination
rsquared measures the variation in y that is explained by y's linear association with x
a residual plot graphs.. the residuals on the vertical axis and either the explanatory, response or preodicted response values on the horizontal
residuals from a LSQR always have a mean of 0
the horizxontal axis of a residual plot corresponds to the regresson line
an observation is influential if removing it would markedly change the position fot her egression line

## Statistical Methods Flashcards 1

range length of the smallest interval which contains all the data
interquartile range (IQR) difference between the third and first quartiles
Variance amount of variation of all the scores for a variable s(square)=sum (x-mean(x))(square)/N
Standard Deviation square root of variance
Normal Distribution two-thirds of cases within one standard deviation of the mean
two-thirds of cases within one standard deviation of the mean
approximately 95% of cases within two standard deviations of the mean
Deviation x-x(mean)
Experimental research design a scientific control is used to minimize the unintended influence of other variables
Random assignment of subjects
Quasi-experimental design Experimental method but without random assignment
When is quasi experimental design used? when randomization is impossible and/or impractical
Advantages of quasi experimental design Easier
minimizes threats to external validity
efficient in longitudinal research
Disadvantages of quasi experimental design Threats to internal validity
Causal relationships difficult to determine
Confounding variables
Internal validity extent to
we can accurately state that the
independent variable produced the observed
effect
Threats to internal validity regression to the mean
confounding variables
extraneous variable occurring between pre- and post-measurement
maturation
instrumentation error
investigator bias
differential attrition
External validity Generalizability
– to or across target populations
– to or across environments
Threats to external validity situation or environment
Hawthorne effect (testing effect)
Rosenthal effect
Selection bias
pre and post test effects

Hawthorne effect People act differently when they know they are being tested
Non-experimental design No control group
Observational
Used for theory development
cross tabulation A table of the frequency distribution of two or more variables
Null hypothesis hypothesis of no difference
Something happens by chance or that no difference exists between populations
Chi-square (?2) compares a set of frequencies expected if the null hypothesis is true (fe) against a set of frequencies observed in a sample (fo)
Geometric mean used for data based on ratios, proportionate growth, percentage change.
differential attrition extent of subjects who drop out of a study
AKA mortality
reliability consistency of observations
not the same as validity (could be consistently false)
reliability is necessary BEFORE validity can be established
Social desirability effect Subject respond in a way that appears favorable to the tester
Advantages of interviews higher response rate
more lengthy and detailed
more complex
more flexible
protectivity (confidential)
less reactivity
reactivity act of measuring changes responses
Cohort study Compare group of people who share a certain characteristic (smokers) with unexposed group(non-smokers).
Experimental trial Exposed v. unexposed groups in trial setting (drug v. placebo)
Case control study Historical,
Subjects already have a condition, study looks back to see if there are characteristics of these patients that differ from those who don’t have the disease
Odds Ratio Probability that something will occur divided by probability it will not occur.
Kurtosis Measure of peakedness in frequency distributions
platykurtic lower, wider peak around the mean
leptokurtic a more acute peak around the mean (bunching toward the mean)
skewness a measure of the asymmetry of the probability distribution
positive skew Tail to the right is longer
negative skew Tail to the left is longer

## third exam with ellis. April 2011

A Type 1 error is the result of
A research article reports results of a study using a test for dependent means as " (38)=3.11, <.01." This tells you:
When conducting a test for independent means, a typical research hypothesis might be:
How do you set up a hypothesis testing problem?
Other names for the test for dependent means includes all of the following EXCEPT:
Place the five steps of the hypothesis-testing process in the correct order:
The main idea of a chi-square test is that you:
When a conducting a test for independent means:
If you know the samples variance but not the populations variance
A researcher tests

## CH 1

Statistics Is the science of data
Two branches of statistics? Descriptive statistics,Inferential statistics
Descriptive Statistics Consists of the collection, organization, summarization, and presentation of data
Inferential Statistics Consists of generalizing from samples to populations by..
Performing estimations and hypothesis tests,Determining relationships among variables,Making predictions,
Uses probability( the chance of an event occurring)
Population Consists of all subjects (human or otherwise) that are being studied.
Sample Is a group of subjects selected from a population.
Variable Is a characteristic or attribute that can assume different values.
Data are the values (measurements or observations) that the variables can assume.
Random variable variables whose values are determined by chance.
Types of Data Variables can be classified as qualitative or quantitative
Qualitative variables that can be placed into distinct categories, according to some characteristic or attribute
Quantitative variables that are numerical and can be ordered or ranked
Qualitative Variables Categorical, Can be further classified into two groups, nominal and ordinal
Nominal Data data can be classified into mutually exclusive (non-overlapping) categories in which no order or ranking can be imposed.
Ordinal data can be classified into categories that can be ranked or ordered
Quantitative Variables Numerical,Can be further classified into two groups, discrete and continuous.
Discrete variables can be assigned values such as 0, 1, 2, 3 and are said to be countable. Obtained by counting
Continuous variables can assume an infinite number of values in an interval between two specific values. Obtained by measuring
inferential statistics goal learn about a population by using a sample
unbiased samples Each subject in the population has an equally likely chance of being selected.
Observational study the researcher observes what is happening (or what has happened in the past) and tries to draw conclusions based on these observations.
Experimental study the researcher manipulates one of the variables and tries to determine how the manipulation influences other variables.
independent variable one that is being manipulated by the researcher
dependent variable the outcome

## Analysis of Variance

Inflation of Type I Error Each t-test has a type I error of alpha, however over a series of multiple t-tests the overall type I error does not stay at alpha but grows w/ each new comparison.
Analysis of Variance (ANOVA) Compares multiple pop. means that avoids the inflation of type I error rates. Test one null hypothesis to determine if any of the pop. mean differ from the rest. Type I error remains at alpha. Use w/ interval/ratio data.
One-Factor ANOVA "One factor" mean that b/w all groups, there is only one source of variation investigated.
Independent Sample ANOVA Assumptions Population: all pops. are normally dist., all pops. have same variance. Sampling: samples are independent of one another, each sample obtained by SRS from pop.
Mean Square Within/Error Pool all the sample variances. When the sample sizes are the same the pooled variance is the average of the individual sample variances. DF=N-k N:total # of obs. k:# diff. pop. Good estimate of pop. variance when null is correct & incorrect.
Mean Square Among/Treatment Based on the sample means. DF=k-1. Good estimate of pop. variance only when null is correct.
Two Main Points MSW is based on variability w/n each sample. MSA only good when all sample means can be considered from same pop. If null is true, MSA & MSW should be similar & ratio should be close to 1.
Partitioning the Sum of Squares SSA:among groups sum of squares df=k-1. SSW:w/n groups sum of squares df=N-k. SST:total sum of squares SSW+SSA df=N-1.
One-Factor ANOVA for Dependent Samples Goal is to decide if there is a difference among the dependent pop. means. Randomized complete block (two-way) & repeated measures. MST has k-1 df & MSE has N-k-b+1 df. SSB has b-1 df. SST has N-1 df.
Dependent Samples Assumptions Population: normally dist. pops., variances are same, scores in all pairs of pops. should have same degree of relationship. Sample: samples are dependent, observations come from SRS from pop.

## Test 2 Notes

What is a Statistical Inference? A procedure by which we use information from a sample, that is drawn from a population, to reach a conclusion about the population.
What is Estimation? Estimation uses sample data to calculate a statistic that is an approximation of the parameter of the population from which the sample was drawn.
What is Samples Population? The population from which you draw your sample.
What is Target Population? The population you wish to make an inference about; the population you wish to generalize your results to.
Why is Estimation useful? Workers in the health sciences field are often interested in parameters, such as proportions or means, of different populations.It is usually not feasible (due to cost and/or time limitations) to sample the entire population even if it is finite.
A ___________ is a single numerical value used to estimate the corresponding population parameter Point estimate
An ___________ consists of a range of values (with a lower and upper bound) constructed to have a specific probability (the confidence) of including the population parameter. Interval Estimate
An __________ is the single value computed. The __________ is the rule that that tells us how to compute the estimate estimate, estimator
An estimate, T, is said to be an _______ of a parameter 0 if the expected value of the estimate (T) equals 0. E(T) = 0 unbiased estimator
One criteria for picking the best estimator is the property of what? Unbiasedness
Unbiased estimates of their corresponding parameters: difference between two sample means, sample proportion, difference between two sample proportions
What is this:In repeated sampling from a normally distributed population with a known standard deviation, 100(1-?) percent of all intervals of the form will in the long run include the population mean mue Probabilistic Interpretation
Unbiased estimates of their corresponding parameters: difference between two sample means, sample proportion, difference between two sample proportions
What is this:In repeated sampling from a normally distributed population with a known standard deviation, 100(1-?) percent of all intervals of the form will in the long run include the population mean mue Probabilistic Interpretation
What is this:When sampling is from a normally distributed population with a known standard deviation, we are 100(1-?) percent confident that the single computed interval, contains the population mean mue. Practical Interpretaion
The ______________ is the quantity obtained by multiplying the reliability coefficient by the standard error of the mean. This quantity is also called the _______________ . precision of the estimate, margin of error
You cannot always assume the population is normally distribution. However, the ________ tells us that for large samples, the sampling dist. of xbar is approximately normally distributed regardless of the distribution of the individuals in the population. Central Limit Theorem
It is almost always the case that if you don’t know your population mean, u, (which is why we would use this estimation procedure), then you also don’t know your ______________ . population variance
The number of ________ for a statistic equals the number of observations minus the number of components in its calculation that need to be estimated. degrees of freedom (df)
How do you know if the population variances are equal? If the larger samp.variance is more than 2x as lrg as the smlr samp. var.– then the pop. var. are un=. You don't have to use the un= var. form, for the CI around the dif. between 2 pop.means – if you encounter a ? that has unequal var.–var. are un= and th
n*p > 5 and n*(1-p) > 5, we can consider the sampling distribution of p-hat to be close to the ______________. normal distribution
True or False If we fail to reject the null hypothesis then we conclude that the null hypothesis is true. False
In a hypothesis test, one way to reject the null hypothesis is to see if the p-value is less than or equal to ________. Aplha
When estimating the population mean, we use the ____distribution when the population variance is known and the _____ distribution when the population variance is unknown. z, t
The ______gives us the probability associated with obtaining the computed test statistic or one more extreme if the null hypothesis is true. p-value
Ture or FalseHolding everything else constant, a 99% confidence interval is wider than a 95% confidence interval. True
The probability of rejecting a null when it is actually true is called _______; this is a Type ___ error. aplha, Type 1
As the sample size ________ the standard error of the estimate decreases. increases
True or FalseWhen evaluating a given hypothesis, a confidence interval and hypothesis test on the same data won't always give you the same conclusion. True
The ____ hypothesis is always a statement of equality. Null
Power is the term for the probability of: Rejecting a null hypothesis when it is actually false.
True or False Using the same data, a p-value from a two tailed test is larger than a p-value based on a one-tailed test. True
Holding everything else constant, as the sample size increases, the width of the confidence interval: Decreases
What is the general form of a CI__________+/-________________*____________ _____________________________ estimator+/-reliability coefficient/standard errorMargin of error
When calculating a confidence interval for the population mean and the population variance is unknown you use ____________ table. t
What can we assume when p<aplha That the variences are unequal and reject the null
What can we assume when p>aplha That the variences are equal and fail to reject the null
When you see an equal sign what can you assume? That it is a two tailed test and that to find the p-value you would have to divide by 2
When you see a greater than or less than sign what can you assume? That the table is a one tailed test.

## vocab review

science of collecting, organizing, summarizing, analyzing and making inferences from data statistics
includes collection, organization, summarizing, graphical displays descriptive statistics
includes making inferences, hypothesis testing, determining relationships, making predictions inferential statistics
the values or measurements that variables describing an event can assume data
values that are numeric quantitative data
data values that can be placed into distinct categories according to some characteristic or attribute qualitative data
assume values that can be counted discrete variables
variables that can assume all values between any two given values continuous variables
consists of all subjects that are being studied population
subset of a population sample
sample of an entire population census
characteristic or a fact of a population parameter
a characteristic or a fact of a sample statistic
an organization of raw data into tabular form using classes (or intervals) and frequencies frequency distribution
number of times the value occurs in the data set frequency count
represent data that can be placed in specific categories, such as gender, hair color, or religious affiliation categorical frequency distributions
simply lists the data values with the corresponding number of times or frequency count with which each value occurs ungrouped frequency distribution
obtained by dividing the frequency for that class by the total number of observations relative frequency
sum of the frequencies for all values at or below the given value cumulative frequency
sum of the relative frequencies for all values at or below the given value cumulative relative frequency
obtained by constructing classes (or intervals) for the data and then listing the corresponding number of values (frequency count) in each interval grouped frequency distribution
a plot that displays a dot for each value in a data set along a number line dot plot
a graph that uses vertical or horizontal bars to represent the frequencies of a category in a data set bar chart/ bar gram
a graphical display of a frequency or a relative frequency distribution that uses classes and vertical bars (rectangles) of various heights to represent the frequencies histogram
a graph that displays the data using lines to connect points plotted for the frequencies. The frequencies represent the heights of the vertical bars in the histogr frequency polygon
a data plot that uses part of a data value as the stem to form groups or classes and part of the data value as the leaf. stem and leaf plot
displays data that are observed over a given period of time time-series graph
a circle that is divided into slices according to the percentage of the data values in each category pie chart
of bar chart in which the horizontal axis represents categories of interest pareto chart
the average of the set of values mean
the numerical value in the middle when the data set is arranged in order median
the most frequently occurring value in the data set mode
most of the data values fall to the left of the mean, and the tail of the distribution is to the right. The mean is to the right of the median, and the mode is to the left of the median. positively skewed
most of the data values fall to the right of the mean, and the tail is to the left. Mean is to the left of the median, and the mode is to the right. negatively skewed
data values are evenly distributed on both sides of the mean. When the distribution is unimodal, the mean, median and mode are all equal to one another and are located at the center of the distribution. symmetrical distribution
the difference between the maximum and minimum data values, is affected by outliers range
the difference between the first and third quartiles. It kicks out the extremes-a nice feature for highly skewed data. interquartile range
the 'average' number of deviations from the mean mean deviation
almost the average of the squared deviations of the data from the mean variance
most commonly used statistical tool to monitor and control the quality of goods and services, such as consistency in delivery times; positive square root of the variance standard deviation
the relative amount of dispersion in a data set. Used to compare data that use different units. coefficient of variation
about 68% of the data is within 1SD of the mean, about 95% of the data is within 2SD of the mean, about 99.7% of the data is within 3SD of the mean Empirical Rule
the mean is less than the median is less than the mode left-skewed distribution
the mode is less than the median is less than the mean right-skewed distribution
Used to compare two or more data sets; tells us how many SD a specific value is above or below the mean value of the data set z score
numerical values that divide an ordered data set into 100 groups of values with a most 1 percent of the data values in each group percentiles
a graphical display that involves a five number summary of a distribution of values consisting of the minimum value, the lower quartile, the median, the upper quartile, and the maximum value box plot
the values of the dependent variable are along the vertical axis, and the values of the independent variable are along the horizontal axis scatter plot
a statistical relationship between two variables correlation
named data nominal
ordinal ordered data
interval uniformly spaced values with no natural zero
ratio uniformly spaced values with a natural zero
simple random sample (SRS) ideal method (all members of a population have an equally likely chance to be represented in the sample–no intentional bias).
convenience sample gather data in the easiest way possible
cluster sample divide the population into clusters and randomly select from the clusters
stratified sample divide the population into at least two different strata each with a shared characteristic-gender, age group and then sample from these strata
systematic samples from some beginning data value we select every nth data value