Section like the standard deviation which gives us

Section A:

The levels of measurement used can be factors in
determining what measures are used in the descriptive analysis of the data.
Each measure has it’s merits and can be used effectively if following the
proper criteria. For example, when working with nominal data, since there are
no clearly defined intervals we cannot obtain a mean that would provide logical
data. However, the mode which represents the category with the largest volume
of samples in the data is still relevant to nominal data and can give us pertinent
information, in this case which option or category most people in a sample
choose. In order to effectively obtain a median, which is the value that separates
the lower half of the data from the upper half, the data must be ordered in a significant
way, where one is better than the other. Thus, the median, which is not useful
for nominal data is a sufficient descriptive tool for ordinal data, as a way of
finding the “positional” average of a sample. Going further into interval and
ratio data, the mean of the data now becomes available due to the clearly
defined equal intervals and the ability to do arithmetic on the data. When
categories are equidistant, comparing them numerically becomes possible and
averaging them out yields the mean. The mean is a common tool of measuring an
average in data sets and is crucial in obtaining further data like the standard
deviation which gives us a more complete analysis. It is important to note,
that we can use options besides the default measures for levels of measurements.
If presenting the median for ratio data offers an important insight into a
trend it should also be mentioned. There are also cases of instability for
measures, the mode of a set of data can become unstable when two or more
categories are equally common. When data has a distribution that is
concentrated on both tails and relatively low in the center (a bathtub
distribution) the median becomes susceptible to drastic changes from
proportionally insignificant alterations in the tails. Lastly, the mean is also
unstable in presence of extreme outliers, they can be rather skewed simply due
to a small few samples lying unusually distant from the regular mean. To avoid
using unstable means, taking trimmed means or using another measure such as the
median is often the preferred solution. One last important distinction is the
option of sometimes treating ordinal data as interval in order to obtain the
mean. Sometimes this can be useful but it only works if the distance between
categories can be estimated to be roughly the same. This is another reason
determining the level of measurement is crucial, as it presents us with which
tools can be used to provide an accurate analysis as well as some alternatives
in cases of instability. When considering measures of dispersion for nominal sets
of data the index of diversity tells us how likely it is, that two cases drawn
at random from a sample will come from different categories. As an example, if
a random survey was taken of a class of 40 about their country of origin and
the IOD was 0.723, we would move the decimal over and say that there is a 72.3%
chance that two students drawn at random will be from different countries. The
index of qualitative variation is a similar measure for nominal data but it
holds itself to a different scale, between 1 and 0. For most small samples of
data the two measures will differ, but as samples become higher the two values
begin to correlate more closely. Both measures also do not rank categories
against each other and hence, order is not important when obtaining these measures.
It should also be noted that both IOD and IQV do not take into consideration
the number of categories present, and thus the third measure, entropy is useful
because as it codes categories as 1s or 0s, it keeps track of how many are used
total when counting the bits it provides. Entropy itself measures “the number
of bits of information we need, on average, to eliminate uncertainty about
where cases are found” 39. The interquartile and interdecile range are similar
methods of measure, typically used for ordinal data and provide ranges either
in quartiles (fourths) or deciles (tenths) to show where samples of data are
concentrated. It also helps focus on key areas such as the centre of the data,
without being heavily affected by outliers. The Median Absolute Deviation is
also a sufficient tool when working with ordinal data, as it is resistant to
extreme values and provides a good substitute for standard deviations when they
are unstable. The MAD is a measure typically used in analyzing the centre of a
distribution. Lastly, the standard deviation is the measure used for interval
and ratio data. It has many advantages compared to other measures, such as it
considers all cases, it is expressed in meaningful units that can compared
across samples. The standard deviation is however susceptible to extreme
outliers. Measures of association deal with reducing errors in our predictions,
based on data sets. For example peoples voting preference based on ethnicity,
we would use lambda. If the result was say 0.28, that would mean we could reduce
our error in predicting voting preference by ethnicity by 28%. If we used the
gamma measure and it turned out to be .34, we would then say we can reduce our
errors in predicting the direction of pairs by 34%. The Somer’s d measure would
also give us a similar answer, however some consider it more accurate as it assesses
independent variable in all cases it might influence. Finally, the Q value
tells us the percentage we can “reduce our errors in guessing whether one pairs
favour one form of association (or the other) if we know how the variable are
linked” 145.

PRE measures help us reduce errors by determining
the way two variables are linked which than allow us to predict more accurately
the outcome of the data. Understanding the relationships between variables will
help us make more correct predictions in their outcomes and thus lowering how
many errors we make. If we wanted to predict party voted for by person
ethnicity, we could obtain a lambda between the two variables. Hypothetically
if this lambda came out to be 0.43, we could reduce our errors by 43% in
predicting party voted for based on a person’s ethnicity if we knew how the
variables are linked. When using the gamma measure, the focus is on the
concentration of concordant and discordant pairs, and compares them. Gamma
reduces our error in predicting the direction of pairs, if the link between
variables is known. Somers d is a modified version of gamma, taking in to
consideration the variable T, “representing the pairs of cases tied on the
dependent variable, but not the independent”137. Lastly, the measure Q, which
does not measure concordant or discordant pairs but instead measures whether
pairs favour one type of association or the other. It reduces our errors in the
sense that we can predict which type of association more pairs will favour, if
we know how the two variables are linked.

Conditional tables are tables used in analyzing the
relationships between variables when there may be hidden variables present that
affect the data but are not initially known and need to be tested. Most are
used in seeing if a third variable is linked with the original independent and
dependent variables. As a broad example, imagine a study looking at the link
between alcohol consumption and income. We notice there is a trend between the
two variables, that as alcoholism becomes a greater problem, income goes down.
Now it can be agreeable to say people who drink constantly do not earn as much,
but that alone does not tell you why. By incorporating another variable (test
factor) and tables showing its association with the original variables, we may
be able to gain new insight on the relationships between the three, a process
called “specification”. Say if for our example we started looking at education
as our third variable, and found out those who have no education are more prone
to be alcoholics and in turn make less money. This could suggest a spurious
relationship between the variables, both possibly being caused by a lack of
education. Examples like this, where many factors could be hidden require
conditional tables to clearly demonstrate the effect third variables may have
in situations as well as possibly bring to light some causal links. In conditional
tables we see that by controlling for a third variable, we may see a change in
the how the relationships between variables are represented, and that the correlation
between the variables can possibly be made more accurate by analyzing hidden
variables. In our example, we may find that the better the education someone
has, the higher their income and the lower their chance at consuming high
concentrations of alcohol on average. We may find a distorted relationship, “one
that shows “the original table shows an effect in the opposite direction to what
we obtain after controlling for a third variable” 189.

Section B:

Graphs are mainly used as visual representations of
numerical data, they organize and present data in more readily digestible
images as opposed to rows of numbers. Using graphs is advantageous when trying
to show links between variables, trends or skews in curves, or when you want to
compare categories visually side by side. They provide an easy and effective
summation of data that might involve countless samples and allow multiple ways
to present themselves. Commonly used graphs like bar charts or histograms are
very concise and organized when representing categorical data. The three types
of variables shown in bar charts or histograms are discrete variables, which
have distinct columns, continuous variables which are lines that can take on ay
value within its line’s range, and discrete-continuous which have distinct
categories which follow along an ordered range. In these types of graphs the
mode can usually be found easiest, it is simply the highest bar among the
categories. The median and mean cannot be seen as simply but their general area
can be assumed through the concentration of data in the middle of the graphs.
In signal peaked curves that are continuous, if there is a positive skew the
order of the measures of central tendency is usually mode, median, than mean,
where as if it negatively skewed, it is the opposite order. The boxplot is
another visual tool used to represent data, it is a representation of the
five-number summary of a data set and contains the minimum, the first quartile,
the median, the third quartile, and the maximum. The box starts at the lower
part of the first quartile and stretches all the way up to the third quartile,
it has a thick bold black line that goes through the center where the median is
and usually has to lines stretching through it up to its maximum on the upper
half and its minimum on the lower half. The boxplot also typically contains
circles on the outer ends signifying outliers. An advantage of boxplots is that
they can be readily compared to other boxplots, similarities or differences can
be spotted easily just by seeing if their medians line up, or minimums and
maximums differ greatly, making it an efficient tool when analyzing multiple
sets of data. Back-to-back histograms are a style of graph made for direct
comparison. Containing two sets of data, they allow us to compare proportion of
both samples side by side according to the percentage on the left hand side of
the graph, which is very useful when comparing groups like males and females.
Population pyramids are a similar type of graph, but they usually measure ages
with both sets of data on either side of a center column showing the age
ranges. For ratio or interval level data, scatterplots can be used to show how
closely correlate two variables are at every point, which is the visual
representation of r. If for example a set of data has an r of 1, that means the
variables are very positively correlated to each other, where as if they have a
correlation of -1, the two variable are negatively correlated (One gets higher,
the other lower). There are also cases where r=0 which simply means there is no
correlation between the two variables. Boxplots can also be used if we want to
show some significant data, but do not need to show every value of the
variables. We can also use line graphs to chart specific relationships between
variables, for example to see the average change in y when x changes by 3. Even
bar charts can be used since they can provide the mean scores with the heights
ot the bars. We also have used association plots to display data obtained from
crosstabulations, where importance is placed on cases within cells. Association
tables show lines for heavy cells(shaded) which have observed values greater
than the expected and light cells which have expected values greater than their
observed. Association tables, unsurprisingly are used to determine the association
between two variables and more so the strength of their association by using
standard residuals. If a value of less than plus or minus 2 SR’s is obtained
than there is an association between two variables. Mosaic plots are another
type of visual representation for association between two variables. Cases are
represented by large rectangles between rows and columns, their sizes of which
are proportional to the number of samples in their category (more samples in a
category, the larger the rectangle). The standardized residuals in mosaic plots
are read through the patterns given in the key. Mosaic plots also present
extreme values in an easily read way, with SR’s greater than + or – 2. Lastly,
the double decker graph is plot which contains bars whose width represents the
subcategories, while the dependent variable is the scale which measures the
height of the bars. The unique feature of the double decker is that it allows
for more than one independent variable at the bottom of the graph. Showing the
changes throughout categories as different independent variables are considered,
highlighting the associations between multiple variables. We can also obtain
more specific data of how variables are related, as in how a unit of change in
y affects a unit of change in x. The double decker can show spurious relations or
distorted relations depending on which independent variables are looked at,
which can be advantageous in a situation where one independent variable does
not provide a complete picture of what is presented in the data. Scatterplots
are also used for regression results, with trend lines going through the
centroid. With multiple regression we can obtain the effect of each predictor
with the other being controlled. In multiple regression, we can even compare
the effects of all the independent variable with the effects of the individual


Section C:

(An important clarification, when referring to standard
deviations of distributions, I know it is more correct to call them standard
errors, I am aware of the difference and I hope I will not be penalized for referring
to them simply as standard deviations)

The four common sampling distributions are the normal,
the t, the chi-square, and the f distribution. The normal distribution is the
most commonly used one, it is continuous, unimodal, and symmetric, also the
standard normal curve has a mean of 0 and standard deviation of 1, although
this could differ in other cases. The normal distribution is also mesokurtic,
meaning its peak is relatively centered. Using the Central Limit Theorem, which
tell us “any sampling distribution of a mean for any variable with a definable
SD, will be normal if the sample size is large enough”90.Meaning we can
identify the proportion of samples that lie within certain ranges away from the
mean, assuming that distributions are large enough to be considered normal and
they also have a definable standard deviation. Using that information we can
determine 68% of all cases lie within 1 standard deviation away from the mean,
or 95% lie within 2 SDs of the mean. For smaller distributions we typically use
the t distribution, also unimodal and symmetric but they contain heavier tails and
lower peaks. However as sample size increases for t distributions they begin to
resemble normal distributions closely. T distributions also have degrees of
freedom which become important in assessing p-values and how significant specific
data is. P-values determine the point in which values display significant signs
of association between variables. The standard chi-square distribution will be
unimodal, with a peak at 0, a right skewed distribution and mean of 1, with a
standard deviation of 2. The chi-square distribution becomes more normal as its
degrees of freedom increase, that is to say its peak grows, it skews back over
to normal curve, and it becomes symmetric as dF goes up. The three
distributions also have another relation to each other, “the t distribution
results from dividing the normal distribution by a chi square”101. Chi-square
distributions are efficient for two-way crosstabulations because they keep
track of the independent rows and columns used. Finally, the f distribution is
continuous, symmetric, and slightly positively skewed, but as degrees of freedom
for both become higher, f distributions become more symmetric. The f
distribution has two typical uses, the first is significance testing for
comparison means in ANOVA, and in regression to check the significance of a set
of predictors. Anova is usually used to compare the means of subgroups with a
null hypothesis that there is no difference. If the means are different than
the null will abandoned. The f distribution shows us which results should be
treated as significant in ANOVA and regression.

The Bayesian approach to predicting probabilities is
one that asks the direct question of “how likely is it that the true difference
between groups, or the association between variables, is such-and-such, or lies
in a such-and-such range”116. Instead of the standard model of statistical
inference which is first started with a null hypothesis stating there is no
difference. The Bayesian method starts out with a set of prior probabilities, data
on how likely certain outcomes are based on past information. These priors are
the first bit of information used in setting up the analysis, than given the
data we obtain ourselves we build a set of posterior probabilities. Posterior
probabilities are the more looked at set of probabilities between the two,
because they incorporate the actual data, which could for example swamp out
irrelevant personal prior probabilities that are not significant for the data.
With the revised probabilities, “an estimate of the likely difference between
groups, or the likely association between variables”126can be obtained, as
well as saying “how far from our best estimates the truth is likely to be”
126. The range we estimate the true value is likely to be in is known as the
credible interval, it refers to the area in a distribution that our real figure
relies in as well as how likely. For example, if we take the central 90% of our
posterior, than we could say that given the priors, it is 90% likely the true
figure lies somewhere in that range. Unlike the confidence interval used in the
standard approach which would be referring to 90% of all samples lying within
that range. The main methodical difference between the standard approach of
inference and the Bayesian method, is that the Bayesian uses prior
probabilities to built information which it will than test the association of,
while the standard approach will build a null hypothesis that says there will
be no difference and than try to either prove or disprove it.

Section D:

In regression analysis when the link between an
independent variable (IV) and a dependent variable (DV) is not linear we must analyze
the data and choose which option is most sufficient for each case. When dealing
with a flat trend line, we must truncate the independent variable where it
begins to flatten out. The flat categories will be truncated (appear summed up
as a new category), so that the original trend can become linear. Another case
is when we exponential curves, which arise when there is a dependent variable
whose slope rises or lowers at a separate rate from the IV. Some dependent
variables increase constantly as time passes by specific rates, these graphs
can become linear by taking their logarithms. Another case, one where the DV
may have a changing slope throughout different parts of the graph is also
possible, in this case usually a second b value is taken at the point where the
slope changes for our linear equation.

In multiple regressions, to estimate the effect of
an IV, with others held constant we typically use software which conducts matrix
manipulations. These manipulations regress each independent variable on to the
others to create residuals that are independent o other independent variables.
The residuals represent “the part of the variation in x1 that is independent of
X2” 226. We become liable to obtain biased coefficients when we leave out
relrvant predictors because if the removed IV can be associated with other IVs
in the equation, the section of there variation that is correlated with the
missing variable will be used in predicting y. To prevent a biased result, all
relevant predictors must be included because the residuals that are produced
for the dependent variable are affected by the other independent variables. We
must also try to avoid irrelevant predictors as they will skew our estimates. An
irrelevant predictors presence will cause a reduction in the residuals of the
predictors it is correlated with which will lead to a reduction in the amount
of information used to predict the dependent variable, thus producing inaccurate