Friday, February 26, 2016

Using PROC ANOVA For One-Way Analysis of Variance

Using data from the Gapminder codebook: female employ rate  and urban rate. The data output was generated using SAS Studio.


Hypothesis formulation
Based on research review, it appears that while more women have joined the labor force thanks to urbanization, they have encountered several cultural and social obstacles that have negatively impacted their employment rates. It would be interesting to study the correlation between urbanization and the female employment rate in the Gapminder dataset. Hypothesis: Higher urbanization rates lead to lower female employment rates.

Summary of findings:
  • The “Class Level Information” table lists the variables that appear in the CLASS statement, their levels, and the number of observations in the data set. There are 6 levels, 213 observations read and 173 observations used.
  • The model degrees of freedom (DF) for the one-way analysis of variance are the number of levels minus 1; in this case, 6 - 1=5.
  • The Corrected Total degrees of freedom are the total number of observations minus one; in this case 173 – 1 = 172. The sum of Model and Error degrees of freedom equal the Corrected Total.
  • The overall F test is significant (F = 6.29; p < 0.0001), indicating that the model as a whole accounts for a significant portion of the variability in the female employ rate.
  • The F test for the urbanization rate groups is significant, indicating that some contrast between the means for the different urbanization rate groups is different from zero. The null hypothesis is rejected.
  • The Model and Urbanization Rate Groups F tests are identical, since “Urbanizationrategroup”  is the only term in the model.
Further analysis:
  • The F test for Urbanization Rate Groups (F = 6.29; p < 0.0001) suggests that there are female employ rate distribution rate differences among the urbanization rate groups, but it does not reveal any information about the nature of the differences. A Mean comparison methods is used to gather further information using Tukey’s multiple comparisons test for pairwise differences between the urbanization rate group means.
  • Significant differences in the female employ rate are observed between:
    • Urbanization rate group 1 and groups 6, 4, and 5
    • Urbanization rate group 2 and groups 6, 4, and 5
Code:


/** USING PROC ANOVA FOR ONE-WAY ANALYSIS OF VARIANCE **/
/** PROC ANOVA is used when the independent variable is categorical and the dependent variable is continuous. 
       Independent variable: UrbanizationRateGroup (categorized in the Binning/Grouping assignment)
Dependent: Female employ rate (continuous variable)
**/
PROC ANOVA  DATA=NEW;
CLASS UrbanizationRateGroup; /** defines the categorical variable **/
MODEL femaleemployrate = UrbanizationRateGroup; /** defines the dependent variable  & effects **/  
RUN;


/** Using the TUKEY procedure to further understand differences in the categorical variable **/
PROC ANOVA  DATA=NEW;
CLASS UrbanizationRateGroup; 
MODEL femaleemployrate = UrbanizationRateGroup; 
    MEANS UrbanizationRateGroup / TUKEY;    ** studentized range test **/
RUN;



Results:

                                                                Urbanization Rate Vertical Bar Chart


The ANOVA Procedure



Class Level Information
ClassLevelsValues
UrbanizationRateGroup61 2 3 4 5 6
Number of Observations Read213
Number of Observations Used173


The ANOVA Procedure
Dependent Variable: femaleemployrate



SourceDFSum of SquaresMean SquareF ValuePr > F
Model55948.157951189.631596.29<.0001
Error16731599.31337189.21745
Corrected Total17237547.47132
R-SquareCoeff VarRoot MSEfemaleemployrate Mean
0.15841728.8060413.7556347.75260
SourceDFAnova SSMean SquareF ValuePr > F
UrbanizationRateGrou55948.1579541189.6315916.29<.0001
Distribution of femaleemployrate by UrbanizationRateGroup


The ANOVA Procedure
Tukey's Studentized Range (HSD) Test for femaleemployrate
Note:This test controls the Type I experimentwise error rate.



Alpha0.05
Error Degrees of Freedom167
Error Mean Square189.2174
Critical Value of Studentized Range4.07731
Comparisons significant at the 0.05 level are indicated by ***.
UrbanizationRateGroup
Comparison
Difference
Between
Means
Simultaneous 95% Confidence Limits
1 - 213.615-7.74234.972
1 - 320.462-0.64841.572
1 - 625.2804.45946.102***
1 - 426.2085.09847.318***
1 - 527.5376.84548.229***
2 - 1-13.615-34.9727.742
2 - 36.847-3.89317.586
2 - 611.6651.50421.826***
2 - 412.5931.85423.333***
2 - 513.9224.03023.815***
3 - 1-20.462-41.5720.648
3 - 2-6.847-17.5863.893
3 - 64.818-4.81314.449
3 - 45.747-4.49315.987
3 - 57.076-2.27216.423
6 - 1-25.280-46.102-4.459***
6 - 2-11.665-21.826-1.504***
6 - 3-4.818-14.4494.813
6 - 40.928-8.70310.559
6 - 52.257-6.41910.934
4 - 1-26.208-47.318-5.098***
4 - 2-12.593-23.333-1.854***
4 - 3-5.747-15.9874.493
4 - 6-0.928-10.5598.703
4 - 51.329-8.01910.677
5 - 1-27.537-48.229-6.845***
5 - 2-13.922-23.815-4.030***
5 - 3-7.076-16.4232.272
5 - 6-2.257-10.9346.419
5 - 4-1.329-10.6778.019