Saturday, March 19, 2016

Testing Statistical Interaction

Using data from the Gapminder codebook: female employ rate  and urban rate. The data output was generated using SAS Studio.


Hypothesis formulation
Based on research review, it appears that while more women have joined the labor force thanks to urbanization, they have encountered several cultural and social obstacles that have negatively impacted their employment rates. It would be interesting to study the correlation between urbanization and the female employment rate in the Gapminder dataset. Hypothesis: Higher urbanization rates lead to lower female employment rates.

Testing Statistical Interaction or Moderation In The Context of ANOVA
The goal in this exercise is to determine if the explanatory variable (urbanization rate) is associated with the response variable (female employ rate), for each level of a third categorical variable, in this case 'income per person'. The income levels were grouped by dividing the 'income per person' values into 6 levels. The urbanization rate values were were grouped into 6 levels in a previous assignment.  

/** Grouping income per person **/
DATA NEW; /** Creates new variable that will be the SAS output data set**/
SET          work.newdata;  /** Reads observations from the SAS dataset**/
KEEP        femaleemployrate employrate urbanrate incomeperperson 
         FemaleEmploymentRateGroup EmploymentRateGroup 
         UrbanizationRateGroup IncomePerPersonGroup; 

if (incomeperperson ^= . & incomeperperson <= 800) then
IncomePerPersonGroup = "1";
if (incomeperperson > 800 & incomeperperson <= 2000) then
IncomePerPersonGroup = "2";
if (incomeperperson > 2000 & incomeperperson <= 8000) then
IncomePerPersonGroup = "3";
if (incomeperperson > 8000 & incomeperperson <= 18000) then
IncomePerPersonGroup = "4";
if (incomeperperson > 18000 & incomeperperson <= 24000) then
IncomePerPersonGroup = "5";
if (incomeperperson > 24000) then
IncomePerPersonGroup ="6";
RUN;

The ANOVA procedure was then generated to test for moderation for each level of IncomePerPersonGroup:

/** TESTING MODERATION IN THE CONTEXT OF ANOVA **/
PROC SORT DATA=NEW;
   BY IncomePerPersonGroup;
RUN;

PROC ANOVA  DATA=NEW;
          CLASS UrbanizationRateGroup; 
          MODEL femaleemployrate = UrbanizationRateGroup;
          MEANS UrbanizationRateGroup;
          BY IncomePerPersonGroup;
RUN;


Results:
The results show an inverse association between the female employ rate and urbanization rate in the first / lowest income level, where the income per person is $800 or lower.  At this income level, the P-value is statistically significant (0.0107) and has the highest F value among the 6 levels. The means table shows that urbanization rate group 1 (lowest level) has the highest mean value of female employ rate (77.4) compared to levels 2 - 6.
The higher Income levels 2 - 6 have P-values higher than 0.05 and are therefore not statistically significant.








Testing Statistical Interaction or Moderation In The Context of CHI SQUARE

PROC GCHART;
VBAR UrbanizationRateGroup / DISCRETE TYPE=MEAN SUMVAR=femaleemployrate;
RUN;

A bar chart reveals an inverse association between female employ rate and the urban rate.


/** TESTING MODERATION IN THE CONTEXT OF CHI SQUARE TEST OF INDEPENDENCE **/
PROC SORT DATA=NEW;
BY IncomePerPersonGroup;
RUN;

PROC FREQ DATA=NEW;    
     TABLES FemaleEmploymentRateGroup * UrbanizationRateGroup / CHISQ; 
BY IncomePerPersonGroup;
RUN;

Results:
The results show an inverse association between the female employ rate and urbanization rate in the first and third income levels, where the income per person are $800 or lower, and between $2,000-$8,000.  At these income levels, the P-values are statistically significant (0.0015 for income level 1 and 0.0087 for income level 3) and have the highest Chi Square values of 44.1 for income level 1 and 26.6 for income level 3. Income levels 2, 4, 5 and 6 have P-values higher than 0.05 and are therefore not statistically significant. 















Testing Statistical Interaction or Moderation In The Context of the Pearson Correlation Coefficient

/** TESTING MODERATION IN THE CONTEXT OF CORRELATION **/
PROC SORT DATA=NEW;
BY IncomePerPersonGroup;
RUN;

PROC CORR DATA=NEW;
VAR urbanrate femaleemployrate;
BY IncomePerPersonGroup;
RUN;

Results:
The results show an inverse association between the female employ rate and urbanization rate in the first income level, where the income per person is $800 or lower. At this income level, the P-value is statistically significant (0.0006) with a correlation co-efficient of -.46785.  Income levels 2,3, 4, 5 and 6 have P-values higher than 0.05 and are therefore not statistically significant. 


























Monday, March 14, 2016

Pearson Correlation

Using data from the Gapminder codebook: female employ rate  and urban rate. The data output was generated using SAS Studio.


Hypothesis formulation
Based on research review, it appears that while more women have joined the labor force thanks to urbanization, they have encountered several cultural and social obstacles that have negatively impacted their employment rates. It would be interesting to study the correlation between urbanization and the female employment rate in the Gapminder dataset. Hypothesis: Higher urbanization rates lead to lower female employment rates.

Calculating the Pearson Correlation

code
/** DETERMINING THE CO-EFFICIENT CORRELATION**/
PROC CORR DATA=work.newdata;
          VAR urbanrate femaleemployrate;
RUN;

results:

For the association between urban rate and female employment rate, the correlation co-efficient is approximately -0.303, with p-vale < 0.0001. The association between urban rate and female employment rate appears to be modestly negative and significant statistically . It is therefore unlikely that the association is by chance alone.

Squaring the correlation co-efficient gives us the fraction of the variability of one variable that can be predicted by another. 

R^2 = (-0.303)^2 = 0.091797,

Therefore, if we know the urban rate we can predict approximately 9.2% of the variability we see in the female employment rate. This means that 91% of the variability is due to factors other than the urban rate.

This scatter plot shows the correlation between female employment rate and urban rate:

/** SCATTER PLOT **/
PROC SGPLOT DATA=WORK.NEWDATA;
REG X=urbanrate Y=femaleemployrate;
RUN;




Monday, March 7, 2016

Chi-Square Test of independence

Using data from the Gapminder codebook: female employ rate  and urban rate. The data output was generated using SAS Studio.


Hypothesis formulation
Based on research review, it appears that while more women have joined the labor force thanks to urbanization, they have encountered several cultural and social obstacles that have negatively impacted their employment rates. It would be interesting to study the correlation between urbanization and the female employment rate in the Gapminder dataset.

Hypothesis: Higher urbanization rates lead to lower female employment rates.

Chi-Square Test
A Chi-Square test of independence was used to look for the association or difference between female employment rates and urbanization rates based on the hypothesis being tested. Because there are several continuous values in both variables, the data for each variable was grouped into categories in a previous assignment. The categories were then used in the chi-square test.

SAS Code
/** CHI SQUARE TEST OF INDEPENDENCE **/
PROC FREQ DATA=NEW;   
TABLES FemaleEmploymentRateGroup * UrbanizationRateGroup / CHISQ; 
TITLE 'Chi-Squared Test of Independence: Female Employment Rate and Urbanization Rate';
RUN;



Results
The contingency table shows:
The row percentages (Female Employment Rate Group / response variable) appear to be higher than the column percentages (Urbanization Rate Groups / independent variable) for Female Employment Rate Groups 1, 2, 5 and 6. The reverse is true for rows 3, 4 and 5. The data suggests an inverse relationship between the two groups. The higher level urbanization rate groups appear to have a lower percentage of female employment rates.

The Statistics table shows a Chi-Square value of 52.58 that is significant at the 0.001 probability level. The probability is less than .001 indicating a strong relationship between these two variables.












Post-hoc Tests for Chi Square Tests of Independence
This is used to examine the p-value and protect against a Type 1 error by using the Bonferroni Adjustment. The Null hypothesis will be rejected at the P-value / Comparisons level, i.e. 0.05/15 = 0.003333 
 Below are the Chi Square p-levels generated from pairs of each of the 6 levels in the Urbanization Rate Group. Level 1 has the most number of statistically significant values (below 0.003), for pair-wise comparisons (1,4), (1,5) and (1,6).



SAS Code:

DATA COMPARISON1; 
SET WORK.NEW;    
IF UrbanizationRateGroup = '1' OR UrbanizationRateGroup = '2';
PROC FREQ; 
     TABLES FemaleEmploymentRateGroup * UrbanizationRateGroup / CHISQ; 
RUN;

DATA COMPARISON2; 
SET WORK.NEW;    
IF UrbanizationRateGroup = '1' OR UrbanizationRateGroup = '3';
PROC FREQ; 
     TABLES FemaleEmploymentRateGroup * UrbanizationRateGroup / CHISQ; 
RUN;

DATA COMPARISON3; 
SET WORK.NEW;    
IF UrbanizationRateGroup = '1' OR UrbanizationRateGroup = '4';
PROC FREQ; 
     TABLES FemaleEmploymentRateGroup * UrbanizationRateGroup / CHISQ; 
RUN;

DATA COMPARISON4; 
SET WORK.NEW;    
IF UrbanizationRateGroup = '1' OR UrbanizationRateGroup = '5';
PROC FREQ; 
     TABLES FemaleEmploymentRateGroup * UrbanizationRateGroup / CHISQ; 
RUN;

DATA COMPARISON5; 
SET WORK.NEW;    
IF UrbanizationRateGroup = '1' OR UrbanizationRateGroup = '6';
PROC FREQ; 
     TABLES FemaleEmploymentRateGroup * UrbanizationRateGroup / CHISQ; 
RUN;

DATA COMPARISON6; 
SET WORK.NEW;    
IF UrbanizationRateGroup = '2' OR UrbanizationRateGroup = '3';
PROC FREQ; 
     TABLES FemaleEmploymentRateGroup * UrbanizationRateGroup / CHISQ; 
RUN;

DATA COMPARISON7; 
SET WORK.NEW;    
IF UrbanizationRateGroup = '2' OR UrbanizationRateGroup = '4';
PROC FREQ; 
     TABLES FemaleEmploymentRateGroup * UrbanizationRateGroup / CHISQ; 
RUN;

DATA COMPARISON8; 
SET WORK.NEW;    
IF UrbanizationRateGroup = '2' OR UrbanizationRateGroup = '5';
PROC FREQ; 
     TABLES FemaleEmploymentRateGroup * UrbanizationRateGroup / CHISQ; 
RUN;

DATA COMPARISON9; 
SET WORK.NEW;    
IF UrbanizationRateGroup = '2' OR UrbanizationRateGroup = '6';
PROC FREQ; 
     TABLES FemaleEmploymentRateGroup * UrbanizationRateGroup / CHISQ; 
RUN;

DATA COMPARISON10; 
SET WORK.NEW;    
IF UrbanizationRateGroup = '3' OR UrbanizationRateGroup = '4';
PROC FREQ; 
     TABLES FemaleEmploymentRateGroup * UrbanizationRateGroup / CHISQ; 
RUN;

DATA COMPARISON11; 
SET WORK.NEW;    
IF UrbanizationRateGroup = '3' OR UrbanizationRateGroup = '5';
PROC FREQ; 
     TABLES FemaleEmploymentRateGroup * UrbanizationRateGroup / CHISQ; 
RUN;

DATA COMPARISON12; 
SET WORK.NEW;    
IF UrbanizationRateGroup = '3' OR UrbanizationRateGroup = '6';
PROC FREQ; 
     TABLES FemaleEmploymentRateGroup * UrbanizationRateGroup / CHISQ; 
RUN;

DATA COMPARISON13; 
SET WORK.NEW;    
IF UrbanizationRateGroup = '4' OR UrbanizationRateGroup = '5';
PROC FREQ; 
     TABLES FemaleEmploymentRateGroup * UrbanizationRateGroup / CHISQ; 
RUN;

DATA COMPARISON14; 
SET WORK.NEW;    
IF UrbanizationRateGroup = '4' OR UrbanizationRateGroup = '6';
PROC FREQ; 
     TABLES FemaleEmploymentRateGroup * UrbanizationRateGroup / CHISQ; 
RUN;

DATA COMPARISON15; 
SET WORK.NEW;    
IF UrbanizationRateGroup = '5' OR UrbanizationRateGroup = '6';
PROC FREQ; 
     TABLES FemaleEmploymentRateGroup * UrbanizationRateGroup / CHISQ; 
RUN;