Saturday, February 6, 2016

Making Data Management Decisions

Using data from the Gapminder codebook: female employ rate  and urban rate. The data output was generated using SAS Studio.


Hypothesis formulation
Based on research review, it appears that while more women have joined the labor force thanks to urbanization, they have encountered several cultural and social obstacles that have negatively impacted their employment rates. It would be interesting to study the correlation between urbanization and the female employment rate in the Gapminder dataset. Hypothesis: Higher urbanization rates lead to lower female employment rates.


The program below uses PROC MEANS to identify some descriptive statistics, including:
N: number of non-missing values
NMISS: number of missing values
MEAN
MEDIAN
MODE
MIN: minimum value
MAX: maximum value
STDDEV: standard deviation
RANGE

Each of the 3 variables has too many different data values and significantly large data ranges. To make it easier to analyze these variables of interest, it is necessary to group the data using IF-THEN statements. When grouping or binning the data, each variable's values of the minimum value, maximum value and standard deviation serve as a guide when selecting the data ranges to be used in the IF-THEN statements. The  first IF-THEN statement for each variable deals with observations that have missing data. 


/** USING SAS STUDIO**/
/** Import the saved csv file from folder **/
PROC IMPORT DATAFILE="\\fsp5800\users\maryt\Desktop\gapminder.csv"
 DBMS=CSV             /** data source identifier **/
 OUT=work.newdata;    /** the output SAS data set **/
 GETNAMES=YES;        /** generates SAS data set names from the first record in the import file **/



/** BINNING VARIABLES: Execute statements for observations that meet certain conditions**/
DATA NEW;  /** Creates new variable that will be the SAS output data set**/
SET   work.newdata;  /** Reads observations from the SAS dataset**/
KEEP  femaleemployrate employrate urbanrate  /** Include in output data sets**/
      FemaleEmploymentRateGroup
      EmploymentRateGroup
      UrbanizationRateGroup; /** Creates secondary variables **/


/** Exploring the data using PROC MEANS to produce statistics **/
PROC MEANS DATA=NEW N NMISS MEAN MODE MEDIAN MIN MAX STD RANGE ;
RUN;

The MEANS Procedure


VariableNN MissMeanModeMedianMinimumMaximumStd DevRange
femaleemployrate
employrate
urbanrate
178
178
203
35
35
10
47.5494381
58.6359551
56.7693596
42.0999985
47.2999992
100.0000000
47.5499992
58.6999989
57.9400000
11.3000002
32.0000000
10.4000000
83.3000031
83.1999969
100.0000000
14.6257425
10.5194545
23.8449326
72.0000029
51.1999969
89.6000000




/** FemaleEmploymentRateGroup **/

/** '^= .' excludes missing data**/
if (femaleemployrate ^= . & femaleemployrate <= 15) then 
     FemaleEmploymentRateGroup = "1";
if (femaleemployrate > 15 & femaleemployrate <= 30) then
     FemaleEmploymentRateGroup = "2";
if (femaleemployrate > 30 & femaleemployrate <= 45) then
     FemaleEmploymentRateGroup = "3";
if (femaleemployrate > 45 & femaleemployrate <= 60) then
     FemaleEmploymentRateGroup = "4";
if (femaleemployrate > 60 & femaleemployrate <= 75) then
     FemaleEmploymentRateGroup = "5";
if (femaleemployrate > 75) then
     FemaleEmploymentRateGroup = "6";


/** EmploymentRateGroup **/

if (employrate ^= . & employrate <= 35) then
     EmploymentRateGroup = "1";
if (employrate > 35 & employrate <= 45) then
     EmploymentRateGroup = "2";
if (employrate > 45 & employrate <= 55) then
     EmploymentRateGroup = "3";
if (employrate > 55 & employrate <= 65) then
     EmploymentRateGroup = "4";
if (employrate > 65 & employrate <= 75) then
     EmploymentRateGroup = "5";
if (employrate > 75 ) then
     EmploymentRateGroup = "6";


/** UrbanizationRateGroup **/

if (urbanrate ^= . & urbanrate <= 15) then
     UrbanizationRateGroup = "1";
if (urbanrate > 15 & urbanrate <= 30) then
     UrbanizationRateGroup = "2";
if (urbanrate > 30 & urbanrate <= 45) then
     UrbanizationRateGroup = "3";
if (urbanrate > 45 & urbanrate <= 60) then
     UrbanizationRateGroup = "4";
if (urbanrate > 60 & urbanrate <= 75) then
     UrbanizationRateGroup = "5";
if (urbanrate > 75) then
     UrbanizationRateGroup="6";
RUN;

/** Display Frequency tables **/
PROC FREQ DATA=NEW;
     TABLES FemaleEmploymentRateGroup EmploymentRateGroup UrbanizationRateGroup;
RUN;


Findings:


Female employ rate:
  • 178 of the 213 countries have female employment statistics listed.  35 of the 213 countries do not have the statistics listed.
  • The highest female employment rate listed is 83.3%, from Burundi. The lowest female employment rate listed is 11.3%, from the West Bank and Gaza. The resulting range among the observations is 72% and the standard deviation is 14.6.
  • The average female employment rate among the 178 observations is 47.6%.
  • The female employment rate with the highest frequency (mode) is 42.1%.  
  • The highest number of observations is in ‘Female Employment Rate Group’ = 4, which includes 75 observations (countries).  Therefore the majority of countries have female employment rates between 46-60%.
  • The second highest frequency of female employment rate is in group 3, between 31-45%. This accounts for 30.9% of the observations.


Employ rate:
  • 178 of the 213 countries have overall employment statistics listed. 35 of the 213 countries do not have the statistics listed.
  • The highest employment rate listed is 83.2%, from both Burundi and Uganda. The lowest employment rate listed is 32%, from the West Bank and Gaza. The resulting range among the observations is 51.2% and the standard deviation is 10.5.
  • The average employment rate among the 178 observations is 58.6%.
  • The employment rate with the highest frequency (mode) is 47.3%.  
  • The highest number of observations is in ‘Employment Rate Group’ = 4, which includes 76 observations (countries) or 42.7% of total observations.  Therefore the majority of countries have employment rates between 56-65%.
  • The second highest frequency of employment rate is in group 3, between 46-55%. This accounts for 23% of the observations.


Urban rate:
  • 203 of the 213 countries have urbanization rate statistics listed. 10 of the 213 countries do not have the statistics listed.
  • The highest urban rate listed is 100%, from 6 countries: Hong Kong / China, Singapore, Macao / China, Cayman Islands, Monaco and Bermuda. The lowest urban rate listed is 10.4%, from Burundi.  The resulting range among the observations is 89.6% and the standard deviation is 23.8.
  • The average urban rate among the 203 observations is 56.8%.
  • The urban rate with the highest frequency (mode) is 100%.  
  • The highest number of observations is in ‘Urban Rate Group’ = 5, which includes 50 observations (countries) or 24.6% of total observations.  Therefore the majority of countries have urban rates between 61-75%.
  • The second highest frequency of urban rate is in group 6, greater than 75%. This accounts for 23.7% of the observations.



The FREQ Procedure


FemaleEmploymentRateGroupFrequencyPercentCumulative
Frequency
Cumulative
Percent
Frequency Missing = 35
131.6931.69
2158.431810.11
35530.907341.01
47542.1314883.15
52111.8016994.94
695.06178100.00
EmploymentRateGroupFrequencyPercentCumulative
Frequency
Cumulative
Percent
Frequency Missing = 35
121.1221.12
2179.551910.67
34123.036033.71
47642.7013676.40
52815.7316492.13
6147.87178100.00



UrbanizationRateGroupFrequencyPercentCumulative
Frequency
Cumulative
Percent
Frequency Missing = 10
152.4652.46
23014.783517.24
33517.247034.48
43517.2410551.72
55024.6315576.35
64823.65203100.00





No comments:

Post a Comment