## Background and Data Source

We’ve heard for quite some time that the middle class is in crisis and shrinking. But what to do the data say? Is the middle class in crisis? Where is the middle class going? What does that crisis look like in the data?

Data for this exploration comes from the Current Population Survey (CPS) Annual Social and Economic Supplement (ASEC). The CPS ASEC is a longitudinal survey of 50,000 households conducted by the US Census Bureau1. Despite the relatively small sample size (<1.5% of the population), this dataset is regularly used for income analysis and other demographic trends. Data were downloaded and pulled into R using code from All Survey Data Free from Anthony Damico2.

## Data Scrubbing

Data were processed using the same steps as Pew Research Center’s 2015 report, America’s Middle Class is Losing Ground6. First, CPS ASEC data for the years 2000, 2005, 2010, 2015 were downloaded into MonetDB. Second, Total Income3, was adjusted for inflation to 2015 USD using the getSymbols function from the quantmod package4 to get Consumer Price Index (CPI) data from the Federal Reserve of St. Louis.

Finally, household income was adjusted for household size. Intuitively, we know that when two people live together, their household income is the sum of each individual’s earnings. However, many expenses (most notably housing and utilities) are shared, so a household of two does not have double the expenses of a household of one. To adjust for this fact, the standard procedure is to divide total household income by the square root of household size5. In other words, a two-person household is assumed to have 1.41 times, rather than 2 times, the expenses as a one-person household.

After income was standardized across all households in the dataset, income class was calculated using the definition used by Pew Research Center6: middle income is 2/3 to 2 times the median, adjusted income. Anying above two times median income is considered “upper” income, and anything below two-thirds of median income is “lower” income. This is distinct from lower, middle, and upper class, which has a wealth component in addition to income as well as connotations of lifestyle6. This is still a relatively crude measure since it does not account for cost of living, but it is useful for a broad-strokes analysis.

## Exploratory Analysis

First, like Pew6, I found that the middle class is shrinking, at least in terms of people living in each income class.

These proportions are slightly different than those found by Pew6, but they show the same general trend and are therefore close enough for my purposes. My larger interest is to look more in detail at how the distribution in income is changing. We can see that income is, as it always is, heavily skewed to the right, but the distributions are not identical year-to-year.

## Difference in Density Analysis

### Theoretical Construction and Example

The question I want to answer is where the middle class is going. To some degree, this is answered by the above graphs showing the proportions of adults living in each income category, but I’m asking a further question: which incomes levels are more–and less–prevelent than they used to be. Conceptually, I want to zoom in to see the gaps between the density curves curves. Since this is something of an unconventional way of visualizing data, let’s start with a proof-of-concept example. First, we will take random samples from a uniform and a Gaussian distribution and graph the density functions of these two samples on top of each other.

set.seed(100) #Set the seed
#Get two random samples
sample1<-data.frame(x=runif(100,0,2))
sample2<-data.frame(x=rnorm(100,1,.33))
#Convert these random samples to density distributions
dist1  <-density(sample1$x,from=0,to=2) dist2 <-density(sample2$x,from=0,to=2)
#Put these density functions into a dataframe
df1    <-data.frame(x=dist1$x,y=dist1$y)
df2    <-data.frame(x=dist2$x,y=dist2$y)

#Plot these density curves
ggplot() +
scale_x_continuous(limits=c(0,2)) +
geom_line(data=df1,aes(x,y),size=1.4,colour="blue")  +
geom_line(data=df2,aes(x,y),size=1.4,colour="green") +
geom_hline(yintercept=0) 

As expected, we can see the sampling from the uniform distribution (blue) is more densely populated at the ends of the range and less prevalent around the mean. We can quantify these differences by looking at the differences in the density curves. For example, at x = 1 the normal density curve (green) is 1.214 and the blue curve is 0.57 making green 2.128 times as dense as the blue population at x = 1.

To visualize this and see these differences in clearer relief, we can create a difference-in-density curve (no longer a density curve, because the difference can be negative). When we graph this difference curve, we can highlight the sign of the difference in density to see which population is higher or lower, and easily visualize the magnitude of these differences. As this is a comparison, we must have a base population and a comparison population where the resultant density curve is density(base)-density(comparison). In this case, blue (uniform distrubtion) will be our base population and green (normal distribution) is the comparison population.

#Build a Dataframe with the difference in densities
df3    <-data.frame(x=dist1$x,y=dist1$y-dist2$y) #Graph it ggplot() + scale_x_continuous(limits=c(0,2)) + geom_line(data=df3,aes(x,y),size=1.4,colour="black") + geom_area(data=subset(df3,y>0),aes(x=x,y=y),alpha=0.5,fill="green")+ geom_area(data=subset(df3,y<0),aes(x=x,y=y),alpha=0.5,fill="red")+ geom_hline(yintercept=0) Now, we can see not just where in our x range the blue population is more prominent than the green population, but also the magnitude of these differences, the latter of which is harder to see when the two density curves are merely juxtaposed. As we’ll see later, we can calculate the area under the curve for different sections to quantify the relative magnitude of each difference between the base and comparison densities. ### Difference in Density in Income Let’s apply this same methodology of differences in density to income distributions over time. Since this methodology necessarily requires a base year and can only compare two distributions at a time, we will use 2015 as our comparison year, and look at where household income has increased/decreased in relative to the base years of 2000 and 2010. As expected, the proportion of middle income households is smaller (read: negative relative density) in 2015 than in 2010 or 2000. But where are those households going? As seen by the green on the far left, we can see more households living with$0 or ultra-low income. But the $100,000 to$200,000 range is far more common in 2015 than either base year, indicating that we also have more households in the upper income echelons. Calculating the areas under these curves, we can compare the size of the shifts to upper and lower echelons.

Comparison of relative increases to lower and upper echelons from base year to 2015
baseyear lowerGain upperGain timesGain
2000 1.3e-06 1.51e-05 11.737
2010 6.3e-06 7.90e-06 1.258

lowerGain and upperGain columns are the areas under the positive portion of the difference in density curves where income is less than $50K (lower) or between$50K and $200K (upper). timesGain is the ratio between upper echelon increases and lower echelon increases. ## Economic Interpretation and Explanation In the final column of the above table, we see that since 2010, 25% more households moved to upper income than moved into lower income. Note that this is a cross-sectional analysis, so we cannot comment directly on which households moved or other characteristics about those moves. There is some evidence that part of the gains to the$100,000 to $200,000 income levels came not from the middle income group, but from declines in the highest income earners. That said, particularly when we look at the last 15 years (base year 2000), we can see the relative shifts are far more (11 times) in favor of large opulence than poverty. What do these graphs tell us? First, I think the hyperbolic notion that those evil, billionaire CEOs are taking all the money away from the middle class is solidly debunked by the substantial growth in the proportion of upper middle income households. Are the ultra-rich becoming richer while the mega-rich retire and don’t get replaced by burgeoning young professionals? That conclusion could be supported by the data (note the blocks of red above$250,000 annual income indicating decreases in that level individual, household-adjusted incomes in this range are less prevalent in 2015 than in the base years), but it is certainly not the only explanation.

But what else is going on to explain these findings? Looking at the demographics, We can see some notable differences: most notably, is the number of workers per household.

Pairwise T-Tests show statistically significant differences between all the socioeconomic income levels. However, the substantive difference seems to be the number of workers. Lower income households tend to only have one worker, though the average number of adults across income classes suggests that these are not, on average, a single parent as a single earner (though it is significantly and substantively more common in lower-income households than middle or upper-income households). On the other hand, upper income households have, on average, 2 or more earners. In other words, both the average and median households are better off in 2014 (the income year reported on in the 2015 survey) than they were in 1999 or 2009. However, a large part of that is due to the wider trend of two-income households and other demographic shifts5. This larger trend explains some of the shift between 2000 and 2015, but we can see that this cannot be the full story, since the average workers per household (and all other measures of household size) actually falls between 2010 to 2015. Unfortunately, I need to end my investigation herea, so I will delve into that question at a later date.

## Conclusion

Yes, the middle class is shrinking. Inequality is real, as is poverty. CEO pay has exploded since the mid 1990s7. This is all true. But whatever the populists say, this does not mean that everyone is suddenly going to be subjugated to the ultra-rich. The median and average households are still doing okay. There are demographic shifts–lower fertility rates, higher cohabitation, higher divorce rates, lower/later marriage, higher rates of children living with parents longer, etc.–and these demographic shifts go a long way to explaining a growing proportion of upper middle income households. Whether these are good shifts or bad shifts depends on your values and worldview, but economically, they are preventing household inequality from rising.

This all implies that maybe, just maybe, free market capitalism is doing what free markets do best: expand opulence for more people than are harmed by free trade and free markets. This is not to say that everything is perfect and rosy. Obviously, there are serious social challenges facing America today, and inequality has complicated and real ethical and moral concerns. However, the populist nightmare that the middle class is becoming poorer to the benefit of the upper classes of society does not appear to be one of those challenges.

# Appendix A: Future Directions

For better or for worse, I’m a busy guy, and this is–for now–just a hobby. I hope to add to this investigation as I have time, but in the meantime, I encourage readers to fork my repo and add some of the following adjustments and considerations I wish I had time to include:

• Adjustment for Cost of Living, or at least calculating median (and therefore class) by region or FIPS code. This will almost certainly require a new, larger dataset, but is worth exploring given the urbanization of millenials.
• More robust ways of accounting for household size generally.
• Identifying the same households over time to track changes to income class.

# Appendix B: Code

All code is available on Github at github.com/dannhek/income_distribution. Below is a scattering of key pieces of code.

#### SQL Query used to get a subset of the data from MonetDB

#Retrieve Data from MonetDB SQL Database using dbQuery
query2015 <- "select h_idnum1
,h_year
,max(hwsval)
,max(htotval)
,max(h_numper)
,max(h_numper-hunder18)
,sum(case when earner=1 then 1 else 0 end)
from asec15 where htotval > 0 group by h_idnum1,h_year"
df2015 <- dbGetQuery(db, query2015)
#From BuildHHCSV.r

#### Variables used in analysis

R_Variable_Name ASEC_Variable_Source Variable_Description
X N/A Observation counter
year h_year Survey year (Data reflect previous calendar year earnings)
h_id h_idnum1 Household Identifier
h_wages hwsval Household income from Wages or Salary in the previous calendar year
h_income htotval Total household income in the previous calendar year
h_size h_numper Number of people (all ages) living in household
h_num_earners earner Number of people earning some income in household
h_num_fams hnumfam Number of families living in household
seclass [Calculated Value] Social Economic Class, as defined by Pew.

#### Difference In Density Function

#Building the Difference in Differences Graphs
getYearComparison <- function(df,year1=2000,year2=2015) {
dist1 <- density(subset(df,year == year1, adj_h_income)$adj_h_income,from=0,to=1000000) dist2 <- density(subset(df,year == year2, adj_h_income)$adj_h_income,from=0,to=1000000)
df1    <-data.frame(x=dist1$x,y=dist2$y-dist1$y) ; df1$pos <- df1\$y>0

compareYears <- ggplot(data=df1) +
geom_line(aes(x=x,y=y),size=1,colour="black")   +
geom_area(aes(x=x,y=ifelse(y>0,y,0),ymin=0),alpha=0.5,fill="lightgreen")+
geom_area(aes(x=x,y=ifelse(y<0,y,0),ymax=0),alpha=0.5,fill="pink")+
geom_hline(yintercept=0)+
ggtitle(paste0("Changes in Income Distribution Between ",year1," and ",year2)) +
xlab("Adjusted Household Income (2015 USD)") +
ylab("Difference in Distribution Density") +
scale_x_continuous(labels=comma,limits=c(0,500000),
breaks=c(0,50000,100000,200000,300000,400000,500000)) +
scale_y_continuous(labels=comma) +
scale_fill_brewer(palette="Set3")

#Return both a graph and the new dataframe
list(graph=compareYears,dataframe=df1)
}

## References

1: US Census Bureau. (n.d.). Small Area Income and Poverty Estimates. Retrieved from https://www.census.gov/did/www/saipe/data/model/info/cpsasec.html

2: Damico, A. J. (2016) Curren Population Survey. ASDFree. Github Repository. https://github.com/ajdamico/asdfree/tree/master/Current%20Population%20Survey; commit c680eec92cbba64512d756e533696dedaa3d415e

3: Variable htotval from the CPS Data Dictionary

4: Ryan, J. A.; Ulrich, J. M.; Thielen, W. (2015). Quantmod. CRAN Package. https://cran.r-project.org/web/packages/quantmod/quantmod.pdf 5: Burkhauser, R. (2012). Podcast interview with Russ Roberts. Retrieved from http://www.econtalk.org/archives/2012/04/burkhauser_on_t.html

6: Kochhar, R.; Fry, R.; Rohal, M. (2015). The American Middle Class is Losing Ground. Pew Research Center. Retrieved from http://www.pewsocialtrends.org/files/2015/12/2015-12-09_middle-class_FINAL-report.pdf

7: Planet Money. (2016). Episode 682: When CEO Pay Exploded. Retrieved from http://www.npr.org/sections/money/2016/02/05/465747726/-682-when-ceo-pay-exploded