VolSweep

View Original

Texas voter clusters

tl;dr This analysis of the Texas voter file proceeds stepwise from clustering voters based on issue scores to identifying gerrymandering and examining anomalous election outcomes from the 2020 general election.

Texas received a lot of attention during the 2020 election cycle. Democrats hoped to flip ten congressional seats and turn the state blue, while Republicans defended said seats and kept the state red. Political analysts pored over exit polls particularly in the Río Grande Valley, which appeared to show a conservative shift among Latinx voters

Beyond the election cycle itself, the U.S. Census Bureau announced congressional reapportionments based on the 2020 census which will have Texas receive two additional representatives. The state must complete the accompanying redistricting process, about which there are continuing concerns of illegal gerrymandering by Republicans.

This piece is a statistical analysis of the Texas voter file. As data director of a 2020 U.S. Senate campaign (not Texas) I performed clustering on a set of issue scores with promising results which happened to coincide with the publication of Professor Philip Waggoner’s article urging more widespread use of unsupervised machine learning techniques in political and social research. Suffice it to say I think there is wide open space for unsupervised methods in political analysis. Here I would like to review the eight major ideological voter clusters in Texas, the types of current congressional districts based on relative proportions of those voter clusters, and signs of gerrymandering based on geographic distribution of the voter clusters.

I used a combination of features from L2 and HaystaqDNA which included donation information and other third party data for the overall analysis. Since I wanted to find ideological clusters of voters, I clustered only on selected issue scores. I generally prefer to use hierarchical agglomerative clustering because it is deterministic and I can manually examine clusters at different levels of the dendrogram to decide where to make slices for the final cluster count. All analysis is done in Python using Matplotlib for visualization. Many thanks to Blake Silberberg and HaystaqDNA for access to the data and helpful feedback.

So what are the eight major voter clusters in Texas? Figure 1 shows pairwise biplots of the clustering features’ first three principal components. These first three principal components explain 70.4% of the variance in the data set. The heads of the yellow arrows are located at the projection of each original clustering feature into the new plane described by principal components (the six farthest from the origin shown per plot). The more parallel two arrows are pointing in the same [opposite] directions, the more positively [negatively] correlated those variables are; likewise regarding the relationship between an arrow and the PCA axes themselves. The longer an arrow extends from the origin, the more variance it explains. In the leftmost plot, for example, “opposes school choice” is a clustering feature whose PC1 coordinate is positive and relatively small compared to its positive PC2 coordinate, and the arrow itself is longer than the arrow for “supports gun control.” From these observations we know that “opposes school choice” explains more variance along (and is positively correlated to) the PC2 axis than the PC1 axis, with which it is also positively correlated. The PC2 coordinate for “opposes school choice” is larger in magnitude than the PC2 coordinate for “supports gun control,” so we also know that the former explains more variance along the PC2 axis. Following the figure is a cluster-by-cluster description in decreasing order of general conservativeness.

Fig. 1 Pairwise biplots for the three principal components explaining the most variance in the data set. The blue cluster is the most conservative overall and the grey cluster is the least conservative overall. Figure 18 in the appendix is the same plot colored by age instead of cluster membership.

Fox & Friends

  • 960k voters (6.0%)

  • average age 66.9 years

  • 90% European ethnicity

  • 73% male

  • very conservative overall

Picket Fence Betty Draper

  • 3.27m voters (20.6%)

  • average age 56.6 years

  • 79% European ethnicity

  • 47.5% male

  • conservative overall, slightly more supportive of environmental & citizenship issues than Fox & Friends

Young & Curious Conservatives

  • 1.21m voters (7.6%)

  • average age 39.9 years

  • 66% European, 15% Latinx ethnicity

  • 43.4% male

  • pretty conservative, less so than previous two clusters; supportive of renewable energy and gun control

Reddish Independent

  • 1.44m voters (9.1%)

  • average age 46.9 years

  • 71.1% European, 12.1% Latinx ethnicity

  • 58.2% male

  • cautiously liberal but not strong on any one issue, strongly opposes environmental issues and abortion rights

Blueish Independent

  • 4.36m voters (27.4%)

  • average age 42.4 years

  • 44.3% European, 36.4% Latinx ethnicity

  • 49.7% male

  • very supportive of renewables, pretty strong on Medicare for All, pathway to citizenship, $15 minimum wage, gun control, climate change action, jobs guarantee; against immigration crackdown

Sleeping “Giant”

  • 840k voters (5.3%)

  • average age 43.1 years

  • 79.1% Latinx, 14.1% European ethnicity

  • 55.7% male

  • generally liberal, not super strong on any one issue on average; supports Medicare for All, renewable energy, jobs guarantee, pathway to citizenship, $15 minimum wage; not strong on opioid treatment, #MeToo, Green New Deal, abortion rights

Vanilla Libs

  • 1.37m voters (8.6%)

  • average age 58.4 years

  • 53.5% European, 31.5% Latinx ethnicity

  • 47.7% male

  • liberal overall, not very strong on any one issue on average; strongest support is renewable energy

Prog (W)OC

  • 2.45m voters (15.4%)

  • average age 43.1 years

  • 46.2% Latinx, 29.6% Black, 18.8% European ethnicity

  • 24.4% male

  • liberal overall; extremely strong average support for Medicare for All and renewable energy


How are these clusters geographically distributed across the state? Figures 2 & 3 show that counties along the Río Grande Valley tend to have higher proportions of Blueish Independents than the rest of the state, while same goes for the southern group of those counties with respect to Sleeping “Giant”s. Fox & Friends and Picket Fence Betty Draper are fairly evenly distributed outside of the Río Grande Valley, with Fox & Friends not as dense anywhere as other clusters. Young & Curious Conservatives are not proportionately dense anywhere in particular and are represented most strongly in metro areas.

Fig. 2 Log of cluster population in each county. White corresponds to the per county frequency of a theoretical cluster having 1/8 share of statewide registered voters and uniform distribution across all counties.

Fig. 3 Proportion of each county made up of each cluster. The lower Río Grande Valley counties are mostly Sleeping “Giant,” whereas the upper Valley has some but is more Blueish Independent. Vanilla Libs and Prog (W)OC are also present in larger proportion along the upper Río Grande Valley than the remaining clusters.

Figure 4 shows that Fox & Friends, Picket Fence Betty Draper, and Vanilla Libs had the highest turnout rates in the 2020 general election (93.3%, 84.1%, and 86.9%, respectively). We also know from the cluster descriptions that they are the whitest (non-Latinx) and oldest on average. The bulk of currently registered voters who are available for turnout targeting are Blueish Independent and Prog (W)OC.

Fig. 4 Age histograms per cluster for 2020 general election voteds (green) vs non-voteds (purple). Legend totals include out-of-state registered voters, whereas county- and congressional district-based plots in this post do not.

OK that’s great, we have some idea of turnout for these clusters. We also know that in Figure 1 they’re oriented from the political left to the political right, left to right. Do we have some idea how they voted in the 2020 general election? We can try using candidate donations as a proxy, to start. Figure 5 is not a random sample of the voter file; it shows one dot per donor, colored by the recipient (blue for Biden, red for Trump). We see that the left is almost all blue, the right is mostly red with some blue, and the center is sort of a muddy mix.

Fig. 5 Axes are the same principal components as Figure 1 (the redder direction is toward Fox & Friends and the bluer direction is toward Prog (W)OC). One dot per donor, colored by candidate recipient’s party.

Since we definitely see a pattern in Figure 5 but the cluster boundaries are not apparent, let’s look at a plot of percent of cluster donating to Trump versus percent of cluster donating to Biden (Figure 6). Using donations as proxies for votes is confounding by the fact that not everyone has disposable income to donate, which is why we did not take into account the amount of the donation from the data we had available. The highest percentage of any cluster donating to a candidate was just under 6% of Vanilla Libs donating to Biden. It seems safe to assume that Fox & Friends and Picket Fence Betty Draper are solidly Republican and Vanilla Libs is solidly Democrat. Among the clusters who donated in lower percentages, Prog (W)OC had barely anyone donating to Trump while about 1% donated to Biden. This appears to be a function of disposable income and not ambivalence on partisanship. Of the final four, Blueish Independent and Sleeping “Giant” appear to be Democrat-leaning and Reddish Independent and Young & Curious Conservatives appear to be Republican-leaning. They seem to be the clusters best targeted for persuasion in a partisan race.

Fig. 6 Scatterplot of percentage of each cluster donating to each candidate (dashed line is equal percentage donated to each candidate).

Before going into plots of 2020 general election activity, a small caveat about L2 data compared to Texas Secretary of State data in Figure 7. Essentially, L2 took the voter file once the TX SOS updated the file with 2020 election results. Both L2 and the TX SOS are constantly updating their own databases to accommodate for national change-of-address requests (NCOAs), deaths, late submission, etc., and the SOS has accurate information available on its website (special thank you to Kate Fisher at the Elections Division!). So we see that L2’s data set reports > 95% of voters for most counties and those that are outside of that range tend to have had smaller turnouts.

Fig. 7 Comparison of L2’s turnout figures and Texas Secretary of State’s turnout figures for the 2020 general election.

Let’s take a look at which clusters had the biggest changes in turnout between the 2016 and 2020 general elections. We’ll start by looking at where the highest raw frequencies of voters by cluster are (Figure 8a). The highest frequency of any cluster per county is Blueish Independent in Harris County; second highest is Prog (W)OC in Harris County. We see that Picket Fence Betty Draper, Blueish Independent, and Prog (W)OC have high frequencies in the top four or so most-populous counties. (Note: This plot should give you a sense why Figure 2 used a log scale!) Figure 8b shows that in the 2016 general election Fox & Friends has consistently high turnout across counties, followed by Picket Fence Betty Draper and Vanilla Libs (NB 2020 totals in the denominator and y-axis labels for 2016 voting data). Reddish Independent and Blueish Independent had low turnout across counties. Figure 8c shows that all clusters experienced an average increase in turnout across counties for 2020, with Sleeping “Giant” lagging under 50% in some of the large counties. Figure 8d shows that from 2016 to 2020, Reddish Independent and Blueish Independent experienced the largest change in turnout, with counties like Hays nearly tripling turnout in those clusters.

Fig. 8a Frequency of each voter cluster per county for the top 30 most populous counties. Sorted left to right in order of decreasing total population.

Fig. 8b Proportion of each voter cluster voting in the 2016 general election per county for the top 30 most populous counties. White corresponds to half so anything green is > 50% and anything purple is < 50%.

Fig. 8c Proportion of each cluster voting in the 2020 general election per county for the top 30 most populous counties. Same as above; white corresponds to half so anything green is > 50% and anything purple is < 50%.

Fig. 8d Proportion change from 2016 to 2020 in voter cluster turnout per county. White corresponds to 1 which is no change in turnout frequency (could be different people but same number overall) and green[/purple] shows the multiple that turned out in 2020 compared to 2016. Purple is fractional turnout. For example, Reddish Independent had over 2.5x turnout in Collin County in 2020 compared to 2016. Webb County obviously has anomalous results for Picket Fence Betty Draper and Vanilla Libs, which may be due to a discrepancy between Texas Secretary of State data and L2 data.

Notice in Figure 8d that the proportion change is lower across all clusters in El Paso, Hidalgo, Cameron, and Webb Counties. Is there some way to figure out why groups of voters behave a certain way en masse in different geographic regions? County borders do not get redrawn nearly as often as congressional district boundaries do. As mentioned in the intro, there has long been discussion about congressional gerrymandering in Texas. One of the issues with gerrymandering is wasted votes; let’s take a look at the voter cluster makeup of the 116th Congress districts to see if we can suss out any reasons for district-wide voting patters.

Let’s look at a map of voter cluster proportions per district to get us oriented. The congressional districts around Dallas-Fort Worth, Houston, and Austin are a bit small area-wise for discernment here but area does not factor into our analysis so not to worry.

Fig. 9 Proportion of each congressional district belonging to each cluster. White corresponds to 0.125 (1/8, hypothetical uniform proportion of eight voter clusters).

As we saw in the original cluster descriptions, the statewide proportions per cluster are as follows: Fox & Friends 0.060, Picket Fence Betty Draper 0.206, Young & Curious Conservatives 0.076, Reddish Independent 0.091, Blueish Independent 0.274, Sleeping “Giant” 0.053, Vanilla Libs 0.086, and Prog (W)OC 0.154. Since ideologically similar voters are not randomly distributed geographically — in fact, very much to the contrary — we different areas of the state have higher and lower local densities of each voter cluster than the state average.

The same way I did PCA on the voter clustering features in order to visualize in Figure 1, I did it again on a congressional district (henceforth, “CD”) feature set of voter cluster proportions per district to visualize what they look like when colored by the winner’s political party (CD06 is vacant but between two Republicans in a special election, so it’s red). In layperson’s terms, I wanted to see a plot of CDs’ cluster proportions in three dimensions since we can’t see eight dimensions (at least I can’t).

Figure 10 starkly divides Democratic- and Republican-led CDs. To characterize the crossover region between red and blue on those axes, I “typed” CDs using k-means clustering on the original voter cluster proportion features used for the PCA. Figure 11 shows the CD cluster types with district numbers overlaid.

Fig. 10 Pairwise biplots for the first three principal components of voter cluster proportion per congressional district features. The plotting symbol is the district number and color is the current seat holder’s party (except for CD06 which has a special election between two Republicans, so it’s red anyway).

Fig. 11 Again, one point per congressional district; size is proportionate to district population. Colored by k-means clustering results. The axes are the same principal components from Figure 10.

And this is how they map out geographically:

Fig. 12 Map of 116th Congress boundaries, colored by district types from Figure 11.

Based on the 2020 results, it looks like CD types 3 and 5 will usually go blue, type 4 will nearly certainly go red, type 1 will go mostly if not all red, and type 2 will go mostly if not all blue. Let’s look at the actual cluster proportions per CD. Figure 13 looks a little hectic but the important thing to note is that I made 0.125 (1/8) white on the colormap so that any cluster that is disproportionately over[/under]represented from an even theoretical distribution in a CD (i.e., all clusters 1/8 of total) will appear green[/purple]. Again, remember that the statewide cluster proportions are not all 0.125 so we’re not expecting to see a uniformly colored plot here. We’re looking for where and how the differences are happening.

Fig. 13 The proportion of each cluster per congressional district (columns sum to 1, see colorbar). White represents 0.125 (1/8, or equal proportion for all clusters). The x-axis is sorted left to right by decreasing CD total population within each CD cluster type.

We’ll go top to bottom by cluster using 0.125 as “normal” representation just for the sake of descriptions. Fox & Friends has slightly above normal representation in CD type 4 and very underrepresented in CD types 2 and 3. Picket Fence Betty Draper has very strong representation in CD types 1 and 4 and decent representation in CD23 which may be why it flipped from expected winning party (hang on for Figures 14 and 15 since this one doesn’t address turnout). Young & Curious Conservatives essentially shadow the representation pattern of Picket Fence Betty Draper but doesn’t quite have the numbers yet; same with Reddish Independent, except it is stronger than Young & Curious Conservatives in CD type 4. Blueish Independent is the only voter cluster that is overrepresented across all CDs, which is interesting because we’ve already seen that different CD types are more likely to go to one party or the other. Remember in Figure 6 we saw that Blueish Independent was near the midline for partisan donations; this appears to be a large group of potential swing voters. Sleeping “Giant” has very strong representation in CD type 5 and medium representation in CD type 2 and part of CD type 3 (CD29 and CD33). Vanilla Libs are sort of spread out with medium representation across the state, with stronger representation in CD type 1 (NB Vanilla Libs have the highest likely-Dem voter turnout rate and they have strongest representation in a CD type that is likely to go Republican… CD07 and CD32 addressed shortly). Prog (W)OC has extremely strong representation in CD type 3 — in fact so strong it looks like packing — and pretty strong representation in CD type 2.

OK, so how did these CD types turn out by cluster in 2020? We know something unexpected probably happened in at least a few places because not all CD type 1 went Republican and not all CD type 2 went Democrat. Let’s take a look in Figures 14 and 16.

Fig. 14 Same idea as Figure 8c except with CDs instead of counties. It shows the proportion of each voter cluster per CD that turned out to vote. White is set to 0.50 so anything green is above 50% turnout and anything purple is below 50% turnout.

Fig. 15 This plot shows the results of two-sample tests comparing proportions between voter cluster turnout proportion per CD and the same voter cluster turnout proportion but statewide. In other words, at the given significance level did a cluster have a higher (green) or lower (purple) turnout rate in a CD than it did statewide. Big picture takeaway is that races in CDs of type 1 drove above average turnout rates in almost every voter cluster; the exceptions were CD14 and CD17 which had very average turnout and CD27 which was anomalously low for CD type 1. CD types 4 and 5 in particular do not seem to draw turnout enthusiasm, probably because they are safe Republican and Democrat (refer to figures 10 & 11), respectively, and people feel like their votes are wasted.

Figure 14 shows that the four most conservative clusters (top four rows) had fairly consistent turnout across CDs, and the more conservative the higher the turnout rate. The turnout rate does appear to dissipate in CD type 4 after the first two rows, which may be because Fox & Friends and Picket Fence Betty Draper have such strong representation so the CDs are safe Republican. Vanilla Libs turned out pretty consistently across all CDs. Blueish Independent doesn’t seem like it was very convinced to turn out in CD types 3 and 4. Prog (W)OC and Sleeping “Giant” appear to have turned out lower than their respective average rates in several CDs across types (CD27, CD16, CD19, CD13, CD11, etc.). Sleeping “Giant” had very low turnout rate overall across CD types 2-5, in spite of having very solid representation in CD type 5 in particular (again, could be issue of wasted votes).

Regarding the anomalous outcomes with respect to CD type expectations, Figure 15 suggests that in the cases of CD07 and CD32 the above-average turnout rate of Prog (W)OC, Sleeping “Giant”, and perhaps Blueish Independent was enough to overcome average turnout by Fox & Friends. Figure 15 also shows us that CD23 went Republican likely because of strong turnout by Picket Fence Betty Draper.

All right, so we’ve looked at a statewide map of CDs and a bunch of voting numbers, but what do the CDs actually look like up close? The following is a density plot of the Houston metro area population by voter cluster with 116th Congress boundaries.

Fig. 16 Houston metro area population density plots by voter cluster with 116th Congress boundaries. Check here for a map with current district numbering (that’s the article link, image itself is here).

Let’s not even talk about partisan associations for the time being. Figure 16 suggests that CD29 was drawn to include the highest density areas of Sleeping “Giant”, even scooping out the center part where Prog (W)OC is dense. CD29 has no other cluster density to speak of besides a tad of Blueish Independent and some spillover Prog (W)OC. CD09 and CD18 appear to be drawn to include the highest density areas of Prog (W)OC, which again have not much other cluster density to speak of besides Blueish Independent and Sleeping “Giant”.

Vanilla Libs, Blueish Independent, and Young & Curious Conservatives are each individually densest in the same central area. Vanilla Libs is the densest and its density gets cut up into five different CDs. The three we already mentioned (CD29, CD09, CD18) will go blue anyway, which seems like a further accumulation of wasted votes. Rep. Lizzie Fletcher (D) won CD07 over a nine-term Republican incumbent in 2018 and then again over challenger Wesley Hunt in 2020; from Figure 16 I’d say a growing population of Prog (W)OC plus Vanilla Libs and some Blueish Independent helped to flip the district (possibly also some Young & Curious Conservatives). Before the flip I would have argued that Vanilla Libs were getting cracked in Houston, which I suppose I would still argue but with the additional comment that it is increasingly unsuccessful. Notice that a blue CD07 is one of the anomalies of CD type 1.

The remainder of the Vanilla Libs density falls in CD02, where Rep. Daniel Crenshaw (R) won reelection for a second term in 2020 (and which is also CD type 1). CD02 looks like the lower portion has the most density from Young & Curious Conservatives, Blueish Independent, and Vanilla Libs, with a little Reddish Independent and Picket Fence Betty Draper thrown in. The northeast chunk is far more conservative. Figure 16 suggests that Vanilla Libs could have turned out more in CD02 for the 2020 general.

Figure 17 shows more or less the same geographic area as Figure 16 but only includes data points for the city of Houston itself. You can very clearly see that Fox & Friends is nowhere dense, for example, and that the average age of Picket Fence Betty Draper and Vanilla Libs is higher than other clusters. Older Prog (W)OC voters appear to live in CD09 and CD18. Blueish Independent seems to be the most evenly distributed geographically and age-wise.

Fig. 17 Houston city scatterplot by cluster colored by voter age with 116th Congress boundaries.

New congressional boundaries are being drawn as I type this. If you have any questions about the work shown here or you think I can help in some way for your redistricting/campaign/voter outreach initiative/etc., please let me know. And of course, if you are in a different state and would like to discuss how to apply this work where you are, please get in touch as well. You can email me at rebecca at volsweep dot com. I would love to hear comments. Thank you very much for your interest!

APPENDIX

Fig. 18 Same plot as Figure 1 but colored by voter age instead of cluster membership.

General elections

The plots in the spotlight gallery immediately below show, in order:

  1. Log frequency of a given voter cluster per county, 2016 general election;

  2. Log frequency of a given voter cluster per county, 2020 general election;

  3. The proportion of a county’s voters in a particular cluster actually voting, 2016 general election;

  4. The proportion of a county’s voters in a particular cluster actually voting, 2020 general election;

  5. The proportion change in turnout frequency from 2016 general election to 2020 general election.

In the first two plots, white corresponds to the log frequency if all voteds were evenly distributed across voter clusters and across counties. In the next two plots, white corresponds to 50% voter turnout. In the final plot, white corresponds to no change in frequency between 2016 and 2020; purple is lower turnout and green is higher turnout from 2016 to 2020.

Republican primary

The plots in the spotlight gallery immediately below show the same as for the general elections above, but for the GOP primaries 2016 & 2020. (See descriptions above.)

Democratic primary

The plots in the spotlight gallery immediately below show the same as for the general elections above, but for the Democratic primaries 2016 & 2020. (See descriptions above.)

Two-tailed test for equality of proportion for two-way county results

This plot shows which counties went solidly for Biden (blue), solidly for Trump (red), and which were statistically close enough to be considered a tie at a very small significance level (white). The counties in white should be the ones parties pay the most attention to in presidential races moving forward.

Fig. 22 Statistical test for higher vote share per county between major party presidential candidates.

Ratio of third party vote share to two-way difference

This plot shows the proportion of third-party votes in a county to the directional, two-way difference between votes received by Trump and Biden. “Directional” means the ratio is positive (green) if Trump received more votes and negative (purple) if Biden received more votes. All the green counties happen to have 0 < proportion < 1, which means Trump won the county and the third party vote frequency was not enough to make up the two-way difference.

Fig. 23