National Voter Clusters
tl;dr We analyzed data appended to the national voter file to find six ideological superclusters of voters which we split further into 21 final clusters. Georgia’s voter cluster distribution is presented as an example. Voter clusters have distinct partisan preference distributions and turnout distributions with which we can run generic simulations.
After achieving promising cross-validations and trial results using our Texas voter clusters, we collaborated again with HaystaqDNA to identify ideological clusters of voters on the national level. While we did use the same general process as before using hierarchical agglomerative clustering, we did a two-tiered clustering this time. Since the national voter file is so large, we did a preliminary round to identify the superclusters, or overarching groups, followed by a second round on each individual supercluster to break them up into subclusters (henceforth in this post simply referred to as, “clusters”). The final set of national clusters does not correspond directly to the Texas results; this is because the distributions of the issue scores we used as clustering features pertained to the national electorate instead of just to Texas voters, and as you can imagine the ideological landscape of Texas does not represent that of the entire country.
The six superclusters basically match standard voter characterizations most people recognize, namely: Conservative, Center Right, Center, Center Left, Neoliberal, and DemSoc/Socialist (the last one representing any ideological profile to the “left” of neoliberalism). The main differentiating factors among the superclusters is support for social issues — which increases overall in the order listed above — and a split on support for capitalism. Each supercluster has a more pro-capitalism pole and a more anti-capitalism pole, except for Neoliberal and DemSoc/Socialist which, together, are the shape of the other superclusters and split from each other on support for capitalism at the equator between them. We provide a summary of each individual cluster below Figure 1 if you would like to skip the methods and get right to the clusters themselves.
METHODS/INTERPRETATION
This paragraph is a bit technical and is not imperative to understand but is provided for background on how these clusters orient to each other statistically. Figure 1 shows pairwise biplots of the clustering features’ first three principal components. These first three principal components explain 75.3% of the variance in the data set. The heads of the arrows pointing outward from the center of each plot are located at the projection of each original clustering feature into the new plane described by principal components (the ten farthest from the origin shown per plot, apologies for overlapping labels as their locations are determined by the data). The closer to parallel two arrows are pointing in the same [opposite] directions, the more positively [negatively] correlated the corresponding variables are; likewise regarding the relationship between an arrow and the PCA axes themselves. The longer an arrow extends from the origin, the more variance it explains. In the leftmost plot, for example, “capitalism sound” (i.e., support for capitalism) is a clustering feature whose PC1 coordinate is positive and relatively small compared to its positive PC2 coordinate, and the arrow itself is longer than the arrow for “gay marriage oppose.” From these observations we know that “capitalism sound” explains more variance along and is more positively correlated to the PC2 axis than the PC1 axis, with which it is also positively correlated. The PC2 coordinate for “gay marriage oppose” is larger in magnitude than the PC2 coordinate for “capitalism,” so we also know that the former explains more variance along the PC2 axis. To give a concrete example of how to interpret this information, Coming Up Right is located below and to the right of They’re With Her so we know that members of the former oppose gay marriage more on average and support capitalism less on average than members of the latter.
Fig. 1 Pairwise biplots for the first three principal components in the issue scores clustering feature space. Less opaque patch colors in legend correspond to clusters that support capitalism less on average than the more opaque counterparts. Provided to show general cluster orientation in the clustering feature space.
CLUSTER SUMMARIES
Now we will go over cluster-by-cluster summaries in decreasing order of supercluster conservativeness. Underneath each cluster name you can see the top donations recipients from 2020-2022 chosen by members of that cluster.
Conservative supercluster: 65 years old on average, 41% female, {85% White, 3% Latinx, 1% AAPI}
Center Right supercluster: 48 years old on average, 51% female, {77% White, 7% Latinx, 3% AAPI, 2% Black}
Center supercluster: 47 years old on average, 51% female, {73% White, 9% Latinx, 3% AAPI, 2% Black}
Center Left supercluster: 44 years old on average, 49% female, {56% White, 18% Latinx, 8% Black, 5% AAPI}
Neoliberal supercluster: 67 years old on average, 63% female, {50% White, 25% Black, 13% Latinx, 3% AAPI}
DemSoc/Socialist supercluster: 34 years old on average, 63% female, {40% White, 20% Latinx, 24% Black, 5% AAPI}
CLUSTER PARTISANSHIP COMPOSITION
Now that we’ve gone over the cluster summaries, let’s take a look at partisanship distributions. This heat map shows, for each cluster, the proportion of its voters with a given political orientation (either explicit party registration, primary participation, or data modeling; rows sum to 1).
Fig. 2 Heat map showing the proportion of each national voter cluster belonging to each partisanship category.
GEORGIA EXAMPLE
Now let’s take a quick look at Georgia as an example to see how these clusters are distributed both geographically (Figure 3) and age-wise by turnout status (Figure 4).
Fig. 3 This plot shows, by voter cluster, the proportion of each county in Georgia that belongs to the given voter cluster. White represents 1/21 = 0.048 which is the proportion each cluster would represent per county if the clusters were uniformly distributed; anything green is overrepresented from uniform distribution and anything purple is underrepresented from uniform distribution.
The more conservative clusters appear to have the highest proportional representation in broad areas outside of the Black Belt region (the top row clusters plus New Indys and The Disillusioned). Winning in the Middle, Working Week, Vanilla Libs, They’re With Her, People Power, Professional Left, and Unburdened Left all show very low proportional representation outside of the Atlanta metro area where they are concentrated. La Lucha, Golden Years (L), Rainbow Sunset (L), and Out With the Old all have their highest proportional representation in the Black Belt region.
Fig. 4 Two age histograms per voter cluster; a green one for voters who turned out in the 2020 general election, and a purple one for those who did not turn out.
In Figure 4, both the x- and y-axes are shared by all subplots so we may compare total histogram area across subplots and relative histogram area within subplots. For example, of all clusters, Fox & Friends and Fox Lite appear to be responsible for the most raw turnout of any clusters (i.e., green histogram areas are large compared to other clusters). Additionally, we see that the purple histogram area in those two subplots is very small, in particular in proportion to the green area. That means that the turnout rate is very high for Fox & Friends and Fox Lite. Turnout rates are shown in each subplot legend as percentages in parentheses.
As we showed in the previous blog post with the 2022 midterms simulations, the turnout rates for each cluster across geographic regions (e.g., county) have quite nice normal distributions. We will do more in-depth example analyses in future posts. If you think voter cluster labels can assist in a targeting project, please get in touch at contact@volsweep.com! We’d love to hear what you’re working on!