We describe this all in more detail below, but the key for applied
researchers is that for now we recommend the use of the
jackknife bias correction, which is now built into to our main
conText()
function.
We are hugely thankful to Will Hobbs and Breanna Green for bringing to our attention clear examples where the bias was larger than we had anticipated. We are actively collaborating with them to propose additional fixes.
conText
’s regression framework allows users to estimate
the “semantic” shift in the usage of a term between two or more groups,
while controlling for other group covariates. The resulting coefficients
will be D-dimensional, where D equals the
dimensionality of the pre-trained embeddings used (say, 300). In
Rodriguez et al. (2023) we propose using the Euclidean norm to
summarize the magnitude of these coefficients, noting however that:
“…estimates of the norms have a finite-sample bias for rare words. Thus care is needed when comparing words or groups with substantially different amounts of available data.”
We now give more precise discussion of the potential problems—and a way to mitigate these issues.
While the D-dimensional OLS coefficients resulting from a
conText
regression are approximately unbiased,
functions of these coefficients, such as the norm, are not. The
magnitude of the bias, and hence potential for mistaken inferences,
increases with the variance of the embedding position. Thus, the
magnitude of the bias is a function of sample size.
As one might expect: the smaller the sample—for any of the slices of
the data resulting from the set of covariates—the higher the variance,
the larger the bias. In the case of conText
, a multivariate
multiple regression, the problem is more acute the larger the
dimensionality of the embedding space, with higher
dimensionality (say 500 instead of 100 length vectors) resulting in
higher variance in small samples, all else equal.
Re-sampling methods offer one way to quantify and correct for (or at least mitigate) this bias. These include - permutation - bootstrapping - jackknife cross-validation
Re-sampling approaches to bias correction all work by leveraging sub-samples of the observed data (contexts in our case) to estimate the magnitude of the bias. The methods differ in how these sub-samples are generated. Ultimately, we currently suggest and implement the jackknife in our software. And we do that in light of experimental evaluations we report below.
We emphasize though that with small samples and highly imbalanced groups, non-trivial amounts of bias will remain.
For our evaluations we employ simulated data, representative of the
type of data—numeric vectors i.e. embeddings—we feed into
conText
. In particular, we proceed as follows:
sample D-dimensional (D = 100) vectors from two multivariate normal distributions, each representing a hypothetical population of interest (covariate groups in the data, say Republicans and Democrats). Our interest is in estimating the norm of the difference in (row)means of the two populations.
Within this sampling arrangement we then vary…
For more details on this scheme, see the code at the end of this vignette.
Figure 1 plots our results. The plots on the left capture results for when θ, the true difference in means, equals 0. For these plots, if the estimator was unbiased the point estimate would be on the red “zero” line. The plots on the right capture results for when θ equals 1. For these plots, if the estimator was unbiased the point estimate would on the red “one” line.
There are several takeaways:
The bias, with no correction (top row), is particularly problematic in the case of no differences (θ = 0); and the bias is worse given small, highly unbalanced samples (e.g. n=150, class ratio 1:100). Specifically, we are more likely to incorrectly reject the null (commit a Type 1 error). That is, the point estimate is sometimes far from zero, and the confidence interval does not cover zero. As expected, the bias decreases as we increase sample size and/or both groups are more equally balanced.
When comparing the various re-sampling-based bias correction methods, we observe bootstrapping performs poorly (compared to permutation and jackknife) in the absence of a true difference between the two population means. That is, given a true θ of 0, bootstrapping shows a remaining strong positive bias post-correction, for small, highly unbalanced samples.
The permutation-based correction on the other hand performs relatively poorly in the presence of a true difference between the two population means (θ = 1). That is, permutation shows a strong negative bias for small, highly unbalanced samples. And surprisingly this persists even for larger, more balanced samples. The permutation-based correction is however the best-performer in the case of no true differences (θ = 0). While the permutation-based correction does not perform well generally, it still correctly controls the Type 1 error when used as a test.
The jackknife performs relatively well in both scenarios, although it does not entirely eliminate the bias in the case of θ = 0 (under performing the permutation-based correction).
Figure 1:
Given the above results, we proceeded to update
conText
to allow users to implement jackknife
bias-correction, see Quick
Start Guide for examples. Specifically: - the permutation argument
in conText
remains unchanged, implementing a
permutation-test to estimate empirical p-values. - we recommend
users combine the Jackknife bias correction with the permutation test
when evaluating the significance of the norm of the coefficients. - we
further recommend users avoid relying on results with small samples (for
any of the slices of the data implicit in the regression model) and
always follow up any regression-based results with a qualitative
comparison of nearest neighbors.
Note: The current version of conText
continues to allow for bootstrapping, this argument however does not
implement any bias correction and will eventually be deprecated.
# load libraries
library(dplyr)
library(progress)
library(ggplot2)
library(gridExtra)
library(furrr)
# ---------------------
# Functions
# ---------------------
rmse <- function(x) {sqrt(mean(x^2))}
plugin <- function(x,y) {
rmse(rowMeans(x) - rowMeans(y))
}
permute <- function(z,nx,ny) {
ind <- sample(1:(nx+ny),nx)
xtilde <- z[,ind]
ytilde <- z[,-ind]
plugin(xtilde,ytilde)
}
permutation <- function(x,y) {
z <- cbind(x,y)
plugin(x,y) - mean(replicate(100,permute(z,ncol(x),ncol(y))))
}
jackknife <- function(x,y) {
cov <- c(rep(1,ncol(x)),rep(0, ncol(y)))
z <- t(cbind(x,y))
xbar <- rowMeans(x)
ybar <- rowMeans(y)
jhat <- mean(sapply(1:nrow(z),jk1, z=z,cov=cov,xbar=xbar,ybar=ybar,nx=ncol(x),ny=ncol(y)))
thetahat <- plugin(x,y)
return(nrow(z)*thetahat - (nrow(z)-1)*jhat)
}
jk1 <- function(z,cov,i,xbar,ybar,nx,ny) {
if(cov[i]==0) {
#y case
ytilde <- (ybar*ny - z[i,])/(ny-1)
return(rmse(xbar - ytilde))
} else {
#x case
xtilde <- (xbar*nx - z[i,])/(nx-1)
return(rmse(xtilde - ybar))
}
#rmse(colMeans(zni[covni==1,]) - colMeans(zni[covni==0,]))
}
boot <- function(x,y,strat=TRUE) {
if(strat) {
indx <- sample(1:ncol(x), ncol(x), replace=TRUE)
indy <- sample(1:ncol(y), ncol(y), replace=TRUE)
return(rmse(rowMeans(x[,indx]) - rowMeans(y[,indy])))
} else {
cov <- c(rep(1,ncol(x)),rep(0, ncol(y)))
z <- cbind(x,y)
covb <- c(1,1)
while(length(unique(covb))==1) {
ind <- sample(1:ncol(z),ncol(z),replace=TRUE)
covb <- cov[ind]
}
zb <- z[,ind]
return(rmse(rowMeans(zb[,covb==1, drop=FALSE]) - rowMeans(zb[,covb==0,drop=FALSE])))
}
}
bootstrap <- function(x,y,strat=TRUE) {
2*plugin(x,y) - mean(replicate(500, boot(x,y,strat=strat)))
}
# sampling function
samp_function <- function(mux, muy, samp_size, class_ratio,D) {
# set sample size
nx <- ceiling(class_ratio*samp_size/(1+class_ratio))
ny <- samp_size-nx
# sample
x <- replicate(nx,rnorm(D,mean=mux))
y <- replicate(ny,rnorm(D,mean=muy))
out <- data.frame(
samp_size = samp_size,
class_ratio = class_ratio,
#method = c("none", "Jackknife", "permutation", "bootstrap", "bootstrap (stratified)"),
method = c("none", "Jackknife", "permutation", "bootstrap"),
value = c(
plugin(x,y),
jackknife(x,y),
permutation(x,y),
#bootstrap(x,y,strat=FALSE),
bootstrap(x,y)
)
)
return(out)
}
# ---------------------
# Simulations
# ---------------------
sim_function <- function(theta, D, samp_size, class_ratio) {
mux <- rep(0,D)
muy <- rep(theta,D)
# set sample size
nx <- ceiling(class_ratio*samp_size/(1+class_ratio))
ny <- samp_size-nx
# sample
x <- replicate(nx,rnorm(D,mean=mux))
y <- replicate(ny,rnorm(D,mean=muy))
out <- data.frame(
D = D,
theta = theta,
samp_size = samp_size,
class_ratio = class_ratio,
method = c("none", "Jackknife", "permutation", "bootstrap"),
value = c(
plugin(x,y),
jackknife(x,y),
permutation(x,y),
bootstrap(x,y)
)
)
return(out)
}
params <- expand.grid(
samp_size=c(150, 250, 500, 1000, 2500),
class_ratio=c(0.01, 0.1, 0.5, 1),
theta=c(0, 1),
D = c(10,100)
)
plan(multisession, workers = 12)
sim_out_df <- 1:100 %>%
future_map_dfr(
~params %>%
pmap(
sim_function
),
.progress=TRUE,
.options =furrr_options(seed = 2023L)
)
# ---------------------
# Plot
# ---------------------
plot_df <- sim_out_df %>%
mutate(method = factor(method, levels = c("none", "bootstrap", "permutation", "Jackknife"))) %>%
group_by(D, theta, samp_size, class_ratio, method) %>%
summarize(avg = mean(value), p025 = quantile(value, 0.025), p975 = quantile(value, 0.975), .groups = "drop_last")
# new facet label names
ratio.labs <- c("class ratio\n1:100", "class ratio\n1:10", "class ratio\n1:2", "class ratio\n1:1")
names(ratio.labs) <- c("0.01", "0.1", "0.5", "1")
# plot w. theta = 0
plot1 <- ggplot(subset(plot_df, theta == 0 & D == 100), aes(x = factor(samp_size), y = avg, group =1)) +
geom_point() +
geom_errorbar(aes(ymin = p025 , ymax = p975), width = 0.2) +
geom_hline(aes(yintercept = 0), color = 'red', linetype = 'dashed') +
facet_grid(method~class_ratio, labeller = labeller(class_ratio = ratio.labs)) +
labs(title = paste0("\u03b8"," = 0\n"), y = expression(hat(theta)), x = "Sample Size", caption = paste0("Note: red dashed line = ", "\u03b8.")) +
theme(
strip.text.x = element_text(size=16, face = 'bold'),
strip.text.y = element_text(size=16, face = 'bold'),
axis.text.x = element_text(angle=90, vjust = 0.5, hjust = 1, size = 14),
axis.text.y = element_text(size = 14),
axis.title.y = element_text(size=16, margin = margin(t = 0, r = 15, b = 0, l = 15), face = 'bold'),
axis.title.x = element_text(size=16, margin = margin(t = 15, r = 0, b = 15, l = 0), face = 'bold'),
plot.caption = element_text(hjust = 0, size = 16),
plot.title = element_text(face = "bold", hjust = 0.5, size = 18))
# plot w. theta = 1
plot2 <- ggplot(subset(plot_df, theta == 1 & D == 100), aes(x = factor(samp_size), y = avg, group =1)) +
geom_point() +
geom_errorbar(aes(ymin = p025 , ymax = p975), width = 0.2) +
geom_hline(aes(yintercept = 1), color = 'red', linetype = 'dashed') +
facet_grid(method~class_ratio, labeller = labeller(class_ratio = ratio.labs)) +
labs(title = paste0("\u03b8"," = 1"), y = expression(hat(theta)), x = "Sample Size") +
theme(
strip.text.x = element_text(size=16, face = 'bold'),
strip.text.y = element_text(size=16, face = 'bold'),
axis.text.x = element_text(angle=90, vjust = 0.5, hjust = 1, size = 14),
axis.text.y = element_text(size = 14),
axis.title.y = element_text(size=16, margin = margin(t = 0, r = 15, b = 0, l = 15), face = 'bold'),
axis.title.x = element_text(size=16, margin = margin(t = 15, r = 0, b = 15, l = 0), face = 'bold'),
plot.title = element_text(face = "bold", hjust = 0.5, size = 18))
grid.arrange(plot1, plot2, ncol=2)