Title: | 'a la Carte' on Text (ConText) Embedding Regression |
---|---|
Description: | A fast, flexible and transparent framework to estimate context-specific word and short document embeddings using the 'a la carte' embeddings approach developed by Khodak et al. (2018) <arXiv:1805.05388> and evaluate hypotheses about covariate effects on embeddings using the regression framework developed by Rodriguez et al. (2021)<https://github.com/prodriguezsosa/EmbeddingRegression>. |
Authors: | Pedro L. Rodriguez [aut, cre, cph]
|
Maintainer: | Pedro L. Rodriguez <[email protected]> |
License: | GPL-3 |
Version: | 2.0.0 |
Built: | 2025-01-26 04:41:57 UTC |
Source: | https://github.com/prodriguezsosa/context |
Bootstrap similarity and ratio computations
bootstrap_contrast( target_embeddings1 = NULL, target_embeddings2 = NULL, pre_trained = NULL, candidates = NULL, norm = NULL )
bootstrap_contrast( target_embeddings1 = NULL, target_embeddings2 = NULL, pre_trained = NULL, candidates = NULL, norm = NULL )
target_embeddings1 |
ALC embeddings for group 1 |
target_embeddings2 |
ALC embeddings for group 2 |
pre_trained |
a V x D matrix of numeric values - pretrained embeddings with V = size of vocabulary and D = embedding dimensions |
candidates |
character vector defining the candidates for nearest neighbors - e.g. output from |
norm |
character = c("l2", "none") - set to 'l2' for cosine similarity and to 'none' for inner product (see ?sim2 in text2vec) |
a list with three elements, nns for group 1, nns for group 2 and nns_ratio comparing with ratios of similarities between the two groups
Uses bootstrapping –sampling of of texts with replacement– to identify the top N nearest neighbors based on cosine or inner product similarity.
bootstrap_nns( context = NULL, pre_trained = NULL, transform = TRUE, transform_matrix = NULL, candidates = NULL, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, N = 50, norm = "l2" )
bootstrap_nns( context = NULL, pre_trained = NULL, transform = TRUE, transform_matrix = NULL, candidates = NULL, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, N = 50, norm = "l2" )
context |
(character) vector of texts - |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
transform |
(logical) - if TRUE (default) apply the a la carte transformation, if FALSE ouput untransformed averaged embedding. |
transform_matrix |
(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings. |
candidates |
(character) vector defining the candidates for nearest neighbors - e.g. output from |
bootstrap |
(logical) if TRUE, bootstrap similarity values - sample from texts with replacement. Required to get std. errors. |
num_bootstraps |
(numeric) - number of bootstraps to use. |
confidence_level |
(numeric in (0,1)) confidence level e.g. 0.95 |
N |
(numeric) number of nearest neighbors to return. |
norm |
(character) - how to compute the similarity (see ?text2vec::sim2):
|
a data.frame
with the following columns:
feature
(character) vector of feature terms corresponding to the nearest neighbors.
value
(numeric) cosine/inner product similarity between texts and feature. Average over bootstrapped samples if bootstrap = TRUE.
std.error
(numeric) std. error of the similarity value. Column is dropped if bootstrap = FALSE.
lower.ci
(numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci
(numeric) (if bootstrap = TRUE) upper bound of the confidence interval.
# find contexts of immigration context_immigration <- get_context(x = cr_sample_corpus, target = 'immigration', window = 6, valuetype = "fixed", case_insensitive = TRUE, hard_cut = FALSE, verbose = FALSE) # find local vocab (use it to define the candidate of nearest neighbors) local_vocab <- get_local_vocab(context_immigration$context, pre_trained = cr_glove_subset) set.seed(42L) nns_immigration <- bootstrap_nns(context = context_immigration$context, pre_trained = cr_glove_subset, transform_matrix = cr_transform, transform = TRUE, candidates = local_vocab, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, N = 50, norm = "l2") head(nns_immigration)
# find contexts of immigration context_immigration <- get_context(x = cr_sample_corpus, target = 'immigration', window = 6, valuetype = "fixed", case_insensitive = TRUE, hard_cut = FALSE, verbose = FALSE) # find local vocab (use it to define the candidate of nearest neighbors) local_vocab <- get_local_vocab(context_immigration$context, pre_trained = cr_glove_subset) set.seed(42L) nns_immigration <- bootstrap_nns(context = context_immigration$context, pre_trained = cr_glove_subset, transform_matrix = cr_transform, transform = TRUE, candidates = local_vocab, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, N = 50, norm = "l2") head(nns_immigration)
Bootstrap model coefficients and standard errors
bootstrap_ols(Y = NULL, X = NULL, stratify = NULL)
bootstrap_ols(Y = NULL, X = NULL, stratify = NULL)
Y |
vector of regression model's dependent variable (embedded context) |
X |
data.frame of model independent variables (covariates) |
stratify |
covariates to stratify when bootstrapping |
list with two elements, betas
= list of beta_coefficients (D dimensional vectors);
normed_betas
= tibble with the norm of the non-intercept coefficients
Boostrap similarity vector
bootstrap_similarity( target_embeddings = NULL, pre_trained = NULL, candidates = NULL, norm = NULL )
bootstrap_similarity( target_embeddings = NULL, pre_trained = NULL, candidates = NULL, norm = NULL )
target_embeddings |
the target embeddings (embeddings of context) |
pre_trained |
a V x D matrix of numeric values - pretrained embeddings with V = size of vocabulary and D = embedding dimensions |
candidates |
character vector defining the candidates for nearest neighbors - e.g. output from |
norm |
character = c("l2", "none") - set to 'l2' for cosine similarity and to 'none' for inner product (see ?sim2 in text2vec) |
vector(s) of cosine similarities between alc embedding and nearest neighbor candidates
conText-class
objectbuild a conText-class
object
build_conText( Class = "conText", x_conText, normed_coefficients = data.frame(), features = character(), Dimnames = list() )
build_conText( Class = "conText", x_conText, normed_coefficients = data.frame(), features = character(), Dimnames = list() )
Class |
defines the class of this object (fixed) |
x_conText |
a |
normed_coefficients |
a data.frame withe the normed coefficients and other statistics |
features |
features used in computing the embeddings |
Dimnames |
row (features) and columns (NULL) names |
dem-class
objectbuild a dem-class
object
build_dem( Class = "em", x_dem, docvars = data.frame(), features = character(), Dimnames = list() )
build_dem( Class = "em", x_dem, docvars = data.frame(), features = character(), Dimnames = list() )
Class |
defines tha class of this object (fixed) |
x_dem |
a |
docvars |
document covariates, inherited from dfm and corpus, subset to embeddable documents |
features |
features used in computing the embeddings |
Dimnames |
row (documents) and columns (NULL) names |
fem-class
objectbuild a fem-class
object
build_fem( Class = "fem", x_fem, features = character(), counts = numeric(), Dimnames = list() )
build_fem( Class = "fem", x_fem, features = character(), counts = numeric(), Dimnames = list() )
Class |
defines the class of this object (fixed) |
x_fem |
a |
features |
features used in computing the embeddings |
counts |
counts of features used in computing embeddings |
Dimnames |
row (features) and columns (NULL) names |
Compute similarity and similarity ratios
compute_contrast( target_embeddings1 = NULL, target_embeddings2 = NULL, pre_trained = NULL, candidates = NULL, norm = NULL )
compute_contrast( target_embeddings1 = NULL, target_embeddings2 = NULL, pre_trained = NULL, candidates = NULL, norm = NULL )
target_embeddings1 |
ALC embeddings for group 1 |
target_embeddings2 |
ALC embeddings for group 2 |
pre_trained |
a V x D matrix of numeric values - pretrained embeddings with V = size of vocabulary and D = embedding dimensions |
candidates |
character vector defining the candidates for nearest neighbors - e.g. output from |
norm |
character = c("l2", "none") - set to 'l2' for cosine similarity and to 'none' for inner product (see ?sim2 in text2vec) |
a list with three elements, nns for group 1, nns for group 2 and nns_ratio comparing with ratios of similarities between the two groups
Compute similarity vector (sub-function of bootstrap_similarity)
compute_similarity( target_embeddings = NULL, pre_trained = NULL, candidates = NULL, norm = NULL )
compute_similarity( target_embeddings = NULL, pre_trained = NULL, candidates = NULL, norm = NULL )
target_embeddings |
the target embeddings (embeddings of context) |
pre_trained |
a V x D matrix of numeric values - pretrained embeddings with V = size of vocabulary and D = embedding dimensions |
candidates |
character vector defining the candidates for nearest neighbors - e.g. output from |
norm |
character = c("l2", "none") - set to 'l2' for cosine similarity and to 'none' for inner product (see ?sim2 in text2vec) |
vector of cosine similarities between alc embedding and nearest neighbor candidates
Computes a transformation matrix, given a feature-co-occurrence matrix and corresponding pre-trained embeddings.
compute_transform(x, pre_trained, weighting = 500)
compute_transform(x, pre_trained, weighting = 500)
x |
a (quanteda) |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings,
usually trained on the same corpus as that used for |
weighting |
(character or numeric) weighting options:
Recommended: use |
a dgTMatrix-class
D x D non-symmetrical matrix
(D = dimensions of pre-trained embedding space) corresponding
to an 'a la carte' transformation matrix. This matrix is optimized
for the corpus and pre-trained embeddings employed.
library(quanteda) # note, cr_sample_corpus is too small to produce sensical word vectors # tokenize toks <- tokens(cr_sample_corpus) # construct feature-co-occurrence matrix toks_fcm <- fcm(toks, context = "window", window = 6, count = "weighted", weights = 1 / (1:6), tri = FALSE) # you will generally want to estimate a new (corpus-specific) # GloVe model, we will use cr_glove_subset instead # see the Quick Start Guide to see a full example. # estimate transform local_transform <- compute_transform(x = toks_fcm, pre_trained = cr_glove_subset, weighting = 'log')
library(quanteda) # note, cr_sample_corpus is too small to produce sensical word vectors # tokenize toks <- tokens(cr_sample_corpus) # construct feature-co-occurrence matrix toks_fcm <- fcm(toks, context = "window", window = 6, count = "weighted", weights = 1 / (1:6), tri = FALSE) # you will generally want to estimate a new (corpus-specific) # GloVe model, we will use cr_glove_subset instead # see the Quick Start Guide to see a full example. # estimate transform local_transform <- compute_transform(x = toks_fcm, pre_trained = cr_glove_subset, weighting = 'log')
Estimates an embedding regression model with options to use bootstrapping (to be deprecated) or jackknife debiasing to estimate confidence intervals and a permutation test for inference (see https://github.com/prodriguezsosa/conText for details.)
conText( formula, data, pre_trained, transform = TRUE, transform_matrix, bootstrap = FALSE, num_bootstraps = 100, stratify = FALSE, jackknife = TRUE, confidence_level = 0.95, permute = TRUE, num_permutations = 100, window = 6L, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, hard_cut = FALSE, verbose = TRUE )
conText( formula, data, pre_trained, transform = TRUE, transform_matrix, bootstrap = FALSE, num_bootstraps = 100, stratify = FALSE, jackknife = TRUE, confidence_level = 0.95, permute = TRUE, num_permutations = 100, window = 6L, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, hard_cut = FALSE, verbose = TRUE )
formula |
a symbolic description of the model to be fitted with a target word as a DV e.g.
|
data |
a quanteda |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
transform |
(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings. |
transform_matrix |
(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings. |
bootstrap |
(logical) if TRUE, use bootstrapping – sample from texts with replacement and re-run regression on each sample. |
num_bootstraps |
(numeric) number of bootstraps to use (at least 100). Ignored if bootstrap = FALSE. |
stratify |
(logical) if TRUE, stratify by discrete covariates when bootstrapping. |
jackknife |
(logical) if TRUE (default), jackknife (leave one out) debiasing is implemented. Implies n resamples. |
confidence_level |
(numeric in (0,1)) confidence level e.g. 0.95 |
permute |
(logical) if TRUE, compute empirical p-values using permutation test |
num_permutations |
(numeric) number of permutations to use |
window |
the number of context words to be displayed around the keyword |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
hard_cut |
(logical) - if TRUE then a context must have |
verbose |
(logical) - if TRUE, report the documents that had no overlapping features with the pretrained embeddings provided. |
a conText-class
object - a D x M matrix with D = dimensions
of the pre-trained feature embeddings provided and M = number of covariates
including the intercept. These represent the estimated regression coefficients.
These can be combined to compute ALC embeddings for different combinations of covariates.
The object also includes various informative attributes, importantly
a data.frame
with the following columns:
coefficient
(character) name of (covariate) coefficient.
value
(numeric) norm of the corresponding beta coefficient (debiased if jackknife = TRUE).
std.error
(numeric) (if bootstrap = TRUE or jackknife = TRUE) std. error of the (debiased if jackknife = TRUE) norm of the beta coefficient.
lower.ci
(numeric) (if bootstrap = TRUE or jackknife = TRUE) lower bound of the (debiased if jackknife = TRUE) confidence interval.
upper.ci
(numeric) (if bootstrap = TRUE or jackknife = TRUE) upper bound of the (debiased if jackknife = TRUE) confidence interval.
p.value
(numeric) (if permute = TRUE) empirical p.value of the norm of the coefficient.
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) ## given the target word "immigration" set.seed(2021L) model1 <- conText(formula = immigration ~ party + gender, data = toks, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, bootstrap=FALSE, jackknife = TRUE, confidence_level = 0.95, permute = TRUE, num_permutations = 10, window = 6, case_insensitive = TRUE, verbose = FALSE) # notice, character/factor covariates are automatically "dummified" rownames(model1) # the beta coefficient 'partyR' in this case corresponds to the alc embedding # of "immigration" for Republican party speeches # (normed) coefficient table model1@normed_coefficients
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) ## given the target word "immigration" set.seed(2021L) model1 <- conText(formula = immigration ~ party + gender, data = toks, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, bootstrap=FALSE, jackknife = TRUE, confidence_level = 0.95, permute = TRUE, num_permutations = 10, window = 6, case_insensitive = TRUE, verbose = FALSE) # notice, character/factor covariates are automatically "dummified" rownames(model1) # the beta coefficient 'partyR' in this case corresponds to the alc embedding # of "immigration" for Republican party speeches # (normed) coefficient table model1@normed_coefficients
Computes the ratio of cosine similarities between group embeddings and features –that is, for any given feature it first computes the similarity between that feature and each group embedding, and then takes the ratio of these two similarities. This ratio captures how "discriminant" a feature is of a given group.
contrast_nns( x, groups = NULL, pre_trained = NULL, transform = TRUE, transform_matrix = NULL, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, permute = TRUE, num_permutations = 100, candidates = NULL, N = 20, verbose = TRUE )
contrast_nns( x, groups = NULL, pre_trained = NULL, transform = TRUE, transform_matrix = NULL, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, permute = TRUE, num_permutations = 100, candidates = NULL, N = 20, verbose = TRUE )
x |
(quanteda) |
groups |
(numeric, factor, character) a binary variable of the same length as |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
transform |
(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings. |
transform_matrix |
(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings. |
bootstrap |
(logical) if TRUE, use bootstrapping – sample from texts with replacement and re-estimate cosine ratios for each sample. Required to get std. errors. |
num_bootstraps |
(numeric) - number of bootstraps to use |
confidence_level |
(numeric in (0,1)) confidence level e.g. 0.95 |
permute |
(logical) - if TRUE, compute empirical p-values using a permutation test |
num_permutations |
(numeric) - number of permutations to use |
candidates |
(character) vector of candidate features for nearest neighbors |
N |
(numeric) - nearest neighbors are subset to the union of the N neighbors of each group (if NULL, ratio is computed for all features) |
verbose |
(logical) - if TRUE, report the documents that had no overlapping features with the pretrained embeddings provided. |
a data.frame with following columns:
feature
(character) vector of feature terms corresponding to the nearest neighbors.
value
(numeric) ratio of cosine similarities. Average over bootstrapped samples if bootstrap = TRUE.
std.error
(numeric) std. error of the ratio of cosine similarties. Column is dropped if bootsrap = FALSE.
lower.ci
(numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci
(numeric) (if bootstrap = TRUE) upper bound of the confidence interval.
p.value
(numeric) empirical p-value. Column is dropped if permute = FALSE.
library(quanteda) cr_toks <- tokens(cr_sample_corpus) immig_toks <- tokens_context(x = cr_toks, pattern = "immigration", window = 6L, hard_cut = FALSE, verbose = TRUE) # sample 100 instances of the target term, stratifying by party (only for example purposes) set.seed(2022L) immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party')) set.seed(42L) party_nns <- contrast_nns(x = immig_toks, groups = docvars(immig_toks, 'party'), pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, permute = TRUE, num_permutations = 10, candidates = NULL, N = 20, verbose = FALSE) head(party_nns)
library(quanteda) cr_toks <- tokens(cr_sample_corpus) immig_toks <- tokens_context(x = cr_toks, pattern = "immigration", window = 6L, hard_cut = FALSE, verbose = TRUE) # sample 100 instances of the target term, stratifying by party (only for example purposes) set.seed(2022L) immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party')) set.seed(42L) party_nns <- contrast_nns(x = immig_toks, groups = docvars(immig_toks, 'party'), pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, permute = TRUE, num_permutations = 10, candidates = NULL, N = 20, verbose = FALSE) head(party_nns)
Compute the cosine similarity between one or more ALC embeddings and a set of features.
cos_sim( x, pre_trained, features = NULL, stem = FALSE, language = "porter", as_list = TRUE, show_language = TRUE )
cos_sim( x, pre_trained, features = NULL, stem = FALSE, language = "porter", as_list = TRUE, show_language = TRUE )
x |
a (quanteda) |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
features |
(character) features of interest. |
stem |
(logical) - If TRUE, both |
language |
the name of a recognized language, as returned by
|
as_list |
(logical) if FALSE all results are combined into a single data.frame If TRUE, a list of data.frames is returned with one data.frame per feature. |
show_language |
(logical) if TRUE print out message with language used for stemming. |
a data.frame
or list of data.frames (one for each target)
with the following columns:
target
(character) rownames of x
,
the labels of the ALC embeddings.
NA if is.null(rownames(x)).
feature
(character) feature terms defined in
the features
argument.
value
(numeric) cosine similarity between x
and feature.
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L) # build document-feature matrix immig_dfm <- dfm(immig_toks) # construct document-embedding-matrix immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE) # to get group-specific embeddings, average within party immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party) # compute the cosine similarity between each party's embedding and a specific set of features cos_sim(x = immig_wv_party, pre_trained = cr_glove_subset, features = c('reform', 'enforcement'), as_list = FALSE)
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L) # build document-feature matrix immig_dfm <- dfm(immig_toks) # construct document-embedding-matrix immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE) # to get group-specific embeddings, average within party immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party) # compute the cosine similarity between each party's embedding and a specific set of features cos_sim(x = immig_wv_party, pre_trained = cr_glove_subset, features = c('reform', 'enforcement'), as_list = FALSE)
A subset of a GloVe embeddings model trained on the top 5000 features in the Congressional Record Record corpus covering the 111th - 114th Congresses, and limited to speeches by Democrat and Republican representatives.
cr_glove_subset
cr_glove_subset
A matrix with 500 rows and 300 columns:
each row corresponds to a word
each column corresponds to a dimension in the embedding space
...
https://www.dropbox.com/s/p84wzv8bdmziog8/cr_glove.R?dl=0
A (quanteda) corpus containing a sample of the United States Congressional Record (daily transcripts) covering the 111th to 114th Congresses. The raw corpus is first subset to speeches containing the regular expression "immig*". Then 100 docs from each party-gender pair is randomly sampled. For full data and pre-processing file, see: https://www.dropbox.com/sh/jsyrag7opfo7l7i/AAB1z7tumLuKihGu2-FDmhmKa?dl=0 For nominate scores see: https://voteview.com/data
cr_sample_corpus
cr_sample_corpus
A quanteda corpus with 200 documents and 3 docvars:
party of speaker, (D)emocrat or (R)epublican
gender of speaker, (F)emale or (M)ale
dimension 1 of the nominate score
...
https://data.stanford.edu/congress_text
A square matrix corresponding to the transformation matrix computed using the cr_glove_subset embeddings and corresponding corpus.
cr_transform
cr_transform
A 300 by 300 matrix.
https://www.dropbox.com/s/p84wzv8bdmziog8/cr_glove.R?dl=0
Given a document-feature-matrix, for each document,
multiply its feature counts (columns) with their
corresponding pretrained word embeddings and average
(usually referred to as averaged or additive document embeddings).
If specified and a transformation matrix is provided,
multiply the document embeddings by the transformation matrix
to obtain the corresponding a la carte
document embeddings.
(see eq 2: https://arxiv.org/pdf/1805.05388.pdf)
dem(x, pre_trained, transform = TRUE, transform_matrix, verbose = TRUE)
dem(x, pre_trained, transform = TRUE, transform_matrix, verbose = TRUE)
x |
a quanteda ( |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
transform |
(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings. |
transform_matrix |
(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings. |
verbose |
(logical) - if TRUE, report the documents that had no overlapping features with the pretrained embeddings provided. |
a N x D (dem-class
) document-embedding-matrix corresponding to the ALC embeddings for each document.
N = number of documents (that could be embedded), D = dimensions of pretrained embeddings. This object
inherits the document variables in x
, the dfm used. These can be accessed calling the attribute: @docvars
.
Note, documents with no overlapping features with the pretrained embeddings provided are automatically
dropped. For a list of the documents that were embedded call the attribute: @Dimnames$docs
.
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L) # construct document-feature-matrix immig_dfm <- dfm(immig_toks) # construct document-embedding-matrix immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L) # construct document-feature-matrix immig_dfm <- dfm(immig_toks) # construct document-embedding-matrix immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)
Average embeddings in a dem by a grouping variable, by averaging over columns within groups
and creating new "documents" with the group labels.
Similar in essence to dfm_group
.
dem_group(x, groups = NULL)
dem_group(x, groups = NULL)
x |
a ( |
groups |
a character or factor variable equal in length to the number of documents |
a G x D (dem-class
) document-embedding-matrix corresponding to the ALC embeddings for each group.
G = number of unique groups defined in the groups
variable, D = dimensions of pretrained embeddings.
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L) # build document-feature matrix immig_dfm <- dfm(immig_toks) # construct document-embedding-matrix immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE) # to get group-specific embeddings, average within party immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party)
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L) # build document-feature matrix immig_dfm <- dfm(immig_toks) # construct document-embedding-matrix immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE) # to get group-specific embeddings, average within party immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party)
Take a random sample of documents from a dem
with/without replacement and
with the option to group by a variable in dem@docvars
. Note: dem_sample
uses dplyr::sample_frac
underneath the hood, as such size
refers to the fraction of total obs.
dem_sample(x, size = NULL, replace = FALSE, weight = NULL, by = NULL)
dem_sample(x, size = NULL, replace = FALSE, weight = NULL, by = NULL)
x |
a ( |
size |
< |
replace |
Sample with or without replacement? |
weight |
(numeric) Sampling weights. Vector of non-negative numbers of length |
by |
(character or factor vector) either of length 1 with the name of grouping variable for sampling.
Refer to the variable WITH QUOTATIONS e.g. |
a size
x D (dem-class
) document-embedding-matrix corresponding to the sampled
ALC embeddings. Note, @features
in the resulting object will correspond to the original @features
,
that is, they are not subsetted to the sampled documents. For a list of the documents that were
sampled call the attribute: @Dimnames$docs
.
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L) # build document-feature matrix immig_dfm <- dfm(immig_toks) # construct document-embedding-matrix immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE) # to get a random sample immig_wv_party <- dem_sample(immig_dem, size = 10, replace = TRUE, by = "party") # also works immig_wv_party <- dem_sample(immig_dem, size = 10, replace = TRUE, by = immig_dem@docvars$party)
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L) # build document-feature matrix immig_dfm <- dfm(immig_toks) # construct document-embedding-matrix immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE) # to get a random sample immig_wv_party <- dem_sample(immig_dem, size = 10, replace = TRUE, by = "party") # also works immig_wv_party <- dem_sample(immig_dem, size = 10, replace = TRUE, by = immig_dem@docvars$party)
For a vector of contexts (generally the context variable in get_context output), return the transformed (or untransformed) additive embeddings, aggregated or by instance, along with the local vocabulary. Keep track of which contexts were embedded and which were excluded.
embed_target( context, pre_trained, transform = TRUE, transform_matrix, aggregate = TRUE, verbose = TRUE )
embed_target( context, pre_trained, transform = TRUE, transform_matrix, aggregate = TRUE, verbose = TRUE )
context |
(character) vector of texts - |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
transform |
(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings. |
transform_matrix |
(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings. |
aggregate |
(logical) - if TRUE (default) output will return one embedding (i.e. averaged over all instances of target) if FALSE output will return one embedding per instance |
verbose |
(logical) - report the observations that had no overlap the provided pre-trained embeddings |
required packages: quanteda
list with three elements:
target_embedding
the target embedding(s). Values and dimensions will vary with the above settings.
local_vocab
(character) vocabulary that appears in the set of contexts provided.
obs_included
(integer) rows of the context vector that were included in the computation. A row (context) is excluded when none of the words in the context are present in the pre-trained embeddings provided.
# find contexts for term immigration context_immigration <- get_context(x = cr_sample_corpus, target = 'immigration', window = 6, valuetype = "fixed", case_insensitive = TRUE, hard_cut = FALSE, verbose = FALSE) contexts_vectors <- embed_target(context = context_immigration$context, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, aggregate = FALSE, verbose = FALSE)
# find contexts for term immigration context_immigration <- get_context(x = cr_sample_corpus, target = 'immigration', window = 6, valuetype = "fixed", case_insensitive = TRUE, hard_cut = FALSE, verbose = FALSE) contexts_vectors <- embed_target(context = context_immigration$context, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, aggregate = FALSE, verbose = FALSE)
Efficient way of comparing two corpora along many features simultaneously.
feature_sim(x, y, features = character(0))
feature_sim(x, y, features = character(0))
x |
a ( |
y |
a ( |
features |
(character) vector of features for which to compute similarity scores. If not defined then all overlapping features will be used. |
a data.frame
with following columns:
feature
(character) overlapping features
value
(numeric) cosine similarity between overlapping features.
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # create feature co-occurrence matrix for each party (set tri = FALSE to work with fem) fcm_D <- fcm(toks[docvars(toks, 'party') == "D",], context = "window", window = 6, count = "frequency", tri = FALSE) fcm_R <- fcm(toks[docvars(toks, 'party') == "R",], context = "window", window = 6, count = "frequency", tri = FALSE) # compute feature-embedding matrix fem_D <- fem(fcm_D, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE) fem_R <- fem(fcm_R, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE) # compare "horizontal" cosine similarity feat_comp <- feature_sim(x = fem_R, y = fem_D)
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # create feature co-occurrence matrix for each party (set tri = FALSE to work with fem) fcm_D <- fcm(toks[docvars(toks, 'party') == "D",], context = "window", window = 6, count = "frequency", tri = FALSE) fcm_R <- fcm(toks[docvars(toks, 'party') == "R",], context = "window", window = 6, count = "frequency", tri = FALSE) # compute feature-embedding matrix fem_D <- fem(fcm_D, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE) fem_R <- fem(fcm_R, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE) # compare "horizontal" cosine similarity feat_comp <- feature_sim(x = fem_R, y = fem_D)
Given a featureco-occurrence matrix for each feature,
multiply its feature counts (columns) with their
corresponding pre-trained embeddings and average
(usually referred to as averaged or additive embeddings).
If specified and a transformation matrix is provided,
multiply the feature embeddings by the transformation matrix
to obtain the corresponding a la carte
embeddings.
(see eq 2: https://arxiv.org/pdf/1805.05388.pdf)
fem(x, pre_trained, transform = TRUE, transform_matrix, verbose = TRUE)
fem(x, pre_trained, transform = TRUE, transform_matrix, verbose = TRUE)
x |
a quanteda ( |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
transform |
(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings. |
transform_matrix |
(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings. |
verbose |
(logical) - if TRUE, report the features that had no overlapping (co-occurring) features with the pretrained embeddings provided. |
a fem-class
object
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # create feature co-occurrence matrix (set tri = FALSE to work with fem) toks_fcm <- fcm(toks, context = "window", window = 6, count = "frequency", tri = FALSE) # compute feature-embedding matrix toks_fem <- fem(toks_fcm, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # create feature co-occurrence matrix (set tri = FALSE to work with fem) toks_fcm <- fcm(toks, context = "window", window = 6, count = "frequency", tri = FALSE) # compute feature-embedding matrix toks_fem <- fem(toks_fcm, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)
Find cosine similarities between target and candidate words
find_cos_sim(target_embedding, pre_trained, candidates, norm = "l2")
find_cos_sim(target_embedding, pre_trained, candidates, norm = "l2")
target_embedding |
matrix of numeric values |
pre_trained |
matrix of numeric values - pretrained embeddings |
candidates |
character vector defining vocabulary to subset comparison to |
norm |
character = c("l2", "none") - how to scale input matrices. If they are already scaled - use "none" (see ?sim2) |
a vector of cosine similarities of length candidates
Return nearest neighbors based on cosine similarity
find_nns( target_embedding, pre_trained, N = 5, candidates = NULL, norm = "l2", stem = FALSE, language = "porter" )
find_nns( target_embedding, pre_trained, N = 5, candidates = NULL, norm = "l2", stem = FALSE, language = "porter" )
target_embedding |
(numeric) 1 x D matrix. D = dimensions of pretrained embeddings. |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
N |
(numeric) number of nearest neighbors to return. |
candidates |
(character) vector of candidate features for nearest neighbors |
norm |
(character) - how to compute similarity (see ?text2vec::sim2):
|
stem |
(logical) - whether to stem candidates when evaluating nns. Default is FALSE.
If TRUE, candidate stems are ranked by their average cosine similarity to the target.
We recommend you remove misspelled words from candidate set |
language |
the name of a recognized language, as returned by
|
(character) vector of nearest neighbors to target
find_nns(target_embedding = cr_glove_subset['immigration',], pre_trained = cr_glove_subset, N = 5, candidates = NULL, norm = "l2", stem = FALSE)
find_nns(target_embedding = cr_glove_subset['immigration',], pre_trained = cr_glove_subset, N = 5, candidates = NULL, norm = "l2", stem = FALSE)
A wrapper function for quanteda's kwic()
function that subsets documents to where
target is present before tokenizing to speed up processing, and concatenates
kwic's pre/post variables into a context
column.
get_context( x, target, window = 6L, valuetype = "fixed", case_insensitive = TRUE, hard_cut = FALSE, what = "word", verbose = TRUE )
get_context( x, target, window = 6L, valuetype = "fixed", case_insensitive = TRUE, hard_cut = FALSE, what = "word", verbose = TRUE )
x |
(character) vector - this is the set of documents (corpus) of interest. |
target |
(character) vector - these are the target words whose contexts we want to evaluate This vector may include a single token, a phrase or multiple tokens and/or phrases. |
window |
(numeric) - defines the size of a context (words around the target). |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
hard_cut |
(logical) - if TRUE then a context must have |
what |
(character) defines which quanteda tokenizer to use. You will rarely want to change this.
For chinese text you may want to set |
verbose |
(logical) - if TRUE, report the total number of target instances found. |
a data.frame
with the following columns:
docname
(character) document name to which instances belong to.
target
(character) targets.
context
(numeric) pre/post variables in kwic()
output concatenated.
target
in the return data.frame is equivalent to kwic()
's keyword
output variable,
so it may not match the user-defined target exactly if valuetype
is not fixed.
# get context words sorrounding the term immigration context_immigration <- get_context(x = cr_sample_corpus, target = 'immigration', window = 6, valuetype = "fixed", case_insensitive = FALSE, hard_cut = FALSE, verbose = FALSE)
# get context words sorrounding the term immigration context_immigration <- get_context(x = cr_sample_corpus, target = 'immigration', window = 6, valuetype = "fixed", case_insensitive = FALSE, hard_cut = FALSE, verbose = FALSE)
This is a wrapper function for cos_sim()
that allows users to go from a
tokenized corpus to results with the option to bootstrap cosine similarities
and get the corresponding std. errors.
get_cos_sim( x, groups = NULL, features = character(0), pre_trained, transform = TRUE, transform_matrix, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, stem = FALSE, language = "porter", as_list = TRUE )
get_cos_sim( x, groups = NULL, features = character(0), pre_trained, transform = TRUE, transform_matrix, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, stem = FALSE, language = "porter", as_list = TRUE )
x |
a (quanteda) |
groups |
(numeric, factor, character) a binary variable of the same length as |
features |
(character) features of interest |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
transform |
(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings. |
transform_matrix |
(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings. |
bootstrap |
(logical) if TRUE, use bootstrapping – sample from texts with replacement and
re-estimate cosine similarities for each sample. Required to get std. errors.
If |
num_bootstraps |
(integer) number of bootstraps to use. |
confidence_level |
(numeric in (0,1)) confidence level e.g. 0.95 |
stem |
(logical) - If TRUE, both |
language |
the name of a recognized language, as returned by
|
as_list |
(logical) if FALSE all results are combined into a single data.frame If TRUE, a list of data.frames is returned with one data.frame per feature. |
a data.frame
or list of data.frames (one for each target)
with the following columns:
target
(character) rownames of x
,
the labels of the ALC embeddings.
feature
(character) feature terms defined in
the features
argument.
value
(numeric) cosine similarity between x
and feature. Average over bootstrapped samples if bootstrap = TRUE.
std.error
(numeric) std. error of the similarity value. Column is dropped if bootstrap = FALSE.
lower.ci
(numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci
(numeric) (if bootstrap = TRUE) upper bound of the confidence interval.
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L) # sample 100 instances of the target term, stratifying by party (only for example purposes) set.seed(2022L) immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party')) # compute the cosine similarity between each group's embedding # and a specific set of features set.seed(2021L) get_cos_sim(x = immig_toks, groups = docvars(immig_toks, 'party'), features = c("reform", "enforce"), pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, stem = TRUE, as_list = FALSE)
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L) # sample 100 instances of the target term, stratifying by party (only for example purposes) set.seed(2022L) immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party')) # compute the cosine similarity between each group's embedding # and a specific set of features set.seed(2021L) get_cos_sim(x = immig_toks, groups = docvars(immig_toks, 'party'), features = c("reform", "enforce"), pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, stem = TRUE, as_list = FALSE)
Get similarity scores between a target word or words and a comparison vector
of one candidate word or words. When two vectors of candidate words are
provided (second_vec
is not NULL
), the function calculates the cosine
similarity between a composite index of the two vectors. This is
operationalized as the mean similarity of the target word to the first
vector of terms plus negative one multiplied by the mean similarity to the
second vector of terms.
get_grouped_similarity( x, target, first_vec, second_vec, pre_trained, transform_matrix, group_var, window = window, norm = "l2", remove_punct = FALSE, remove_symbols = FALSE, remove_numbers = FALSE, remove_separators = FALSE, valuetype = "fixed", hard_cut = FALSE, case_insensitive = TRUE )
get_grouped_similarity( x, target, first_vec, second_vec, pre_trained, transform_matrix, group_var, window = window, norm = "l2", remove_punct = FALSE, remove_symbols = FALSE, remove_numbers = FALSE, remove_separators = FALSE, valuetype = "fixed", hard_cut = FALSE, case_insensitive = TRUE )
x |
a (quanteda) |
target |
(character) vector of words |
first_vec |
(character) vector of words |
second_vec |
(character) vector of words |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings,
usually trained on the same corpus as that used for |
transform_matrix |
(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings. |
group_var |
(character) variable name in corpus object defining grouping variable |
window |
(numeric) - defines the size of a context (words around the target) |
norm |
(character) - "l2" for l2 normalized cosine similarity and "none" for dot product |
remove_punct |
(logical) - if |
remove_symbols |
(logical) - if |
remove_numbers |
(logical) - if |
remove_separators |
(logical) - if |
valuetype |
the type of pattern matching: |
hard_cut |
(logical) - if TRUE then a context must have |
case_insensitive |
(logical) - if |
a data.frame
with the following columns:
group
the grouping variable specified for the analysis
val
(numeric) cosine similarity scores
quanteda::docvars(cr_sample_corpus, 'year') <- rep(2011:2014, each = 50) cos_simsdf <- get_grouped_similarity(cr_sample_corpus, group_var = "year", target = "immigration", first_vec = c("left", "lefty"), second_vec = c("right", "rightwinger"), pre_trained = cr_glove_subset, transform_matrix = cr_transform, window = 12L, norm = "l2")
quanteda::docvars(cr_sample_corpus, 'year') <- rep(2011:2014, each = 50) cos_simsdf <- get_grouped_similarity(cr_sample_corpus, group_var = "year", target = "immigration", first_vec = c("left", "lefty"), second_vec = c("right", "rightwinger"), pre_trained = cr_glove_subset, transform_matrix = cr_transform, window = 12L, norm = "l2")
Local vocab consists of the intersect between the set of pretrained embeddings and the collection of texts.
get_local_vocab(context, pre_trained)
get_local_vocab(context, pre_trained)
context |
(character) vector of contexts (usually |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
(character) vector of words common to the texts and pretrained embeddings.
# find local vocab (use it to define the candidate of nearest neighbors) local_vocab <- get_local_vocab(cr_sample_corpus, pre_trained = cr_glove_subset)
# find local vocab (use it to define the candidate of nearest neighbors) local_vocab <- get_local_vocab(cr_sample_corpus, pre_trained = cr_glove_subset)
This is a wrapper function for ncs()
that allows users to go from a
tokenized corpus to results with the option to bootstrap cosine similarities
and get the corresponding std. errors.
get_ncs( x, N = 5, groups = NULL, pre_trained, transform = TRUE, transform_matrix, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, as_list = TRUE )
get_ncs( x, N = 5, groups = NULL, pre_trained, transform = TRUE, transform_matrix, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, as_list = TRUE )
x |
a (quanteda) |
N |
(numeric) number of nearest contexts to return |
groups |
a character or factor variable equal in length to the number of documents |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
transform |
(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings. |
transform_matrix |
(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings. |
bootstrap |
(logical) if TRUE, use bootstrapping – sample from |
num_bootstraps |
(integer) number of bootstraps to use. |
confidence_level |
(numeric in (0,1)) confidence level e.g. 0.95 |
as_list |
(logical) if FALSE all results are combined into a single data.frame If TRUE, a list of data.frames is returned with one data.frame per embedding |
a data.frame
or list of data.frames (one for each target)
with the following columns:
target
(character) rownames of x
,
the labels of the ALC embeddings. NA
if is.null(rownames(x))
.
context
(character) contexts collapsed into single documents (i.e. untokenized).
rank
(character) rank of context in terms of similarity with x
.
value
(numeric) cosine similarity between x
and context.
std.error
(numeric) std. error of the similarity value. Column is dropped if bootstrap = FALSE.
lower.ci
(numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci
(numeric) (if bootstrap = TRUE) upper bound of the confidence interval.
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L, rm_keyword = FALSE) # sample 100 instances of the target term, stratifying by party (only for example purposes) set.seed(2022L) immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party')) # compare nearest contexts between groups set.seed(2021L) immig_party_ncs <- get_ncs(x = immig_toks, N = 10, groups = docvars(immig_toks, 'party'), pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, as_list = TRUE) # nearest neighbors of "immigration" for Republican party immig_party_ncs[["D"]]
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L, rm_keyword = FALSE) # sample 100 instances of the target term, stratifying by party (only for example purposes) set.seed(2022L) immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party')) # compare nearest contexts between groups set.seed(2021L) immig_party_ncs <- get_ncs(x = immig_toks, N = 10, groups = docvars(immig_toks, 'party'), pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, as_list = TRUE) # nearest neighbors of "immigration" for Republican party immig_party_ncs[["D"]]
This is a wrapper function for nns()
that allows users to go from a
tokenized corpus to results with the option to bootstrap cosine similarities
and get the corresponding std. errors.
get_nns( x, N = 10, groups = NULL, candidates = character(0), pre_trained, transform = TRUE, transform_matrix, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, stem = FALSE, language = "porter", as_list = TRUE )
get_nns( x, N = 10, groups = NULL, candidates = character(0), pre_trained, transform = TRUE, transform_matrix, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, stem = FALSE, language = "porter", as_list = TRUE )
x |
a (quanteda) |
N |
(numeric) number of nearest neighbors to return |
groups |
a character or factor variable equal in length to the number of documents |
candidates |
(character) vector of features to consider as candidates to be nearest neighbor
You may for example want to only consider features that meet a certain count threshold
or exclude stop words etc. To do so you can simply identify the set of features you
want to consider and supply these as a character vector in the |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
transform |
(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings. |
transform_matrix |
(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings. |
bootstrap |
(logical) if TRUE, use bootstrapping – sample from |
num_bootstraps |
(integer) number of bootstraps to use. |
confidence_level |
(numeric in (0,1)) confidence level e.g. 0.95 |
stem |
(logical) - whether to stem candidates when evaluating nns. Default is FALSE.
If TRUE, candidate stems are ranked by their average cosine similarity to the target.
We recommend you remove misspelled words from candidate set |
language |
the name of a recognized language, as returned by
|
as_list |
(logical) if FALSE all results are combined into a single data.frame If TRUE, a list of data.frames is returned with one data.frame per group. |
a data.frame
or list of data.frames (one for each target)
with the following columns:
target
(character) rownames of x
,
the labels of the ALC embeddings. NA
if is.null(rownames(x))
.
feature
(character) features identified as nearest neighbors.
rank
(character) rank of feature in terms of similarity with x
.
value
(numeric) cosine similarity between x
and feature. Average over bootstrapped samples if bootstrap = TRUE.
std.error
(numeric) std. error of the similarity value. Column is dropped if bootstrap = FALSE.
lower.ci
(numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci
(numeric) (if bootstrap = TRUE) upper bound of the confidence interval.
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L) # sample 100 instances of the target term, stratifying by party (only for example purposes) set.seed(2022L) immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party')) # we limit candidates to features in our corpus feats <- featnames(dfm(immig_toks)) # compare nearest neighbors between groups set.seed(2021L) immig_party_nns <- get_nns(x = immig_toks, N = 10, groups = docvars(immig_toks, 'party'), candidates = feats, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, bootstrap = TRUE, num_bootstraps = 100, stem = TRUE, as_list = TRUE) # nearest neighbors of "immigration" for Republican party immig_party_nns[["R"]]
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L) # sample 100 instances of the target term, stratifying by party (only for example purposes) set.seed(2022L) immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party')) # we limit candidates to features in our corpus feats <- featnames(dfm(immig_toks)) # compare nearest neighbors between groups set.seed(2021L) immig_party_nns <- get_nns(x = immig_toks, N = 10, groups = docvars(immig_toks, 'party'), candidates = feats, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, bootstrap = TRUE, num_bootstraps = 100, stem = TRUE, as_list = TRUE) # nearest neighbors of "immigration" for Republican party immig_party_nns[["R"]]
This is a wrapper function for nns_ratio()
that allows users to go from a
tokenized corpus to results with the option to: (1) bootstrap cosine similarity ratios
and get the corresponding std. errors. (2) use a permutation test to get empirical
p-values for inference.
get_nns_ratio( x, N = 10, groups, numerator = NULL, candidates = character(0), pre_trained, transform = TRUE, transform_matrix, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, permute = TRUE, num_permutations = 100, stem = FALSE, language = "porter", verbose = TRUE, show_language = TRUE )
get_nns_ratio( x, N = 10, groups, numerator = NULL, candidates = character(0), pre_trained, transform = TRUE, transform_matrix, bootstrap = TRUE, num_bootstraps = 100, confidence_level = 0.95, permute = TRUE, num_permutations = 100, stem = FALSE, language = "porter", verbose = TRUE, show_language = TRUE )
x |
a (quanteda) tokens object |
N |
(numeric) number of nearest neighbors to return. Nearest neighbors
consist of the union of the top N nearest neighbors of the embeddings in |
groups |
a character or factor variable equal in length to the number of documents |
numerator |
(character) defines which group is the nuemerator in the ratio. |
candidates |
(character) vector of features to consider as candidates to be nearest neighbor
You may for example want to only consider features that meet a certian count threshold
or exclude stop words etc. To do so you can simply identify the set of features you
want to consider and supply these as a character vector in the |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
transform |
(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings. |
transform_matrix |
(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings. |
bootstrap |
(logical) if TRUE, use bootstrapping – sample from texts with replacement and
re-estimate cosine similarity ratios for each sample. Required to get std. errors.
If |
num_bootstraps |
(integer) number of bootstraps to use. |
confidence_level |
(numeric in (0,1)) confidence level e.g. 0.95 |
permute |
(logical) if TRUE, compute empirical p-values using permutation test |
num_permutations |
(numeric) number of permutations to use. |
stem |
(logical) - whether to stem candidates when evaluating nns. Default is FALSE.
If TRUE, candidate stems are ranked by their average cosine similarity to the target.
We recommend you remove misspelled words from candidate set |
language |
the name of a recognized language, as returned by
|
verbose |
provide information on which group is the numerator |
show_language |
(logical) if TRUE print out message with language used for stemming. |
a data.frame
with following columns:
feature
(character) features in candidates
(or all features if candidates
not defined), one instance for each embedding in x
.
value
(numeric) cosine similarity ratio between x
and feature. Average over bootstrapped samples if bootstrap = TRUE.
std.error
(numeric) std. error of the similarity value. Column is dropped if bootstrap = FALSE.
lower.ci
(numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci
(numeric) (if bootstrap = TRUE) upper bound of the confidence interval.
p.value
(numeric) empirical p-value of bootstrapped ratio of cosine similarities if permute = TRUE, if FALSE, column is dropped.
group
(character) group in groups
for which feature belongs
to the top N nearest neighbors. If "shared", the feature appeared as
top nearest neighbor for both groups.
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L) # sample 100 instances of the target term, stratifying by party (only for example purposes) set.seed(2022L) immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party')) # we limit candidates to features in our corpus feats <- featnames(dfm(immig_toks)) # compute ratio set.seed(2021L) immig_nns_ratio <- get_nns_ratio(x = immig_toks, N = 10, groups = docvars(immig_toks, 'party'), numerator = "R", candidates = feats, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, bootstrap = TRUE, num_bootstraps = 100, permute = FALSE, num_permutations = 5, verbose = FALSE) head(immig_nns_ratio)
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L) # sample 100 instances of the target term, stratifying by party (only for example purposes) set.seed(2022L) immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party')) # we limit candidates to features in our corpus feats <- featnames(dfm(immig_toks)) # compute ratio set.seed(2021L) immig_nns_ratio <- get_nns_ratio(x = immig_toks, N = 10, groups = docvars(immig_toks, 'party'), numerator = "R", candidates = feats, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, bootstrap = TRUE, num_bootstraps = 100, permute = FALSE, num_permutations = 5, verbose = FALSE) head(immig_nns_ratio)
Calculate cosine similarities between target word and candidates words over sequenced variable using ALC embedding approach
get_seq_cos_sim( x, seqvar, target, candidates, pre_trained, transform_matrix, window = 6, valuetype = "fixed", case_insensitive = TRUE, hard_cut = FALSE, verbose = TRUE )
get_seq_cos_sim( x, seqvar, target, candidates, pre_trained, transform_matrix, window = 6, valuetype = "fixed", case_insensitive = TRUE, hard_cut = FALSE, verbose = TRUE )
x |
(character) vector - this is the set of documents (corpus) of interest |
seqvar |
ordered variable such as list of dates or ordered iseology scores |
target |
(character) vector - target word |
candidates |
(character) vector of features of interest |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
transform_matrix |
(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings. |
window |
(numeric) - defines the size of a context (words around the target). |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
hard_cut |
(logical) - if TRUE then a context must have |
verbose |
(logical) - if TRUE, report the total number of target instances found. |
a data.frame with one column for each candidate term with corresponding cosine similarity values and one column for seqvar.
library(quanteda) # gen sequence var (here: year) docvars(cr_sample_corpus, 'year') <- rep(2011:2014, each = 50) cos_simsdf <- get_seq_cos_sim(x = cr_sample_corpus, seqvar = docvars(cr_sample_corpus, 'year'), target = "equal", candidates = c("immigration", "immigrants"), pre_trained = cr_glove_subset, transform_matrix = cr_transform)
library(quanteda) # gen sequence var (here: year) docvars(cr_sample_corpus, 'year') <- rep(2011:2014, each = 50) cos_simsdf <- get_seq_cos_sim(x = cr_sample_corpus, seqvar = docvars(cr_sample_corpus, 'year'), target = "equal", candidates = c("immigration", "immigrants"), pre_trained = cr_glove_subset, transform_matrix = cr_transform)
Given a set of embeddings and a set of tokenized contexts, find the top N nearest contexts.
ncs(x, contexts_dem, contexts = NULL, N = 5, as_list = TRUE)
ncs(x, contexts_dem, contexts = NULL, N = 5, as_list = TRUE)
x |
a (quanteda) |
contexts_dem |
a |
contexts |
a (quanteda) |
N |
(numeric) number of nearest contexts to return |
as_list |
(logical) if FALSE all results are combined into a single data.frame If TRUE, a list of data.frames is returned with one data.frame per embedding |
a data.frame
or list of data.frames (one for each target)
with the following columns:
target
(character) rownames of x
,
the labels of the ALC embeddings. NA
if is.null(rownames(x))
.
context
(character) contexts collapsed into single documents (i.e. untokenized).
If contexts
is NULL then this variable will show the context (document) ids which
you can use to merge.
rank
(character) rank of context in terms of similarity with x
.
value
(numeric) cosine similarity between x
and context.
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L, rm_keyword = FALSE) # build document-feature matrix immig_dfm <- dfm(immig_toks) # construct document-embedding-matrix immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE) # to get group-specific embeddings, average within party immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party) # find nearest contexts by party # setting as_list = FALSE combines each group's # results into a single data.frame (useful for joint plotting) ncs(x = immig_wv_party, contexts_dem = immig_dem, contexts = immig_toks, N = 5, as_list = TRUE)
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L, rm_keyword = FALSE) # build document-feature matrix immig_dfm <- dfm(immig_toks) # construct document-embedding-matrix immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE) # to get group-specific embeddings, average within party immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party) # find nearest contexts by party # setting as_list = FALSE combines each group's # results into a single data.frame (useful for joint plotting) ncs(x = immig_wv_party, contexts_dem = immig_dem, contexts = immig_toks, N = 5, as_list = TRUE)
Given a set of embeddings and a set of candidate neighbors, find the top N nearest neighbors.
nns( x, N = 10, candidates = character(0), pre_trained, stem = FALSE, language = "porter", as_list = TRUE, show_language = TRUE )
nns( x, N = 10, candidates = character(0), pre_trained, stem = FALSE, language = "porter", as_list = TRUE, show_language = TRUE )
x |
a |
N |
(numeric) number of nearest neighbors to return |
candidates |
(character) vector of features to consider as candidates to be nearest neighbor
You may for example want to only consider features that meet a certain count threshold
or exclude stop words etc. To do so you can simply identify the set of features you
want to consider and supply these as a character vector in the |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
stem |
(logical) - whether to stem candidates when evaluating nns. Default is FALSE.
If TRUE, candidate stems are ranked by their average cosine similarity to the target.
We recommend you remove misspelled words from candidate set |
language |
the name of a recognized language, as returned by
|
as_list |
(logical) if FALSE all results are combined into a single data.frame If TRUE, a list of data.frames is returned with one data.frame per group. |
show_language |
(logical) if TRUE print out message with language used for stemming. |
a data.frame
or list of data.frames (one for each target)
with the following columns:
target
(character) rownames of x
,
the labels of the ALC embeddings. NA
if is.null(rownames(x))
.
feature
(character) features identified as nearest neighbors.
rank
(character) rank of feature in terms of similarity with x
.
value
(numeric) cosine similarity between x
and feature.
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L) # build document-feature matrix immig_dfm <- dfm(immig_toks) # construct document-embedding-matrix immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE) # to get group-specific embeddings, average within party immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party) # find nearest neighbors by party # setting as_list = FALSE combines each group's # results into a single tibble (useful for joint plotting) immig_nns <- nns(immig_wv_party, pre_trained = cr_glove_subset, N = 5, candidates = immig_wv_party@features, stem = FALSE, as_list = TRUE)
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L) # build document-feature matrix immig_dfm <- dfm(immig_toks) # construct document-embedding-matrix immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE) # to get group-specific embeddings, average within party immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party) # find nearest neighbors by party # setting as_list = FALSE combines each group's # results into a single tibble (useful for joint plotting) immig_nns <- nns(immig_wv_party, pre_trained = cr_glove_subset, N = 5, candidates = immig_wv_party@features, stem = FALSE, as_list = TRUE)
Computes the ratio of cosine similarities between group embeddings and features –that is, for any given feature it first computes the similarity between that feature and each group embedding, and then takes the ratio of these two similarities. This ratio captures how "discriminant" a feature is of a given group. Values larger (smaller) than 1 mean the feature is more (less) discriminant of the group in the numerator (denominator).
nns_ratio( x, N = 10, numerator = NULL, candidates = character(0), pre_trained, stem = FALSE, language = "porter", verbose = TRUE, show_language = TRUE )
nns_ratio( x, N = 10, numerator = NULL, candidates = character(0), pre_trained, stem = FALSE, language = "porter", verbose = TRUE, show_language = TRUE )
x |
a (quanteda) |
N |
(numeric) number of nearest neighbors to return. Nearest neighbors
consist of the union of the top N nearest neighbors of the embeddings in |
numerator |
(character) defines which group is the nuemerator in the ratio |
candidates |
(character) vector of features to consider as candidates to be nearest neighbor
You may for example want to only consider features that meet a certian count threshold
or exclude stop words etc. To do so you can simply identify the set of features you
want to consider and supply these as a character vector in the |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
stem |
(logical) - whether to stem candidates when evaluating nns. Default is FALSE.
If TRUE, candidate stems are ranked by their average cosine similarity to the target.
We recommend you remove misspelled words from candidate set |
language |
the name of a recognized language, as returned by
|
verbose |
report which group is the numerator and which group is the denominator. |
show_language |
(logical) if TRUE print out message with language used for stemming. |
a data.frame
with following columns:
feature
(character) features in candidates
(or all features if candidates
not defined), one instance for each embedding in x
.
value
(numeric) ratio of cosine similarities.
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L) # build document-feature matrix immig_dfm <- dfm(immig_toks) # construct document-embedding-matrix immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE) # to get group-specific embeddings, average within party immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party) # compute the cosine similarity between each party's # embedding and a specific set of features nns_ratio(x = immig_wv_party, N = 10, numerator = "R", candidates = immig_wv_party@features, pre_trained = cr_glove_subset, verbose = FALSE) # with stemming nns_ratio(x = immig_wv_party, N = 10, numerator = "R", candidates = immig_wv_party@features, pre_trained = cr_glove_subset, stem = TRUE, verbose = FALSE)
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L) # build document-feature matrix immig_dfm <- dfm(immig_toks) # construct document-embedding-matrix immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, verbose = FALSE) # to get group-specific embeddings, average within party immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party) # compute the cosine similarity between each party's # embedding and a specific set of features nns_ratio(x = immig_wv_party, N = 10, numerator = "R", candidates = immig_wv_party@features, pre_trained = cr_glove_subset, verbose = FALSE) # with stemming nns_ratio(x = immig_wv_party, N = 10, numerator = "R", candidates = immig_wv_party@features, pre_trained = cr_glove_subset, stem = TRUE, verbose = FALSE)
Permute similarity and ratio computations
permute_contrast( target_embeddings1 = NULL, target_embeddings2 = NULL, pre_trained = NULL, candidates = NULL, norm = NULL )
permute_contrast( target_embeddings1 = NULL, target_embeddings2 = NULL, pre_trained = NULL, candidates = NULL, norm = NULL )
target_embeddings1 |
ALC embeddings for group 1 |
target_embeddings2 |
ALC embeddings for group 2 |
pre_trained |
a V x D matrix of numeric values - pretrained embeddings with V = size of vocabulary and D = embedding dimensions |
candidates |
character vector defining the candidates for nearest neighbors - e.g. output from |
norm |
character = c("l2", "none") - set to 'l2' for cosine similarity and to 'none' for inner product (see ?sim2 in text2vec) |
a list with three elements, nns for group 1, nns for group 2 and nns_ratio comparing with ratios of similarities between the two groups
Estimate empirical p-value using permuted regression
permute_ols(Y = NULL, X = NULL)
permute_ols(Y = NULL, X = NULL)
Y |
vector of regression model's dependent variable (embedded context) |
X |
data.frame of model independent variables (covariates) |
list with two elements, betas
= list of beta_coefficients (D dimensional vectors);
normed_betas
= tibble with the norm of the non-intercept coefficients
get_nns_ratio()
A way of visualizing the top nearest neighbors of a pair of ALC embeddings that captures how "discriminant" each feature is of each embedding (group).
plot_nns_ratio(x, alpha = 0.01, horizontal = TRUE)
plot_nns_ratio(x, alpha = 0.01, horizontal = TRUE)
x |
output of get_nns_ratio |
alpha |
(numerical) betwee 0 and 1. Significance threshold to identify significant values.
These are denoted by a |
horizontal |
(logical) defines the type of plot. if TRUE results are plotted on 1 dimension. If FALSE, results are plotted on 2 dimensions, with the second dimension catpuring the ranking of cosine ratio similarties. |
a ggplot-class
object.
library(ggplot2) library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L) # sample 100 instances of the target term, stratifying by party (only for example purposes) set.seed(2022L) immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party')) # we limit candidates to features in our corpus feats <- featnames(dfm(immig_toks)) # compute ratio set.seed(2022L) immig_nns_ratio <- get_nns_ratio(x = immig_toks, N = 10, groups = docvars(immig_toks, 'party'), numerator = "R", candidates = feats, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, bootstrap = TRUE, # num_bootstraps should be at least 100, # we use 10 here due to CRAN-imposed constraints # on example execution time num_bootstraps = 100, permute = FALSE, num_permutations = 10, verbose = FALSE) plot_nns_ratio(x = immig_nns_ratio, alpha = 0.01, horizontal = TRUE)
library(ggplot2) library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L) # sample 100 instances of the target term, stratifying by party (only for example purposes) set.seed(2022L) immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party')) # we limit candidates to features in our corpus feats <- featnames(dfm(immig_toks)) # compute ratio set.seed(2022L) immig_nns_ratio <- get_nns_ratio(x = immig_toks, N = 10, groups = docvars(immig_toks, 'party'), numerator = "R", candidates = feats, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, bootstrap = TRUE, # num_bootstraps should be at least 100, # we use 10 here due to CRAN-imposed constraints # on example execution time num_bootstraps = 100, permute = FALSE, num_permutations = 10, verbose = FALSE) plot_nns_ratio(x = immig_nns_ratio, alpha = 0.01, horizontal = TRUE)
Contexts most similar on average to the full set of contexts.
prototypical_context( context, pre_trained, transform = TRUE, transform_matrix, N = 3, norm = "l2" )
prototypical_context( context, pre_trained, transform = TRUE, transform_matrix, N = 3, norm = "l2" )
context |
(character) vector of texts - |
pre_trained |
(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding. |
transform |
(logical) - if TRUE (default) apply the a la carte transformation, if FALSE ouput untransformed averaged embedding. |
transform_matrix |
(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings. |
N |
(numeric) number of most "prototypical" contexts to return. |
norm |
(character) - how to compute similarity (see ?text2vec::sim2):
|
a data.frame
with the following columns:
doc_id
(integer) document id.
typicality_score
(numeric) average similarity score to all other contexts
context
(character) contexts
# find contexts of immigration context_immigration <- get_context(x = cr_sample_corpus, target = 'immigration', window = 6, valuetype = "fixed", case_insensitive = TRUE, hard_cut = FALSE, verbose = FALSE) # identify top N prototypical contexts and compute typicality score pt_context <- prototypical_context(context = context_immigration$context, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, N = 3, norm = 'l2')
# find contexts of immigration context_immigration <- get_context(x = cr_sample_corpus, target = 'immigration', window = 6, valuetype = "fixed", case_insensitive = TRUE, hard_cut = FALSE, verbose = FALSE) # identify top N prototypical contexts and compute typicality score pt_context <- prototypical_context(context = context_immigration$context, pre_trained = cr_glove_subset, transform = TRUE, transform_matrix = cr_transform, N = 3, norm = 'l2')
Run jackknife debiased OLS
run_jack_ols(X, Y, confidence_level = 0.95)
run_jack_ols(X, Y, confidence_level = 0.95)
X |
data.frame of model independent variables (covariates) |
Y |
vector of regression model's dependent variable (embedded context) |
confidence_level |
(numeric in (0,1)) confidence level e.g. 0.95 |
list with two elements, betas
= list of beta_coefficients (D dimensional vectors);
normed_betas
= tibble with the norm and CIs of the non-intercept coefficients
Run OLS
run_ols(Y = NULL, X = NULL)
run_ols(Y = NULL, X = NULL)
Y |
vector of regression model's dependent variable (embedded context) |
X |
data.frame of model independent variables (covariates) |
list with two elements, betas
= list of beta_coefficients (D dimensional vectors);
normed_betas
= tibble with the norm of the non-intercept coefficients
This function uses quanteda's kwic()
function to find the contexts
around user defined patterns (i.e. target words/phrases) and return a tokens object
with the tokenized contexts and corresponding document variables.
tokens_context( x, pattern, window = 6L, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, hard_cut = FALSE, rm_keyword = TRUE, verbose = TRUE )
tokens_context( x, pattern, window = 6L, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, hard_cut = FALSE, rm_keyword = TRUE, verbose = TRUE )
x |
a (quanteda) |
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
window |
the number of context words to be displayed around the keyword |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
hard_cut |
(logical) - if TRUE then a context must have |
rm_keyword |
(logical) if FALSE, keyword matching pattern is included in the tokenized contexts |
verbose |
(logical) if TRUE, report the total number of instances per pattern found |
a (quanteda) tokens-class
. Each document in the output tokens object
inherits the document variables (docvars
) of the document from whence it came,
along with a column registering corresponding the pattern used.
This information can be retrieved using docvars()
.
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)
library(quanteda) # tokenize corpus toks <- tokens(cr_sample_corpus) # build a tokenized corpus of contexts sorrounding a target term immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)