Package 'conText' reference manual

Title:	'a la Carte' on Text (ConText) Embedding Regression
Description:	A fast, flexible and transparent framework to estimate context-specific word and short document embeddings using the 'a la carte' embeddings approach developed by Khodak et al. (2018) <arXiv:1805.05388> and evaluate hypotheses about covariate effects on embeddings using the regression framework developed by Rodriguez et al. (2021)<https://github.com/prodriguezsosa/EmbeddingRegression>.
Authors:	Pedro L. Rodriguez [aut, cre, cph] , Arthur Spirling [aut] , Brandon Stewart [aut] , Christopher Barrie [ctb]
Maintainer:	Pedro L. Rodriguez <[email protected]>
License:	GPL-3
Version:	2.0.0
Built:	2025-02-25 04:32:44 UTC
Source:	https://github.com/prodriguezsosa/context

Bootstrap similarity and ratio computations

Description

Bootstrap similarity and ratio computations

Usage

bootstrap_contrast(
  target_embeddings1 = NULL,
  target_embeddings2 = NULL,
  pre_trained = NULL,
  candidates = NULL,
  norm = NULL
)
bootstrap_contrast(
  target_embeddings1 = NULL,
  target_embeddings2 = NULL,
  pre_trained = NULL,
  candidates = NULL,
  norm = NULL
)

Arguments

`target_embeddings1`	ALC embeddings for group 1
`target_embeddings2`	ALC embeddings for group 2
`pre_trained`	a V x D matrix of numeric values - pretrained embeddings with V = size of vocabulary and D = embedding dimensions
`candidates`	character vector defining the candidates for nearest neighbors - e.g. output from `get_local_vocab`
`norm`	character = c("l2", "none") - set to 'l2' for cosine similarity and to 'none' for inner product (see ?sim2 in text2vec)

Value

a list with three elements, nns for group 1, nns for group 2 and nns_ratio comparing with ratios of similarities between the two groups

Bootstrap nearest neighbors

Description

Uses bootstrapping –sampling of of texts with replacement– to identify the top N nearest neighbors based on cosine or inner product similarity.

Usage

bootstrap_nns(
  context = NULL,
  pre_trained = NULL,
  transform = TRUE,
  transform_matrix = NULL,
  candidates = NULL,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  N = 50,
  norm = "l2"
)
bootstrap_nns(
  context = NULL,
  pre_trained = NULL,
  transform = TRUE,
  transform_matrix = NULL,
  candidates = NULL,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  N = 50,
  norm = "l2"
)

Arguments

`context`	(character) vector of texts - `context` variable in get_context output
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`transform`	(logical) - if TRUE (default) apply the a la carte transformation, if FALSE ouput untransformed averaged embedding.
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`candidates`	(character) vector defining the candidates for nearest neighbors - e.g. output from `get_local_vocab`.
`bootstrap`	(logical) if TRUE, bootstrap similarity values - sample from texts with replacement. Required to get std. errors.
`num_bootstraps`	(numeric) - number of bootstraps to use.
`confidence_level`	(numeric in (0,1)) confidence level e.g. 0.95
`N`	(numeric) number of nearest neighbors to return.
`norm`	(character) - how to compute the similarity (see ?text2vec::sim2): `"l2"` cosine similarity `"none"` inner product

Value

a data.frame with the following columns:

feature: (character) vector of feature terms corresponding to the nearest neighbors.
value: (numeric) cosine/inner product similarity between texts and feature. Average over bootstrapped samples if bootstrap = TRUE.
std.error: (numeric) std. error of the similarity value. Column is dropped if bootstrap = FALSE.
lower.ci: (numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci: (numeric) (if bootstrap = TRUE) upper bound of the confidence interval.

Examples


# find contexts of immigration
context_immigration <- get_context(x = cr_sample_corpus,
                                   target = 'immigration',
                                   window = 6,
                                   valuetype = "fixed",
                                   case_insensitive = TRUE,
                                   hard_cut = FALSE, verbose = FALSE)

# find local vocab (use it to define the candidate of nearest neighbors)
local_vocab <- get_local_vocab(context_immigration$context, pre_trained = cr_glove_subset)

set.seed(42L)
nns_immigration <- bootstrap_nns(context = context_immigration$context,
                                 pre_trained = cr_glove_subset,
                                 transform_matrix = cr_transform,
                                 transform = TRUE,
                                 candidates = local_vocab,
                                 bootstrap = TRUE,
                                 num_bootstraps = 100,
                                 confidence_level = 0.95,
                                 N = 50,
                                 norm = "l2")

head(nns_immigration)

# find contexts of immigration
context_immigration <- get_context(x = cr_sample_corpus,
                                   target = 'immigration',
                                   window = 6,
                                   valuetype = "fixed",
                                   case_insensitive = TRUE,
                                   hard_cut = FALSE, verbose = FALSE)

# find local vocab (use it to define the candidate of nearest neighbors)
local_vocab <- get_local_vocab(context_immigration$context, pre_trained = cr_glove_subset)

set.seed(42L)
nns_immigration <- bootstrap_nns(context = context_immigration$context,
                                 pre_trained = cr_glove_subset,
                                 transform_matrix = cr_transform,
                                 transform = TRUE,
                                 candidates = local_vocab,
                                 bootstrap = TRUE,
                                 num_bootstraps = 100,
                                 confidence_level = 0.95,
                                 N = 50,
                                 norm = "l2")

head(nns_immigration)

Bootstrap OLS

Description

Bootstrap model coefficients and standard errors

Usage

bootstrap_ols(Y = NULL, X = NULL, stratify = NULL)
bootstrap_ols(Y = NULL, X = NULL, stratify = NULL)

Arguments

`Y`	vector of regression model's dependent variable (embedded context)
`X`	data.frame of model independent variables (covariates)
`stratify`	covariates to stratify when bootstrapping

Value

list with two elements, betas = list of beta_coefficients (D dimensional vectors); normed_betas = tibble with the norm of the non-intercept coefficients

Boostrap similarity vector

Description

Boostrap similarity vector

Usage

bootstrap_similarity(
  target_embeddings = NULL,
  pre_trained = NULL,
  candidates = NULL,
  norm = NULL
)
bootstrap_similarity(
  target_embeddings = NULL,
  pre_trained = NULL,
  candidates = NULL,
  norm = NULL
)

Arguments

`target_embeddings`	the target embeddings (embeddings of context)
`pre_trained`	a V x D matrix of numeric values - pretrained embeddings with V = size of vocabulary and D = embedding dimensions
`candidates`	character vector defining the candidates for nearest neighbors - e.g. output from `get_local_vocab`
`norm`	character = c("l2", "none") - set to 'l2' for cosine similarity and to 'none' for inner product (see ?sim2 in text2vec)

Value

vector(s) of cosine similarities between alc embedding and nearest neighbor candidates

build a `conText-class` object

Description

build a conText-class object

Usage

build_conText(
  Class = "conText",
  x_conText,
  normed_coefficients = data.frame(),
  features = character(),
  Dimnames = list()
)
build_conText(
  Class = "conText",
  x_conText,
  normed_coefficients = data.frame(),
  features = character(),
  Dimnames = list()
)

Arguments

`Class`	defines the class of this object (fixed)
`x_conText`	a `⁠dgCMatrix class⁠` matrix
`normed_coefficients`	a data.frame withe the normed coefficients and other statistics
`features`	features used in computing the embeddings
`Dimnames`	row (features) and columns (NULL) names

build a `dem-class` object

Description

build a dem-class object

Usage

build_dem(
  Class = "em",
  x_dem,
  docvars = data.frame(),
  features = character(),
  Dimnames = list()
)
build_dem(
  Class = "em",
  x_dem,
  docvars = data.frame(),
  features = character(),
  Dimnames = list()
)

Arguments

`Class`	defines tha class of this object (fixed)
`x_dem`	a `⁠dgCMatrix class⁠` matrix
`docvars`	document covariates, inherited from dfm and corpus, subset to embeddable documents
`features`	features used in computing the embeddings
`Dimnames`	row (documents) and columns (NULL) names

build a `fem-class` object

Description

build a fem-class object

Usage

build_fem(
  Class = "fem",
  x_fem,
  features = character(),
  counts = numeric(),
  Dimnames = list()
)
build_fem(
  Class = "fem",
  x_fem,
  features = character(),
  counts = numeric(),
  Dimnames = list()
)

Arguments

`Class`	defines the class of this object (fixed)
`x_fem`	a `⁠dgCMatrix class⁠` matrix
`features`	features used in computing the embeddings
`counts`	counts of features used in computing embeddings
`Dimnames`	row (features) and columns (NULL) names

Compute similarity and similarity ratios

Description

Compute similarity and similarity ratios

Usage

compute_contrast(
  target_embeddings1 = NULL,
  target_embeddings2 = NULL,
  pre_trained = NULL,
  candidates = NULL,
  norm = NULL
)
compute_contrast(
  target_embeddings1 = NULL,
  target_embeddings2 = NULL,
  pre_trained = NULL,
  candidates = NULL,
  norm = NULL
)

Arguments

`target_embeddings1`	ALC embeddings for group 1
`target_embeddings2`	ALC embeddings for group 2
`pre_trained`	a V x D matrix of numeric values - pretrained embeddings with V = size of vocabulary and D = embedding dimensions
`candidates`	character vector defining the candidates for nearest neighbors - e.g. output from `get_local_vocab`
`norm`	character = c("l2", "none") - set to 'l2' for cosine similarity and to 'none' for inner product (see ?sim2 in text2vec)

Value

a list with three elements, nns for group 1, nns for group 2 and nns_ratio comparing with ratios of similarities between the two groups

Compute similarity vector (sub-function of bootstrap_similarity)

Description

Compute similarity vector (sub-function of bootstrap_similarity)

Usage

compute_similarity(
  target_embeddings = NULL,
  pre_trained = NULL,
  candidates = NULL,
  norm = NULL
)
compute_similarity(
  target_embeddings = NULL,
  pre_trained = NULL,
  candidates = NULL,
  norm = NULL
)

Arguments

`target_embeddings`	the target embeddings (embeddings of context)
`pre_trained`	a V x D matrix of numeric values - pretrained embeddings with V = size of vocabulary and D = embedding dimensions
`candidates`	character vector defining the candidates for nearest neighbors - e.g. output from `get_local_vocab`
`norm`	character = c("l2", "none") - set to 'l2' for cosine similarity and to 'none' for inner product (see ?sim2 in text2vec)

Value

vector of cosine similarities between alc embedding and nearest neighbor candidates

Compute transformation matrix A

Description

Computes a transformation matrix, given a feature-co-occurrence matrix and corresponding pre-trained embeddings.

Usage

compute_transform(x, pre_trained, weighting = 500)
compute_transform(x, pre_trained, weighting = 500)

Arguments

x

a (quanteda) fcm-class object.

pre_trained

(numeric) a F x D matrix corresponding to pretrained embeddings, usually trained on the same corpus as that used for x. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding

weighting

(character or numeric) weighting options:

1: no weighting.
"log": weight by the log of the frequency count.
numeric: threshold based weighting (= 1 if token count meets threshold, 0 ow).

Recommended: use log for small corpora, a numeric threshold for larger corpora.

Value

a dgTMatrix-class D x D non-symmetrical matrix (D = dimensions of pre-trained embedding space) corresponding to an 'a la carte' transformation matrix. This matrix is optimized for the corpus and pre-trained embeddings employed.

Examples


library(quanteda)

# note, cr_sample_corpus is too small to produce sensical word vectors

# tokenize
toks <- tokens(cr_sample_corpus)

# construct feature-co-occurrence matrix
toks_fcm <- fcm(toks, context = "window", window = 6,
count = "weighted", weights = 1 / (1:6), tri = FALSE)

# you will generally want to estimate a new (corpus-specific)
# GloVe model, we will use cr_glove_subset instead
# see the Quick Start Guide to see a full example.

# estimate transform
local_transform <- compute_transform(x = toks_fcm,
pre_trained = cr_glove_subset, weighting = 'log')
library(quanteda)

# note, cr_sample_corpus is too small to produce sensical word vectors

# tokenize
toks <- tokens(cr_sample_corpus)

# construct feature-co-occurrence matrix
toks_fcm <- fcm(toks, context = "window", window = 6,
count = "weighted", weights = 1 / (1:6), tri = FALSE)

# you will generally want to estimate a new (corpus-specific)
# GloVe model, we will use cr_glove_subset instead
# see the Quick Start Guide to see a full example.

# estimate transform
local_transform <- compute_transform(x = toks_fcm,
pre_trained = cr_glove_subset, weighting = 'log')

Embedding regression

Description

Estimates an embedding regression model with options to use bootstrapping (to be deprecated) or jackknife debiasing to estimate confidence intervals and a permutation test for inference (see https://github.com/prodriguezsosa/conText for details.)

Usage

conText(
  formula,
  data,
  pre_trained,
  transform = TRUE,
  transform_matrix,
  bootstrap = FALSE,
  num_bootstraps = 100,
  stratify = FALSE,
  jackknife = TRUE,
  confidence_level = 0.95,
  permute = TRUE,
  num_permutations = 100,
  window = 6L,
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE,
  hard_cut = FALSE,
  verbose = TRUE
)
conText(
  formula,
  data,
  pre_trained,
  transform = TRUE,
  transform_matrix,
  bootstrap = FALSE,
  num_bootstraps = 100,
  stratify = FALSE,
  jackknife = TRUE,
  confidence_level = 0.95,
  permute = TRUE,
  num_permutations = 100,
  window = 6L,
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE,
  hard_cut = FALSE,
  verbose = TRUE
)

Arguments

`formula`	a symbolic description of the model to be fitted with a target word as a DV e.g. `immigrant ~ party + gender`. To use a phrase as a DV, place it in quotations e.g. `"immigrant refugees" ~ party + gender`. To use all covariates included in the data, you can use `.` on RHS, e.g.`immigrant ~ .`. If you wish to treat the full document as you DV, rather than a single target word, use `.` on the LHS e.g. `. ~ party + gender`. If you wish to use all covariates on the RHS use `immigrant ~ .`. Any `character` or `factor` covariates will automatically be converted to a set of binary (`0/1`s) indicator variables for each group, leaving the first level out of the regression.
`data`	a quanteda `tokens-class` object with the necessary document variables. Covariates must be either binary indicator variables or "transformable" into binary indicator variables. conText will automatically transform any non-indicator variables into binary indicator variables (multiple if more than 2 classes), leaving out a "base" category.
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`transform`	(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings.
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`bootstrap`	(logical) if TRUE, use bootstrapping – sample from texts with replacement and re-run regression on each sample.
`num_bootstraps`	(numeric) number of bootstraps to use (at least 100). Ignored if bootstrap = FALSE.
`stratify`	(logical) if TRUE, stratify by discrete covariates when bootstrapping.
`jackknife`	(logical) if TRUE (default), jackknife (leave one out) debiasing is implemented. Implies n resamples.
`confidence_level`	(numeric in (0,1)) confidence level e.g. 0.95
`permute`	(logical) if TRUE, compute empirical p-values using permutation test
`num_permutations`	(numeric) number of permutations to use
`window`	the number of context words to be displayed around the keyword
`valuetype`	the type of pattern matching: `"glob"` for "glob"-style wildcard expressions; `"regex"` for regular expressions; or `"fixed"` for exact matching. See valuetype for details.
`case_insensitive`	logical; if `TRUE`, ignore case when matching a `pattern` or dictionary values
`hard_cut`	(logical) - if TRUE then a context must have `window` x 2 tokens, if FALSE it can have `window` x 2 or fewer (e.g. if a doc begins with a target word, then context will have `window` tokens rather than `window` x 2)
`verbose`	(logical) - if TRUE, report the documents that had no overlapping features with the pretrained embeddings provided.

Value

a conText-class object - a D x M matrix with D = dimensions of the pre-trained feature embeddings provided and M = number of covariates including the intercept. These represent the estimated regression coefficients. These can be combined to compute ALC embeddings for different combinations of covariates. The object also includes various informative attributes, importantly a data.frame with the following columns:

coefficient: (character) name of (covariate) coefficient.
value: (numeric) norm of the corresponding beta coefficient (debiased if jackknife = TRUE).
std.error: (numeric) (if bootstrap = TRUE or jackknife = TRUE) std. error of the (debiased if jackknife = TRUE) norm of the beta coefficient.
lower.ci: (numeric) (if bootstrap = TRUE or jackknife = TRUE) lower bound of the (debiased if jackknife = TRUE) confidence interval.
upper.ci: (numeric) (if bootstrap = TRUE or jackknife = TRUE) upper bound of the (debiased if jackknife = TRUE) confidence interval.
p.value: (numeric) (if permute = TRUE) empirical p.value of the norm of the coefficient.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

## given the target word "immigration"
set.seed(2021L)
model1 <- conText(formula = immigration ~ party + gender,
                 data = toks,
                 pre_trained = cr_glove_subset,
                 transform = TRUE,
                 transform_matrix = cr_transform,
                 bootstrap=FALSE,
                 jackknife = TRUE,
                 confidence_level = 0.95,
                 permute = TRUE,
                 num_permutations = 10,
                 window = 6,
                 case_insensitive = TRUE,
                 verbose = FALSE)

# notice, character/factor covariates are automatically "dummified"
rownames(model1)

# the beta coefficient 'partyR' in this case corresponds to the alc embedding
# of "immigration" for Republican party speeches

# (normed) coefficient table
model1@normed_coefficients

library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

## given the target word "immigration"
set.seed(2021L)
model1 <- conText(formula = immigration ~ party + gender,
                 data = toks,
                 pre_trained = cr_glove_subset,
                 transform = TRUE,
                 transform_matrix = cr_transform,
                 bootstrap=FALSE,
                 jackknife = TRUE,
                 confidence_level = 0.95,
                 permute = TRUE,
                 num_permutations = 10,
                 window = 6,
                 case_insensitive = TRUE,
                 verbose = FALSE)

# notice, character/factor covariates are automatically "dummified"
rownames(model1)

# the beta coefficient 'partyR' in this case corresponds to the alc embedding
# of "immigration" for Republican party speeches

# (normed) coefficient table
model1@normed_coefficients

Contrast nearest neighbors

Description

Computes the ratio of cosine similarities between group embeddings and features –that is, for any given feature it first computes the similarity between that feature and each group embedding, and then takes the ratio of these two similarities. This ratio captures how "discriminant" a feature is of a given group.

Usage

contrast_nns(
  x,
  groups = NULL,
  pre_trained = NULL,
  transform = TRUE,
  transform_matrix = NULL,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  permute = TRUE,
  num_permutations = 100,
  candidates = NULL,
  N = 20,
  verbose = TRUE
)
contrast_nns(
  x,
  groups = NULL,
  pre_trained = NULL,
  transform = TRUE,
  transform_matrix = NULL,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  permute = TRUE,
  num_permutations = 100,
  candidates = NULL,
  N = 20,
  verbose = TRUE
)

Arguments

`x`	(quanteda) `tokens-class` object
`groups`	(numeric, factor, character) a binary variable of the same length as `x`
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`transform`	(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings.
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`bootstrap`	(logical) if TRUE, use bootstrapping – sample from texts with replacement and re-estimate cosine ratios for each sample. Required to get std. errors.
`num_bootstraps`	(numeric) - number of bootstraps to use
`confidence_level`	(numeric in (0,1)) confidence level e.g. 0.95
`permute`	(logical) - if TRUE, compute empirical p-values using a permutation test
`num_permutations`	(numeric) - number of permutations to use
`candidates`	(character) vector of candidate features for nearest neighbors
`N`	(numeric) - nearest neighbors are subset to the union of the N neighbors of each group (if NULL, ratio is computed for all features)
`verbose`	(logical) - if TRUE, report the documents that had no overlapping features with the pretrained embeddings provided.

Value

a data.frame with following columns:

feature: (character) vector of feature terms corresponding to the nearest neighbors.
value: (numeric) ratio of cosine similarities. Average over bootstrapped samples if bootstrap = TRUE.
std.error: (numeric) std. error of the ratio of cosine similarties. Column is dropped if bootsrap = FALSE.
lower.ci: (numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci: (numeric) (if bootstrap = TRUE) upper bound of the confidence interval.
p.value: (numeric) empirical p-value. Column is dropped if permute = FALSE.

Examples


library(quanteda)

cr_toks <- tokens(cr_sample_corpus)

immig_toks <- tokens_context(x = cr_toks,
pattern = "immigration", window = 6L, hard_cut = FALSE, verbose = TRUE)

# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))

set.seed(42L)
party_nns <- contrast_nns(x = immig_toks,
groups = docvars(immig_toks, 'party'),
pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform,
bootstrap = TRUE,
num_bootstraps = 100,
confidence_level = 0.95,
permute = TRUE, num_permutations = 10,
candidates = NULL, N = 20,
verbose = FALSE)

head(party_nns)
library(quanteda)

cr_toks <- tokens(cr_sample_corpus)

immig_toks <- tokens_context(x = cr_toks,
pattern = "immigration", window = 6L, hard_cut = FALSE, verbose = TRUE)

# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))

set.seed(42L)
party_nns <- contrast_nns(x = immig_toks,
groups = docvars(immig_toks, 'party'),
pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform,
bootstrap = TRUE,
num_bootstraps = 100,
confidence_level = 0.95,
permute = TRUE, num_permutations = 10,
candidates = NULL, N = 20,
verbose = FALSE)

head(party_nns)

Compute the cosine similarity between one or more ALC embeddings and a set of features.

Description

Compute the cosine similarity between one or more ALC embeddings and a set of features.

Usage

cos_sim(
  x,
  pre_trained,
  features = NULL,
  stem = FALSE,
  language = "porter",
  as_list = TRUE,
  show_language = TRUE
)
cos_sim(
  x,
  pre_trained,
  features = NULL,
  stem = FALSE,
  language = "porter",
  as_list = TRUE,
  show_language = TRUE
)

Arguments

`x`	a (quanteda) `dem-class` or `fem-class` object.
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`features`	(character) features of interest.
`stem`	(logical) - If TRUE, both `features` and `rownames(pre_trained)` are stemmed and average cosine similarities are reported. We recommend you remove misspelled words from `pre_trained` as these can significantly influence the average.
`language`	the name of a recognized language, as returned by `getStemLanguages`, or a two- or three-letter ISO-639 code corresponding to one of these languages (see references for the list of codes).
`as_list`	(logical) if FALSE all results are combined into a single data.frame If TRUE, a list of data.frames is returned with one data.frame per feature.
`show_language`	(logical) if TRUE print out message with language used for stemming.

Value

a data.frame or list of data.frames (one for each target) with the following columns:

target: (character) rownames of x, the labels of the ALC embeddings. NA if is.null(rownames(x)).
feature: (character) feature terms defined in the features argument.
value: (numeric) cosine similarity between x and feature.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)

# build document-feature matrix
immig_dfm <- dfm(immig_toks)

# construct document-embedding-matrix
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

# to get group-specific embeddings, average within party
immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party)

# compute the cosine similarity between each party's embedding and a specific set of features
cos_sim(x = immig_wv_party, pre_trained = cr_glove_subset,
features = c('reform', 'enforcement'), as_list = FALSE)
library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)

# build document-feature matrix
immig_dfm <- dfm(immig_toks)

# construct document-embedding-matrix
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

# to get group-specific embeddings, average within party
immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party)

# compute the cosine similarity between each party's embedding and a specific set of features
cos_sim(x = immig_wv_party, pre_trained = cr_glove_subset,
features = c('reform', 'enforcement'), as_list = FALSE)

GloVe subset

Description

A subset of a GloVe embeddings model trained on the top 5000 features in the Congressional Record Record corpus covering the 111th - 114th Congresses, and limited to speeches by Democrat and Republican representatives.

Usage

cr_glove_subset
cr_glove_subset

Format

A matrix with 500 rows and 300 columns:

row: each row corresponds to a word
column: each column corresponds to a dimension in the embedding space

...

Source

https://www.dropbox.com/s/p84wzv8bdmziog8/cr_glove.R?dl=0

Congressional Record sample corpus

Description

A (quanteda) corpus containing a sample of the United States Congressional Record (daily transcripts) covering the 111th to 114th Congresses. The raw corpus is first subset to speeches containing the regular expression "immig*". Then 100 docs from each party-gender pair is randomly sampled. For full data and pre-processing file, see: https://www.dropbox.com/sh/jsyrag7opfo7l7i/AAB1z7tumLuKihGu2-FDmhmKa?dl=0 For nominate scores see: https://voteview.com/data

Usage

cr_sample_corpus
cr_sample_corpus

Format

A quanteda corpus with 200 documents and 3 docvars:

party: party of speaker, (D)emocrat or (R)epublican
gender: gender of speaker, (F)emale or (M)ale
nominate_dim1: dimension 1 of the nominate score

...

Source

https://data.stanford.edu/congress_text

Transformation matrix

Description

A square matrix corresponding to the transformation matrix computed using the cr_glove_subset embeddings and corresponding corpus.

Usage

cr_transform
cr_transform

Format

A 300 by 300 matrix.

Source

https://www.dropbox.com/s/p84wzv8bdmziog8/cr_glove.R?dl=0

Build a document-embedding matrix

Description

Given a document-feature-matrix, for each document, multiply its feature counts (columns) with their corresponding pretrained word embeddings and average (usually referred to as averaged or additive document embeddings). If specified and a transformation matrix is provided, multiply the document embeddings by the transformation matrix to obtain the corresponding ⁠a la carte⁠ document embeddings. (see eq 2: https://arxiv.org/pdf/1805.05388.pdf)

Usage

dem(x, pre_trained, transform = TRUE, transform_matrix, verbose = TRUE)
dem(x, pre_trained, transform = TRUE, transform_matrix, verbose = TRUE)

Arguments

`x`	a quanteda (`dfm-class`) document-feature-matrix
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`transform`	(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings.
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`verbose`	(logical) - if TRUE, report the documents that had no overlapping features with the pretrained embeddings provided.

Value

a N x D (dem-class) document-embedding-matrix corresponding to the ALC embeddings for each document. N = number of documents (that could be embedded), D = dimensions of pretrained embeddings. This object inherits the document variables in x, the dfm used. These can be accessed calling the attribute: ⁠@docvars⁠. Note, documents with no overlapping features with the pretrained embeddings provided are automatically dropped. For a list of the documents that were embedded call the attribute: ⁠@Dimnames$docs⁠.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)

# construct document-feature-matrix
immig_dfm <- dfm(immig_toks)

# construct document-embedding-matrix
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)
library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)

# construct document-feature-matrix
immig_dfm <- dfm(immig_toks)

# construct document-embedding-matrix
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

Average document-embeddings in a dem by a grouping variable

Description

Average embeddings in a dem by a grouping variable, by averaging over columns within groups and creating new "documents" with the group labels. Similar in essence to dfm_group.

Usage

dem_group(x, groups = NULL)
dem_group(x, groups = NULL)

Arguments

`x`	a (`dem-class`) document-embedding-matrix
`groups`	a character or factor variable equal in length to the number of documents

Value

a G x D (dem-class) document-embedding-matrix corresponding to the ALC embeddings for each group. G = number of unique groups defined in the groups variable, D = dimensions of pretrained embeddings.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)

# build document-feature matrix
immig_dfm <- dfm(immig_toks)

# construct document-embedding-matrix
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

# to get group-specific embeddings, average within party
immig_wv_party <- dem_group(immig_dem,
groups = immig_dem@docvars$party)
library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)

# build document-feature matrix
immig_dfm <- dfm(immig_toks)

# construct document-embedding-matrix
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

# to get group-specific embeddings, average within party
immig_wv_party <- dem_group(immig_dem,
groups = immig_dem@docvars$party)

Randomly sample documents from a dem

Description

Take a random sample of documents from a dem with/without replacement and with the option to group by a variable in dem@docvars. Note: dem_sample uses dplyr::sample_frac underneath the hood, as such size refers to the fraction of total obs.

Usage

dem_sample(x, size = NULL, replace = FALSE, weight = NULL, by = NULL)
dem_sample(x, size = NULL, replace = FALSE, weight = NULL, by = NULL)

Arguments

`x`	a (`dem-class`) document-embedding-matrix
`size`	<`tidy-select`> For `sample_n()`, the number of rows to select. For `sample_frac()`, the fraction of rows to select. If `tbl` is grouped, `size` applies to each group.
`replace`	Sample with or without replacement?
`weight`	(numeric) Sampling weights. Vector of non-negative numbers of length `nrow(x)`. Weights are automatically standardised to sum to 1 (see `dplyr::sample_frac`). May not be applied when `by` is used.
`by`	(character or factor vector) either of length 1 with the name of grouping variable for sampling. Refer to the variable WITH QUOTATIONS e.g. `"party"`. Must be a variable in `dem@docvars`. OR of length nrow(x).

Value

a size x D (dem-class) document-embedding-matrix corresponding to the sampled ALC embeddings. Note, ⁠@features⁠ in the resulting object will correspond to the original ⁠@features⁠, that is, they are not subsetted to the sampled documents. For a list of the documents that were sampled call the attribute: ⁠@Dimnames$docs⁠.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)

# build document-feature matrix
immig_dfm <- dfm(immig_toks)

# construct document-embedding-matrix
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

# to get a random sample
immig_wv_party <- dem_sample(immig_dem, size = 10,
replace = TRUE, by = "party")

# also works
immig_wv_party <- dem_sample(immig_dem, size = 10,
replace = TRUE, by = immig_dem@docvars$party)
library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)

# build document-feature matrix
immig_dfm <- dfm(immig_toks)

# construct document-embedding-matrix
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

# to get a random sample
immig_wv_party <- dem_sample(immig_dem, size = 10,
replace = TRUE, by = "party")

# also works
immig_wv_party <- dem_sample(immig_dem, size = 10,
replace = TRUE, by = immig_dem@docvars$party)

Embed target using either: (a) a la carte OR (b) simple (untransformed) averaging of context embeddings

Description

For a vector of contexts (generally the context variable in get_context output), return the transformed (or untransformed) additive embeddings, aggregated or by instance, along with the local vocabulary. Keep track of which contexts were embedded and which were excluded.

Usage

embed_target(
  context,
  pre_trained,
  transform = TRUE,
  transform_matrix,
  aggregate = TRUE,
  verbose = TRUE
)
embed_target(
  context,
  pre_trained,
  transform = TRUE,
  transform_matrix,
  aggregate = TRUE,
  verbose = TRUE
)

Arguments

`context`	(character) vector of texts - `context` variable in get_context output
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`transform`	(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings.
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`aggregate`	(logical) - if TRUE (default) output will return one embedding (i.e. averaged over all instances of target) if FALSE output will return one embedding per instance
`verbose`	(logical) - report the observations that had no overlap the provided pre-trained embeddings

Details

required packages: quanteda

Value

list with three elements:

target_embedding: the target embedding(s). Values and dimensions will vary with the above settings.
local_vocab: (character) vocabulary that appears in the set of contexts provided.
obs_included: (integer) rows of the context vector that were included in the computation. A row (context) is excluded when none of the words in the context are present in the pre-trained embeddings provided.

Examples

# find contexts for term immigration
context_immigration <- get_context(x = cr_sample_corpus, target = 'immigration',
                        window = 6, valuetype = "fixed", case_insensitive = TRUE,
                        hard_cut = FALSE, verbose = FALSE)

contexts_vectors <- embed_target(context = context_immigration$context,
pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform,
aggregate = FALSE, verbose = FALSE)
# find contexts for term immigration
context_immigration <- get_context(x = cr_sample_corpus, target = 'immigration',
                        window = 6, valuetype = "fixed", case_insensitive = TRUE,
                        hard_cut = FALSE, verbose = FALSE)

contexts_vectors <- embed_target(context = context_immigration$context,
pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform,
aggregate = FALSE, verbose = FALSE)

Given two feature-embedding-matrices, compute "parallel" cosine similarities between overlapping features.

Description

Efficient way of comparing two corpora along many features simultaneously.

Usage

feature_sim(x, y, features = character(0))
feature_sim(x, y, features = character(0))

Arguments

`x`	a (`fem-class`) feature embedding matrix.
`y`	a (`fem-class`) feature embedding matrix.
`features`	(character) vector of features for which to compute similarity scores. If not defined then all overlapping features will be used.

Value

a data.frame with following columns:

feature: (character) overlapping features
value: (numeric) cosine similarity between overlapping features.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# create feature co-occurrence matrix for each party (set tri = FALSE to work with fem)
fcm_D <- fcm(toks[docvars(toks, 'party') == "D",],
context = "window", window = 6, count = "frequency", tri = FALSE)
fcm_R <- fcm(toks[docvars(toks, 'party') == "R",],
context = "window", window = 6, count = "frequency", tri = FALSE)

# compute feature-embedding matrix
fem_D <- fem(fcm_D, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)
fem_R <- fem(fcm_R, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

# compare "horizontal" cosine similarity
feat_comp <- feature_sim(x = fem_R, y = fem_D)
library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# create feature co-occurrence matrix for each party (set tri = FALSE to work with fem)
fcm_D <- fcm(toks[docvars(toks, 'party') == "D",],
context = "window", window = 6, count = "frequency", tri = FALSE)
fcm_R <- fcm(toks[docvars(toks, 'party') == "R",],
context = "window", window = 6, count = "frequency", tri = FALSE)

# compute feature-embedding matrix
fem_D <- fem(fcm_D, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)
fem_R <- fem(fcm_R, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

# compare "horizontal" cosine similarity
feat_comp <- feature_sim(x = fem_R, y = fem_D)

Create an feature-embedding matrix

Description

Given a featureco-occurrence matrix for each feature, multiply its feature counts (columns) with their corresponding pre-trained embeddings and average (usually referred to as averaged or additive embeddings). If specified and a transformation matrix is provided, multiply the feature embeddings by the transformation matrix to obtain the corresponding ⁠a la carte⁠ embeddings. (see eq 2: https://arxiv.org/pdf/1805.05388.pdf)

Usage

fem(x, pre_trained, transform = TRUE, transform_matrix, verbose = TRUE)
fem(x, pre_trained, transform = TRUE, transform_matrix, verbose = TRUE)

Arguments

`x`	a quanteda (`fcm-class`) feature-co-occurrence-matrix
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`transform`	(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings.
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`verbose`	(logical) - if TRUE, report the features that had no overlapping (co-occurring) features with the pretrained embeddings provided.

Value

a fem-class object

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# create feature co-occurrence matrix (set tri = FALSE to work with fem)
toks_fcm <- fcm(toks, context = "window", window = 6,
count = "frequency", tri = FALSE)

# compute feature-embedding matrix
toks_fem <- fem(toks_fcm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)
library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# create feature co-occurrence matrix (set tri = FALSE to work with fem)
toks_fcm <- fcm(toks, context = "window", window = 6,
count = "frequency", tri = FALSE)

# compute feature-embedding matrix
toks_fem <- fem(toks_fcm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

Find cosine similarities between target and candidate words

Description

Find cosine similarities between target and candidate words

Usage

find_cos_sim(target_embedding, pre_trained, candidates, norm = "l2")
find_cos_sim(target_embedding, pre_trained, candidates, norm = "l2")

Arguments

`target_embedding`	matrix of numeric values
`pre_trained`	matrix of numeric values - pretrained embeddings
`candidates`	character vector defining vocabulary to subset comparison to
`norm`	character = c("l2", "none") - how to scale input matrices. If they are already scaled - use "none" (see ?sim2)

Value

a vector of cosine similarities of length candidates

Return nearest neighbors based on cosine similarity

Description

Return nearest neighbors based on cosine similarity

Usage

find_nns(
  target_embedding,
  pre_trained,
  N = 5,
  candidates = NULL,
  norm = "l2",
  stem = FALSE,
  language = "porter"
)
find_nns(
  target_embedding,
  pre_trained,
  N = 5,
  candidates = NULL,
  norm = "l2",
  stem = FALSE,
  language = "porter"
)

Arguments

`target_embedding`	(numeric) 1 x D matrix. D = dimensions of pretrained embeddings.
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`N`	(numeric) number of nearest neighbors to return.
`candidates`	(character) vector of candidate features for nearest neighbors
`norm`	(character) - how to compute similarity (see ?text2vec::sim2): `"l2"` cosine similarity `"none"` inner product
`stem`	(logical) - whether to stem candidates when evaluating nns. Default is FALSE. If TRUE, candidate stems are ranked by their average cosine similarity to the target. We recommend you remove misspelled words from candidate set `candidates` as these can significantly influence the average.
`language`	the name of a recognized language, as returned by `getStemLanguages`, or a two- or three-letter ISO-639 code corresponding to one of these languages (see references for the list of codes).

Value

(character) vector of nearest neighbors to target

Examples

find_nns(target_embedding = cr_glove_subset['immigration',],
         pre_trained = cr_glove_subset, N = 5,
         candidates = NULL, norm = "l2", stem = FALSE)
find_nns(target_embedding = cr_glove_subset['immigration',],
         pre_trained = cr_glove_subset, N = 5,
         candidates = NULL, norm = "l2", stem = FALSE)

Get context words (words within a symmetric window around the target word/phrase) sorrounding a user defined target.

Description

A wrapper function for quanteda's kwic() function that subsets documents to where target is present before tokenizing to speed up processing, and concatenates kwic's pre/post variables into a context column.

Usage

get_context(
  x,
  target,
  window = 6L,
  valuetype = "fixed",
  case_insensitive = TRUE,
  hard_cut = FALSE,
  what = "word",
  verbose = TRUE
)
get_context(
  x,
  target,
  window = 6L,
  valuetype = "fixed",
  case_insensitive = TRUE,
  hard_cut = FALSE,
  what = "word",
  verbose = TRUE
)

Arguments

`x`	(character) vector - this is the set of documents (corpus) of interest.
`target`	(character) vector - these are the target words whose contexts we want to evaluate This vector may include a single token, a phrase or multiple tokens and/or phrases.
`window`	(numeric) - defines the size of a context (words around the target).
`valuetype`	the type of pattern matching: `"glob"` for "glob"-style wildcard expressions; `"regex"` for regular expressions; or `"fixed"` for exact matching. See valuetype for details.
`case_insensitive`	logical; if `TRUE`, ignore case when matching a `pattern` or dictionary values
`hard_cut`	(logical) - if TRUE then a context must have `window` x 2 tokens, if FALSE it can have `window` x 2 or fewer (e.g. if a doc begins with a target word, then context will have `window` tokens rather than `window` x 2)
`what`	(character) defines which quanteda tokenizer to use. You will rarely want to change this. For chinese text you may want to set `what = 'fastestword'`.
`verbose`	(logical) - if TRUE, report the total number of target instances found.

Value

a data.frame with the following columns:

docname: (character) document name to which instances belong to.
target: (character) targets.
context: (numeric) pre/post variables in kwic() output concatenated.

Note

target in the return data.frame is equivalent to kwic()'s keyword output variable, so it may not match the user-defined target exactly if valuetype is not fixed.

Examples

# get context words sorrounding the term immigration
context_immigration <- get_context(x = cr_sample_corpus, target = 'immigration',
                                   window = 6, valuetype = "fixed", case_insensitive = FALSE,
                                   hard_cut = FALSE, verbose = FALSE)
# get context words sorrounding the term immigration
context_immigration <- get_context(x = cr_sample_corpus, target = 'immigration',
                                   window = 6, valuetype = "fixed", case_insensitive = FALSE,
                                   hard_cut = FALSE, verbose = FALSE)

Given a tokenized corpus, compute the cosine similarities of the resulting ALC embeddings and a defined set of features.

Description

This is a wrapper function for cos_sim() that allows users to go from a tokenized corpus to results with the option to bootstrap cosine similarities and get the corresponding std. errors.

Usage

get_cos_sim(
  x,
  groups = NULL,
  features = character(0),
  pre_trained,
  transform = TRUE,
  transform_matrix,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  stem = FALSE,
  language = "porter",
  as_list = TRUE
)
get_cos_sim(
  x,
  groups = NULL,
  features = character(0),
  pre_trained,
  transform = TRUE,
  transform_matrix,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  stem = FALSE,
  language = "porter",
  as_list = TRUE
)

Arguments

`x`	a (quanteda) `tokens-class` object
`groups`	(numeric, factor, character) a binary variable of the same length as `x`
`features`	(character) features of interest
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`transform`	(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings.
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`bootstrap`	(logical) if TRUE, use bootstrapping – sample from texts with replacement and re-estimate cosine similarities for each sample. Required to get std. errors. If `groups` defined, sampling is automatically stratified.
`num_bootstraps`	(integer) number of bootstraps to use.
`confidence_level`	(numeric in (0,1)) confidence level e.g. 0.95
`stem`	(logical) - If TRUE, both `features` and `rownames(pre_trained)` are stemmed and average cosine similarities are reported. We recommend you remove misspelled words from `pre_trained` as these can significantly influence the average.
`language`	the name of a recognized language, as returned by `getStemLanguages`, or a two- or three-letter ISO-639 code corresponding to one of these languages (see references for the list of codes).
`as_list`	(logical) if FALSE all results are combined into a single data.frame If TRUE, a list of data.frames is returned with one data.frame per feature.

Value

a data.frame or list of data.frames (one for each target) with the following columns:

target: (character) rownames of x, the labels of the ALC embeddings.
feature: (character) feature terms defined in the features argument.
value: (numeric) cosine similarity between x and feature. Average over bootstrapped samples if bootstrap = TRUE.
std.error: (numeric) std. error of the similarity value. Column is dropped if bootstrap = FALSE.
lower.ci: (numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci: (numeric) (if bootstrap = TRUE) upper bound of the confidence interval.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L)

# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))

# compute the cosine similarity between each group's embedding
# and a specific set of features
set.seed(2021L)
get_cos_sim(x = immig_toks,
            groups = docvars(immig_toks, 'party'),
            features = c("reform", "enforce"),
            pre_trained = cr_glove_subset,
            transform = TRUE,
            transform_matrix = cr_transform,
            bootstrap = TRUE,
            num_bootstraps = 100,
            confidence_level = 0.95,
            stem = TRUE,
            as_list = FALSE)
library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L)

# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))

# compute the cosine similarity between each group's embedding
# and a specific set of features
set.seed(2021L)
get_cos_sim(x = immig_toks,
            groups = docvars(immig_toks, 'party'),
            features = c("reform", "enforce"),
            pre_trained = cr_glove_subset,
            transform = TRUE,
            transform_matrix = cr_transform,
            bootstrap = TRUE,
            num_bootstraps = 100,
            confidence_level = 0.95,
            stem = TRUE,
            as_list = FALSE)

Get averaged similarity scores between target word(s) and one or two vectors of candidate words.

Description

Get similarity scores between a target word or words and a comparison vector of one candidate word or words. When two vectors of candidate words are provided (second_vec is not NULL), the function calculates the cosine similarity between a composite index of the two vectors. This is operationalized as the mean similarity of the target word to the first vector of terms plus negative one multiplied by the mean similarity to the second vector of terms.

Usage

get_grouped_similarity(
  x,
  target,
  first_vec,
  second_vec,
  pre_trained,
  transform_matrix,
  group_var,
  window = window,
  norm = "l2",
  remove_punct = FALSE,
  remove_symbols = FALSE,
  remove_numbers = FALSE,
  remove_separators = FALSE,
  valuetype = "fixed",
  hard_cut = FALSE,
  case_insensitive = TRUE
)
get_grouped_similarity(
  x,
  target,
  first_vec,
  second_vec,
  pre_trained,
  transform_matrix,
  group_var,
  window = window,
  norm = "l2",
  remove_punct = FALSE,
  remove_symbols = FALSE,
  remove_numbers = FALSE,
  remove_separators = FALSE,
  valuetype = "fixed",
  hard_cut = FALSE,
  case_insensitive = TRUE
)

Arguments

`x`	a (quanteda) `corpus` object
`target`	(character) vector of words
`first_vec`	(character) vector of words
`second_vec`	(character) vector of words
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings, usually trained on the same corpus as that used for `x`. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`group_var`	(character) variable name in corpus object defining grouping variable
`window`	(numeric) - defines the size of a context (words around the target)
`norm`	(character) - "l2" for l2 normalized cosine similarity and "none" for dot product
`remove_punct`	(logical) - if `TRUE` remove all characters in the Unicode "Punctuation" `⁠[P]⁠` class
`remove_symbols`	(logical) - if `TRUE` remove all characters in the Unicode "Symbol" `⁠[S]⁠` class
`remove_numbers`	(logical) - if `TRUE` remove tokens that consist only of numbers, but not words that start with digits, e.g. `⁠2day⁠`
`remove_separators`	(logical) - if `TRUE` remove separators and separator characters (Unicode "Separator" `⁠[Z]⁠` and "Control" `⁠[C]⁠` categories)
`valuetype`	the type of pattern matching: `"glob"` for "glob"-style wildcard expressions; `"regex"` for regular expressions; or `"fixed"` for exact matching
`hard_cut`	(logical) - if TRUE then a context must have `window` x 2 tokens, if FALSE it can have `window` x 2 or fewer (e.g. if a doc begins with a target word, then context will have `window` tokens rather than `window` x 2)
`case_insensitive`	(logical) - if `TRUE`, ignore case when matching a target patter

Value

a data.frame with the following columns:

group: the grouping variable specified for the analysis
val: (numeric) cosine similarity scores

Examples

quanteda::docvars(cr_sample_corpus, 'year') <- rep(2011:2014, each = 50)
cos_simsdf <- get_grouped_similarity(cr_sample_corpus,
                                    group_var = "year",
                                    target = "immigration",
                                    first_vec = c("left", "lefty"),
                                    second_vec = c("right", "rightwinger"),
                                    pre_trained = cr_glove_subset,
                                    transform_matrix = cr_transform,
                                    window = 12L,
                                    norm = "l2")
quanteda::docvars(cr_sample_corpus, 'year') <- rep(2011:2014, each = 50)
cos_simsdf <- get_grouped_similarity(cr_sample_corpus,
                                    group_var = "year",
                                    target = "immigration",
                                    first_vec = c("left", "lefty"),
                                    second_vec = c("right", "rightwinger"),
                                    pre_trained = cr_glove_subset,
                                    transform_matrix = cr_transform,
                                    window = 12L,
                                    norm = "l2")

Identify words common to a collection of texts and a set of pretrained embeddings.

Description

Local vocab consists of the intersect between the set of pretrained embeddings and the collection of texts.

Usage

get_local_vocab(context, pre_trained)
get_local_vocab(context, pre_trained)

Arguments

`context`	(character) vector of contexts (usually `context` in `get_context()` output)
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.

Value

(character) vector of words common to the texts and pretrained embeddings.

Examples

# find local vocab (use it to define the candidate of nearest neighbors)
local_vocab <- get_local_vocab(cr_sample_corpus, pre_trained = cr_glove_subset)
# find local vocab (use it to define the candidate of nearest neighbors)
local_vocab <- get_local_vocab(cr_sample_corpus, pre_trained = cr_glove_subset)

Given a set of tokenized contexts, find the top N nearest contexts.

Description

This is a wrapper function for ncs() that allows users to go from a tokenized corpus to results with the option to bootstrap cosine similarities and get the corresponding std. errors.

Usage

get_ncs(
  x,
  N = 5,
  groups = NULL,
  pre_trained,
  transform = TRUE,
  transform_matrix,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  as_list = TRUE
)
get_ncs(
  x,
  N = 5,
  groups = NULL,
  pre_trained,
  transform = TRUE,
  transform_matrix,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  as_list = TRUE
)

Arguments

`x`	a (quanteda) `tokens-class` object
`N`	(numeric) number of nearest contexts to return
`groups`	a character or factor variable equal in length to the number of documents
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`transform`	(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings.
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`bootstrap`	(logical) if TRUE, use bootstrapping – sample from `x` with replacement and re-estimate cosine similarities for each sample. Required to get std. errors. If `groups` defined, sampling is automatically stratified.
`num_bootstraps`	(integer) number of bootstraps to use.
`confidence_level`	(numeric in (0,1)) confidence level e.g. 0.95
`as_list`	(logical) if FALSE all results are combined into a single data.frame If TRUE, a list of data.frames is returned with one data.frame per embedding

Value

a data.frame or list of data.frames (one for each target) with the following columns:

target: (character) rownames of x, the labels of the ALC embeddings. NA if is.null(rownames(x)).
context: (character) contexts collapsed into single documents (i.e. untokenized).
rank: (character) rank of context in terms of similarity with x.
value: (numeric) cosine similarity between x and context.
std.error: (numeric) std. error of the similarity value. Column is dropped if bootstrap = FALSE.
lower.ci: (numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci: (numeric) (if bootstrap = TRUE) upper bound of the confidence interval.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigration",
window = 6L, rm_keyword = FALSE)

# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))

# compare nearest contexts between groups
set.seed(2021L)
immig_party_ncs <- get_ncs(x = immig_toks,
                           N = 10,
                           groups = docvars(immig_toks, 'party'),
                           pre_trained = cr_glove_subset,
                           transform = TRUE,
                           transform_matrix = cr_transform,
                           bootstrap = TRUE,
                           num_bootstraps = 100,
                           confidence_level = 0.95,
                           as_list = TRUE)

# nearest neighbors of "immigration" for Republican party
immig_party_ncs[["D"]]
library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigration",
window = 6L, rm_keyword = FALSE)

# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))

# compare nearest contexts between groups
set.seed(2021L)
immig_party_ncs <- get_ncs(x = immig_toks,
                           N = 10,
                           groups = docvars(immig_toks, 'party'),
                           pre_trained = cr_glove_subset,
                           transform = TRUE,
                           transform_matrix = cr_transform,
                           bootstrap = TRUE,
                           num_bootstraps = 100,
                           confidence_level = 0.95,
                           as_list = TRUE)

# nearest neighbors of "immigration" for Republican party
immig_party_ncs[["D"]]

Given a tokenized corpus and a set of candidate neighbors, find the top N nearest neighbors.

Description

This is a wrapper function for nns() that allows users to go from a tokenized corpus to results with the option to bootstrap cosine similarities and get the corresponding std. errors.

Usage

get_nns(
  x,
  N = 10,
  groups = NULL,
  candidates = character(0),
  pre_trained,
  transform = TRUE,
  transform_matrix,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  stem = FALSE,
  language = "porter",
  as_list = TRUE
)
get_nns(
  x,
  N = 10,
  groups = NULL,
  candidates = character(0),
  pre_trained,
  transform = TRUE,
  transform_matrix,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  stem = FALSE,
  language = "porter",
  as_list = TRUE
)

Arguments

`x`	a (quanteda) `tokens-class` object
`N`	(numeric) number of nearest neighbors to return
`groups`	a character or factor variable equal in length to the number of documents
`candidates`	(character) vector of features to consider as candidates to be nearest neighbor You may for example want to only consider features that meet a certain count threshold or exclude stop words etc. To do so you can simply identify the set of features you want to consider and supply these as a character vector in the `candidates` argument.
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`transform`	(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings.
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`bootstrap`	(logical) if TRUE, use bootstrapping – sample from `x` with replacement and re-estimate cosine similarities for each sample. Required to get std. errors. If `groups` defined, sampling is automatically stratified.
`num_bootstraps`	(integer) number of bootstraps to use.
`confidence_level`	(numeric in (0,1)) confidence level e.g. 0.95
`stem`	(logical) - whether to stem candidates when evaluating nns. Default is FALSE. If TRUE, candidate stems are ranked by their average cosine similarity to the target. We recommend you remove misspelled words from candidate set `candidates` as these can significantly influence the average.
`language`	the name of a recognized language, as returned by `getStemLanguages`, or a two- or three-letter ISO-639 code corresponding to one of these languages (see references for the list of codes).
`as_list`	(logical) if FALSE all results are combined into a single data.frame If TRUE, a list of data.frames is returned with one data.frame per group.

Value

a data.frame or list of data.frames (one for each target) with the following columns:

target: (character) rownames of x, the labels of the ALC embeddings. NA if is.null(rownames(x)).
feature: (character) features identified as nearest neighbors.
rank: (character) rank of feature in terms of similarity with x.
value: (numeric) cosine similarity between x and feature. Average over bootstrapped samples if bootstrap = TRUE.
std.error: (numeric) std. error of the similarity value. Column is dropped if bootstrap = FALSE.
lower.ci: (numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci: (numeric) (if bootstrap = TRUE) upper bound of the confidence interval.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L)

# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))

# we limit candidates to features in our corpus
feats <- featnames(dfm(immig_toks))

# compare nearest neighbors between groups
set.seed(2021L)
immig_party_nns <- get_nns(x = immig_toks, N = 10,
                           groups = docvars(immig_toks, 'party'),
                           candidates = feats,
                           pre_trained = cr_glove_subset,
                           transform = TRUE,
                           transform_matrix = cr_transform,
                           bootstrap = TRUE,
                           num_bootstraps = 100,
                           stem = TRUE,
                           as_list = TRUE)

# nearest neighbors of "immigration" for Republican party
immig_party_nns[["R"]]
library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L)

# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))

# we limit candidates to features in our corpus
feats <- featnames(dfm(immig_toks))

# compare nearest neighbors between groups
set.seed(2021L)
immig_party_nns <- get_nns(x = immig_toks, N = 10,
                           groups = docvars(immig_toks, 'party'),
                           candidates = feats,
                           pre_trained = cr_glove_subset,
                           transform = TRUE,
                           transform_matrix = cr_transform,
                           bootstrap = TRUE,
                           num_bootstraps = 100,
                           stem = TRUE,
                           as_list = TRUE)

# nearest neighbors of "immigration" for Republican party
immig_party_nns[["R"]]

Given a corpus and a binary grouping variable, computes the ratio of cosine similarities over the union of their respective N nearest neighbors.

Description

This is a wrapper function for nns_ratio() that allows users to go from a tokenized corpus to results with the option to: (1) bootstrap cosine similarity ratios and get the corresponding std. errors. (2) use a permutation test to get empirical p-values for inference.

Usage

get_nns_ratio(
  x,
  N = 10,
  groups,
  numerator = NULL,
  candidates = character(0),
  pre_trained,
  transform = TRUE,
  transform_matrix,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  permute = TRUE,
  num_permutations = 100,
  stem = FALSE,
  language = "porter",
  verbose = TRUE,
  show_language = TRUE
)
get_nns_ratio(
  x,
  N = 10,
  groups,
  numerator = NULL,
  candidates = character(0),
  pre_trained,
  transform = TRUE,
  transform_matrix,
  bootstrap = TRUE,
  num_bootstraps = 100,
  confidence_level = 0.95,
  permute = TRUE,
  num_permutations = 100,
  stem = FALSE,
  language = "porter",
  verbose = TRUE,
  show_language = TRUE
)

Arguments

`x`	a (quanteda) tokens object
`N`	(numeric) number of nearest neighbors to return. Nearest neighbors consist of the union of the top N nearest neighbors of the embeddings in `x`. If these overlap, then resulting N will be smaller than 2*N.
`groups`	a character or factor variable equal in length to the number of documents
`numerator`	(character) defines which group is the nuemerator in the ratio.
`candidates`	(character) vector of features to consider as candidates to be nearest neighbor You may for example want to only consider features that meet a certian count threshold or exclude stop words etc. To do so you can simply identify the set of features you want to consider and supply these as a character vector in the `candidates` argument.
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`transform`	(logical) if TRUE (default) apply the 'a la carte' transformation, if FALSE ouput untransformed averaged embeddings.
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`bootstrap`	(logical) if TRUE, use bootstrapping – sample from texts with replacement and re-estimate cosine similarity ratios for each sample. Required to get std. errors. If `groups` defined, sampling is automatically stratified.
`num_bootstraps`	(integer) number of bootstraps to use.
`confidence_level`	(numeric in (0,1)) confidence level e.g. 0.95
`permute`	(logical) if TRUE, compute empirical p-values using permutation test
`num_permutations`	(numeric) number of permutations to use.
`stem`	(logical) - whether to stem candidates when evaluating nns. Default is FALSE. If TRUE, candidate stems are ranked by their average cosine similarity to the target. We recommend you remove misspelled words from candidate set `candidates` as these can significantly influence the average.
`language`	the name of a recognized language, as returned by `getStemLanguages`, or a two- or three-letter ISO-639 code corresponding to one of these languages (see references for the list of codes).
`verbose`	provide information on which group is the numerator
`show_language`	(logical) if TRUE print out message with language used for stemming.

Value

a data.frame with following columns:

feature: (character) features in candidates (or all features if candidates not defined), one instance for each embedding in x.
value: (numeric) cosine similarity ratio between x and feature. Average over bootstrapped samples if bootstrap = TRUE.
std.error: (numeric) std. error of the similarity value. Column is dropped if bootstrap = FALSE.
lower.ci: (numeric) (if bootstrap = TRUE) lower bound of the confidence interval.
upper.ci: (numeric) (if bootstrap = TRUE) upper bound of the confidence interval.
p.value: (numeric) empirical p-value of bootstrapped ratio of cosine similarities if permute = TRUE, if FALSE, column is dropped.
group: (character) group in groups for which feature belongs to the top N nearest neighbors. If "shared", the feature appeared as top nearest neighbor for both groups.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L)

# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))

# we limit candidates to features in our corpus
feats <- featnames(dfm(immig_toks))

# compute ratio
set.seed(2021L)
immig_nns_ratio <- get_nns_ratio(x = immig_toks,
                                 N = 10,
                                 groups = docvars(immig_toks, 'party'),
                                 numerator = "R",
                                 candidates = feats,
                                 pre_trained = cr_glove_subset,
                                 transform = TRUE,
                                 transform_matrix = cr_transform,
                                 bootstrap = TRUE,
                                 num_bootstraps = 100,
                                 permute = FALSE,
                                 num_permutations = 5,
                                 verbose = FALSE)

head(immig_nns_ratio)
library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L)

# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))

# we limit candidates to features in our corpus
feats <- featnames(dfm(immig_toks))

# compute ratio
set.seed(2021L)
immig_nns_ratio <- get_nns_ratio(x = immig_toks,
                                 N = 10,
                                 groups = docvars(immig_toks, 'party'),
                                 numerator = "R",
                                 candidates = feats,
                                 pre_trained = cr_glove_subset,
                                 transform = TRUE,
                                 transform_matrix = cr_transform,
                                 bootstrap = TRUE,
                                 num_bootstraps = 100,
                                 permute = FALSE,
                                 num_permutations = 5,
                                 verbose = FALSE)

head(immig_nns_ratio)

Calculate cosine similarities between target word and candidates words over sequenced variable using ALC embedding approach

Description

Calculate cosine similarities between target word and candidates words over sequenced variable using ALC embedding approach

Usage

get_seq_cos_sim(
  x,
  seqvar,
  target,
  candidates,
  pre_trained,
  transform_matrix,
  window = 6,
  valuetype = "fixed",
  case_insensitive = TRUE,
  hard_cut = FALSE,
  verbose = TRUE
)
get_seq_cos_sim(
  x,
  seqvar,
  target,
  candidates,
  pre_trained,
  transform_matrix,
  window = 6,
  valuetype = "fixed",
  case_insensitive = TRUE,
  hard_cut = FALSE,
  verbose = TRUE
)

Arguments

`x`	(character) vector - this is the set of documents (corpus) of interest
`seqvar`	ordered variable such as list of dates or ordered iseology scores
`target`	(character) vector - target word
`candidates`	(character) vector of features of interest
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`window`	(numeric) - defines the size of a context (words around the target).
`valuetype`	the type of pattern matching: `"glob"` for "glob"-style wildcard expressions; `"regex"` for regular expressions; or `"fixed"` for exact matching. See valuetype for details.
`case_insensitive`	logical; if `TRUE`, ignore case when matching a `pattern` or dictionary values
`hard_cut`	(logical) - if TRUE then a context must have `window` x 2 tokens, if FALSE it can have `window` x 2 or fewer (e.g. if a doc begins with a target word, then context will have `window` tokens rather than `window` x 2)
`verbose`	(logical) - if TRUE, report the total number of target instances found.

Value

a data.frame with one column for each candidate term with corresponding cosine similarity values and one column for seqvar.

Examples


library(quanteda)

# gen sequence var (here: year)
docvars(cr_sample_corpus, 'year') <- rep(2011:2014, each = 50)
cos_simsdf <- get_seq_cos_sim(x = cr_sample_corpus,
seqvar = docvars(cr_sample_corpus, 'year'),
target = "equal",
candidates = c("immigration", "immigrants"),
pre_trained = cr_glove_subset,
transform_matrix = cr_transform)
library(quanteda)

# gen sequence var (here: year)
docvars(cr_sample_corpus, 'year') <- rep(2011:2014, each = 50)
cos_simsdf <- get_seq_cos_sim(x = cr_sample_corpus,
seqvar = docvars(cr_sample_corpus, 'year'),
target = "equal",
candidates = c("immigration", "immigrants"),
pre_trained = cr_glove_subset,
transform_matrix = cr_transform)

Given a set of embeddings and a set of tokenized contexts, find the top N nearest contexts.

Description

Given a set of embeddings and a set of tokenized contexts, find the top N nearest contexts.

Usage

ncs(x, contexts_dem, contexts = NULL, N = 5, as_list = TRUE)
ncs(x, contexts_dem, contexts = NULL, N = 5, as_list = TRUE)

Arguments

`x`	a (quanteda) `dem-class` or `fem-class` object.
`contexts_dem`	a `dem-class` object corresponding to the ALC embeddings of candidate contexts.
`contexts`	a (quanteda) `tokens-class` object of tokenized candidate contexts. Note, these must correspond to the same contexts in `contexts_dem`. If NULL, then the context (document) ids will be output instead of the text.
`N`	(numeric) number of nearest contexts to return
`as_list`	(logical) if FALSE all results are combined into a single data.frame If TRUE, a list of data.frames is returned with one data.frame per embedding

Value

a data.frame or list of data.frames (one for each target) with the following columns:

target: (character) rownames of x, the labels of the ALC embeddings. NA if is.null(rownames(x)).
context: (character) contexts collapsed into single documents (i.e. untokenized). If contexts is NULL then this variable will show the context (document) ids which you can use to merge.
rank: (character) rank of context in terms of similarity with x.
value: (numeric) cosine similarity between x and context.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*",
window = 6L, rm_keyword = FALSE)

# build document-feature matrix
immig_dfm <- dfm(immig_toks)

# construct document-embedding-matrix
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

# to get group-specific embeddings, average within party
immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party)

# find nearest contexts by party
# setting as_list = FALSE combines each group's
# results into a single data.frame (useful for joint plotting)
ncs(x = immig_wv_party, contexts_dem = immig_dem,
contexts = immig_toks, N = 5, as_list = TRUE)
library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*",
window = 6L, rm_keyword = FALSE)

# build document-feature matrix
immig_dfm <- dfm(immig_toks)

# construct document-embedding-matrix
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

# to get group-specific embeddings, average within party
immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party)

# find nearest contexts by party
# setting as_list = FALSE combines each group's
# results into a single data.frame (useful for joint plotting)
ncs(x = immig_wv_party, contexts_dem = immig_dem,
contexts = immig_toks, N = 5, as_list = TRUE)

Given a set of embeddings and a set of candidate neighbors, find the top N nearest neighbors.

Description

Given a set of embeddings and a set of candidate neighbors, find the top N nearest neighbors.

Usage

nns(
  x,
  N = 10,
  candidates = character(0),
  pre_trained,
  stem = FALSE,
  language = "porter",
  as_list = TRUE,
  show_language = TRUE
)
nns(
  x,
  N = 10,
  candidates = character(0),
  pre_trained,
  stem = FALSE,
  language = "porter",
  as_list = TRUE,
  show_language = TRUE
)

Arguments

`x`	a `dem-class` or `fem-class` object.
`N`	(numeric) number of nearest neighbors to return
`candidates`	(character) vector of features to consider as candidates to be nearest neighbor You may for example want to only consider features that meet a certain count threshold or exclude stop words etc. To do so you can simply identify the set of features you want to consider and supply these as a character vector in the `candidates` argument.
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`stem`	(logical) - whether to stem candidates when evaluating nns. Default is FALSE. If TRUE, candidate stems are ranked by their average cosine similarity to the target. We recommend you remove misspelled words from candidate set `candidates` as these can significantly influence the average.
`language`	the name of a recognized language, as returned by `getStemLanguages`, or a two- or three-letter ISO-639 code corresponding to one of these languages (see references for the list of codes).
`as_list`	(logical) if FALSE all results are combined into a single data.frame If TRUE, a list of data.frames is returned with one data.frame per group.
`show_language`	(logical) if TRUE print out message with language used for stemming.

Value

a data.frame or list of data.frames (one for each target) with the following columns:

target: (character) rownames of x, the labels of the ALC embeddings. NA if is.null(rownames(x)).
feature: (character) features identified as nearest neighbors.
rank: (character) rank of feature in terms of similarity with x.
value: (numeric) cosine similarity between x and feature.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)

# build document-feature matrix
immig_dfm <- dfm(immig_toks)

# construct document-embedding-matrix
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

# to get group-specific embeddings, average within party
immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party)

# find nearest neighbors by party
# setting as_list = FALSE combines each group's
# results into a single tibble (useful for joint plotting)
immig_nns <- nns(immig_wv_party, pre_trained = cr_glove_subset,
N = 5, candidates = immig_wv_party@features, stem = FALSE, as_list = TRUE)
library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)

# build document-feature matrix
immig_dfm <- dfm(immig_toks)

# construct document-embedding-matrix
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

# to get group-specific embeddings, average within party
immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party)

# find nearest neighbors by party
# setting as_list = FALSE combines each group's
# results into a single tibble (useful for joint plotting)
immig_nns <- nns(immig_wv_party, pre_trained = cr_glove_subset,
N = 5, candidates = immig_wv_party@features, stem = FALSE, as_list = TRUE)

Computes the ratio of cosine similarities for two embeddings over the union of their respective top N nearest neighbors.

Description

Usage

nns_ratio(
  x,
  N = 10,
  numerator = NULL,
  candidates = character(0),
  pre_trained,
  stem = FALSE,
  language = "porter",
  verbose = TRUE,
  show_language = TRUE
)
nns_ratio(
  x,
  N = 10,
  numerator = NULL,
  candidates = character(0),
  pre_trained,
  stem = FALSE,
  language = "porter",
  verbose = TRUE,
  show_language = TRUE
)

Arguments

`x`	a (quanteda) `dem-class` or `fem-class` object.
`N`	(numeric) number of nearest neighbors to return. Nearest neighbors consist of the union of the top N nearest neighbors of the embeddings in `x`. If these overlap, then resulting N will be smaller than 2*N.
`numerator`	(character) defines which group is the nuemerator in the ratio
`candidates`	(character) vector of features to consider as candidates to be nearest neighbor You may for example want to only consider features that meet a certian count threshold or exclude stop words etc. To do so you can simply identify the set of features you want to consider and supply these as a character vector in the `candidates` argument.
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`stem`	(logical) - whether to stem candidates when evaluating nns. Default is FALSE. If TRUE, candidate stems are ranked by their average cosine similarity to the target. We recommend you remove misspelled words from candidate set `candidates` as these can significantly influence the average.
`language`	the name of a recognized language, as returned by `getStemLanguages`, or a two- or three-letter ISO-639 code corresponding to one of these languages (see references for the list of codes).
`verbose`	report which group is the numerator and which group is the denominator.
`show_language`	(logical) if TRUE print out message with language used for stemming.

Value

a data.frame with following columns:

feature: (character) features in candidates (or all features if candidates not defined), one instance for each embedding in x.
value: (numeric) ratio of cosine similarities.

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)

# build document-feature matrix
immig_dfm <- dfm(immig_toks)

# construct document-embedding-matrix
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

# to get group-specific embeddings, average within party
immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party)

# compute the cosine similarity between each party's
# embedding and a specific set of features
nns_ratio(x = immig_wv_party, N = 10, numerator = "R",
candidates = immig_wv_party@features,
pre_trained = cr_glove_subset, verbose = FALSE)

# with stemming
nns_ratio(x = immig_wv_party, N = 10, numerator = "R",
candidates = immig_wv_party@features,
pre_trained = cr_glove_subset, stem = TRUE, verbose = FALSE)
library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)

# build document-feature matrix
immig_dfm <- dfm(immig_toks)

# construct document-embedding-matrix
immig_dem <- dem(immig_dfm, pre_trained = cr_glove_subset,
transform = TRUE, transform_matrix = cr_transform, verbose = FALSE)

# to get group-specific embeddings, average within party
immig_wv_party <- dem_group(immig_dem, groups = immig_dem@docvars$party)

# compute the cosine similarity between each party's
# embedding and a specific set of features
nns_ratio(x = immig_wv_party, N = 10, numerator = "R",
candidates = immig_wv_party@features,
pre_trained = cr_glove_subset, verbose = FALSE)

# with stemming
nns_ratio(x = immig_wv_party, N = 10, numerator = "R",
candidates = immig_wv_party@features,
pre_trained = cr_glove_subset, stem = TRUE, verbose = FALSE)

Permute similarity and ratio computations

Description

Permute similarity and ratio computations

Usage

permute_contrast(
  target_embeddings1 = NULL,
  target_embeddings2 = NULL,
  pre_trained = NULL,
  candidates = NULL,
  norm = NULL
)
permute_contrast(
  target_embeddings1 = NULL,
  target_embeddings2 = NULL,
  pre_trained = NULL,
  candidates = NULL,
  norm = NULL
)

Arguments

`target_embeddings1`	ALC embeddings for group 1
`target_embeddings2`	ALC embeddings for group 2
`pre_trained`	a V x D matrix of numeric values - pretrained embeddings with V = size of vocabulary and D = embedding dimensions
`candidates`	character vector defining the candidates for nearest neighbors - e.g. output from `get_local_vocab`
`norm`	character = c("l2", "none") - set to 'l2' for cosine similarity and to 'none' for inner product (see ?sim2 in text2vec)

Value

a list with three elements, nns for group 1, nns for group 2 and nns_ratio comparing with ratios of similarities between the two groups

Permute OLS

Description

Estimate empirical p-value using permuted regression

Usage

permute_ols(Y = NULL, X = NULL)
permute_ols(Y = NULL, X = NULL)

Arguments

`Y`	vector of regression model's dependent variable (embedded context)
`X`	data.frame of model independent variables (covariates)

Value

list with two elements, betas = list of beta_coefficients (D dimensional vectors); normed_betas = tibble with the norm of the non-intercept coefficients

Plot output of `get_nns_ratio()`

Description

A way of visualizing the top nearest neighbors of a pair of ALC embeddings that captures how "discriminant" each feature is of each embedding (group).

Usage

plot_nns_ratio(x, alpha = 0.01, horizontal = TRUE)
plot_nns_ratio(x, alpha = 0.01, horizontal = TRUE)

Arguments

`x`	output of get_nns_ratio
`alpha`	(numerical) betwee 0 and 1. Significance threshold to identify significant values. These are denoted by a `*` on the plot.
`horizontal`	(logical) defines the type of plot. if TRUE results are plotted on 1 dimension. If FALSE, results are plotted on 2 dimensions, with the second dimension catpuring the ranking of cosine ratio similarties.

Value

a ggplot-class object.

Examples


library(ggplot2)
library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L)

# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))

# we limit candidates to features in our corpus
feats <- featnames(dfm(immig_toks))

# compute ratio
set.seed(2022L)
immig_nns_ratio <- get_nns_ratio(x = immig_toks,
                                 N = 10,
                                 groups = docvars(immig_toks, 'party'),
                                 numerator = "R",
                                 candidates = feats,
                                 pre_trained = cr_glove_subset,
                                 transform = TRUE,
                                 transform_matrix = cr_transform,
                                 bootstrap = TRUE,
                                 # num_bootstraps should be at least 100,
                                 # we use 10 here due to CRAN-imposed constraints
                                 # on example execution time
                                 num_bootstraps = 100,
                                 permute = FALSE,
                                 num_permutations = 10,
                                 verbose = FALSE)

plot_nns_ratio(x = immig_nns_ratio, alpha = 0.01, horizontal = TRUE)
library(ggplot2)
library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigration", window = 6L)

# sample 100 instances of the target term, stratifying by party (only for example purposes)
set.seed(2022L)
immig_toks <- tokens_sample(immig_toks, size = 100, by = docvars(immig_toks, 'party'))

# we limit candidates to features in our corpus
feats <- featnames(dfm(immig_toks))

# compute ratio
set.seed(2022L)
immig_nns_ratio <- get_nns_ratio(x = immig_toks,
                                 N = 10,
                                 groups = docvars(immig_toks, 'party'),
                                 numerator = "R",
                                 candidates = feats,
                                 pre_trained = cr_glove_subset,
                                 transform = TRUE,
                                 transform_matrix = cr_transform,
                                 bootstrap = TRUE,
                                 # num_bootstraps should be at least 100,
                                 # we use 10 here due to CRAN-imposed constraints
                                 # on example execution time
                                 num_bootstraps = 100,
                                 permute = FALSE,
                                 num_permutations = 10,
                                 verbose = FALSE)

plot_nns_ratio(x = immig_nns_ratio, alpha = 0.01, horizontal = TRUE)

Find most "prototypical" contexts.

Description

Contexts most similar on average to the full set of contexts.

Usage

prototypical_context(
  context,
  pre_trained,
  transform = TRUE,
  transform_matrix,
  N = 3,
  norm = "l2"
)
prototypical_context(
  context,
  pre_trained,
  transform = TRUE,
  transform_matrix,
  N = 3,
  norm = "l2"
)

Arguments

`context`	(character) vector of texts - `context` variable in get_context output
`pre_trained`	(numeric) a F x D matrix corresponding to pretrained embeddings. F = number of features and D = embedding dimensions. rownames(pre_trained) = set of features for which there is a pre-trained embedding.
`transform`	(logical) - if TRUE (default) apply the a la carte transformation, if FALSE ouput untransformed averaged embedding.
`transform_matrix`	(numeric) a D x D 'a la carte' transformation matrix. D = dimensions of pretrained embeddings.
`N`	(numeric) number of most "prototypical" contexts to return.
`norm`	(character) - how to compute similarity (see ?text2vec::sim2): `"l2"` cosine similarity `"none"` inner product

Value

a data.frame with the following columns:

doc_id: (integer) document id.
typicality_score: (numeric) average similarity score to all other contexts
context: (character) contexts

Examples


# find contexts of immigration
context_immigration <- get_context(x = cr_sample_corpus, target = 'immigration',
                                   window = 6, valuetype = "fixed", case_insensitive = TRUE,
                                   hard_cut = FALSE, verbose = FALSE)

# identify top N prototypical contexts and compute typicality score
pt_context <- prototypical_context(context = context_immigration$context,
pre_trained = cr_glove_subset,
transform = TRUE,
transform_matrix = cr_transform,
N = 3, norm = 'l2')
# find contexts of immigration
context_immigration <- get_context(x = cr_sample_corpus, target = 'immigration',
                                   window = 6, valuetype = "fixed", case_insensitive = TRUE,
                                   hard_cut = FALSE, verbose = FALSE)

# identify top N prototypical contexts and compute typicality score
pt_context <- prototypical_context(context = context_immigration$context,
pre_trained = cr_glove_subset,
transform = TRUE,
transform_matrix = cr_transform,
N = 3, norm = 'l2')

Run jackknife debiased OLS

Description

Run jackknife debiased OLS

Usage

run_jack_ols(X, Y, confidence_level = 0.95)
run_jack_ols(X, Y, confidence_level = 0.95)

Arguments

`X`	data.frame of model independent variables (covariates)
`Y`	vector of regression model's dependent variable (embedded context)
`confidence_level`	(numeric in (0,1)) confidence level e.g. 0.95

Value

list with two elements, betas = list of beta_coefficients (D dimensional vectors); normed_betas = tibble with the norm and CIs of the non-intercept coefficients

Run OLS

Description

Run OLS

Usage

run_ols(Y = NULL, X = NULL)
run_ols(Y = NULL, X = NULL)

Arguments

`Y`	vector of regression model's dependent variable (embedded context)
`X`	data.frame of model independent variables (covariates)

Value

list with two elements, betas = list of beta_coefficients (D dimensional vectors); normed_betas = tibble with the norm of the non-intercept coefficients

Get the tokens of contexts sorrounding user defined patterns

Description

This function uses quanteda's kwic() function to find the contexts around user defined patterns (i.e. target words/phrases) and return a tokens object with the tokenized contexts and corresponding document variables.

Usage

tokens_context(
  x,
  pattern,
  window = 6L,
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE,
  hard_cut = FALSE,
  rm_keyword = TRUE,
  verbose = TRUE
)
tokens_context(
  x,
  pattern,
  window = 6L,
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE,
  hard_cut = FALSE,
  rm_keyword = TRUE,
  verbose = TRUE
)

Arguments

`x`	a (quanteda) `tokens-class` object
`pattern`	a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.
`window`	the number of context words to be displayed around the keyword
`valuetype`	the type of pattern matching: `"glob"` for "glob"-style wildcard expressions; `"regex"` for regular expressions; or `"fixed"` for exact matching. See valuetype for details.
`case_insensitive`	logical; if `TRUE`, ignore case when matching a `pattern` or dictionary values
`hard_cut`	(logical) - if TRUE then a context must have `window` x 2 tokens, if FALSE it can have `window` x 2 or fewer (e.g. if a doc begins with a target word, then context will have `window` tokens rather than `window` x 2)
`rm_keyword`	(logical) if FALSE, keyword matching pattern is included in the tokenized contexts
`verbose`	(logical) if TRUE, report the total number of instances per pattern found

Value

a (quanteda) tokens-class. Each document in the output tokens object inherits the document variables (docvars) of the document from whence it came, along with a column registering corresponding the pattern used. This information can be retrieved using docvars().

Examples


library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)
library(quanteda)

# tokenize corpus
toks <- tokens(cr_sample_corpus)

# build a tokenized corpus of contexts sorrounding a target term
immig_toks <- tokens_context(x = toks, pattern = "immigr*", window = 6L)

Package 'conText'

Help Index

Bootstrap similarity and ratio computations

Description

Usage

Arguments

Value

Bootstrap nearest neighbors

Description

Usage

Arguments

Value

Examples

Bootstrap OLS

Description

Usage

Arguments

Value

Boostrap similarity vector

Description

Usage

Arguments

Value

build a conText-class object

Description

Usage

Arguments

build a dem-class object

Description

Usage

Arguments

build a fem-class object

Description

Usage

Arguments

Compute similarity and similarity ratios

Description

Usage

Arguments

Value

Compute similarity vector (sub-function of bootstrap_similarity)

Description

Usage

Arguments

Value

Compute transformation matrix A

Description

Usage

Arguments

Value

Examples

Embedding regression

Description

Usage

Arguments

Value

Examples

Contrast nearest neighbors

Description

Usage

Arguments

Value

Examples

Compute the cosine similarity between one or more ALC embeddings and a set of features.

Description

Usage

Arguments

Value

Examples

GloVe subset

Description

Usage

Format

Source

Congressional Record sample corpus

Description

Usage

Format

Source

Transformation matrix

build a `conText-class` object

build a `dem-class` object

build a `fem-class` object