HRAF API 1 Tutorial

Word2vec

Word2vec and Word Embeddings

Six of the endpoints in HRAF API 1 (3D PCA, Ethnoword PCA, Hypergraph, Wordsnetwork, Average of Similar Words, and Combination) are based on a model of ethnographic vocabulary trained using Word2vec. When trained on the text of a corpus such as eHRAF World Cultures, Word2vec can produce lists of similar terms. For example, the five most similar terms to magic in the eHRAF World Cultures corpus are magical, black magic, magical power, charm, and sorcery. This can be further modified by including the word bad. The five most similar terms to both magic and bad are evil, black magic, harm, dangerous, and harmful.

Word2vec is a means of creating and retrieving word embeddings, by representing words as numbers, thus making it possible for computers to identify patterns in large collections of text. Each word (or phrase, i.e., bigram, trigram, etc.) in a corpus corresponds to a unique vector: an ordered list of numbers. (More technically, word embeddings represent words numerically by embedding them in a space with many dimensions, placing the word at a given point in each dimension, thereby establishing a location for the word. Each location will have a "coordinate" expressed as a value for a given dimension. Words can be compared to other words based on their "coordinates." A given set of coordinates can be expressed as a vector by listing the position in each dimension in a standardized order: [x,y,z,...].) Each number in the list represents a value (coordinate) in a dimension, corresponding to one of many different aspects of a word's meaning based on how that word is used in the original text. The vector for magic looks like this:

[-4.546, -3.959, 0.618, 3.881, 2.855, 3.384, -0.756, 0.966, 1.250, 3.391, 0.677, 0.957, 3.770, -1.168, -4.550, -1.585, -2.048, -3.464, -2.135, -3.441, 3.890, 1.956, 0.154, -2.346, -1.714, -0.795, -4.980, 1.445, -4.600, -0.804, -0.134, -1.655, 3.292, 1.067, -3.677, -3.734, -0.605, -0.289, 1.814, 1.765, 1.729, -2.634, 0.584, -0.262, 6.619, 1.586, 7.053, 0.274, -1.700, 1.319, 3.728, 0.659, -0.335, -3.900, -4.376, 0.461, 0.346, -2.335, 3.349, 0.360, 0.760, -3.518, 0.967, 1.685, -1.471, 1.493, 3.065, -0.776, 1.162, -0.383, -3.836, 0.721, 0.416, 1.018, 1.229, 0.863, -1.504, -0.250, 2.079, 0.186, 1.406, 1.350, 0.474, 2.215, 0.262, -1.305, -1.801, 0.572, -2.977, 0.271, -0.458, 1.453, -2.017, -0.470, -2.785, -1.157, -2.955, -1.374, 0.339, 0.878]

This might not seem meaningful at first, but because these numerical representations of words allow us to measure how similar words are to each other, we can uncover relationships hidden in a large collection of texts. We didn't need to tell Word2vec that black magic is bad magic. The model "learned" that itself from the texts because of the way the two phrases are used in the corpus. While this may seem like a simple association, the underlying mathematical structure of word embeddings enables far more complex and nuanced relationships to be discovered within a corpus. The following sections will introduce key concepts helpful in making sense of word embeddings and how to use them in the API. Now that we have an idea of what Word2vec and word embeddings are, let's explore how Word2vec captures the relationships between words.

Word2vec and Distributional Similarity

Words tend to appear in predictable contexts. For example, the word hunt might often appear near words like hunter, game, or fishing in a piece of text. When words appear near each other, they often have related meanings or associations. This idea, known as distributional similarity, refers to the principle that words occurring in similar linguistic contexts tend to have similar meanings or associations. Word2vec captures this information by placing words in a virtual "space" where similar words are positioned "close" to each other. This space typically has between 100 and 1,000 different "directions" or "dimensions" (each corresponding to a number in the long, ordered list), allowing Word2vec to record complex relationships between words based on how they are used in sentences. (More technically, Word2vec models this idea by representing words as vectors in a high-dimensional space, such that words appearing in similar linguistic contexts point in the same direction and have similar magnitudes on common dimensions.)

While this method of representation does an especially good job of capturing semantic relationships, it also includes syntactic information as well as other contextual associations. This is because Word2vec creates a representation of a word by using all of the contextual information from the environment in which a word appears. This brings us to the first powerful feature of word embeddings: the ability to calculate similarity between words. For example, asking the model to provide the most similar words to hunt results in the following ranked list of terms, along with their similarity to hunt (on a scale of 0-1, with 1 representing equality):

hunt
Term	Similarity
hunting	0.913
hunting_expedition	0.797
go_hunte	0.780
hunter	0.778
hunted	0.746
fishing	0.741
deer_hunte	0.741
hunting_trip	0.722
big_game	0.719
fishing_expedition	0.716

Clearly, the list extends beyond simple synonyms. While hunting, hunter, and hunted are just different forms of hunt and exemplify semantic similarity, the rest of the terms have a more complex relationship with our search term. Some of the terms are phrases containing a form of hunt, such as hunting expedition or deer hunte (a shortened form that includes both phrases like deer hunter or even sentences such as deer is hunted). Still others seem only conceptually or contextually related to hunt, such as fishing, fishing expedition, or big game. This ability of word embeddings to reveal contextual associations beyond strict synonymy is a powerful tool that can be further utilized.

Exploring Contextual Associations with Vector Arithmetic

One of Word2vec's most powerful features is vector arithmetic, the ability to add or subtract representations of words from one another. While adding or subtracting "words" can be unintuitive at first, HRAF API 1 makes it easier than it sounds. Continuing with our previous example, the vector for the word hunt in our corpus looks like this:

[-0.508, 0.006, -1.802, 3.072, -1.478, 1.833, -3.093, -0.915, 2.129, 3.205, -0.825, 0.502, -0.140, -0.002, -1.220, 1.221, -0.341, -3.022, -3.053, -1.091, -1.867, 0.554, -1.301, -2.495, -1.277, -0.170, -1.094, -0.177, -4.950, 4.114, -0.689, -1.109, 5.211, -0.365, -0.732, -1.170, -2.215, -1.063, 0.485, 3.397, 6.177, -1.276, 2.687, -0.164, -2.452, -1.773, 1.213, -2.550, 2.339, -0.175, 3.012, -0.930, 1.561, 2.550, -2.452, 0.490, 2.774, 2.818, -4.225, 0.491, -0.071, -1.219, -3.282, 1.477, -1.029, -0.169, -1.209, -2.278, 3.383, 0.451, -4.767, -0.634, -2.900, -5.092, -0.961, 0.304, -3.204, 4.380, 1.233, -1.968, -1.709, -0.158, 2.391, 2.981, 1.602, -0.454, -0.954, 0.593, -1.597, 1.215, 1.199, 1.625, -1.638, 5.664, -4.427, -2.351, -1.833, -0.081, -0.102, 3.194]

Now that we have a list of numbers, it is relatively straightforward to combine this list with another. We only need to come up with a question we would like to answer. Let's say we are curious about gender-related divisions of labor as they relate to hunting. If we wanted to find out what roles women might take in hunting-related activities, we can try combining the vectors for hunt and woman (it can be helpful to choose lemmas – dictionary forms) and see what sort of overlap of contexts generally occurred in the corpus. This is the vector representation of woman:

[-5.968, 0.469, -1.951, 1.852, 1.325, -1.538, 0.881, 0.881, 1.078, 1.600, 1.072, 2.622, -1.025, 1.701, -1.869, 0.117, 4.442, -3.123, -5.041, 0.612, -0.856, 0.274, -0.501, 1.090, 0.069, -0.420, -3.112, -0.820, -6.101, 3.404, 4.208, -0.596, 5.861, -3.583, 0.440, -1.436, 1.486, 0.403, 1.146, 1.687, 5.458, -3.992, -2.134, -2.061, 3.004, -0.110, 0.321, -1.442, 1.086, 1.560, 1.967, -0.308, 1.808, 0.292, -1.645, -1.115, 0.202, -0.821, -0.525, 1.880, 2.138, 0.023, -4.982, -1.705, 0.047, -3.946, 0.466, -1.662, -0.646, -2.419, 2.286, 1.220, -4.349, -1.422, 4.543, 0.061, -4.233, -2.292, 0.366, -1.470, -3.287, -0.204, -0.110, -1.090, -0.306, -5.383, -1.225, -0.636, -4.884, 1.296, -2.736, -1.838, -4.327, 0.339, -0.343, 3.246, 3.322, 0.073, -0.671, 3.240]

If there were no API for us to use, we would be forced to perform the element-wise addition (-0.508 + (-5.968), 0.006 + 0.469,...) ourselves. Instead, by simply searching for both the terms at once, the calculation is done for us. Here are the most similar terms in the corpus to the combined contexts of hunt and woman:

hunt + woman
Term	Similarity
man	0.830
hunting	0.746
their_wive	0.729
hunter	0.726
young_man	0.706
their_husband	0.699
their_wives	0.674
womenfolk	0.673
wife	0.670
older_man	0.664

What you might conclude from this ranked list of similar terms is that there appears to have been little combining of contexts. Two of the most similar terms from the previous list are present (hunting and hunter), but otherwise the terms seem mostly to do with identity rather than anything like what we were interested in. Why might that be? There are two obvious hypotheses to propose: the first is that the result conveys what someone might conclude after reading the ethnography in the corpus—the dominant linguistic context in which hunt and woman co-occur involves discussion of social relationships; the second hypothesis is that, due to an unequal number of occurrences in the corpus, the true intersection of hunt and woman is obscured by the more frequent term, woman.

Upon reflection, you might wonder if both of these hypotheses could be true at once. Let's first compare the most similar terms to woman with the most similar terms to our combined query, hunt + woman:

hunt + woman		woman
Term	Similarity	Term	Similarity
man	0.830	man	0.898
hunting	0.746	young_woman	0.830
their_wive	0.729	their_husband	0.828
hunter	0.726	married_woman	0.815
young_man	0.706	girl	0.808
their_husband	0.699	married_women	0.804
their_wives	0.674	wife	0.803
womenfolk	0.673	older_women	0.802
wife	0.670	child	0.802
older_man	0.664	their_wive	0.800

The most noticeable effect of combining the terms might have been the removal of most of the direct references to hunt, but the relationships in the result sets are different as well. In addition to a general re-ordering of the same or analogous terminology, there appears to have been both a shift towards men (older_woman → older_man, young_woman → young_man), as well as a shift away from childhood (both girl and child are absent and without corresponding terms). This would seem to indicate that the list of similar terms actually combined more of the two contexts than we might have originally thought. However, let's now consider the relative frequency of terms within the corpus.

eHRAF World Cultures does not provide a direct way of looking at word frequencies, however we can estimate a term's frequency by looking at the number of paragraph results after a keyword search. While a keyword search for hunt returns only about 20,000 paragraphs, a keyword search for woman returns more than 120,000 paragraphs. Since woman appears far more frequently than hunt, the focus on social relationships in our list of similar terms now makes much more sense. As you will recall, a vector representation of a word in a Word2vec model uses all of the contexts in which a word appears. Therefore, the more times a word appears in a corpus, the more likely it is that the representation will become generalized. Since woman appears far more frequently than hunt, the combined context was pulled toward the more generalized representation of woman (common gendered terms), rather than the specific domain of hunting-related, female-dominated activities. This might seem like an intractable problem, but with some astute observation and subtle adjustment we can still move our exploration in the direction in which we are interested.

By simply adding together the terms hunt and woman, we moved too much in the direction of woman, perhaps largely due to the relatively high frequency of woman in the corpus (word embeddings are typically influenced by word frequency). If we wanted to somehow counteract this frequency while maintaining the direction of movement within the vector space, what might we do? Looking back at our list of similar terms, we already have a clue. The term within the corpus that is the most similar to woman is man. What might happen if we added hunt and woman, and also subtracted man?

hunt + woman		hunt + woman - man
Term	Similarity	Term	Similarity
man	0.830	hunting	0.877
hunting	0.746	fishing	0.724
their_wive	0.729	hunting_expedition	0.717
hunter	0.726	hunted	0.714
young_man	0.706	small_game	0.710
their_husband	0.699	hunting_fishing	0.705
their_wives	0.674	deer_hunte	0.704
womenfolk	0.673	food_gathering	0.704
wife	0.670	go_hunte	0.694
older_man	0.664	berry_picking	0.668

The difference made by subtracting man—the most similar term in the corpus to woman—is striking. The set of similar terms has only one term in common with the query hunt + woman, but looks remarkably similar to our original query for the single term hunt. Let's compare the sets of most similar terms for hunt and hunt + woman - man:

hunt		hunt + woman - man
Term	Similarity	Term	Similarity
hunting	0.913	hunting	0.877
hunting_expedition	0.797	fishing	0.724
go_hunte	0.780	hunting_expedition	0.717
hunter	0.778	hunted	0.714
hunted	0.746	small_game	0.710
fishing	0.741	hunting_fishing	0.705
deer_hunte	0.741	deer_hunte	0.704
hunting_trip	0.722	food_gathering	0.704
big_game	0.719	go_hunte	0.694
fishing_expedition	0.716	berry_picking	0.668

There are also interesting shifts of similar terms in the list perhaps signaling progress towards answering our original question about roles women take in hunting-related activities: big game has been replaced by small game, both food gathering and berry picking are now present, and the number of mentions of trip or expedition has gone from three to one. Why does the vector space appear to have moved back in the direction of hunt, and why has the composition seemed to shift towards a more domain-specific representation of woman? To answer these questions, you might first consider what is left over when subtracting man from woman, and vice versa.

woman - man		man - woman
Term	Similarity	Term	Similarity
postpartum_women	0.526	kokwal	0.503
child_reare	0.515	bravest_man	0.496
marital_pattern	0.508	captain	0.489
modern_contraception	0.505	avenger	0.488
reproductive_health	0.499	his_follower	0.488
sex_role	0.498	mighty	0.488
gender_difference	0.498	bravest_warrior	0.485
maternity	0.497	horseman	0.483
maternal_health	0.496	kill_him	0.482
lactating_mother	0.495	duellist	0.480

These equations work something like set difference, stripping out all the commonality between the two groupings of people. While what is left has the danger of amplifying biases in the corpus, with careful use it can provide an effective way of navigating the vector space. A known property of word embeddings is a well-demonstrated ability to engage in analogical reasoning (create analogies) through vector arithmetic, such that words with clear relationships can be used like an "axis" along which contextual associations can be explored. Just like the familiar analogies from standardized testing (A is to B, as X is to...), adding and subtracting analogously-related terms (synonym/antonym, type/kind, component/part, etc.) allows one to focus on a particular vector subspace. While the discovery of the "gender" axis brought us closer to finding the gender-related divisions of labor we might expect in hunting activities, there is still another step we can take towards refining our query: weighting words.

"Weighting," or multiplying a vector representation of a word by a single number, can provide us with additional precision and control when exploring the vector space, preserving the focus (direction) of a query, but allowing us to increase the relative importance (magnitude) of different words. Thus far, we have been searching using neutral weights (+1 and -1). Increasing or decreasing the weight of a term can compensate for a term's relative frequency or infrequency within the corpus, similar to balancing the addition of woman by subtracting man. Let's see what happens when we increase the weight of both woman and man.

hunt 1 + woman 1 - man *1		hunt 1 + woman 2 - man *2
Term	Similarity	Term	Similarity
hunting	0.877	hunting	0.904
fishing	0.724	gathering	0.801
hunting_expedition	0.717	foraging	0.793
hunted	0.714	food_gathering	0.782
small_game	0.710	herbal_medicine	0.769
hunting_fishing	0.705	wild_food	0.758
deer_hunte	0.704	plant_gathering	0.751
food_gathering	0.704	nut_harvesting	0.742
go_hunte	0.694	berry_picking	0.736
berry_picking	0.668	wild_herbs	0.730

Clearly, increasing the weight on the word pair that forms our axis has further moved the vector in the desired direction. While hunting remains the most similar word to our query, the new list of terms is now almost entirely related to gathering.

In addition to exploring different combinations of equally weighted axis word pairs (+/- 2.75, +/- 1.25, etc.), there is another facet we have yet to consider: the frequency of the word man in the corpus. The word man has more than 200,000 paragraph results within eHRAF World Cultures. If man appears at about twice the frequency of woman in the corpus, we might compensate by increasing or decreasing its weighting. Consider the following results:

hunt * 1 + woman * 2.75 - man * 2.75		hunt * 1 + woman * 2.75 - man * 3
Term	Similarity	Term	Similarity
food_preparation	0.601	food_preparation	0.574
domestic_work	0.592	horticultural_activitie	0.554
child_care	0.586	child_care	0.550
horticultural_activitie	0.585	berry_picking	0.542
berry_picking	0.579	foraging	0.539
hunting	0.570	food_gathering	0.536
housework	0.564	child_reare	0.531
foraging	0.559	domestic_work	0.531
food_gathering	0.558	diet	0.531
child_reare	0.557	subsistence_activitie	0.527

From the columns on the left side, wherein weighting on the axis pair is raised to 2.75, we can see the similarity of hunting has dropped into the middle of the list. By subtracting slightly more from the vector for man from the query, as in the right columns, hunting disappears from the list of most similar terms.

Visualizing Relationships and Dimensionality Reduction

Although looking at tables of terms and their similarities might be useful, a picture can be much easier to understand. But how can that possibly be achieved? A two-dimensional drawing or a three-dimensional object is easy to imagine, but what does something with 100 dimensions look like? There are methods for visualizing high-dimensional data, but they become increasingly difficult to comprehend as the number of dimensions increases. For example, here is a heatmap that compares the vectors for woman and man, with the dimensions with the greatest difference between them ranked (red numbers):

At the risk of stating the obvious, the heatmap comparing man and woman is somewhat overwhelming. Even without recalling the tables from the previous section, we can make the following observation: the vectors appear quite similar. Additionally, while the ranked differences in dimensions do help, they don't provide any insight into which of these dimensions might be responsible for the differences we saw in the tables when subtracting man from woman and vice versa.

While looking at the absolute difference between vectors is sufficient with just two, comparing more than two vectors requires a more sophisticated method: variance, or how spread out the data is compared to its average value. Here is another heatmap that depicts the vectors of the top ten most similar terms to the query hunt + woman - man. The dimensions with the greatest variance in this image are ranked with red numbers:

With this method of visualization, additional information again makes the data more difficult to compare. For all the information this image can provide us, recall that Word2vec "learns" its representations of words by looking at the usage environment. As there is no strict one-to-one relationship between dimensions in a vector and types of linguistic or contextual information, at least some of these dimensions are related to each other. In other words, what we want to investigate is the variance within related groups of dimensions (covariance). To achieve this, we now turn to dimensionality reduction and principal component analysis (PCA).

Dimensionality reduction is a technique used to simplify a complex space while attempting to preserve its structure. Dimensionality reduction transforms data in a manner similar to a technique we've already used. The single numerical value "similarity" in our tables above represents a measurement used to evaluate a space while preserving its structure (cosine similarity: if a value is close to 1 the vectors point in the same direction, if it is close to -1 they point in opposite directions, and if it is close to 0 they are unrelated). A key difference between our "similarity" measure and dimensionality reduction is that our "similarity" only takes into account focus (direction) and ignores importance (magnitude). Dimensionality reduction attempts to take into account both focus and importance, but must simplify the space in order to do it. What we really want is to both reduce the dimensionality of our data so that we can more easily visualize it, as well as identify the most significant patterns of variation across it. One method for achieving this is PCA.

PCA is a dimensionality reduction technique that projects data into a lower dimensional space by identifying the directions (principal components) that capture the most variance. PCA first measures variation across individual dimensions, as dimensions that show more variability have a greater likelihood of being more meaningful. For example, a dimension capturing subsistence-related terms might show high variability across words like hunt, gather, fish, and farm. Second, PCA examines how pairs of dimensions vary together. For example, if some of the dimensions of the vectors for woman and gather tend to vary together, while some of the dimensions of the vectors for man and hunt tend to vary together, PCA would identify this as a potential pattern of variation. Finally, after examining all possible pairs, PCA determines the most significant (principal) patterns of variation and projects these onto a simpler space that we can easily visualize. The patterns it selects are not only independent of each other (orthogonal), but also ranked based on how much of the variance within the data they are able to explain. This allows us to maintain clarity both when visualizing word embeddings as well as interpreting them. Now that we understand how PCA provides a clearer means of exploring the embedding space created by Word2vec, let's continue with our example of hunt + woman - man.

These are the results of the first stage of our query hunt + woman - man: the 15 most similar terms to hunt as plotted in a 3D space based on their PCA values. The query term hunt, colored light blue, appears in the center of the space, surrounded by the terms most similar to it, colored dark blue.

[Note: hunt has been artificially included in the PCA (it is not included in its own list of similar terms). Some of HRAF API 1's Word2vec-based endpoints (3D PCA, Ethnoword PCA, and Hypergraph) allow query terms to be artificially included in analyses. This feature can be useful for understanding how the most similar terms relate to query terms. However, including terms like this does affect the PCA, so this feature should be advisedly.]

The order of the terms most similar to hunt in the legend preserves the ordering we saw in the table of most similar terms above. What we might notice after some examination is that the closest term to hunt within the 3D graph is not hunting but rather deer_hunte. This seems confusing at first, but recall that the terms are plotted in this space using their PCA values, while the similarity score above is cosine similarity. Again, cosine similarity does not consider magnitude, but only direction within the vector space. This means that two words with highly similar meanings can have high cosine similarity even if one has a much stronger representation in certain dimensions than the other. PCA, on the other hand, reveals patterns of variance in the data, helping us see which features contribute the most to distinguishing words in the space. While cosine similarity remains valuable for comparing words in terms of their immediate semantic proximity, PCA helps uncover the deeper structure of word relationships, revealing how different groups of words vary together in ways that may not be immediately obvious. Let's consider some of that structure now.

Looking at the first principal component, we can see that the terms with the highest values for PCA1 are (descending order) fishing_expedition, hunting_expedition, hunting_trip, hunting_partie, and go_hunte. The terms with the lowest values for PCA1 are (ascending order) wild_pig, hunted, big_game, small_game, and hunter. What does this tell us about the results of our query? Recall that principal components are ranked according to the variance they explain within the data: principal component one explains more than principal component two, principal component two explains more than principal component three, etc. Therefore, this first principal component is the most important one for understanding our query. Additionally, the terms with the highest (positive) and lowest (negative) scores align the most and least with each principal component (pattern of variance). This means that we should more carefully consider words with the highest (and lowest) PCA scores, or combinations of them, while it is safe to ignore terms with scores close to zero. Finally, while PCA is able to establish which terms are most (or least) closely aligned with the patterns found in the data, the interpretation is left to us. Principal components are mathematical constructs without inherent meaning, and their interpretation depends on domain knowledge and context. There can be multiple, valid conclusions drawn from the same analysis. So what do our lists of terms have in common? One possible interpretation of the first principal component for hunt is that it is most aligned with going on trips and expeditions (fishing_expedition, hunting_expedition, hunting_trip, and go_hunte), especially in groups (hunting_partie), while it is least aligned with animals that are hunted (wild_pig, hunted, big_game, small_game), or hunters themselves (hunter).

The second principal component for hunt is most and least aligned with a different subset of the most similar terms. The terms with the highest values for PCA2 are hunt_deer, go_hunte, wild_pig, and hunting_deer, while the terms with the lowest values are fishing, hunting, big_game, small_game, hunting_partie, and fishing_expedition. Again, judgment is required when deciding which terms contribute significantly to a principal component. As hunt_deer and go_hunte have relatively higher PCA2 values than wild_pig, and wild_pig and hunting_deer have relatively similar values, we could consider ignoring the term wild_pig. This would allow for a clearer possible interpretation of PCA2 as being associated with actions, or specifically the action of hunting. We could also consider doing something similar for the terms with the lowest PCA2 values, and only use the terms with the two lowest values, fishing and hunting. This would provide for a clear contrast between the action of hunting and general descriptions of hunting.

With the third principal component, there is again a shift in the subset of most-aligned terms. The following terms are those with the highest values: fishing, hunt_deer, hunting, deer_hunte, and hunting_deer. These are the terms with the lowest values: hunting_partie, big_game, hunting_expedition, hunting_trip, and hunter. A simple interpretation is that the third principal component seems to be a mirror image of the first principal component. While that is plausible, there are some subtle differences. First, while both the positive PCA3 terms and the negative PCA1 terms contain references to prey animals (fish and deer vs. pig and game), terms positively aligned with PCA3 can be interpreted as more specific than general. Additionally, while the negative PCA3 terms and the positive PCA1 terms both relate to group hunting, negative PCA3 terms seem to focus more on the organization and objective of that group hunting, rather than the travel aspect, especially considering the changed ranking of hunting_partie.

Now that you have some sense of how principal components may be interpreted, for the next few iterations of the hunt + woman - man query we will focus on how plotting the results of PCA can help us understand the word embedding space. Consider the following PCA visualization for the query hunt + woman:

Terms that originally appeared in the list of similar terms for hunt have retained their dark blue color, while new terms have been colored maroon. As in the previous visualization, hunt is colored light blue, while woman is colored orange. While the ranked list of terms that are most similar to hunt + woman (the legend) alternates between new additions and those originally found in the query for hunt, the plotting of the PCA scores reveals something interesting. Rather than the list of similar terms being grouped around a single point, there appear to be two centers: a larger cluster around woman and a smaller one around hunt. What might this suggest? While we saw in the previous section that the list of similar terms to the query hunt + woman is different from the list of terms most similar to just woman, the clustering of all of the new terms relatively closer to woman appears to support our hypothesis that the addition of woman has overpowered hunt. What effect might the subtraction of man might have on the positions of the terms within the PCA plot?

The following graph plots the PCA scores for the unweighted query hunt *1 + woman *1 - man *1 . Terms that originally appeared in the list of similar terms to hunt have retained their dark blue color. New terms have been colored turquoise, while man has been colored light green.

There are two striking features of this new graph. First, after subtracting man, none of the new terms from the hunt + woman query (colored maroon) remain in the list of most similar terms. Second, the proximity of the terms has dramatically shifted back towards hunt. Rather than the most similar terms being clustered around two centers, all of the most similar terms are now centered around hunt. Furthermore, many of the newly appeared terms appear to have shifted in the direction of woman and away from man. Did the subtraction of man properly balance our query and answer our question? We will approach that question by adjusting the weights of our query terms.

The final PCA graph we will consider is for the weighted query hunt *1 + woman *2.75 - man *3 . The graph follows, with new terms colored purple.

We can see the same two interesting features in this graph as the previous one. First, there is a dramatic shift in the set of most similar terms (all the terms that originally appeared in the query for hunt [colored dark blue] have disappeared). Second, there is a significant movement in the clustering of the terms. While potentially still centered around hunt, the most similar terms to our query have moved into the space in between the two centers of hunt and woman - man . While the unweighted query seemed to contrast concepts that might be traditionally male- or female-related, the weighted query seems to contrast concepts in different spheres of women's lives. While this visual shift would seem to be the sort of result we are interested in, how has that affected the results of our PCA?

Comparing the first principal component of both the weighted and unweighted query hunt + woman - man provides additional insight into the relationship between these terms. For the unweighted query hunt *1 + woman *1 - man *1, the terms most associated with the first principal component are go_hunte, berry_picking, and hunting_trip, while the terms least associated with the first principal component are small_game, hunting_fishing, large_game, and big_game. Concerning the weighted query hunt *1 + woman *2.75 - man *3, the terms most associated with the first principal component are child_care, child_reare, postpartum_women, domestic_work, and egg_collecting. The terms least associated with the weighted query's first principal component are hunting_fishing, food_gathering, forage, diet, collecting_wild, foraging, and subsistence_activities. How might we interpret the differences in these first principal components? After a cursory inspection of the results of the unweighted query, intuition told us that the terms are divided between male- and female-dominated concepts related to hunting. However, the first feature that PCA has identified seems to be more related to group activity or acquisition (go_hunte, berry_picking, and hunting_trip) vs. the objective of hunting (small_game, large_game, and big_game) rather than anything like a gender-based division of labor. In comparison, the first component for the weighted query appears to sustain our hypothesis, differentiating child care and home-centered labor (child_care, child_reare, postpartum_women, domestic_work, egg_collecting) from foraging (hunting_fishing, food_gathering, forage, diet, collecting_wild, foraging, subsistence_activities).

Whether or not this is a good result depends on the question or questions we are interested in answering. Both the weighted and unweighted queries of hunt + woman - man provide interesting insights into which hunting-related activities tend to be female-dominated. The most important idea to keep in mind is how to use the different endpoints to interpret the results.

LDA Topic Modeling

Latent Dirichlet Allocation (LDA) helps identify topics within a corpus by grouping related words together.