Source: MachineLearningMastery.com
Context vectors are a powerful tool for advanced NLP tasks. They allow you to capture the contextual meaning of words, such as identifying the correct sense of a word in a sentence when it has multiple meanings. In this post, we will explore some example applications of context vectors. Specifically:
- You will learn how to extract contextual keywords from a document
- You will learn how to generate a summary of a document using context vectors
Let’s get started.
Applications with Context Vectors
Photo by Erik Karits. Some rights reserved.
Overview
This post is divided into two parts; they are:
- Contextual Keyword Extraction
- Contextual Text Summarization
Contextual Keyword Extraction
Contextual keyword extraction is a technique for identifying the most important words in a document based on their contextual relevance. Imagine that you have a document and want to highlight the most representative words. One way to do this is by finding the words that are most semantically similar to the document. This technique is useful for a wide range of NLP tasks, such as information retrieval, document clustering, and text summarization.
Let’s implement a simple contextual keyword extraction system by comparing each word in the document to the document as a whole:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
import numpy as np import torch from transformers import BertTokenizer, BertModel def get_context_vectors(sentence, model, tokenizer): inputs = tokenizer(sentence, return_tensors=“pt”, add_special_tokens=True) input_ids = inputs[“input_ids”] attention_mask = inputs[“attention_mask”] # Get the tokens (for reference) tokens = tokenizer.convert_ids_to_tokens(input_ids[0]) # Forward pass, get all hidden states from each layer with torch.no_grad(): outputs = model(input_ids, attention_mask=attention_mask, output_hidden_states=True) hidden_states = outputs.hidden_states # Each element in hidden states has shape (batch_size, sequence_length, hidden_size) # Here takes the first element in the batch from the last layer last_layer_vectors = hidden_states[–1][0].numpy() # Shape: (sequence_length, hidden_size) return tokens, last_layer_vectors def cosine_similarity(vec1, vec2): return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2)) def extract_contextual_keywords(document, model, tokenizer, top_n=5): “”“extract contextual keywords from a document”“” # Split the document into sentences (simple split by period) sentences = [s.strip() for s in document.split(“.”) if s.strip()] # Process each sentence to get context vectors all_tokens = [] all_vectors = [] for sentence in sentences: if not sentence: continue # Skip empty sentences # Get context vectors tokens, vectors = get_context_vectors(sentence, model, tokenizer) # Store tokens and vectors (excluding special tokens [CLS] and [SEP]) all_tokens.extend(tokens[1:–1]) all_vectors.extend(vectors[1:–1]) # Convert to numpy arrays, then calculate the document vector as average of all token vectors all_vectors = np.array(all_vectors) doc_vector = np.mean(all_vectors, axis=0) # Calculate similarity between each token vector and the document vector similarities = [] for token, vec in zip(all_tokens, all_vectors): # Skip special tokens, punctuation, and common words if token in [“[CLS]”, “[SEP]”, “.”, “,”, “!”, “?”, “the”, “a”, “an”, “is”, “are”, “was”, “were”]: continue # compute similarity, then remember it with the token sim = cosine_similarity(vec, doc_vector) similarities.append((sim, token)) # Sort the similarity and get the top N top_similarities = sorted(similarities, reverse=True)[:top_n] return top_similarities # Example document document = “”“ Artificial intelligence is transforming industries around the world. Machine learning algorithms can analyze vast amounts of data to identify patterns and make predictions. Natural language processing enables computers to understand and generate human language. Computer vision systems can recognize objects and interpret visual information. These technologies are driving innovation in healthcare, finance, transportation, and many other sectors. ““” tokenizer = BertTokenizer.from_pretrained(“bert-base-uncased”) model = BertModel.from_pretrained(“bert-base-uncased”) model.eval() # Extract contextual keywords and print the result top_keywords = extract_contextual_keywords(document, model, tokenizer, top_n=10) print(“Top contextual keywords:”) for similarity, token in top_keywords: print(f“{token}: {similarity:.4f}”) |
In this example, the BERT model is used to generate context vectors for each word in the document. The document vector is computed as the average of all token vectors. Alternatively, you could obtain the document vector by extracting the [CLS]
prefix token after feeding the entire document into the model. However, this approach is not used here because the input document may be too long for the model to process at once. Instead, the document is split into sentences, and each sentence is processed separately.
With the vectors for each word and the document, you compute the cosine similarity between each word and the document. The function extract_contextual_keywords()
returns the top N words with the highest similarity scores. These results are then printed.
Cosine similarity measures how close two vectors are to each other. In this case, if a word vector is close to the document vector, it is assumed to be a good representative of the document. This works because the word vectors are context-aware, as generated by the transformer model. Unlike traditional keyword extraction methods that rely on frequency (such as TF-IDF) or predefined rules (such as RAKE), this approach leverages the semantic understanding captured by the transformer model.
When you run this code, you will get:
Top contextual keywords: to: 0.7961 can: 0.7909 can: 0.7804 of: 0.7551 human: 0.7365 analyze: 0.7354 enables: 0.7345 computers: 0.7310 in: 0.7282 systems: 0.7153 |
To improve the result, you may consider implementing stop word removal to exclude common words such as “to” in the output.
Contextual Text Summarization
Summarizing a document can be done in different ways. One of the most common approaches is to select the most representative sentences from the document—a method known as extractive summarization.
One way to perform extractive summarization is by generating a vector for each sentence and a vector for the entire document. The sentences most similar to the document are then selected. With context vectors, it is straightforward to implement this approach. Let’s do this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
import numpy as np import torch from transformers import BertTokenizer, BertModel def cosine_similarity(vec1, vec2): return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2)) def get_sentence_embedding(sentence, model, tokenizer): “”“Sentence embedding extracted from the [CLS] prefix token”“” # Tokenize the input inputs = tokenizer(sentence, return_tensors=“pt”, add_special_tokens=True, truncation=True, max_length=512) # Forward pass, get hidden states with torch.no_grad(): outputs = model(**inputs) # Get the [CLS] token embedding at position 0 from the last layer cls_embedding = outputs.last_hidden_state[0, 0].numpy() return cls_embedding def extractive_summarize(document, model, tokenizer, num_sentences=3): # Split the document into sentences sentences = [s.strip() for s in document.split(“.”) if s.strip()] if len(sentences) <= num_sentences: return document # Get embeddings for all sentences sentence_embeddings = [] for sentence in sentences: embedding = get_sentence_embedding(sentence, model, tokenizer) sentence_embeddings.append(embedding) # Calculate the document embedding (average of all sentence embeddings) # then find the most similar sentences document_embedding = np.mean(sentence_embeddings, axis=0) similarities = [] for idx, embedding in enumerate(sentence_embeddings): sim = cosine_similarity(embedding, document_embedding) similarities.append((sim, idx)) top_sentences = sorted(similarities, reverse=True)[:num_sentences] # Extract the sentences, preserve the original order top_indices = sorted([x[1] for x in top_sentences]) summary_sentences = [sentences[i] for i in top_indices] # Join the sentences to form the summary summary = “. “.join(summary_sentences) + “.” return summary # Example document document = “”“ Transformer models have revolutionized natural language processing by introducing mechanisms that can effectively capture contextual relationships in text. One of the most powerful aspects of transformers is their ability to generate context-aware vector representations, often referred to as context vectors. Unlike traditional word embeddings that assign a fixed vector to each word regardless of context, transformer models generate dynamic representations that depend on the surrounding words. This allows them to capture the nuanced meanings of words in different contexts. For example, in the sentences “I‘m going to the bank to deposit money” and “I’m going to sit by the river bank,“ the word “bank” has different meanings. A traditional word embedding would assign the same vector to “bank” in both sentences, but a transformer model generates different context vectors that capture the distinct meanings based on the surrounding words. This contextual understanding enables transformers to excel at a wide range of NLP tasks, from question answering and sentiment analysis to machine translation and text summarization. ““” # Generate a summary tokenizer = BertTokenizer.from_pretrained(“bert-base-uncased”) model = BertModel.from_pretrained(“bert-base-uncased”) summary = extractive_summarize(document, model, tokenizer, num_sentences=3) # Print the original document and the summary print(“Original Document:”) print(document) print(“Summary:”) print(summary) |
If you run this code, you will get:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
Original Document: Transformer models have revolutionized natural language processing by introducing mechanisms that can effectively capture contextual relationships in text. One of the most powerful aspects of transformers is their ability to generate context-aware vector representations, often referred to as context vectors. Unlike traditional word embeddings that assign a fixed vector to each word regardless of context, transformer models generate dynamic representations that depend on the surrounding words. This allows them to capture the nuanced meanings of words in different contexts. For example, in the sentences “I’m going to the bank to deposit money” and “I’m going to sit by the river bank,” the word “bank” has different meanings. A traditional word embedding would assign the same vector to “bank” in both sentences, but a transformer model generates different context vectors that capture the distinct meanings based on the surrounding words. This contextual understanding enables transformers to excel at a wide range of NLP tasks, from question answering and sentiment analysis to machine translation and text summarization. Summary: One of the most powerful aspects of transformers is their ability to generate context-aware vector representations, often referred to as context vectors. Unlike traditional word embeddings that assign a fixed vector to each word regardless of context, transformer models generate dynamic representations that depend on the surrounding words. A traditional word embedding would assign the same vector to “bank” in both sentences, but a transformer model generates different context vectors that capture the distinct meanings based on the surrounding words. |
In this example, the function get_sentence_embedding()
is used to generate an embedding for an entire sentence by using the [CLS]
token embedding from the last layer of the transformer. The [CLS]
token is a special token prepended to the sentence, and the transformer is trained to produce an embedding that represents the entire input.
In the function extractive_summarize()
, you generate sentence embeddings for each sentence in the document and compute the document embedding as the average of all sentence embeddings. Then, you calculate the cosine similarity between the document embedding and each sentence embedding, selecting the top N sentences with the highest similarity scores.
The summary is formed by joining these top N sentences in their original order within the document. This assumes that the most semantically similar sentences are the most representative of the document.
Further Reading
Below are some further readings that you may find useful:
- Rose et al (2010) “Automatic Keyword Extraction from Individual Documents” (the RAKE algorithm paper)
- Mihalcea et al (2004) “TextRank: bringing Order into Text“
- Wikipedia: TF-IDF
- Wikipedia: BM25 algorithm
- Introduction to Extractive and Abstrative Summarization
Summary
In this post, you saw how context vectors can be used in various applications. In particular, you learned:
- How to generate context vectors for a document, sentence, or word
- How to perform contextual keyword extraction to find important keywords in a document
- How to perform extractive summarization
These applications demonstrate the power and versatility of context vectors for advanced NLP tasks. By understanding and leveraging these vectors, you can build sophisticated NLP systems that capture rich semantic relationships in text.