Further Applications with Context Vectors

Source: MachineLearningMastery.com

Context vectors are powerful representations generated by transformer models that capture the meaning of words in their specific contexts. In our previous tutorials, we explored how to generate these vectors and some basic applications. Now, we’ll focus on building practical applications that leverage context vectors to solve real-world problems.

In this tutorial, we’ll implement several applications to demonstrate the power and versatility of context vectors. We’ll use the Hugging Face transformers library to extract context vectors from pre-trained models and build applications around them. Specifically, you will learn:

Building a semantic search engine with context vectors
Creating a document clustering and topic modeling application
Developing a document classification system

Let’s get started.

Further Applications with Context Vectors
Photo by Matheus Bertelli. Some rights reserved.

Overview

This post is divided into three parts; they are:

Building a Semantic Search Engine
Document Clustering
Document Classification

Building a Semantic Search Engine

If you want to find a specific document within a collection, you might use a simple keyword search. However, this approach is limited by the precision of keyword matching. You might not remember the exact wording used in the document, only what it was about. In such cases, semantic search is more effective.

Semantic search allows you to search by meaning rather than by keywords. Each document is represented by a context vector that captures its meaning, and the query is also represented as a context vector. The search engine then finds the documents most similar to the query, using a similarity measure such as L2 distance or cosine similarity.

Since you’ve already learned how to generate context vectors using a transformer model, let’s implement a simple semantic search engine:

import torch

import numpy as np

from sklearn.metrics.pairwise import cosine_similarity

from transformers import AutoTokenizer, AutoModel

def get_context_vector(text, model, tokenizer):

“”“Get context vector by mean pooling”“”

# Tokenize input, get model output

inputs = tokenizer(text, return_tensors=“pt”, truncation=True, max_length=512)

with torch.no_grad():

outputs = model(**inputs)

# Mean pooling: take average across sequence length of the output

pooled_vector = torch.mean(outputs.last_hidden_state, dim=1)

return pooled_vector[0]

def semantic_search(query, documents, document_vectors, top_k=2):

“”“Search the corpus”“”

# Calculate similarity between query and all documents

query_vector = get_context_vector(query, model, tokenizer)

similarities = cosine_similarity([query_vector], document_vectors)[0]

# Get indices of top-k most similar documents

top_indices = np.argsort(similarities)[::–1][:top_k]

# Return top-k documents and their similarity scores

results = []

for idx in top_indices:

results.append({

“document”: documents[idx],

“similarity”: similarities[idx]

})

return results

# Load pre-trained model and tokenizer

tokenizer = AutoTokenizer.from_pretrained(“bert-base-uncased”)

model = AutoModel.from_pretrained(“bert-base-uncased”)

# Create a document corpus and convert them into context vectors

documents = [

“Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed.”,

“Deep learning is a subset of machine learning that uses neural networks with many layers.”,

“Natural language processing is a field of AI that focuses on the interaction between computers and human language.”,

“Computer vision is an interdisciplinary field that deals with how computers can gain high-level understanding from digital images or videos.”,

“Reinforcement learning is about taking suitable actions to maximize reward in a particular situation.”

]

document_vectors = [get_context_vector(doc, model, tokenizer) for doc in documents]

# Example search

query = “How do computers learn from data?”

results = semantic_search(query, documents, document_vectors)

# Print results

print(f“Query: {query}n”)

for i, result in enumerate(results):

print(f“Result {i+1} (Similarity: {result[“similarity“]:.4f}):”)

print(result[“document”])

print()

In this example, the context vector is created using the get_context_vector() function. You pass in the text as a string or a list of strings, and the tokenizer and model produce a tensor output. This output is a matrix of shape (batch size, sequence length, hidden size). Not all tokens in the sequence are valid, so you use the attention mask produced by the tokenizer to identify valid tokens.

Each input string’s context vector is computed as the mean of all valid token embeddings. Note that other methods to create context vectors are possible, such as using the [CLS] token or different pooling strategies.

In this example, you begin with a collection of documents and a query string. You generate context vectors for both, and in semantic_search(), compare the query vector with all document vectors using cosine similarity to find the top-k most similar documents.

The output of the above code is:

Query: How do computers learn from data?

Result 1 (Similarity: 0.7573):

Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed.

Result 2 (Similarity: 0.7342):

Computer vision is an interdisciplinary field that deals with how computers can gain high-level understanding from digital images or videos.

You can see that the semantic search engine understands the meaning behind queries, rather than just matching keywords. However, the quality of results depends on how well the context vectors represent the documents and queries, as well as the similarity metric used.

Document Clustering

Document clustering groups similar documents together. It is useful when organizing a large collection of documents. While you could classify documents manually, that approach is time-consuming. Clustering is an automatic, unsupervised process—you don’t need to provide any labels. The algorithm groups documents into clusters based on their similarity.

With context vectors for each document, you can use any standard clustering algorithm. Below, we use K-means clustering:

import matplotlib.pyplot as plt

import numpy as np

import torch

from sklearn.cluster import KMeans

from sklearn.decomposition import PCA

from transformers import AutoTokenizer, AutoModel

def get_context_vector(text, model, tokenizer):

“”“Get context vector by mean pooling”“”

# Tokenize input, get model output

inputs = tokenizer(text, return_tensors=“pt”, truncation=True, max_length=512)

with torch.no_grad():

outputs = model(**inputs)

# Mean pooling: take average across sequence length of the output

pooled_vector = torch.mean(outputs.last_hidden_state, dim=1)

return pooled_vector[0]

# Create a document corpus (more documents for clustering)

documents = [

“Machine learning algorithms build models based on sample data to make predictions without being explicitly programmed.”,

“Deep learning uses neural networks with many layers to learn representations of data with multiple levels of abstraction.”,

“Neural networks are computing systems inspired by the biological neural networks that constitute animal brains.”,

“Convolutional neural networks are deep neural networks most commonly applied to analyzing visual imagery.”,

“Natural language processing is a subfield of linguistics, computer science, and artificial intelligence.”,

“Sentiment analysis uses NLP to identify and extract opinions within text to determine writer’s attitude.”,

“Named entity recognition is a subtask of information extraction that seeks to locate and classify named entities in text.”,

“Computer vision is an interdisciplinary field that deals with how computers can gain high-level understanding from digital images.”,

“Image recognition is the ability of software to identify objects, places, people, writing and actions in images.”,

“Object detection is a computer technology related to computer vision and image processing.”

]

# Generate context vectors for all documents

tokenizer = AutoTokenizer.from_pretrained(“bert-base-uncased”)

model = AutoModel.from_pretrained(“bert-base-uncased”)

document_vectors = np.array([get_context_vector(doc, model, tokenizer) for doc in documents])

# Perform K-means clustering on documents

num_clusters = 3

kmeans = KMeans(n_clusters=num_clusters, random_state=42)

cluster_labels = kmeans.fit_predict(document_vectors)

# Print documents in each cluster

for i in range(num_clusters):

print(f“nCluster {i+1}:”)

cluster_docs = [documents[j] for j in range(len(documents)) if cluster_labels[j] == i]

for doc in cluster_docs:

print(f“- {doc}”)

# Visualize the clusters in reduced dimensionality

pca = PCA(n_components=2)

reduced_vectors = pca.fit_transform(document_vectors)

plt.figure(figsize=(10, 6))

colors = [“red”, “blue”, “green”]

for i in range(num_clusters):

# Plot points in each cluster

cluster_points = reduced_vectors[cluster_labels == i]

plt.scatter(cluster_points[:, 0], cluster_points[:, 1], c=colors[i], label=f“Cluster {i+1}”)

plt.title(“Document Clusters”)

plt.xlabel(“PCA Component 1”)

plt.ylabel(“PCA Component 2”)

plt.legend()

plt.grid(True)

plt.show()

In this example, the same get_context_vector() function is used to generate context vectors for a corpus of documents. Each document is transformed into a fixed-size context vector. Then, the K-means clustering algorithm groups the documents. The number of clusters is set to 3, but you can experiment with other values to see what makes the most sense.

The output of the above code is:

Cluster 1:

– Deep learning uses neural networks with many layers to learn representations of data with multiple levels of abstraction.

– Neural networks are computing systems inspired by the biological neural networks that constitute animal brains.

– Convolutional neural networks are deep neural networks most commonly applied to analyzing visual imagery.

– Sentiment analysis uses NLP to identify and extract opinions within text to determine writer‘s attitude.

Cluster 2:

– Natural language processing is a subfield of linguistics, computer science, and artificial intelligence.

– Named entity recognition is a subtask of information extraction that seeks to locate and classify named entities in text.

– Computer vision is an interdisciplinary field that deals with how computers can gain high–level understanding from digital images.

– Image recognition is the ability of software to identify objects, places, people, writing and actions in images.

– Object detection is a computer technology related to computer vision and image processing.

Cluster 3:

– Machine learning algorithms build models based on sample data to make predictions without being explicitly programmed.

The quality of clustering depends on the context vectors and the clustering algorithm. To evaluate the results, you can visualize the clusters in 2D using Principal Component Analysis (PCA). PCA reduces the vectors to their first two principal components, which can be plotted in a scatter plot:

If you don’t see clear clusters—as in this case—it suggests the clustering isn’t ideal. You may need to adjust how you generate context vectors. However, the issue might also be that all the documents are related to machine learning, so forcing them into three distinct clusters may not be meaningful.

In general, document clustering helps automatically discover topics in a collection. For good results, you need a moderately large and diverse corpus with clear topic distinctions.

Document Classification

If you happen to have labels for the documents, you can use them to train a classifier. This goes one step beyond clustering. With labels, you control how documents are grouped.

You may need more data to train a reliable classifier. Below, we’ll use a logistic regression classifier to categorize documents.

from transformers import AutoTokenizer, AutoModel

import torch

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report

def get_context_vector(text, model, tokenizer):

“”“Get context vector by mean pooling”“”

# Tokenize input, get model output

inputs = tokenizer(text, return_tensors=“pt”, truncation=True, max_length=512)

with torch.no_grad():

outputs = model(**inputs)

# Mean pooling: take average across sequence length of the output

pooled_vector = torch.mean(outputs.last_hidden_state, dim=1)

return pooled_vector[0]

# Create a dataset of texts with labels

texts = [

“The stock market reached a new high today, with technology stocks leading the gains.”,

“The company reported strong quarterly earnings, exceeding analysts’ expectations.”,

“Investors are optimistic about the economy despite recent inflation concerns.”,

“The new vaccine has shown high efficacy in clinical trials against all variants.”,

“Researchers have discovered a potential treatment for a previously incurable disease.”,

“The hospital announced expanded capacity to handle the increasing number of patients.”,

“The latest smartphone features a better camera and longer battery life.”,

“The software update includes new security features and performance improvements.”,

“The tech company unveiled its newest artificial intelligence system yesterday.”

]

labels = [

“Business”,

“Health”,

“Technology”,

“Technology”

]

# Generate context vectors for all texts

tokenizer = AutoTokenizer.from_pretrained(“bert-base-uncased”)

model = AutoModel.from_pretrained(“bert-base-uncased”)

text_vectors = np.array([get_context_vector(text, model, tokenizer) for text in texts])

# Split into training and testing sets, train a classifier, then evaluate

X_train, X_test, y_train, y_test = train_test_split(text_vectors, labels, test_size=0.3, random_state=42)

classifier = LogisticRegression(max_iter=1000)

classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

print(classification_report(y_test, y_pred))

# Classify new texts

new_texts = [

“The central bank has decided to keep interest rates unchanged.”,

“A new study shows that regular exercise can reduce the risk of heart disease.”,

“The new laptop has a faster processor and more memory than previous models.”

]

new_vectors = np.array([get_context_vector(text, model, tokenizer) for text in new_texts])

predictions = classifier.predict(new_vectors)

# Print predictions

for text, prediction in zip(new_texts, predictions):

print(f“Text: {text}”)

print(f“Category: {prediction}n”)

The context vectors are generated the same way as in the previous example. Instead of clustering or manually comparing similarities, you provide a list of labels (one per document) to a logistic regression classifier. Using the implementation from scikit-learn, we train the model on the training set and evaluate it on the test set.

The classification_report() function from scikit-learn provides metrics like precision, recall, F1 score, and accuracy. The result looks like this:

precision recall f1-score support

Business 0.50 1.00 0.67 1

Health 0.00 0.00 0.00 1

Technology 1.00 1.00 1.00 1

accuracy 0.67 3

macro avg 0.50 0.67 0.56 3

weighted avg 0.50 0.67 0.56 3

To use the trained classifier, follow the same workflow: use the get_context_vector() function to convert new text into context vectors, then pass them to the classifier to predict categories. When you run the above code, you should see:

Text: The central bank has decided to keep interest rates unchanged.

Category: Business

Text: A new study shows that regular exercise can reduce the risk of heart disease.

Category: Health

Text: The new laptop has a faster processor and more memory than previous models.

Category: Technology

Note that the classifier is trained on context vectors, which ideally capture the meaning of the text rather than just surface keywords. As a result, it should more effectively generalize to new inputs, even those with unseen keywords.

Summary

In this post, you’ve explored how to build practical applications using context vectors generated by transformer models. Specifically, you’ve implemented:

A semantic search engine to find documents most similar to a query
A document clustering application to group documents into meaningful categories
A document classification system to categorize documents into predefined categories

These applications highlight the power and versatility of context vectors for understanding and processing text. By leveraging the semantic capabilities of transformer models, you can build sophisticated NLP systems that go beyond simple keyword matching or rule-based methods.

Overview

Building a Semantic Search Engine

Document Clustering

Document Classification

Summary

No comments yet.