Hybrid Chunking combining Fixed-size and Semantic Chunking¶
- Create sample investment analysis text.
- Implement both fixed-size and semantic chunking using the CharacterTextSplitter and spaCy's sentence tokenizer.
- Store the embedded chunks in FAISS, using separate indices for fixed-size and semantic chunks.
- Create a Retrieval-Augmented Generation (RAG) agent that retrieves relevant chunks based on a user query: the first part of the query is handled by the fixed-size index, and the second part by the semantic index.
- Display the fixed-size and semantic chunks visually and retrieve results in tables with appropriate formatting.
Hybrid Chunking Strategy:¶
- Fixed-size chunks are created with a character-based splitter (CharacterTextSplitter), which ensures even-sized chunks.
- Semantic chunks are created using spaCy's sentence tokenizer for context-preserving chunks.
Embedding and Storage:¶
- Each chunk is embedded using spaCy's vector and stored in FAISS indices: one for fixed-size chunks and one for semantic chunks.
User Query and Hybrid Retrieval:¶
- The query is split into two parts, each part tailored for retrieval from a specific index.
- The top k results (defined by the variable k) are retrieved from each index.
- The first part of the query retrieves from the fixed-size chunk index, and the second part retrieves from the semantic chunk index.
Visualization and Display:¶
- Both fixed-size and semantic chunks are displayed in a bar graph for chunk visualization.
- Retrieval results are displayed in tables, with wrapped text for readability and bold headers to differentiate fields.
In [ ]:
%pip install -q spacy sentence-transformers faiss-cpu langchain matplotlib prettytable
In [ ]:
%python -m spacy download en_core_web_sm
In [ ]:
import faiss
import spacy
from langchain.text_splitter import CharacterTextSplitter
import matplotlib.pyplot as plt
from prettytable import PrettyTable
import numpy as np
# Sample Investment Analysis Text
investment_text = """
The financial services sector has experienced robust growth due to the adoption of digital banking and financial technology.
Our firm's investment banking division saw a revenue increase of 20% year-over-year, driven by higher client acquisition and new advisory services.
For the fiscal year ending in 2023, the company reported revenue of $30 million with an EBITDA margin of 30%.
The debt-to-equity ratio remains low at 0.4, providing a stable foundation for future investments and expansion in asset management.
Additionally, the company's market share in wealth management grew by 7%, attributed to expanded service offerings and improved client retention.
"""
# Load spaCy for NLP
nlp = spacy.load("en_core_web_sm")
# Step 1: Hybrid Chunking Strategy
# Fixed-size chunking
#fixed_size_splitter = CharacterTextSplitter(chunk_size=115, chunk_overlap=0)
fixed_size_splitter = CharacterTextSplitter(separator="\n", chunk_size=150, chunk_overlap=0)
fixed_chunks = fixed_size_splitter.split_text(investment_text)
# Semantic chunking
semantic_chunks = [sent.text for sent in nlp(investment_text).sents]
# Step 2: Embedding and Storing in FAISS
# Embedding function using spaCy
def get_embedding(text):
return nlp(text).vector
# Get vector dimension for FAISS index
vector_dim = nlp("test").vector.shape[0]
# Fixed-size Index
fixed_index = faiss.IndexFlatL2(vector_dim)
fixed_vectors = [get_embedding(chunk) for chunk in fixed_chunks]
fixed_index.add(np.array(fixed_vectors).astype('float32'))
# Semantic Index
semantic_index = faiss.IndexFlatL2(vector_dim)
semantic_vectors = [get_embedding(chunk) for chunk in semantic_chunks]
semantic_index.add(np.array(semantic_vectors).astype('float32'))
# Step 3: User Query and Hybrid Retrieval
# Split user query into two parts
#query = "Show revenue growth in investment banking and details on market share in wealth management."
query = "Show debt-to-equity for future investements and details on market share in wealth management."
query_parts = query.split(" and ")
# Define k for top-k retrievals globally for both fixed-length and semantic chunking
k = 1
# Embed each part of the query
query_vector_1 = get_embedding(query_parts[0]).reshape(1, -1)
query_vector_2 = get_embedding(query_parts[1]).reshape(1, -1)
# Retrieve top-k for Fixed-Size Index (first part of query)
k=2 # Define k for top-k retrieval for fixed
_, fixed_distances = fixed_index.search(query_vector_1, k)
fixed_retrieved = [(fixed_chunks[i], i, fixed_distances[0][j]) for j, i in enumerate(fixed_index.search(query_vector_1, k)[1][0])]
# Retrieve top-k for Semantic Index (second part of query)
k=1 # Define k for top-k retrieval for fixed
_, semantic_distances = semantic_index.search(query_vector_2, k)
semantic_retrieved = [(semantic_chunks[i], i, semantic_distances[0][j]) for j, i in enumerate(semantic_index.search(query_vector_2, k)[1][0])]
# Step 4: Visualize Chunks in a Bar Graph
fig, ax = plt.subplots(figsize=(12, 8))
# Display bar graph for fixed-size chunks
for i, chunk in enumerate(fixed_chunks):
ax.barh(f"Fixed Chunk {i}", len(chunk), color='lightblue', label="Fixed-size Chunks" if i == 0 else "")
# Display bar graph for semantic chunks
for i, chunk in enumerate(semantic_chunks):
ax.barh(f"Semantic Chunk {i}", len(chunk), color='lightgreen', label="Semantic Chunks" if i == 0 else "")
ax.set_xlabel("Chunk Length")
ax.set_ylabel("Chunk Type")
ax.set_title("Chunk Visualization")
ax.legend()
plt.tight_layout()
plt.show()
# Step 5: Display User Query and Retrieval Results
# Display user query
print("User Query:\n", query)
# Display first part of user query
print("\nFirst Part of User Query:\n", query_parts[0])
# Display Retrieval Results for Fixed-Size Query (First Part)
fixed_table = PrettyTable()
fixed_table.field_names = ["Chunk Text", "Index", "Distance"]
fixed_table.align = "l"
for text, idx, distance in fixed_retrieved:
wrapped_text = "\n".join([text[j:j + 80] for j in range(0, len(text), 80)])
fixed_table.add_row([wrapped_text, idx, distance])
print("\nFixed-Size RAG Retrieval Results (First Part of Query):")
print(fixed_table)
# Display second part of user query
print("\nSecond Part of User Query:\n", query_parts[1])
# Display Retrieval Results for Semantic Query (Second Part)
semantic_table = PrettyTable()
semantic_table.field_names = ["Chunk Text", "Index", "Distance"]
semantic_table.align = "l"
for text, idx, distance in semantic_retrieved:
wrapped_text = "\n".join([text[j:j + 80] for j in range(0, len(text), 80)])
semantic_table.add_row([wrapped_text, idx, distance])
print("\nSemantic RAG Retrieval Results (Second Part of Query):")
print(semantic_table)
User Query: Show debt-to-equity for future investements and details on market share in wealth management. First Part of User Query: Show debt-to-equity for future investements Fixed-Size RAG Retrieval Results (First Part of Query): +----------------------------------------------------------------------------------+-------+----------+ | Chunk Text | Index | Distance | +----------------------------------------------------------------------------------+-------+----------+ | The debt-to-equity ratio remains low at 0.4, providing a stable foundation for f | 3 | 3 | | uture investments and expansion in asset management. | | | | Our firm's investment banking division saw a revenue increase of 20% year-over-y | 1 | 1 | | ear, driven by higher client acquisition and new advisory services. | | | +----------------------------------------------------------------------------------+-------+----------+ Second Part of User Query: details on market share in wealth management. Semantic RAG Retrieval Results (Second Part of Query): +----------------------------------------------------------------------------------+-------+----------+ | Chunk Text | Index | Distance | +----------------------------------------------------------------------------------+-------+----------+ | Additionally, the company's market share in wealth management grew by 7%, attrib | 4 | 4 | | uted to expanded service offerings and improved client retention. | | | | | | | +----------------------------------------------------------------------------------+-------+----------+