Hybrid Chunking combining Fixed-size and Semantic Chunking¶

  • Create sample investment analysis text.
  • Implement both fixed-size and semantic chunking using the CharacterTextSplitter and spaCy's sentence tokenizer.
  • Store the embedded chunks in FAISS, using separate indices for fixed-size and semantic chunks.
  • Create a Retrieval-Augmented Generation (RAG) agent that retrieves relevant chunks based on a user query: the first part of the query is handled by the fixed-size index, and the second part by the semantic index.
  • Display the fixed-size and semantic chunks visually and retrieve results in tables with appropriate formatting.
Hybrid Chunking Strategy:¶
  • Fixed-size chunks are created with a character-based splitter (CharacterTextSplitter), which ensures even-sized chunks.
  • Semantic chunks are created using spaCy's sentence tokenizer for context-preserving chunks.
Embedding and Storage:¶
  • Each chunk is embedded using spaCy's vector and stored in FAISS indices: one for fixed-size chunks and one for semantic chunks.
User Query and Hybrid Retrieval:¶
  • The query is split into two parts, each part tailored for retrieval from a specific index.
  • The top k results (defined by the variable k) are retrieved from each index.
  • The first part of the query retrieves from the fixed-size chunk index, and the second part retrieves from the semantic chunk index.
Visualization and Display:¶
  • Both fixed-size and semantic chunks are displayed in a bar graph for chunk visualization.
  • Retrieval results are displayed in tables, with wrapped text for readability and bold headers to differentiate fields.
In [ ]:
%pip install -q spacy sentence-transformers faiss-cpu langchain matplotlib prettytable
In [ ]:
%python -m spacy download en_core_web_sm
In [ ]:
import faiss
import spacy
from langchain.text_splitter import CharacterTextSplitter
import matplotlib.pyplot as plt
from prettytable import PrettyTable
import numpy as np

# Sample Investment Analysis Text
investment_text = """
The financial services sector has experienced robust growth due to the adoption of digital banking and financial technology. 
Our firm's investment banking division saw a revenue increase of 20% year-over-year, driven by higher client acquisition and new advisory services.
For the fiscal year ending in 2023, the company reported revenue of $30 million with an EBITDA margin of 30%.
The debt-to-equity ratio remains low at 0.4, providing a stable foundation for future investments and expansion in asset management.
Additionally, the company's market share in wealth management grew by 7%, attributed to expanded service offerings and improved client retention.
"""

# Load spaCy for NLP
nlp = spacy.load("en_core_web_sm")

# Step 1: Hybrid Chunking Strategy
# Fixed-size chunking
#fixed_size_splitter = CharacterTextSplitter(chunk_size=115, chunk_overlap=0)
fixed_size_splitter = CharacterTextSplitter(separator="\n", chunk_size=150, chunk_overlap=0)
fixed_chunks = fixed_size_splitter.split_text(investment_text)


# Semantic chunking
semantic_chunks = [sent.text for sent in nlp(investment_text).sents]

# Step 2: Embedding and Storing in FAISS
# Embedding function using spaCy
def get_embedding(text):
    return nlp(text).vector

# Get vector dimension for FAISS index
vector_dim = nlp("test").vector.shape[0]

# Fixed-size Index
fixed_index = faiss.IndexFlatL2(vector_dim)
fixed_vectors = [get_embedding(chunk) for chunk in fixed_chunks]
fixed_index.add(np.array(fixed_vectors).astype('float32'))

# Semantic Index
semantic_index = faiss.IndexFlatL2(vector_dim)
semantic_vectors = [get_embedding(chunk) for chunk in semantic_chunks]
semantic_index.add(np.array(semantic_vectors).astype('float32'))

# Step 3: User Query and Hybrid Retrieval
# Split user query into two parts
#query = "Show revenue growth in investment banking and details on market share in wealth management."
query = "Show debt-to-equity for future investements and details on market share in wealth management."
query_parts = query.split(" and ")

# Define k for top-k retrievals globally for both fixed-length and semantic chunking
k = 1

# Embed each part of the query
query_vector_1 = get_embedding(query_parts[0]).reshape(1, -1)
query_vector_2 = get_embedding(query_parts[1]).reshape(1, -1)

# Retrieve top-k for Fixed-Size Index (first part of query)
k=2 # Define k for top-k retrieval for fixed
_, fixed_distances = fixed_index.search(query_vector_1, k)
fixed_retrieved = [(fixed_chunks[i], i, fixed_distances[0][j]) for j, i in enumerate(fixed_index.search(query_vector_1, k)[1][0])]

# Retrieve top-k for Semantic Index (second part of query)
k=1  # Define k for top-k retrieval for fixed
_, semantic_distances = semantic_index.search(query_vector_2, k)
semantic_retrieved = [(semantic_chunks[i], i, semantic_distances[0][j]) for j, i in enumerate(semantic_index.search(query_vector_2, k)[1][0])]

# Step 4: Visualize Chunks in a Bar Graph
fig, ax = plt.subplots(figsize=(12, 8))

# Display bar graph for fixed-size chunks
for i, chunk in enumerate(fixed_chunks):
    ax.barh(f"Fixed Chunk {i}", len(chunk), color='lightblue', label="Fixed-size Chunks" if i == 0 else "")

# Display bar graph for semantic chunks
for i, chunk in enumerate(semantic_chunks):
    ax.barh(f"Semantic Chunk {i}", len(chunk), color='lightgreen', label="Semantic Chunks" if i == 0 else "")

ax.set_xlabel("Chunk Length")
ax.set_ylabel("Chunk Type")
ax.set_title("Chunk Visualization")
ax.legend()
plt.tight_layout()
plt.show()

# Step 5: Display User Query and Retrieval Results

# Display user query
print("User Query:\n", query)

# Display first part of user query
print("\nFirst Part of User Query:\n", query_parts[0])

# Display Retrieval Results for Fixed-Size Query (First Part)
fixed_table = PrettyTable()
fixed_table.field_names = ["Chunk Text", "Index", "Distance"]
fixed_table.align = "l"

for text, idx, distance in fixed_retrieved:
    wrapped_text = "\n".join([text[j:j + 80] for j in range(0, len(text), 80)])
    fixed_table.add_row([wrapped_text, idx, distance])

print("\nFixed-Size RAG Retrieval Results (First Part of Query):")
print(fixed_table)

# Display second part of user query
print("\nSecond Part of User Query:\n", query_parts[1])

# Display Retrieval Results for Semantic Query (Second Part)
semantic_table = PrettyTable()
semantic_table.field_names = ["Chunk Text", "Index", "Distance"]
semantic_table.align = "l"

for text, idx, distance in semantic_retrieved:
    wrapped_text = "\n".join([text[j:j + 80] for j in range(0, len(text), 80)])
    semantic_table.add_row([wrapped_text, idx, distance])

print("\nSemantic RAG Retrieval Results (Second Part of Query):")
print(semantic_table)
No description has been provided for this image
User Query:
 Show debt-to-equity for future investements and details on market share in wealth management.

First Part of User Query:
 Show debt-to-equity for future investements

Fixed-Size RAG Retrieval Results (First Part of Query):
+----------------------------------------------------------------------------------+-------+----------+
| Chunk Text                                                                       | Index | Distance |
+----------------------------------------------------------------------------------+-------+----------+
| The debt-to-equity ratio remains low at 0.4, providing a stable foundation for f | 3     | 3        |
| uture investments and expansion in asset management.                             |       |          |
| Our firm's investment banking division saw a revenue increase of 20% year-over-y | 1     | 1        |
| ear, driven by higher client acquisition and new advisory services.              |       |          |
+----------------------------------------------------------------------------------+-------+----------+

Second Part of User Query:
 details on market share in wealth management.

Semantic RAG Retrieval Results (Second Part of Query):
+----------------------------------------------------------------------------------+-------+----------+
| Chunk Text                                                                       | Index | Distance |
+----------------------------------------------------------------------------------+-------+----------+
| Additionally, the company's market share in wealth management grew by 7%, attrib | 4     | 4        |
| uted to expanded service offerings and improved client retention.                |       |          |
|                                                                                  |       |          |
+----------------------------------------------------------------------------------+-------+----------+