Knowledge Graphs for RAG using Neo4j — chat with SEC data — Part 1

Arshdeep Kaur
8 min readMay 9, 2024

--

CHAT WITH SEC — PLAN OF ACTION

  1. We will work with a single 10-K form for a company called NetApp.
    (Sections/items 1, 1a, 7, 7a were extracted from the10-K file and converted to .json object. )
  2. Split form sections (i.e. items: 1, 1a, 7, 7a) into chunks using a Langchain text splitter.
  3. Create a graph where each chunk is a node. Add chunk metadata as properties.
  4. Create a vector index.
  5. Create a text embedding vector for each chunk and populate the index.

CONSTRUCTING A KNOWLEDGE GRAPH FROM TEXT

  1. For each specified item (section) in the Form 10-K filing (item1, item1a, item7, item7a), the function performs the following steps:
  • Retrieves the text data for the item from the JSON object.
  • Splits the text data into chunks using a text_splitter.split_text() function.
  • Initializes a sequence ID (chunk_seq_id) for the chunks.
  • Iterates through the first 20 chunks of the item’s text data. (We limit the number of chunks in each section to 20 to speed things up.)
  • Constructs a dictionary representing each chunk along with its associated metadata.
  • To summarize: each item (‘item1’ ’item1a’ ’item7' ’item7a') is split into 20 chunks of 2000 characters each and assigned a unique chunkId. I.e. a chunkId is uniquely identified by form_id, item, chunk_seq_id within the item.
def split_form10k_data_from_file(file):
chunks_with_metadata = [] # use this to accumlate chunk records
file_as_object = json.load(open(file)) # open the json file
for item in ['item1','item1a','item7','item7a']: # pull these keys from the json
print(f'Processing {item} from {file}')
item_text = file_as_object[item] # grab the text of the item
item_text_chunks = text_splitter.split_text(item_text) # split the text into chunks
chunk_seq_id = 0
for chunk in item_text_chunks[:20]: # only take the first 20 chunks
form_id = file[file.rindex('/') + 1:file.rindex('.')] # extract form id from file name
# finally, construct a record with metadata and the chunk text
chunks_with_metadata.append({
'text': chunk,
# metadata from looping...
'f10kItem': item,
'chunkSeqId': chunk_seq_id,
# constructed metadata...
'formId': f'{form_id}', # pulled from the filename
'chunkId': f'{form_id}-{item}-chunk{chunk_seq_id:04d}',
# metadata from file...
'names': file_as_object['names'],
'cik': file_as_object['cik'],
'cusip6': file_as_object['cusip6'],
'source': file_as_object['source'],
})
chunk_seq_id += 1
print(f'\tSplit into {chunk_seq_id} chunks')
return chunks_with_metadata

2. Next, we create (upsert: update/insert i.e. if the match fails) graph nodes using chunks.

merge_chunk_node_query = """
MERGE(mergedChunk:Chunk {chunkId: $chunkParam.chunkId})
ON CREATE SET
mergedChunk.names = $chunkParam.names,
mergedChunk.formId = $chunkParam.formId,
mergedChunk.cik = $chunkParam.cik,
mergedChunk.cusip6 = $chunkParam.cusip6,
mergedChunk.source = $chunkParam.source,
mergedChunk.f10kItem = $chunkParam.f10kItem,
mergedChunk.chunkSeqId = $chunkParam.chunkSeqId,
mergedChunk.text = $chunkParam.text
RETURN mergedChunk
"""

3. Loop through the chunks and call function: merge_chunk_node_query, and create nodes.

node_count = 0
for chunk in first_file_chunks:
print(f"Creating `:Chunk` node for chunk ID {chunk['chunkId']}")
kg.query(merge_chunk_node_query,
params={
'chunkParam': chunk
})
node_count += 1
print(f"Created {node_count} nodes")

4. 23 nodes get created for thesechunkIds:

  • 20 nodes for item1: 0000950170–23–027948-item1-chunk0000— — 0000950170–23–027948-item1-chunk0019
  • 1 node for each of the other items item1a, item7, item7a:
    0000950170–23–027948-item1a-chunk0000, 0000950170–23–027948-item7-chunk0000, 0000950170–23–027948-item7a-chunk0000.

5. Sanity check: verify 23 nodes:

kg.query("""
MATCH (n)
RETURN count(n) as nodeCount
""")

Output

[{'nodeCount': 23}]

So far so good. ✅

VECTOR INDEX

  1. Create vector index
kg.query("""
CREATE VECTOR INDEX `form_10k_chunks` IF NOT EXISTS
FOR (c:Chunk) ON (c.textEmbedding)
OPTIONS { indexConfig: {
`vector.dimensions`: 1536,
`vector.similarity_function`: 'cosine'
}}
""")

Make sure the vector index is online

kg.query("SHOW INDEXES")

Output: ‘state’: ‘ONLINE’,

[{'id': 5,
'name': 'form_10k_chunks',
'state': 'ONLINE',
'populationPercent': 100.0,
'type': 'VECTOR',
'entityType': 'NODE',
'labelsOrTypes': ['Chunk'],
'properties': ['textEmbedding'],
'indexProvider': 'vector-1.0',
'owningConstraint': None,
'lastRead': None,
'readCount': None}]

2. Calculate embedding vectors for chunks and populate index. This query calculates the embedding vector and stores it as a property called textEmbedding on each Chunk node.

kg.query("""
MATCH (chunk:Chunk) WHERE chunk.textEmbedding IS NULL
WITH chunk, genai.vector.encode(
chunk.text,
"OpenAI",
{
token: $openAiApiKey,
endpoint: $openAiEndpoint
}) AS vector
CALL db.create.setNodeVectorProperty(chunk, "textEmbedding", vector)
""",
params={"openAiApiKey":OPENAI_API_KEY, "openAiEndpoint": OPENAI_ENDPOINT} )

3. Create a generic function that takes a question as an input and performs similarity search:

def neo4j_vector_search(question):
"""Search for similar nodes using the Neo4j vector index"""
vector_search_query = """
WITH genai.vector.encode(
$question,
"OpenAI",
{
token: $openAiApiKey,
endpoint: $openAiEndpoint
}) AS question_embedding
CALL db.index.vector.queryNodes($index_name, $top_k, question_embedding) yield node, score
RETURN score, node.text AS text
"""
similar = kg.query(vector_search_query,
params={
'question': question,
'openAiApiKey':OPENAI_API_KEY,
'openAiEndpoint': OPENAI_ENDPOINT,
'index_name':VECTOR_INDEX_NAME,
'top_k': 10})
return similar

4. Perform similarity search

search_results = neo4j_vector_search(
'In a single sentence, tell me about Netapp.'
)

Output:

search_results[0]

{'score': 0.9358915090560913,
'text': '>Item 1. \nBusiness\n\n\nOverview\n\n\nNetApp, Inc. (NetApp, we, us or the Company) is a global cloud-led, data-centric software company. We were incorporated in 1992 and are headquartered in San Jose, California. Building on more than three decades of innovation, we give customers the freedom to manage applications and data across hybrid multicloud environments. Our portfolio of cloud services, and storage infrastructure, powered by intelligent data management software, enables applications to run faster, more reliably, and more securely, all at a lower cost.\n\n\nOur opportunity is defined by the durable megatrends of data-driven digital and cloud transformations. NetApp helps organizations meet the complexities created by rapid data and cloud growth, multi-cloud management, and the adoption of next-generation technologies, such as AI, Kubernetes, and modern databases. Our modern approach to hybrid, multicloud infrastructure and data management, which we term ‘evolved cloud’, provides customers the ability to leverage data across their entire estate with simplicity, security, and sustainability which increases our relevance and value to our customers.\n\n\nIn an evolved cloud state, the cloud is fully integrated into an organization’s architecture and operations. Data centers and clouds are seamlessly united and hybrid multicloud operations are simplified, with consistency and observability across environments. The key benefits NetApp brings to an organization’s hybrid multicloud environment are:\n\n\n•\nOperational simplicity: NetApp’s use of open source, open architectures and APIs, microservices, and common capabilities and data services facilitate the creation of applications that can run anywhere.\n\n\n•\nFlexibility and consistency: NetApp makes moving data and applications between environments seamless through a common storage foundation across on-premises and multicloud environments.'}

So far we have only performed Vector Search. To create a Chatbot (question & answering), we will build a RAG system using LangChain.👇

5. Set up a LangChain RAG workflow to chat with the form using RetrievalQAWithSourcesChain to carry out question answering. Under the hood, Neo4j will use Cypher to perform vector similarity searches.
chain_type=”stuff”, means we are doing prompt stuffing — i.e. combine everything and send to LLM.

neo4j_vector_store = Neo4jVector.from_existing_graph(
embedding=OpenAIEmbeddings(),
url=NEO4J_URI,
username=NEO4J_USERNAME,
password=NEO4J_PASSWORD,
index_name=VECTOR_INDEX_NAME,
node_label=VECTOR_NODE_LABEL,
text_node_properties=[VECTOR_SOURCE_PROPERTY],
embedding_node_property=VECTOR_EMBEDDING_PROPERTY,
)

retriever = neo4j_vector_store.as_retriever()

chain = RetrievalQAWithSourcesChain.from_chain_type(
ChatOpenAI(temperature=0),
chain_type="stuff",
retriever=retriever
)

def prettychain(question: str) -> str:
"""Pretty print the chain's response to a question"""
response = chain({"question": question},
return_only_outputs=True,)
print(textwrap.fill(response['answer'], 60))

6. Ask question

question = "What is Netapp's primary business?"
prettychain(question)

Output

NetApp's primary business is enterprise storage and data
management, cloud storage, and cloud operations.

ADDING RELATIONSHIPS TO THE SEC KNOWLEDGE GRAPH

Now we will add relationships to preserve the structure of the documents.

  1. Create form node: Form
cypher = """
MERGE (f:Form {formId: $formInfoParam.formId })
ON CREATE
SET f.names = $formInfoParam.names
SET f.source = $formInfoParam.source
SET f.cik = $formInfoParam.cik
SET f.cusip6 = $formInfoParam.cusip6
"""

kg.query(cypher, params={'formInfoParam': form_info})

Sanity check

kg.query("MATCH (f:Form) RETURN count(f) as formCount")
[{'formCount': 1}]

Now we will connect everything together. Our goal is to add relationships to improve the context around each chunk. The result will reflect the original structure of the document.

Connect form and chunk nodes

2. Write query that 1. retrieves chunks from the same form, same section (e.g. item1), 2. uses collect to collect everything in a list.
Loop through all sections and call this query to link these chunks using relationship: NEXT.

cypher = """
MATCH (from_same_section:Chunk)
WHERE from_same_section.formId = $formIdParam
AND from_same_section.f10kItem = $f10kItemParam
WITH from_same_section
ORDER BY from_same_section.chunkSeqId ASC
WITH collect(from_same_section) as section_chunk_list
CALL apoc.nodes.link(
section_chunk_list,
"NEXT",
{avoidDuplicates: true}
)
RETURN size(section_chunk_list)
"""

for form10kItemName in ['item1', 'item1a', 'item7', 'item7a']:
kg.query(cypher, params={'formIdParam':form_info['formId'],
'f10kItemParam': form10kItemName})

3. Connect chunks to their parent form with a relationship: PART_OF.

cypher = """
MATCH (c:Chunk), (f:Form)
WHERE c.formId = f.formId
MERGE (c)-[newRelationship:PART_OF]->(f)
RETURN count(newRelationship)
"""

kg.query(cypher)

4. Create a relationship: SECTION on first chunk of each section

cypher = """
MATCH (first:Chunk), (f:Form)
WHERE first.formId = f.formId
AND first.chunkSeqId = 0
WITH first, f
MERGE (f)-[r:SECTION {f10kItem: first.f10kItem}]->(first)
RETURN count(r)
"""

kg.query(cypher)

5. Play with queries. E.g. Return a window of three chunks:

cypher = """
MATCH (c1:Chunk)-[:NEXT]->(c2:Chunk)-[:NEXT]->(c3:Chunk)
WHERE c2.chunkId = $chunkIdParam
RETURN c1.chunkId, c2.chunkId, c3.chunkId
"""

kg.query(cypher,
params={'chunkIdParam': next_chunk_info['chunkId']})
[{'c1.chunkId': '0000950170-23-027948-item1-chunk0000',
'c2.chunkId': '0000950170-23-027948-item1-chunk0001',
'c3.chunkId': '0000950170-23-027948-item1-chunk0002'}]

6. Modify NEXT relationship to have variable length. And return the window with the longest path.

cypher = """
MATCH window=
(:Chunk)-[:NEXT*0..1]->(c:Chunk)-[:NEXT*0..1]->(:Chunk)
WHERE c.chunkId = $chunkIdParam
WITH window as longestChunkWindow
ORDER BY length(window) DESC LIMIT 1
RETURN length(longestChunkWindow)
"""

kg.query(cypher,
params={'chunkIdParam': first_chunk_info['chunkId']})
[{'length(longestChunkWindow)': 1}]

Now we have learnt how to modify the context around a chunk using a window.

7. Create chatbot: define a window retrieval query to get consecutive chunks, create a vector store that retrieves from retrieval_query_window

retrieval_query_window = """
MATCH window=
(:Chunk)-[:NEXT*0..1]->(node)-[:NEXT*0..1]->(:Chunk)
WITH node, score, window as longestWindow
ORDER BY length(window) DESC LIMIT 1
WITH nodes(longestWindow) as chunkList, node, score
UNWIND chunkList as chunkRows
WITH collect(chunkRows.text) as textList, node, score
RETURN apoc.text.join(textList, " \n ") as text,
score,
node {.source} AS metadata
"""

# Create a vector store that retrieves from retrieval_query_window
vector_store_window = Neo4jVector.from_existing_index(
embedding=OpenAIEmbeddings(),
url=NEO4J_URI,
username=NEO4J_USERNAME,
password=NEO4J_PASSWORD,
database="neo4j",
index_name=VECTOR_INDEX_NAME,
text_node_property=VECTOR_SOURCE_PROPERTY,
retrieval_query=retrieval_query_window
)

# Create a retriever from the vector store
retriever_window = vector_store_window.as_retriever()

# Create a chatbot Question & Answer chain from the retriever
chain_window = RetrievalQAWithSourcesChain.from_chain_type(
ChatOpenAI(temperature=0),
chain_type="stuff",
retriever=retriever_window
)

For comparison create a regular vector store that retrieves a single node:

# Create a regular vector store that retrieves a single node
neo4j_vector_store = Neo4jVector.from_existing_graph(
embedding=OpenAIEmbeddings(),
url=NEO4J_URI,
username=NEO4J_USERNAME,
password=NEO4J_PASSWORD,
index_name=VECTOR_INDEX_NAME,
node_label=VECTOR_NODE_LABEL,
text_node_properties=[VECTOR_SOURCE_PROPERTY],
embedding_node_property=VECTOR_EMBEDDING_PROPERTY,
)
# Create a retriever from the vector store
windowless_retriever = neo4j_vector_store.as_retriever()

# Create a chatbot Question & Answer chain from the retriever
windowless_chain = RetrievalQAWithSourcesChain.from_chain_type(
ChatOpenAI(temperature=0),
chain_type="stuff",
retriever=windowless_retriever
)

Compare the 2 chains: windowless_chain, chain_window

question = "In a single sentence, tell me about Netapp's business."
answer = windowless_chain(
{"question": question},
return_only_outputs=True,
)
print(textwrap.fill(answer["answer"]))
NetApp is a global cloud-led, data-centric software company that
provides customers with the freedom to manage applications and data
across hybrid multicloud environments, focusing primarily on
enterprise storage and data management, cloud storage, and cloud
operations markets.
answer = chain_window(
{"question": question},
return_only_outputs=True,
)
print(textwrap.fill(answer["answer"]))
NetApp's business focuses on providing storage-as-a-service solutions
and global support for hybrid cloud environments through a
multichannel distribution strategy and partnerships with leading cloud
providers and other industry partners.

Acknowledgements & References

  1. Andrew Ng at DeepLearning.AI,
    Andreas Kollegger at Neo4j
    Short course on Knowledge Graphs for RAG
    https://github.com/neo4j-examples/sec-edgar-notebooks

--

--