AI Resume Analysis: Knowledgebase Module

Knowledgebase Module Design and Implementation

This note records how I implemented the Knowledgebase module in the interview-guide project. The goal is to connect document upload, vectorization, RAG query, and session association into a sustainable knowledge service workflow.

Module Capability Overview

Document management: supports upload, download, deletion, categorization, keyword search, and statistics.
Vectorization capability: stores vectors with pgvector, and processes chunking/storage through async tasks.
RAG Q&A: supports both non-streaming and streaming (SSE) multi-knowledgebase query.
Session coordination: automatically removes associated session references when deleting a knowledgebase to reduce inconsistency risk.

State Transitions

Diagram 1: KnowledgeBase Main State Machine

flowchart TD
A["Call POST /api/knowledgebase/upload to upload file"] --> B["File validation + type detection + dedup check"]

B --> C{"Is file duplicated (fileHash exists)?"}
C -->|Yes| D["Return existing knowledgebase record\nduplicate=true\nno vectorization triggered"]
C -->|No| E["Parse text content + upload file to storage"]

E --> F["Save KnowledgeBaseEntity\ninitial vectorStatus=PENDING"]
F --> G["Send vectorization task to Redis Stream"]

G --> H["VectorizeStreamConsumer consumes task"]
H --> I["markProcessing\nvectorStatus=PROCESSING"]

I --> J["vectorizeAndStore\nchunk text and write to pgvector"]
J --> K{"Did vectorization succeed?"}

K -->|Yes| L["markCompleted\nvectorStatus=COMPLETED\nvectorError=null"]
K -->|No| M{"retryCount < 3 ?"}
M -->|Yes| N["Requeue task (retry+1)"]
N --> H
M -->|No| O["markFailed\nvectorStatus=FAILED\nwrite vectorError"]

P["Call POST /api/knowledgebase/{id}/revectorize"] --> Q["Reset status to PENDING\nclear vectorError"]
Q --> G

R["Call DELETE /api/knowledgebase/{id} to delete knowledgebase"] --> S["Remove RAG session associations"]
S --> T["Delete vector data (best effort) + delete storage file (best effort)"]
T --> U["Delete knowledgebase DB record\nlifecycle ends"]

Diagram 2: Chunked Knowledgebase Vectorization Flow

flowchart TD
A["Knowledgebase upload succeeds"] --> B["Save knowledgebase record vectorStatus=PENDING"]
B --> C["Send vectorization task to Redis Stream"]
C --> D["VectorizeStreamConsumer starts polling"]
D --> E["Read one message: kbId + content + retryCount"]
E --> F["Set status to PROCESSING"]
F --> G["Execute vectorizeAndStore"]

G --> H["Delete old vectors for this kbId"]
H --> I["Text chunking via TokenTextSplitter"]
I --> J["Add metadata kb_id to each chunk"]
J --> K["Batch call vectorStore.add to write vectors"]

K --> L["Set status to COMPLETED"]
L --> M["ACK message"]

G --> N{"Processing exception?"}
N -->|Yes| O{"retryCount < 3"}
O -->|Yes| P["retryCount+1 and requeue"]
P --> M
O -->|No| Q["Set status to FAILED and record error"]
Q --> M

Key API Design

`GET /api/knowledgebase/list` Get Knowledgebase List (Status Filter + Sorting)

Call chain:

Result.success(listService.listKnowledgeBases(status, sortBy));
knowledgeBaseRepository.findByVectorStatusOrderByUploadedAtDesc(vectorStatus);
knowledgeBaseRepository.findAllByOrderByUploadedAtDesc();
entities = sortEntities(entities, sortBy);

`GET /api/knowledgebase/{id}` Get Knowledgebase Detail

Call chain:

listService.getKnowledgeBase(id);
knowledgeBaseRepository.findById(id);

`DELETE /api/knowledgebase/{id}` Delete Knowledgebase

Core flow:

deleteService.deleteKnowledgeBase(id);
knowledgeBaseRepository.findById(id);
sessionRepository.findByKnowledgeBaseIds(List.of(id));
vectorService.deleteByKnowledgeBaseId(id);
storageService.deleteKnowledgeBase(kb.getStorageKey());
knowledgeBaseRepository.deleteById(id);

Notes:

Removes RAG session associations first, then deletes vectors/storage files, then DB record.
Vector/storage deletion failures are logged as warn and do not block the main delete flow.

`POST /api/knowledgebase/query` Non-Streaming Q&A (Multi-Knowledgebase)

Rate limits:

GLOBAL/IP: 10 each

Call chain:

queryService.queryKnowledgeBase(request);
answerQuestion(...);
countService.updateQuestionCounts(...);
vectorService.similaritySearch(...);

Processing highlights:

knowledgeBaseIds and question are required.
If no hit, returns fixed fallback text: “No information retrieved”.
If hit exists, builds context + prompts and calls default ChatClient for answer generation.
Returns QueryResponse(answer, primaryKbId, kbNamesStr).

`POST /api/knowledgebase/query/stream` Streaming Q&A (SSE, Multi-Knowledgebase)

Rate limits:

GLOBAL/IP: 5 each

Call chain:

queryService.answerQuestionStream(kbIds, question);
countService.updateQuestionCounts(...);
vectorService.similaritySearch(...);
chatClient.prompt().stream().content();
normalizeStreamOutput(...);

Processing highlights:

Returns Flux<String> (text/event-stream).
Empty input or no hit returns fallback text stream directly.
Both stream-time and outer exceptions are downgraded to safe fallback output.

`GET /api/knowledgebase/categories` Get All Category Names

Call chain:

listService.getAllCategories();

Return:

Result<List<String>>

`GET /api/knowledgebase/category/{category}` Get Knowledgebase List by Category

Call chain:

listService.listByCategory(category);

Return:

Result<List<KnowledgeBaseListItemDTO>>

`GET /api/knowledgebase/uncategorized` Get Uncategorized Knowledgebase List

Call chain:

listService.listByCategory(category);

Notes:

Current implementation reuses category-query path and distinguishes uncategorized by specific category value.

`PUT /api/knowledgebase/{id}/category` Update Knowledgebase Category

Call chain:

listService.updateCategory(id, body.get("category"));

Processing highlights:

Queries by id first and throws business exception if not found.
Updates category and persists record when found.

`POST /api/knowledgebase/upload` Upload Knowledgebase File (multipart)

Parameters:

file (required)
name (optional)
category (optional)

Rate limits:

GLOBAL/IP: 3 each

Call chain:

uploadService.uploadKnowledgeBase(file, name, category);
findByFileHash(fileHash);

Processing flow:

Validate file presence and size (max 50MB).
Validate type by MIME + extension whitelist (PDF/DOCX/DOC/TXT/MD).
Compute SHA-256 for dedup check.
Parse text content; fail directly on empty text.
Upload file to RustFS (S3-compatible), generate fileKey/fileUrl.
Save KnowledgeBaseEntity with initial vector status PENDING.
Enqueue async vectorization task to Redis Stream (knowledgebase:vectorize:stream).
Return knowledgeBase + storage + duplicate=false.

`GET /api/knowledgebase/{id}/download` Download Original Knowledgebase File

Call chain:

listService.getEntityForDownload(id);
listService.downloadFile(id);

Return:

ResponseEntity<byte[]> (with Content-Disposition and Content-Type)

`GET /api/knowledgebase/search?keyword=...` Keyword Search Knowledgebase

Call chain:

listService.search(keyword);

`GET /api/knowledgebase/stats` Get Knowledgebase Statistics

Call chain:

listService.getStatistics();

Return:

KnowledgeBaseStatsDTO

`POST /api/knowledgebase/{id}/revectorize` Manual Re-Vectorization

Rate limits:

GLOBAL/IP: 2 each

Call chain:

uploadService.revectorize(id);

Processing flow:

Query knowledgebase by id, throw exception if missing.
Download source file from object storage and re-parse text.
Fail directly if parsing fails or returns empty text.
Reset vector status to PENDING.
Enqueue vectorization task to Redis Stream.
Return success immediately; frontend polls status afterward.

Async Vectorization Processing Flow (Core Implementation)

// 1) Delete old vectors
deleteByKnowledgeBaseId(knowledgeBaseId);

// 2) Text chunking (default no overlap)
List<Document> chunks = textSplitter.apply(List.of(new Document(content)));

// 3) Add metadata (kb_id)
chunks.forEach(chunk -> chunk.getMetadata().put("kb_id", knowledgeBaseId.toString()));

// 4) Batch vector write (DashScope batch <= 10)
for (int i = 0; i < batchCount; i++) {
    int start = i * MAX_BATCH_SIZE;
    int end = Math.min(start + MAX_BATCH_SIZE, totalChunks);
    List<Document> batch = chunks.subList(start, end);
    vectorStore.add(batch);
}

Summary

The core value of the Knowledgebase module is connecting file asset management with retrieval-augmented Q&A. For me, the real value is not just successful upload, but making sure documents reliably enter the vectorization pipeline and finally provide reusable, traceable knowledge support in Q&A scenarios.

Knowledgebase Module Design and Implementation

Module Capability Overview

State Transitions

Diagram 1: KnowledgeBase Main State Machine

Diagram 2: Chunked Knowledgebase Vectorization Flow

Key API Design

GET /api/knowledgebase/list Get Knowledgebase List (Status Filter + Sorting)

GET /api/knowledgebase/{id} Get Knowledgebase Detail

DELETE /api/knowledgebase/{id} Delete Knowledgebase

POST /api/knowledgebase/query Non-Streaming Q&A (Multi-Knowledgebase)

POST /api/knowledgebase/query/stream Streaming Q&A (SSE, Multi-Knowledgebase)

GET /api/knowledgebase/categories Get All Category Names

GET /api/knowledgebase/category/{category} Get Knowledgebase List by Category

GET /api/knowledgebase/uncategorized Get Uncategorized Knowledgebase List

PUT /api/knowledgebase/{id}/category Update Knowledgebase Category

POST /api/knowledgebase/upload Upload Knowledgebase File (multipart)

GET /api/knowledgebase/{id}/download Download Original Knowledgebase File

GET /api/knowledgebase/search?keyword=... Keyword Search Knowledgebase

GET /api/knowledgebase/stats Get Knowledgebase Statistics

POST /api/knowledgebase/{id}/revectorize Manual Re-Vectorization

Async Vectorization Processing Flow (Core Implementation)

Summary

`GET /api/knowledgebase/list` Get Knowledgebase List (Status Filter + Sorting)

`GET /api/knowledgebase/{id}` Get Knowledgebase Detail

`DELETE /api/knowledgebase/{id}` Delete Knowledgebase

`POST /api/knowledgebase/query` Non-Streaming Q&A (Multi-Knowledgebase)

`POST /api/knowledgebase/query/stream` Streaming Q&A (SSE, Multi-Knowledgebase)

`GET /api/knowledgebase/categories` Get All Category Names

`GET /api/knowledgebase/category/{category}` Get Knowledgebase List by Category

`GET /api/knowledgebase/uncategorized` Get Uncategorized Knowledgebase List

`PUT /api/knowledgebase/{id}/category` Update Knowledgebase Category

`POST /api/knowledgebase/upload` Upload Knowledgebase File (multipart)

`GET /api/knowledgebase/{id}/download` Download Original Knowledgebase File

`GET /api/knowledgebase/search?keyword=...` Keyword Search Knowledgebase

`GET /api/knowledgebase/stats` Get Knowledgebase Statistics

`POST /api/knowledgebase/{id}/revectorize` Manual Re-Vectorization