The flow is straightforward:
Document Ingestion: Upload PDFs or Markdown files.
Chunking & Embedding: Break documents into chunks, generate embeddings.
Vector Storage: Store embeddings in PGVector.
Query Processing: User asks a question.
Retrieval: Find relevant chunks using vector similarity.
Generation: LLM generates answer based on retrieved context.
Tech stack overview
Here’s what I used and why:
Spring AI + Kotlin: Because not everything needs to be Python. Spring AI’s 1.0 GA release proved that creating a modular, powerful abstraction layer over LLMs and vector stores.
Ollama: The “Docker for AI models.” It makes LLMs portable, accessible, and easy to deploy. Runs on port 11434 with a REST API ready to receive requests.
PGVector: PostgreSQL with vector extensions. Familiar database, vector capabilities.
Local LLMs: Gemma3, Mistral, Phi-2, and others that run efficiently on consumer hardware.
Coding — Live Demo Flow
Here’s the core RAG implementation in Kotlin:
Code Block 1: Vector Database Configuration
This is the foundation for storing and retrieving embeddings essential for RAG (Retrieval Augmented Generation) functionality all configuration is done programmatically using the builder pattern.
@Configuration
@EnableJpaRepositories
class VectorDatabaseConfig {
@Bean
@Primary
fun dataSource(): DataSource {
val config = HikariConfig()
config.jdbcUrl = "jdbc:postgresql://localhost:5432/ssc_agent_db"
config.username = "postgres"
config.password = "password"
config.driverClassName = "org.postgresql.Driver"
// Enable pgvector extension support
config.addDataSourceProperty("stringtype", "unspecified")
return HikariDataSource(config)
}
@Bean
fun vectorStore(
dataSource: DataSource,
embeddingModel: EmbeddingModel
): PgVectorStore {
return PgVectorStore.builder(dataSource, embeddingModel)
.withSchemaName("public")
.withTableName("document_embeddings")
// Configure vector dimensions - must match embedding model output
.withDimensions(1024)
// Set index type to HNSW for fast similarity search
.withIndexType(PgVectorStore.PgIndexType.HNSW)
// Use cosine distance for semantic similarity measurement
.withDistanceType(PgVectorStore.PgDistanceType.COSINE_DISTANCE)
// Configure batch processing for better performance
.withMaxDocumentBatchSize(10000)
// Enable automatic schema initialization
.withInitializeSchema(true)
// Enable vector table validations for data integrity
.withVectorTableValidationsEnabled(true)
// Optional: Configure HNSW-specific parameters for performance tuning
.withHnswEfConstruction(200) // Higher = better recall, slower build
.withHnswM(16) // Higher = better recall, more memory
.build()
}
}
Code Block 2: Document Processing
This is the core business logic that transforms raw business documents into structured, searchable knowledge. It handles key tasks such as document ingestion, intelligent text chunking, embedding generation, and efficient storage using vector databases. These processes form the foundation of the knowledge base that empowers the AI agent to reason about your specific business context. In this setup, models like mxbai-embed-large and nomic-embed-text are initialized to generate high-quality embeddings from diverse formats, while gemma3 is used for controlled and cost-efficient language generation with a tuned temperature of 0.4 (lower values make answers more focused and predictable, while higher values increase randomness and variability). The entire flow is orchestrated locally using Ollama, ensuring privacy and low latency. This enables the AI agent to deliver smart, fast, and context-aware responses grounded in your internal data.
spring.ai.ollama.base-url= http://localhost:11434
spring.ai.ollama.init.embedding.additional-models= mxbai-embed-large, nomic-embed-text
spring.ai.ollama.chat.options.temperature = 0.4
spring.ai.ollama.chat.options.model = gemma3
Code Block 3: IngestionService RAG Implementation
This service demonstrates the core RAG (Retrieval-Augmented Generation) pipeline in its simplest, most effective form. The queryRAGKnowledge
method performs vector similarity search to find relevant documents, joins the content into a context string, and creates a straightforward system prompt that instructs the AI to use only the retrieved information or respond “IDK :(“ when uncertain. Using Ollama as the local LLM, it generates responses grounded in the business’s own data. The service also handles multi-format document ingestion (PDF, Markdown, Images) with consistent token splitting and vector storage, making it a complete knowledge management solution.
@Service
class IngestionService(
private val vectorStore: VectorStore,
private val pdfDocumentReader: PdfDocumentReader,
private val markdownReader: MarkdownReader,
private val imageReader: ImageReader,
private val ollamaChatModel: OllamaChatModel
) {
private val logger = LoggerFactory.getLogger(IngestionService::class.java)
fun ingest(type: IngestionType) {
when (type) {
IngestionType.PDF -> ingestPdf()
IngestionType.MARKDOWN -> ingestMarkdown()
IngestionType.IMG -> ingestImage()
}
}
private fun ingestPdf() {
logger.info("Ingesting PDF using PdfDocumentReader component")
pdfDocumentReader.getDocsFromPdfWithCatalog()
.let { TokenTextSplitter().apply(it) }
.let { vectorStore.add(it) }
logger.info("PDF loaded into vector store")
}
// ...
// ...
// ...
/**
* Main RAG query method - retrieves similar documents and generates response
* This is the core of the local AI agent's intelligence
*/
fun queryRAGKnowledge(query: String): ResponseEntity<String> {
// Step 1: Find similar documents from vector store
val information = vectorStore.similaritySearch(query)
?.joinToString(System.lineSeparator()) { it.getFormattedContent() }
.orEmpty()
// Step 2: Create system prompt with retrieved information
val systemPromptTemplate = SystemPromptTemplate(
"""
You are a helpful assistant.
Use only the following information to answer the question.
Do not use any other information. If you do not know, simply answer: IDK :(
{information}
""".trimIndent()
)
// Step 3: Build prompt with context and user query
val systemMessage = systemPromptTemplate.createMessage(mapOf("information" to information))
val userMessage = PromptTemplate("{query}").createMessage(mapOf("query" to query))
val prompt = Prompt(listOf(systemMessage, userMessage))
// Step 4: Generate response using Ollama chat model
return ollamaChatModel.call(prompt)
.result
.output
.text
.let { ResponseEntity.ok(it) }
}
}
// Supporting Data Classes and Enums
data class ChatRequest(val message: String)
data class ChatResponse(
val message: String,
val sources: List<String>,
val timestamp: LocalDateTime
)
enum class IngestionType {
PDF, MARKDOWN, IMG
}
Lessons learned / Dev tips
AI Is No Longer the Future, it’s your localhost
The performance gap between local and cloud models is shrinking rapidly. For many use cases, local models are “good enough” and come with significant advantages.
Privacy + Control Are Superpowers
Being able to guarantee data privacy opens doors with clients who previously wouldn’t consider AI solutions.
Spring AI + Kotlin = Clean Dev Experience
The Spring ecosystem’s maturity combined with Kotlin’s expressiveness creates a development experience that rivals Python for AI applications.
Start Simple
Don’t try to build GPT-5 on day one. Start with basic RAG, get it working, then add complexity.
Resources & What’s Next
All the code, configuration files, and sample documents are ready for you in my GitHub repository.
➡️ Official GitHub Repo: https://github.com/AlejoJamC/ssc-local-agent
Here’s what you need to get started:
Install Ollama (curl -fsSL https://ollama.ai/install.sh | sh
)
Pull a model (ollama pull gemma3
)
Clone the repository
Run ./gradlew bootRun
That’s it. No cloud setup, no API keys, no credit card required.
🎥 Watch the Live Demo
SSC Meetup Talk: Your First RAG with Spring AI — See the complete implementation in action with real-time Q&A.
📚 Official Documentation & Examples
Spring AI 1.0 GA Release — The official announcement with key features.
Awesome Spring AI Community Samples — Curated collection of Spring AI implementations.
🔍 Model Selection Tools
Artificial Analysis — Compare model performance, cost, and speed across providers.
LMArena Leaderboard — Community-driven model rankings and comparisons.
🤗 Open LLM Leaderboard — Hugging Face’s comprehensive model evaluation.
👉 What’s next?
Try building your own RAG system. Start with the GitHub repo, experiment with different models, and see what works for your use case. The barrier to entry has never been lower.
Have questions about the implementation? Want to discuss model selection strategies? Drop a comment below or connect with me on LinkedIn.
Remember: the best AI is the one you actually use. Sometimes that means going local.
References:
Liu, F., Kang, Z. and Han, X. (2024) ‘Optimizing RAG Techniques for Automotive Industry PDF Chatbots: A Case Study with Locally Deployed Ollama ModelsOptimizing RAG Techniques Based on Locally Deployed Ollama ModelsA Case Study with Locally Deployed Ollama Models’, in Proceedings of 2024 3rd International Conference on Artificial Intelligence and Intelligent Information Processing, AIIIP 2024. New York, NY, USA: ACM, pp. 152–159. Available at: https://doi.org/10.1145/3707292.3707358.
Johnson, Rod (2025) Why you should use local models. Medium. 30 May. Available at: https://medium.com/@springrod/why-you-should-use-local-models-a3fce1124c94 (Accessed: 6 July 2025).
Spring.io (2025) ‘Spring AI 1.0 GA Released’ [online image] Available at: https://spring.io/blog/2025/05/20/spring-ai-1-0-GA-released [Accessed 20 May 2025].
Mantilla Celis, J.A. (2025) ‘Build a Local AI Agent for Small Businesses [Local RAG implementation diagram], inspired by the work of Bijit, B. (2024) ‘Advanced RAG for LLMs & SLMs’, Medium, 21 April. Available at: https://medium.com/@bijit211987/advanced-rag-for-llms-slms-5bcc6fbba411 (Accessed: 18 May 2025).