en

Server Side Chat: Building a Local AI Agent that actually works (No Cloud required)

Build your first RAG system with Spring AI, Ollama, and Kotlin — process documents, answer questions, and keep your data private.

The problem that started it all

Picture this: you’re a business owner or tech leader who needs to process documents, answer customer questions, and extract insights from your data. But every AI solution you find either:

  • Costs a fortune in API calls

  • Sends sensitive data to the cloud, or

  • Feels like a foreign stack to your team’s expertise.

Sound familiar?

Three months ago, I faced the same challenge. As a software engineer with 10+ years in the JVM ecosystem (Java, Kotlin), I realized something: while Python dominates AI discussions, millions of developers work daily with JVM technologies, and they shouldn’t need to switch stacks to harness AI’s power.

That’s how we built a local RAG (Retrieval-Augmented Generation) system using Spring AI + Kotlin + Ollama. Private, cost-effective, and native to the tools JVM developers already know.

The project was founded with the belief that the next wave of Generative AI applications will not be only for Python developers but will be ubiquitous across many programming languages. — Spring AI engineering team

What you’ll learn

By the end of this post, you’ll understand:

  • Why local AI isn’t just a privacy gimmick. It’s a game-changer.

  • How to choose the right LLM for your use case.

  • The architecture behind a RAG system.

  • Code example using PDF and Markdown as a knowledge source.

  • The challenges I faced and how to avoid them.

Understanding the foundation: Key concepts for Local AI

Before diving into the implementation, let’s establish the core concepts that make this demo possible. Think of this as your AI vocabulary crash course. Understanding these six elements will help you follow along and make informed decisions about your own implementation.

Large Language Models (LLMs) are powerful neural networks trained on massive text corpora. They can generate, summarize, translate, and answer questions using Natural Language Processing, acting as advanced pattern recognizers that understand context and intent.

AI Agents go further. LLMs that reason, act, and use tools like APIs, databases, or file systems. They don’t just chat; they get things done.

Retrieval-Augmented Generation (RAG) bridges LLMs with your own data — PDFs, CSVs, or internal documents. Instead of guessing, the model first retrieves relevant chunks, then generates answers grounded in that specific context. This turns a generic LLM into a domain expert.

Vector Stores and Embeddings make this possible. By converting text into semantic vectors, RAG enables similarity-based retrieval, not just matching words but understanding meaning.

GGUF (GPT-Generated Unified Format) makes local AI feasible. These optimized, quantized models can run efficiently on laptops or edge devices, democratizing access to advanced AI.

Finally, Spring AI and Ollama bring it all together: Spring AI integrates LLMs into JVM apps effortlessly, while Ollama runs quantized models locally via CLI or REST — empowering developers to build private, local-first AI systems.

Why go local? (Spoiler: It’s not just about privacy)

Running LLMs locally isn’t just a developer flex — it’s often the most secure, cost-effective, and flexible way to deploy GenAI solutions in production environments.

The case for local AI models has never been stronger. As Rod Johnson (Creator of Spring) points out, “local models are the future of AI development” — they allow tighter integration, transparent behavior, and complete customization without external dependencies (Johnson, 2025).
In practice, local-first LLMs unlock several key advantages:

  1. Privacy first, always: Your sensitive documents remain secure within your environment, ensuring GDPR compliance.

  2. Cost-efficient development: Local models eliminate continuous API fees, allowing endless iteration on existing hardware.

  3. Eco-friendly: Right-size your model — smaller AI = smaller carbon footprint.

  4. Faster prototyping: Enjoy faster, iteration-friendly development without cloud infrastructure delays.

  5. Regulatory compliance: Local models provide essential data control and compliance for strict geographic requirements.

Architecture: What you’ll build

Figure 1: Local RAG implementation by Alejandro Mantilla inspired by Bijit Ghosh

The flow is straightforward:

  1. Document Ingestion: Upload PDFs or Markdown files.

  2. Chunking & Embedding: Break documents into chunks, generate embeddings.

  3. Vector Storage: Store embeddings in PGVector.

  4. Query Processing: User asks a question.

  5. Retrieval: Find relevant chunks using vector similarity.

  6. Generation: LLM generates answer based on retrieved context.

Tech stack overview

Here’s what I used and why:

Spring AI + Kotlin: Because not everything needs to be Python. Spring AI’s 1.0 GA release proved that creating a modular, powerful abstraction layer over LLMs and vector stores.

Ollama: The “Docker for AI models.” It makes LLMs portable, accessible, and easy to deploy. Runs on port 11434 with a REST API ready to receive requests.

PGVector: PostgreSQL with vector extensions. Familiar database, vector capabilities.

Local LLMs: Gemma3, Mistral, Phi-2, and others that run efficiently on consumer hardware.

Coding — Live Demo Flow

Here’s the core RAG implementation in Kotlin:

Code Block 1: Vector Database Configuration

This is the foundation for storing and retrieving embeddings essential for RAG (Retrieval Augmented Generation) functionality all configuration is done programmatically using the builder pattern.

@Configuration
@EnableJpaRepositories
class VectorDatabaseConfig {
    
    @Bean
    @Primary
    fun dataSource(): DataSource {
        val config = HikariConfig()
        config.jdbcUrl = "jdbc:postgresql://localhost:5432/ssc_agent_db"
        config.username = "postgres"
        config.password = "password"
        config.driverClassName = "org.postgresql.Driver"
        
        // Enable pgvector extension support
        config.addDataSourceProperty("stringtype", "unspecified")
        
        return HikariDataSource(config)
    }
    
    @Bean
    fun vectorStore(
        dataSource: DataSource,
        embeddingModel: EmbeddingModel
    ): PgVectorStore {
        return PgVectorStore.builder(dataSource, embeddingModel)
            .withSchemaName("public")
            .withTableName("document_embeddings")
            // Configure vector dimensions - must match embedding model output
            .withDimensions(1024)
            // Set index type to HNSW for fast similarity search
            .withIndexType(PgVectorStore.PgIndexType.HNSW)
            // Use cosine distance for semantic similarity measurement
            .withDistanceType(PgVectorStore.PgDistanceType.COSINE_DISTANCE)
            // Configure batch processing for better performance
            .withMaxDocumentBatchSize(10000)
            // Enable automatic schema initialization
            .withInitializeSchema(true)
            // Enable vector table validations for data integrity
            .withVectorTableValidationsEnabled(true)
            // Optional: Configure HNSW-specific parameters for performance tuning
            .withHnswEfConstruction(200)  // Higher = better recall, slower build
            .withHnswM(16)                // Higher = better recall, more memory
            .build()
    }
}

Code Block 2: Document Processing

This is the core business logic that transforms raw business documents into structured, searchable knowledge. It handles key tasks such as document ingestion, intelligent text chunking, embedding generation, and efficient storage using vector databases. These processes form the foundation of the knowledge base that empowers the AI agent to reason about your specific business context. In this setup, models like mxbai-embed-large and nomic-embed-text are initialized to generate high-quality embeddings from diverse formats, while gemma3 is used for controlled and cost-efficient language generation with a tuned temperature of 0.4 (lower values make answers more focused and predictable, while higher values increase randomness and variability). The entire flow is orchestrated locally using Ollama, ensuring privacy and low latency. This enables the AI agent to deliver smart, fast, and context-aware responses grounded in your internal data.

spring.ai.ollama.base-url= http://localhost:11434
spring.ai.ollama.init.embedding.additional-models= mxbai-embed-large, nomic-embed-text
spring.ai.ollama.chat.options.temperature = 0.4
spring.ai.ollama.chat.options.model = gemma3

Code Block 3: IngestionService RAG Implementation

This service demonstrates the core RAG (Retrieval-Augmented Generation) pipeline in its simplest, most effective form. The queryRAGKnowledge method performs vector similarity search to find relevant documents, joins the content into a context string, and creates a straightforward system prompt that instructs the AI to use only the retrieved information or respond “IDK :(“ when uncertain. Using Ollama as the local LLM, it generates responses grounded in the business’s own data. The service also handles multi-format document ingestion (PDF, Markdown, Images) with consistent token splitting and vector storage, making it a complete knowledge management solution.

@Service
class IngestionService(
    private val vectorStore: VectorStore,
    private val pdfDocumentReader: PdfDocumentReader,
    private val markdownReader: MarkdownReader,
    private val imageReader: ImageReader,
    private val ollamaChatModel: OllamaChatModel
) {
    private val logger = LoggerFactory.getLogger(IngestionService::class.java)

    fun ingest(type: IngestionType) {
        when (type) {
            IngestionType.PDF -> ingestPdf()
            IngestionType.MARKDOWN -> ingestMarkdown()
            IngestionType.IMG -> ingestImage()
        }
    }

    private fun ingestPdf() {
        logger.info("Ingesting PDF using PdfDocumentReader component")
        pdfDocumentReader.getDocsFromPdfWithCatalog()
            .let { TokenTextSplitter().apply(it) }
            .let { vectorStore.add(it) }
        logger.info("PDF loaded into vector store")
    }

    // ...
    // ...
    // ...

    /**
     * Main RAG query method - retrieves similar documents and generates response
     * This is the core of the local AI agent's intelligence
     */
    fun queryRAGKnowledge(query: String): ResponseEntity<String> {
        // Step 1: Find similar documents from vector store
        val information = vectorStore.similaritySearch(query)
            ?.joinToString(System.lineSeparator()) { it.getFormattedContent() }
            .orEmpty()

        // Step 2: Create system prompt with retrieved information
        val systemPromptTemplate = SystemPromptTemplate(
            """
        You are a helpful assistant.
        Use only the following information to answer the question.
        Do not use any other information. If you do not know, simply answer: IDK :(

        {information}
        """.trimIndent()
        )

        // Step 3: Build prompt with context and user query
        val systemMessage = systemPromptTemplate.createMessage(mapOf("information" to information))
        val userMessage = PromptTemplate("{query}").createMessage(mapOf("query" to query))
        val prompt = Prompt(listOf(systemMessage, userMessage))

        // Step 4: Generate response using Ollama chat model
        return ollamaChatModel.call(prompt)
            .result
            .output
            .text
            .let { ResponseEntity.ok(it) }
    }
}

// Supporting Data Classes and Enums
data class ChatRequest(val message: String)
data class ChatResponse(
    val message: String,
    val sources: List<String>,
    val timestamp: LocalDateTime
)

enum class IngestionType {
    PDF, MARKDOWN, IMG
}

Lessons learned / Dev tips

AI Is No Longer the Future, it’s your localhost

The performance gap between local and cloud models is shrinking rapidly. For many use cases, local models are “good enough” and come with significant advantages.

Privacy + Control Are Superpowers

Being able to guarantee data privacy opens doors with clients who previously wouldn’t consider AI solutions.

Spring AI + Kotlin = Clean Dev Experience

The Spring ecosystem’s maturity combined with Kotlin’s expressiveness creates a development experience that rivals Python for AI applications.

Start Simple

Don’t try to build GPT-5 on day one. Start with basic RAG, get it working, then add complexity.

Resources & What’s Next

All the code, configuration files, and sample documents are ready for you in my GitHub repository.

➡️ Official GitHub Repo: https://github.com/AlejoJamC/ssc-local-agent

Here’s what you need to get started:

  1. Install Ollama (curl -fsSL https://ollama.ai/install.sh | sh)

  2. Pull a model (ollama pull gemma3)

  3. Clone the repository

  4. Run ./gradlew bootRun

That’s it. No cloud setup, no API keys, no credit card required.

🎥 Watch the Live Demo
SSC Meetup Talk: Your First RAG with Spring AI — See the complete implementation in action with real-time Q&A.

📚 Official Documentation & Examples
Spring AI 1.0 GA Release — The official announcement with key features.
Awesome Spring AI Community Samples — Curated collection of Spring AI implementations.

🔍 Model Selection Tools
Artificial Analysis — Compare model performance, cost, and speed across providers.
LMArena Leaderboard — Community-driven model rankings and comparisons.
🤗 Open LLM Leaderboard — Hugging Face’s comprehensive model evaluation.

👉 What’s next?

Try building your own RAG system. Start with the GitHub repo, experiment with different models, and see what works for your use case. The barrier to entry has never been lower.

Have questions about the implementation? Want to discuss model selection strategies? Drop a comment below or connect with me on LinkedIn.

Remember: the best AI is the one you actually use. Sometimes that means going local.

References:

Liu, F., Kang, Z. and Han, X. (2024) ‘Optimizing RAG Techniques for Automotive Industry PDF Chatbots: A Case Study with Locally Deployed Ollama ModelsOptimizing RAG Techniques Based on Locally Deployed Ollama ModelsA Case Study with Locally Deployed Ollama Models’, in Proceedings of 2024 3rd International Conference on Artificial Intelligence and Intelligent Information Processing, AIIIP 2024. New York, NY, USA: ACM, pp. 152–159. Available at: https://doi.org/10.1145/3707292.3707358.

Johnson, Rod (2025) Why you should use local models. Medium. 30 May. Available at: https://medium.com/@springrod/why-you-should-use-local-models-a3fce1124c94 (Accessed: 6 July 2025).

Spring.io (2025) ‘Spring AI 1.0 GA Released’ [online image] Available at: https://spring.io/blog/2025/05/20/spring-ai-1-0-GA-released [Accessed 20 May 2025].

Mantilla Celis, J.A. (2025) ‘Build a Local AI Agent for Small Businesses [Local RAG implementation diagram], inspired by the work of Bijit, B. (2024) ‘Advanced RAG for LLMs & SLMs’, Medium, 21 April. Available at: https://medium.com/@bijit211987/advanced-rag-for-llms-slms-5bcc6fbba411 (Accessed: 18 May 2025).

Thanks for reading.
Now let's get to know each other.

What we do

WAES supports organizations with various solutions at every stage of their digital transformation.

Discover solutions

Work at WAES

Are you ready to start your relocation to the Netherlands? Have a look at our jobs.

Discover jobs

Related articles

Let's shape our future

Work at WAES

Start a new chapter. Join our team of Modern Day Spartans.

Discover jobs

Work with WAES

Camilo Parra Gonzalez

Camilo Parra Gonzalez

Business Manager