AI-powered insurance workflows: Operationalizing LLMs with EXL Insurance LLM™

Background Image

AI-powered insurance workflows:
Operationalizing LLMs with EXL
Insurance LLM™

Abstract

This white paper delves into the process of fine-tuning a Large Language Model (LLM) specifically for the Insurance industry, utilizing EXL’s Data, Domain, and Advanced AI capabilities.

It emphasizes the significance of adapting pre-trained LLMs to specialized fields that require deep reasoning, such claims adjudication in insurance that involve medical records. Fine tuning enhances the ability of LLMs to perform tasks that require deep, domain-specific expertise. This white paper outlines the comprehensive steps which involved the collection, curation, and de-identification of data, with particular attention to handling both structured and unstructured data sources, fine-tuning process, and implementation.

A key focus of the article is the use of Parameter-Efficient Fine-Tuning (PEFT) methods, particularly LoRa (Low-Rank Adaptation), which enable efficient model adaptation without compromising performance. The challenges involved in integrating domain-language while maintaining model efficiency are also addressed. The article describes the implementation process, detailing the environment set up and the model training using AWS and NVIDIA resources. By leveraging multi-GPU configurations and advanced parallelism, the training process was optimized for performance and scalability.

The results presented show that the fine-tuned model significantly outperforms many general-purpose models across a variety of natural language processing tasks, including question answering (Q&A), tagging, summarization and reasoning. The article concludes by offering valuable insights into the intricacies of fine-tuning domain-specific models, illustrating the effectiveness of AI tools like NVIDIA NIM and NeMo for building scalable, high-performance LLMs tailored to specialized industries.

The EXL Insurance LLM™ is embedded into the MedConnection claims adjudication workflow, enabling AI-powered operations and demonstrating how operations can use AI solutions to drive real ROI. This paper showcases how EXL operations platforms are increasingly AI-powered by proprietary LLMs designed for domain-specific outcomes.

Introduction

Large Language Models (LLMs) have revolutionized in the field of Natural Language Processing (NLP). These models, such as ChatGPT, Claude, and Gemini, are trained on vast open-source datasets available on the internet. They possess a broad understanding of language, making them suitable for general NLP tasks [6], [21]. However, these models often lack the specialized domain knowledge necessary for specific industries, such as insurance, healthcare, finance, etc. These sectors require an understanding of unique terminology, context, and regulatory requirements, which general-purpose models might not be able to effectively capture, as they do not have access to industry specific use case data [5], [16], [22].

In contrast, industries, like insurance, have access to rich sources of private, domain-specific data that are not available on the open internet. Fine-tuning pre-trained LLMs offers a solution to this challenge. By adapting these models to domain-specific tasks, fine-tuning enhances their performance and enables them to better understand the intricacies of industry-specific domain knowledge [17], [18]. In the insurance sector, for example, fine-tuned LLMs can significantly improve tasks such as claims analysis, summarization, and question answering by aligning the model’s understanding with specialized insurance terminology, process, legal language, and customer profiles [8], [26].

This white paper also presents a stepby-step guide through the fine-tuning process—from data curation and model selection to training and evaluation— illustrating the methodologies and challenges involved in adapting a pre-trained LLM for the insurance sector. The results of our research demonstrate that fine-tuning a model with domain-specific data leads to substantial improvements in performance, particularly in complex NLP tasks like summarization, reasoning and Q&A. Furthermore, the use of advanced tools and techniques enable the optimization of LLMs for specialized applications, ensuring not only higher accuracy and relevance but also compliance with industry standards and regulations based on domain knowledge.

This work reflects a practical implementation that combines AI model customization with real-world insurance operations. Unlike theoretical LLM applications, this initiative was deployed, evaluated against live data, and benchmarked against general-purpose models to validate ROI and business impact.

This white paper aligns with EXL’s broader strategy of embedding AI into core workflows to power insurance operations with measurable ROI, using proprietary models customized for operational impact andaims to provide practical insights into the fine-tuning process for domain-specific tasks, highlighting how LLMs can be adapted to meet the unique challenges of industries like insurance. It highlights the benefits of leveraging innovative machine learning frameworks for creating efficient and scalable solutions tailored to specific domain knowledge and industry requirements.

1. Literature review

LLM


2. Project initiation

a. Objectives and goals

The primary objective of the EXL Insurance LLM™ is to re-imagine the medical record injury evaluation process essential for claims adjudication. Traditionally, this task has been timeconsuming and dependent on medical professionals’ manual intervention, which increases operational costs and introduces inconsistencies and potential errors. The goal is to leverage AI to automate the labor-intensive aspects of reviewing medical records, reducing reliance on human expertise while increasing efficiency, accuracy, and cost-effectiveness. By doing so, the EXL Insurance LLM™ aims to simplify and expedite the claims process, ensuring that decisions are made faster and more precisely.

The expected outcomes of implementing the EXL Insurance LLM™ are multifaceted:

  • Efficiency improvement: Automating routine tasks, such as summarizing medical records, tagging relevant data, and generating Q&A responses, will significantly reduce the time spent on claims assessments. This allows claim adjusters to focus on higher-level decision-making.
  • Enhanced accuracy: The AI-driven solution will ensure consistency and thoroughness in the claim review process, reducing potential errors or omissions that could compromise decision-making.
  • Cost reduction: By reducing the need for expensive medical professionals to review medical records manually, the solution will lower the overall cost of claims processing, helping insurance companies save money while improving operational outcomes.
  • Improved decision-making: The process provides actionable insights, such as negotiation guidance and clear summaries of medical records, helping claim adjusters make better and more informed decisions quickly and effectively.
  • Scalability: The automated nature of the solution ensures that it can manage increasing volumes of claims without a corresponding increase in costs or delays, making it highly scalable for insurance companies of all sizes.

b. Use Case

In a typical insurance claims process, claim adjusters and medical professionals must carefully review medical records to assess the validity of a claim and determine the appropriate payout. This review is often slow, manual, and prone to inconsistencies due to the diverse ways professionals interpret medical data. EXL Insurance LLM™ addresses these challenges by automating many tasks that traditionally require human involvement.

For instance, the AI solution automatically tags relevant data points, generates essential summaries from medical records, and provides Q&A capabilities that assist claim adjusters in quickly answering critical questions. Additionally, the system offers negotiation guidance to help adjusters effectively engage with claimants and determine fair settlements.

By automating these processes, the EXL Insurance LLM™ enhances the claims adjudication workflow, enabling faster and more consistent decision-making. This not only speeds up claim settlements but also improves the overall customer experience, while reducing the operational costs and risks associated with manual adjudication. This AI-powered approach is especially beneficial in high-volume environments, where human resources alone cannot keep pace with the demands of modern claims processing.

This embedded approach ensures the EXL Insurance LLM™ not only automates tasks but becomes an integral part of AI-powered insurance operations—driving both efficiency and real-time decision support across the claims lifecycle.

LLM


3. Data preparation

a. Data discovery

The current claim-handling process is entirely manual, requiring significant human effort. Before the development of the EXL Insurance LLM™, it was highly dependent on the intensive work of the EXL operations team working with a refined process and platform.

Current - human-intensive process overview

  • Sorting & indexing: The first stage of the workflow is uploading a claim to the portal by the client for a specific date of loss. Usually, a claim can have any number of pages depending on the complexity and length of the treatment. The EXL expert assesses all the pages and classifies the medical documents based on date of service, record type, provider (doctor), and facility (hospital) names to provide a chronology of records. They also deduplicate and pull out all the blank and irrelevant pages in this process.
  • Tagging: In the next stage, EXL medical professionals go through all the medical documents page by page and extract the relevant information in the form of tags with tag citations. Labeling the information and using the correct tag type is critical in this process. Missing any critical information can impact the claim adjusters’ decisionmaking process at a later stage. Clear and consistent tagging reduces the risk of overlooking vital details in medical records and can have a significant impact. For instance, if the insurer mistakenly compensates the claimant for treatment related to a pre-existing condition, they may overpay. Alternatively, if the pre-existing condition should not have been factored in, the claimant may receive less than they are entitled to.
  • Summary: In the next stage, the most experienced experts analyze the medical information and produce a clear, concise, and easy-to-understand summary utilizing the most significant tags. This helps to minimize the information spread across hundreds or thousands of pages in a claim to 10-20-page summary document. In summary, the medical facts are organized and presented in an easily digestible format, e.g., with details of the injury, all the injuries claimed by the injured party, and the claim highlights. The accuracy of medical summaries is foundational to insurance adjusters’ efficient and effective decision-making. It makes the claims settlement process more accurate and fair for all parties involved.
  • Negotiation guidance: The final processing stage for medical professionals is to create the negotiation guidance, which provides the key information critical to the negotiation stage, in contrast to the existing summary, which offers a comprehensive overview of all medical documents related to the case. The claim is converted into a 1–2-page document highlighting the highly relevant information. It is separated into two main categories: Economic Damages, which represent the financial loss portion of the claim (past and future out-of-pocket expense), and Non-economic Damages, which represent pain and suffering and are difficult to translate into dollars. The Negotiation Guidance is the reference document for Claim Adjusters and Legal Staff from the Insurance Company, backed by facts, to negotiate with a counterparty lawyer for a fair value claim indemnity settlement.

The future state of the EXL Insurance LLM™ process overview

The initial stages of the pipeline remain unchanged, beginning with the uploading of claims and then followed by human-assisted indexing and classification of each page according to record type, date of service, provider, and facility name. Subsequently, the medical records are meticulously processed page by page through the EXL Insurance LLM™.

  • Tags: Tags represent labeled pieces of information within a page, with each page potentially containing zero or multiple tags. Pages are classified based on record type and date of service, with each record type being associated with a predefined set of tags. These tags are instrumental in extracting relevant data required for generating summaries. There are forty-one distinct tags, each of which may appear multiple times within a claim. Furthermore, each tag has tag citations, which refer to the specific details associated with the tag name.
  • Summary: The summary is the second stage of the process, designed to expedite the claim process by assisting claim handlers and adjusters in making quick decisions. It highlights the most relevant facts from the claim standpoint. Multiple summary headers are linked to a specific set of tags. The summary takes input from the most significant tags and their corresponding tag citations. These dates help construct a narrative of the claim and its progression. This approach ensures the creation of summaries with highly specific and pertinent information for each claim.
  • Negotiation guidance: The third and final stage of the pipeline is Negotiation Guidance (NG). The EXL Insurance LLM™ analyzes the summary document to pull out the most pertinent information in a 1-2 page NG document. The NG is separated into economic and non-economic factors that focus on negotiating and settling a claim. The NG takes input from summary headers, and every summary header is mapped to at least one NG header.

Human intervention is incorporated at critical stages to avoid potential issues, as every small detail matters and each component depends on the next. As a challenging task, tag extraction cannot achieve 100% accuracy by any model. Therefore, human involvement is introduced during the summary stage, where adjusting with minimal effort is easier. Experienced SMEs can identify gaps in the EXL Insurance LLM™ by reviewing the summary and assessing the claims effectively. Since the NG is the final document, human expertise is also utilized to evaluate it as the outcome of the process. While the process used to take days when entirely manual, the integration of the EXL Insurance LLM™ has reduced the time to mere hours.

b. Data collection

During the data discovery phase of the EXL Insurance LLM™ project, the focus was on collecting and organizing data to tailor the model for the insurance industry. The process began with gathering structured data, organized into predefined tables and fields, facilitating efficient storage, searching, and analysis. Key datasets included insurance claims, which encompassed policy numbers, claim amounts, incident dates, and claim statuses, as well as customer profiles containing information such as names, addresses, policy types, and premium amounts. This structured data, spanning nine years and comprising over 13,500 records from various schema tables, was stored in databases, enabling easy querying and report generation.

In addition to structured data, unstructured data was also utilized. Unlike structured data, unstructured data does not adhere to a predefined format, presenting more complexity in interpretation. This category included handwritten documents, patient questionnaires, reviews, complaints, and free-form claims descriptions where policyholders provided detailed narratives about incidents or damages. Processing and interpreting this data required advanced techniques due to its varied formats, such as PDFs and text documents.

Dummy data was also created to ensure the model’s reliability and performance. This simulated dataset, developed with the collaboration of Subject Matter Experts (SMEs), served as a benchmark for evaluating the model’s performance against competitors and identifying areas for improvement. Integrating structured and unstructured data and leveraging benchmarking through dummy data established a robust foundation for training and refining the model within the insurance domain.

In addition to extracting structured data from databases, the evaluation of Summary headers and negotiation guidance within HTML files was conducted, followed by data extraction from these documents. This task proved challenging due to each HTML file’s varied header, table, and body formats. Furthermore, it was carried out under a tight deadline, adding a layer of complexity.

c. Data curation and preparation

The data used for the insurance domain was saved in PDF format, containing both machine-readable and scanned documents. To efficiently process these documents and extract text while preserving the context for the EXL Insurance LLM™ to understand, a systematic approach is followed:

Text Extraction Using OCR: Optical Character Recognition (OCR) technology was employed to extract text from scanned documents and PDFs. AWS Textract was used for its ability to accurately retrieve text and tables from documents while preserving positional coordinates at the line and word level [2], [20]. This preservation ensures that the context of the information is maintained, allowing the EXL Insurance LLM™ to interpret the text correctly. The extracted data and associated metadata were then prepared for further processing.

Junk Page Detection and Removal: As O7CR accuracy was not perfect, several junk pages were present in the data. To address this, a junk detection tool was developed using the madhurjindal/autonlp-Gibberish-Detector-492513457 model [10]. The tool identifies irrelevant pages by comparing the presence of tag citation words in the text. If fewer than 50% of the expected tag citation words were found on a page, it was flagged as junk. This step is crucial for optimizing the de-identification process and improving model performance by ensuring the removal of irrelevant or incorrect information. Data Cleaning and Duplicate Removal: Standard data cleaning procedures were applied after identifying and removing junk pages. This included the removal of duplicate entries and entries with “N.A.” values, ensuring that the dataset remained clean and consistent for training purposes.

Handling Multiple Tags and Citations: A challenge arising from OCR processing was the occurrence of multiple tag citations for the same tag on a single page. Training data with a one-toone mapping between tag citations and pages could confuse the model, resulting in inconsistent outputs. Additionally, a single page may contain multiple tags. To address this, the data was grouped by page, tag citation, and tags, combining them into a single row item. This approach ensures that each page is uniquely represented in the database, preventing inconsistencies and enhancing model accuracy.

By following this process, the challenges posed by OCR data were effectively addressed, and the dataset was prepared for efficient training, ensuring that the model learned accurate and relevant information.

4. Data de-identification:

Due to privacy concerns, the preprocessed data could not be directly used for training purposes. To prevent the EXL Insurance LLM™ from inadvertently learning sensitive health information and to minimize privacy and security risks, strict de-identification procedures were followed, ensuring compliance with HIPAA (Health Insurance Portability and Accountability Act of 1996) safe harbor guidelines [1]. Deidentification posed one of the most significant challenges in this process.

The initial step in the de-identification process involved removing junk values from the data using a junk detector, which optimized the identification of Personally Identifiable Information (PII) and Protected Health Information (PHI). Several de-identification tools were evaluated based on PII and PHI extraction capabilities. Each was assessed based on its ability to effectively identify and remove sensitive health information in compliance with HIPAA standards [15]. After thorough evaluation, the tool demonstrated the best overall performance, and alignment with regulatory requirements was selected as the primary de-identification solution. After de-identification, the sensitive information was converted into hash values using the SHA256 hashing algorithm [11], ensuring that sensitive data remained protected.

Challenges arose during this process, particularly with overmasking and inconsistency when converting names into hash values. Since the dataset contained full names, partial names, and surnames, maintaining consistency in hashing was essential to ensure the model could recognize references to the same individual despite the masking. To address this, a sequential masking approach was implemented for names. If a name had already been hashed, the system checked the database and applied the same hash value to subsequent instances, ensuring consistency in how the individual was represented.

Another challenge involved ensuring that doctor names, which do not qualify as PII/PHI, were not mistakenly de-identified. Identifying doctor names proved difficult without specific prefixes like “Dr.” To solve this, a list of common name prefixes (e.g., “Dr.”) was created and stored in the database. When a name matches one of these prefixes, a fuzzy matching process compares it with doctor names already in the database. If the match score was sufficiently high, the doctor’s name was retained without hashing, avoiding unnecessary de-identification.

Special care was also required for address-related information linked to doctors. If an address was associated with a doctor’s name, it was crucial not to mask it, as such information was essential for tasks like generating summaries. Masking everything would have obscured more than 50% of the page, impairing the model’s ability to process and understand the data. Therefore, doctor-related addresses were kept visible while still adhering to HIPAA standards.

Additionally, fields like dates of visits and doctor email addresses, which do not fall under HIPAA guidelines, were being masked by AWS Comprehend. To avoid over-masking, an end-to-end pipeline was developed to filter out non-sensitive information, ensuring that only actual PII/PHI was de-identified.

Random sampling was conducted to validate the accuracy of the de-identification process, and subject matter experts (SMEs) were consulted. SMEs reviewed the randomly selected samples to identify any entries that may have been overlooked during de-identification. This review process helped ensure that no sensitive information was missed, and that the de-identification was thorough and accurate, while preserving the integrity and usefulness of the data for model training.


LLM


5. Infrastructure

The training infrastructure is designed to address unique needs based on the number of nodes available and specific use cases. Two setups are provided: one for single-node training using a custom SageMaker container with NVIDIA technology and another for multi-node training using an Amazon EKS cluster with powerful NVIDIA H100 GPUs. Both configurations ensure high performance, scalability, and efficient data management [14].

a. Single-node training setup

For scenarios where single-node training is adequate, a custom SageMaker container has been developed that integrates the NVIDIA NeMo Framework for model fine-tuning [29]. This container is optimized to seamlessly work with the NVIDIA AI stack, leveraging CUDA, NCCL, and cuDNN for highperformance computation [19]. The setup integrates smoothly with AWS S3, enabling efficient data access and management of model checkpoints throughout the training cycles. Using AWS GPU instances achieves optimal hardware performance, leading to faster and more efficient training. Additionally, advanced parallelism techniques, such as Tensor Parallelism and Data

Parallelism is employed to enhance training efficiency and scalability [4].

b. Key features of the single-node training setup:

  • Custom SageMaker Container: Integrates NVIDIA NeMo and NVIDIA AI Stack (CUDA, NCCL, cuDNN) for model finetuning.
  • AWS S3 Integration: Ensures smooth data access and efficient management of model checkpoints.
  • AWS GPU Instances: Utilizes NVIDIA GPU instances (p4d, p5) to optimize hardware performance.
  • Advanced Parallelism Techniques: Implements Tensor Parallelism and Data Parallelism [30] to enhance training efficiency and scalability. Optimized for NVIDIA Tensor Cores and Mixed Precision Training for faster convergence.

c. Multi-node training setup:

For large-scale distributed training, an Amazon EKS cluster [13] is provisioned with two p5 instances, each equipped with 8 NVIDIA H100 GPUs [31]. This setup is ideal for handling extensive datasets and complex models that require significant computational power. An Elastic File System (EFS) volume is established to provide shared access to datasets, checkpoints, and training states, ensuring consistency and reliability across the training nodes. A Helm chart [27] is deployed to manage and orchestrate distributed training jobs, streamlining resource allocation and job scheduling. This approach leverages NVIDIA’s Deep Learning Accelerator (DLA) and NVIDIA NVLink to optimize inter-GPU communication.

d. Key features of the multi-node training setup:

  • Amazon EKS cluster: Equipped with two p5 instances, each featuring 8 NVIDIA H100 GPUs for large-scale distributed training.
  • Elastic File System (EFS): Provides shared access to datasets, checkpoints, and training states for consistency and reliability.
  • Helm chart deployment: Manages and orchestrates distributed training jobs efficiently, streamlining resource allocation and job scheduling.
  • Advanced parallelism: Utilizes Pipeline Parallelism, Tensor Parallelism, and Data Parallelism with NVIDIA NCCL for high-efficiency communication.
  • NVIDIA NVLink: Enables high-bandwidth inter-GPU connectivity
  • High-demand scenarios: Ideal for handling extensive datasets and complex models, supporting rapid processing and scalability.
  • Sophisticated model training: Suited for training complex models with large datasets.

Both setups are designed to leverage the strengths of AWS infrastructure, ensuring high performance, reliability, and scalability. The training pipeline is optimized for precision and efficiency, whether for smaller tasks using a single-node solution or for more demanding projects requiring a multi-node setup.

e. Custom enhancements and bug fixes:

During integration, the NVIDIA NeMo framework was modified to resolve compatibility issues and optimize performance within the AWS environment. MLflow was integrated for realtime monitoring of training metrics, and multi-run capabilities were enabled to allow flexible and scalable fine-tuning

6. Data ingestion and preprocessing pipeline

a. Data preparation and transformation for fine-tuning domain-specific large language models

The data preparation process involved several key steps to ensure that the fine-tuned Large Language Model (EXL Insurance LLM™) could effectively handle specific tasks within the insurance domain. At this stage, a set of unique pages was mapped with tags and tag citations, and all pages were de-identified to prevent data leakage, which was verified by subject matter experts (SMEs). However, this data could not be directly used for training as it required a clearly defined task. The decision was made to focus on the first task: Question and Answer (QnA).

Q&A task:

For the Q&A task, SMEs were engaged to create questions for each tag, with the corresponding tag citation as the answer. The assumption behind this task was that the model would learn patterns from the page text and subsequently extract key information. As all tags represent essential details within the text, generic questions were constructed based on the definitions of these tags. These questions provided a foundation for the model to learn from.

Tag and tag citation extraction:

Given that claims could span 500+ pages, processing the entire claim in a single pass was not feasible. Therefore, a page-by-page inference approach was adopted. To prepare data for tag extraction, a prompt was designed that combined tag definitions with general-purpose instructions. Some categories contained as many as thirty-six tags mapped to a specific record type, so static buckets of 5-6 tags were created to simplify processing. This approach ensured efficient processing of each page while capturing all relevant information.

Summary:

The summary task was a critical step in the data pipeline. SMEs were tasked with creating clear instructions for each summary header, guiding the EXL Insurance LLM™ to extract relevant information from the tags associated with each header. Each summary header was derived from a combination of different tags, and the instructions were structured to ensure the model would correctly summarize the information.

Negotiation guidance:

The final task in the pipeline was Negotiation Guidance, which was similar to the summary task regarding data preparation. This task involved a combination of different summary headers, and SMEs were again engaged to craft specific instructions to guide the model’s output. Since this was the final stage, formatting was emphasized to ensure a script with less human intervention could consume the output.

b. Data transformation for training

At this stage, all the data required for training was prepared. The Alpaca format was chosen for data structuring, which is organizing the data in JSON format with system prompts, user prompts, and assistant prompts, similar to question-answer pairs. The data preparation process varied slightly for each use case.

Q&A data transformation:

For the Q&A task, the data consisted of one question paired with a corresponding answer. Each question was based on a tag, with the tag citation as the answer. However, some tags appeared multiple times across different pages, meaning a single tag could have multiple tag citations. To prevent confusion during training, all tag citations for each tag were combined into a single answer. The question was kept generic to ensure it could address all variations of answers. To further enhance model performance, variations in the questions were introduced, helping prevent bias from the model learning to answer only the exact questions on which it was trained.

Tag extraction data transformation:

Preparing the data for tag extraction was one of the more challenging tasks. Human annotators typically mark information at the claim level, but the data needed to be processed at the page level for inference. A reverse-engineering approach was employed to address this challenge.

Each tag citation was iterated across all pages to determine if the citation appeared on a given page. This method allowed for mapping each tag and its corresponding citation to the appropriate page. Additionally, SMEs defined and verified each tag.

Summary data transformation:

For the summary task, tags were initially mapped at the claim level. SMEs were engaged in mapping summary headers to groups of tags. Each summary header was designed to include information such as the provider and facility names and the Date of the Incident. Including provider and facility names ensured the output adhered to the correct format. At the same time, the Date of the Incident helped the model extract relevant information based on time references in the summary headers. Variations in the summary header instructions were created to ensure the model could generate outputs in the desired format.

Negotiation guidance data transformation:

The dataset for Negotiation Guidance (NG) was smaller— only five hundred claims compared to over 14,000 for other tasks—the NG task involved ten distinct NG headers, each a combination of summary headers. To prepare the data, the order of the summary headers was varied to create different combinations. Variations in the instructions were also generated to ensure the model could handle different sequences of summary headers. Since this was the final stage in the pipeline, a strong focus was placed on providing consistent output formatting to prevent issues with processing scripts. The goal was to minimize the need for multi-turn conversations or inference loops, which could increase cost and complexity.

7. EXL Insurance LLM™ training

The model fine-tuning process began with the Nemo framework, chosen for its ability to utilize tensor and pipeline parallelism, accelerating training [12]. The infrastructure included two H100 GPUs, each with 80GB of memory, to expedite fine-tuning. The Nemo framework served as a wrapper, allowing configuration updates without focusing on minor details, thus enabling attention to the broader task. Model logs were stored in MLflow, as it integrates well with Nemo and offers ease of use.

Given the well-defined task and labeled data, Supervised Fine-Tuning (SFT) [9] with Parameter Efficient Fine-Tuning (PEFT) [28] was selected. Instead of continuing with pretraining, the decision was made to fine-tune the LoRA weights, preserving the base model’s reasoning abilities [7]. The opensource model chosen for this task was Llama 3.1-8B-instruct, which was small, cost-effective for inference, and featured a larger vocabulary and higher context window, contributing to improved results [25].

The fine-tuning process began with the primary use case of Negotiation Guidance. Due to a limited dataset, 10% was supplemented with Q&A and partial summary data for variety. After experimenting with different LoRA parameters, a 16-dimensional, 32-rank configuration was selected, focusing on training the Q, K, and V layers. System, user, and assistant prompts were designed to guide the training. The inclusion of Q&A and summary data helped finalize the LoRA parameters.

Separate experiments were conducted with Q&A data due to its size and complexity, providing valuable insights for tag extraction training. After refining the LoRA parameters, training proceeded with improved instructions. For the Negotiation Guidance use case, it became clear that focusing on instructions after the first round of training was the best approach, given the small dataset. Root cause analysis (RCA) was performed with the assistance of subject matter experts (SMEs) to enhance the instructions. Inference tests showed that better results were achieved in some cases, prompting further LoRA training to refine accuracy. A similar process was followed for tag extraction, summaries, and the Negotiation Guidance use case.

8. EXL Insurance LLM™ inference:

a. Optimized inference pipeline for scalable model deployment

NVIDIA’s Inference Platform (NIM) was integrated for model deployment with AWS GPU instances to provide highperformance, low-latency predictions. The infrastructure was designed to be flexible and adapt to varying workload demands to ensure optimal performance across different use cases.

b. Single-node inference setup

AWS EC2 g5.24xlarge instances were utilized for smaller-scale inference requirements, each equipped with four A10 GPUs. This setup leverages TensorRT-LLM [32], Nvidia Triton server, and CUDA to ensure that inference is executed with high throughput and minimal latency for real-time applications.

Key features:

  • Infrastructure: Single-node inference on EC2 g5.24xlarge instances with 4 A10 GPUs.
  • Performance: High throughput and low-latency predictions utilizing NVIDIA TensorRT-LLM and Triton [24]
  • Scalability: AWS infrastructure enabled seamless scaling per workload demands, ensuring flexibility in resource management.
  • CUDA and Tensor Cores for efficient computation

c. Multi-node inference setup

For larger-scale inference, a multi-node deployment was implemented using Amazon Elastic Kubernetes Service (EKS) [33]. This setup involved two EC2 g5.24xlarge instances, each with four A10 GPUs, providing a robust solution for distributed inference. The multi-node architecture facilitated efficient load balancing and reduced latency under high-demand conditions, ensuring scalability and seamless integration with NVIDIA NIM and AWS GPU instances.

Key features:

  • Infrastructure: Multi-node deployment on EKS with two g5.24xlarge EC2 instances, each equipped with 4 A10 GPUs.
  • Performance: Optimized for scalable, distributed inference with low-latency performance, even under heavy load.
  • Integration: Seamless integration of NVIDIA NIM with AWS GPU instances, ensuring high availability and efficient resource utilization.

d. Custom enhancements and bug fixes

In the inference phase, a significant focus was placed on integrating the NVIDIA stack with the AWS infrastructure. This integration ensured scalability and peak performance were achieved across multiple nodes. Optimization efforts focused on improving GPU resource utilization, managing real-time workloads, and addressing system-level bugs to enhance stability and performance.

e. Optimized NIM profiles

Multiple NIM profiles were explored and tested as part of the research and optimization strategy. These profiles were selected based on specific workload characteristics, ensuring the infrastructure was optimized for training and inference phases. This approach allowed for fine-tuning configurations, providing the best balance between performance and costefficiency for each stage of the model lifecycle.

f. Conclusion

The optimized inference pipeline, utilizing NVIDIA’s Inference Platform and AWS GPU instances, delivers high-performance, low-latency, scalable predictions that meet diverse workload demands. Integrating single-node and multi-node setups ensures flexibility for varying use cases, while custom enhancements and optimized NIM profiles ensure peak performance across distinct phases of model deployment. This infrastructure setup provides an effective, cost-efficient solution for deploying machine learning models at scale.

9. Evaluation

a. Comparison of results from fine-tuned model vs various base models

The evaluation and benchmarking of the EXL Insurance LLM™ was conducted in two thorough phases, where it was compared to some of the most advanced models in the industry. For the initial phase, Llama 3.1 Instruct 8B was used as the base model, serving as a benchmark for performance. Claude Sonnet 3.5 was also included in the comparison since it had shown robust performance in the existing processes within EXL. Several larger and more sophisticated models were tested to provide a more challenging comparison, including Llama 3.1 Instruct 70B, Gemini 1.5 Pro, and OpenAI O1. This diverse selection of models ensured the benchmarking process was comprehensive and robust.

To eliminate any potential biases and ensure a fair and objective comparison, the evaluation was carried out using two distinct approaches: a manual, SME-driven evaluation and an automated evaluation that focused on key performance metrics.

The evaluation process was carefully designed to ensure accuracy and fairness. A sample relevant to the insurance use case was selected for testing. To protect privacy, any Personally Identifiable Information (PII) or Protected Health Information (PHI) within the sample was initially masked using a specific technique. Once masked, the PII/PHI was replaced with dummy data, and human reviewers performed checks to confirm that no real PII/PHI was overlooked during this process.

Following this, three Subject Matter Experts (SMEs) independently evaluated the models. Each SME rated the models on a scale of 1 to 5, where 1 represented the lowest performance, and 5 represented the highest. In addition to their ratings, the SMEs provided valuable, tactical feedback that helped identify areas for improvement and contributed to enhancing the model’s performance. To ensure unbiased assessments, the SMEs were unaware of which model they were reviewing, allowing their ratings and feedback to reflect only the models’ performance without any external influence.

LLM

This benchmarking effort was not part of a public leaderboard or external academic publication. It was conducted internally to validate the operational performance of the EXL Insurance LLM™ against industry benchmarks and to inform future deployment decisions based on real-world ROI and workflow integration.

The automated evaluation measured important metrics like BLEU, ROUGE, BERT, and METEOR scores [3],[23], providing a quantitative analysis of the model’s performance. This combination of manual and automated methods ensured that the evaluation was thorough and reliable, providing a well-rounded picture of the EXL Insurance LLM™ capabilities and performance compared to other leading models

LLM

 LLM


10. Conclusion

This paper explores the process and importance of fine-tuning a Large Language Model (LLM) to tackle challenges specific to the insurance industry using advanced tools such as NVIDIA NeMo and NVIDIA’s Inference Platform (NIM). It demonstrates how pre-trained LLMs can be adapted for specialized industries through careful data handling, de-identification, and the application of efficient fine-tuning techniques like Parameter-Efficient Fine-Tuning (PEFT) with LoRa (Low-Rank Adaptation). The study emphasizes how a well-prepared and diverse dataset, along with the right choice of model and fine-tuning strategy, can significantly improve the performance of LLMs in handling tasks like question answering (Q&A) and summarization. The findings show that fine-tuning a model with domain-specific data enhances its accuracy and relevance for tasks in the insurance sector.

Our results demonstrate that the fine-tuned model outperformed several leading industry models, including Claude 3.5 Sonnet, GPT-4, Gemini 1.5 Pro, and non-fine-tuned models like Mistral 7B and Llama 3.1. The fine-tuned model showed superior performance in key evaluation metrics such as BLEU, ROUGE, BERT, METEOR, and subject matter expert (SME) assessments. These improvements were particularly noticeable in complex tasks like summarization and Q&A, where the domain-specific model provided much higher accuracy and relevance.

The insights from this research also highlight the importance of utilizing advanced technologies such as multi-GPU parallelism, cloud-native services, and specialized training techniques to ensure that fine-tuning is scalable, efficient, and secure. Additionally, best data management practices, such as HIPAA-compliant de-identification processes, are essential for maintaining privacy and security in sensitive sectors like insurance and healthcare.

With the infrastructure, training, and fine-tuning pipeline now mature, this solution is ready for scaled deployment within EXL workflows and sets the stage for additional domainspecific models across healthcare, finance, and other regulated sectors.

The EXL Insurance LLM™ is not a standalone tool but a solution embedded into core insurance workflows. Its ability to power Negotiation Guidance at scale is already demonstrating ROI by improving outcomes in claims settlement and accelerating the decision-making process.

In conclusion, this study demonstrates the significant potential of fine-tuning LLMs for specialized applications. It shows how leveraging modern tools and techniques can optimize model performance while adhering to industry standards. The findings set a precedent for future advancements in domainspecific LLMs and offer a roadmap for their application in other specialized fields.

 

Prepared by

EXL

Authors and contributors:

Anand Logani 
Executive Vice President and Chief Digital and AI Officer

Dr. Raunak N Rathi 
Sr. Manager- Data Scientist, EXL

Tanisha Rao 
Data Scientist, EXL

Dr. Solmaz Torabi Ardakani 
Head of R&D, EXL

Gaurav Iyer 
SVP, Global Leader for AI Solutions & Digital Strategy

Dr. Kiran Thakur 
Subject Matter Expert (Manager), EXL

Parul Tripathi 
AI Solutions Architect, EXL

Deepam Parmar 
Senior AI Engineer, EXL

Manish Singh Mahra 
Project manager

Praroop Bhatt 
Database Specialist

Ashish Kudaisya 
Product Owner

Arturo Devesa 
VP, Chief AI Architect, Insurance LLM project lead

Adolfo E Canovi 
CTO Insurance Platform Services