The Coalition Against Insurance Fraud (CAIF) estimates that insurance fraud costs insurers $80B annually. In the Property & Casualty (P&C) sector alone, fraud related losses are estimated to be $30B annually. While fraud is constantly evolving and affects all types of insurance, the most common types in terms of frequency and average cost are: automobile insurance, workers’ compensation and health and medical insurance.

With advancements in the field of computing, analytics is now playing a greater role in helping insurance companies identify fraudulent claims; in some cases identifying areas of concern even before the fraud occurs. The role data plays in today’s market varies for each insurer, as each weighs the cost of improving information systems against losses caused by fraud.

Using advanced analytics, it is possible to implement efficient fraud detection strategies. This paper is dedicated to demonstrate how and why advanced analytics can assist in identifying and decreasing the number of fraudulent claims.

Advanced analytics in fraud detection

Insurance fraud poses a major threat to both insurers as well as policyholders. As fraud creates significant financial cost, insurers pass on this cost to policyholders in the form of increased premiums. As a result, some consumers and businesses, who could earlier purchase some kind of insurance coverage, are now driven out of the market.

Who is committing fraud? Anyone including applicants, policyholders, third-party claimants and professionals who provide services and equipment to claimants could be committing fraud.

Fraudulent claims come in many forms. This includes requesting coverage for:

  • Injuries or damage that never occurred or services never rendered or equipment never delivered

  • Inflated claims

  • Staged accidents

  • Misrepresentation of facts on an insurance application

With fraudsters using sophisticated techniques that make it difficult for insurers to tell fraudulent claims from honest ones, there has recently been a rapid upsurge in fraudulent activities. As per a survey released in September 2013 by FICO, 35 percent insurers estimated that insurance fraud costs represent 5-10 percent of their total claims, while 31 percent said the cost is as high as 20 percent. The ever changing economic circumstances pose yet another challenge for insurers trying to deal with fraud. Businesses and individuals are showing an increased investment appetite and this money is now falling into the hands of fraudsters.

To combat fraud, it is imperative for companies to invest in fraud monitoring programs. As a result insurers are relying heavily on data analytics for fraud prevention, detection and management. Traditionally, fraud detection strategies focused on identifying fraud after the claim was paid, however it is easier to reduce losses if the fraud is identified before the claim is paid.

The first step towards fraud prevention is to employ techniques such as predictive modeling and claims scoring that uncover fraud before a payment is made. Since insurers have large amounts of data, it makes sense to use advanced analytics techniques to evaluate internal and external data for identifying claims with higher propensity of being fraudulent. By detecting patterns and anomalies in a large database, analytical tools can determine probable characteristics of a fraudster and the need of investigating a claim further. Advanced analytics techniques like logistic regression and Gradient Boosting Model (GBM) can often identify potential cases of fraud at the First Notice of Loss (FNOL), instead of waiting weeks or months for an adjuster to review a claim. This is just one of the ways in which it can be used. Since advanced analytics is capable of assessing the propensity of an individual to engage in fraud before it has occurred, it can also be used in the sales and underwriting process before a policy is sold. The use of advanced analytics has reportedly decreased losses incurred due to fraud for a few carriers by

20 to 50 percent. One of the biggest challenges for limiting fraud loss has been associated with the need for significant resources to examine the high volume of claims. In our experience, insurers only have the capability of reviewing 5 to 10 percent of all the claims in a financial year – which are typically pulled based on less than selective criteria. But what if you were able to pull the right claims? That’s where advanced analytics comes in.

Analytics framework for detection of fraud claims

Data analytics can add incremental value in identifying fraud at each stage of the claim process. These stages include:

  • First Notice of Loss (FNOL) is the stage at which the claimant first notifies the insurer that a loss has occurred.

  • First Contact (FC) is the stage at which the insurer contacts the claimant, after FNOL, asking for more information about the loss that has occurred.

  • Ongoing (OG) is the continuous back and forth of information between the claimant and insurer after FC until the claim is closed.

This flow of information makes it more relevant to carry out a robust identification of potential fraud right at the first stage; the next two stages can be used to appropriately lead investigations in a particular direction.

To demonstrate how this works, we have provided the case study of a U.S. based insurer’s plausible fraud prediction project along with a stair-step approach towards data analytics proposed by us. Please note insurer data for the case study spans from 2005 to mid-2013 and pertains to property in commercial and personal lines of business.

Step 1: Collating the right data

Regardless of the insurance product, there are multiple facets in play, thus making it important to find the “right data”. To uncover the factors indicating fraudulent behavior, an exhaustive data sourcing exercise was undertaken, considering both internal as well as external data. Internal data comprised of information on attributes around claims, policies, customers, claimants, etc. while external data – i.e., data not captured by insurer – was gathered from different resources. This included data such as regional demographics, industrially accepted standard scores, information pertaining to weather conditions that prevailed when the loss occurred, as well as information on catastrophes that may have occurred during the time period of interest.

In this particular case, there were 21 internal data tables and 13 external data tables for more than 1,500 variables. These data tables were woven together at a claim level to create a ‘master dataset’. The variables were classified depending on their availability at independent claim stages and made available accordingly for the three stages of model building. Of the 1,500 variables nearly 100 variables were created using text-mining techniques, which can be a cumbersome process as it generates volumes of data, mostly in the form of text-fields. Text-mining tools are used to convert unstructured data into structured fields, which can be used along with conventional data fields. A 3-grams text mining approach, that uses three word phrases, was used to identify the variables because word phrases have higher predictive power in comparison to individual words. It was found that all variables showing up as significant in the First Contact (second) stage of the model building exercise had origins in unstructured data.

Step 2: Applying analytics techniques

Once variables were identified, we chose two techniques - logistic regression and GBM to identify fraudulent claims. While both had their pros and cons, they provided valuable insights into the data. Eventually, we compared both techniques to arrive at a conclusion as to which one worked better for the project at hand.

A. Logistic regression

Logistic regression is a statistical method for analyzing a dataset with one or more independent variables that determine an outcome. This predictive analytics technique produces an outcome that is measured with a dichotomous variable (which has only two possible outcomes). Plausible fraudulent claims are a rare event; almost less than 1 percent of all claims. As logistic regression underestimates the probability score in case of rare events, in order to ensure unbiased results, an oversampled data set was created where the event rate was >=5 percent.

Since the flow of information was in three stages - FNOL, FC and OG. Residua modeling technique was employed for logistic regression; logistic score from one stage appears as an offset variable in the subsequent stage. The advantage of using this method is that the information gains, which happen in one stage, are carried forward to the subsequent stage. The insurer’s head of claims, underwriting and customer service count on achieving a certain percentage of straight-through processing (STP), decreasing the time from FNOL to claim resolution. The claims that don’t pass the FNOL stage can be flagged as risky and require further assessment, therefore as claims move across the stages we have more information on whether they are fraud or genuine.

B. Gradient Boosting Model (GBM)

GBM is a machine learning technique that aims to improve a single model by fitting many models and combining them for prediction. In GBM, the need to create an over-sampled data doesn’t arise, and modeling exercise can be performed by gradient boosting of classification trees. Since GBM doesn’t support sequential modeling, a parallel model development approach was followed at each of the three stages - FNOL, FC and OG.

Step 3: Running the analysis and analyzing results

Before running the analysis for logistic regression, a standard approach for variable selection was carried out. Starting from elimination of variables based on fill rates, to correlation and clustering analysis, followed by step-wise selection in logistic procedure using SAS. Further shortlisting was done to get rid of multi-collinearity, if any. For GBM, however, no such treatment was required. The table given below shows a comparative analysis of both the techniques:

The three stage-wise logistic models for FNOL, FC as well as OG underperformed in comparison to GBM, in terms of lift and K-S statistics. However, the precision of all three logistic models, was better in comparison to GBM. Higher precision (fewer false positives) in fraud detection takes priority over other performance variables in evaluating claim fraud; as this helps insurance companies optimize fraud detection, decrease costs and increase recovery. Therefore, in this particular case, logistic regression scores over GBM. Which Fraud Identification Approach Is Better - Logistic Regression Or GBM? The two techniques applied have different algorithms running in their background. Logistic regression involves human intervention at multiple stages. GBM, on the other hand, is purely based on a machine learning algorithm, which requires minimal human intervention. Owing to this, a logistic regression analysis can be controlled to a fair degree, but has to follow certain assumptions.

In terms of output, GBM produces a scored dataset. This scored dataset has a probability value for all observations present in the training/modeling dataset with a variable importance list (sorted in descending order with most important variable on the top). GBM reveals nothing about the direction or magnitude of the effect of any predictor variable on the predicted variable. However, this technique offers the advantage of being able to handle both, non-linearity and non-monotonicity, which are quite common in fraud identification problems. Logistic regression delivers a mathematical equation as output. This equation can explain the effect of every predictor variable on the predicted variable, which proves to be an advantage of using logistic regression.

On the flip side, regression drops observations with missing variable values from the modeling exercise, or requires these values to be imputed. GBM requires no such treatment, and is also equipped to handle multi-collinearity on its own, unlike logistic regression that needs multicollinearity to be identified and dealt with by introduction of interactions or removal of certain variables. The interaction terms and variable transformation for logistic regression are subjected to discretion of the person building the model, whereas, GBM itself introduces and tests for interactions and variable transformation. GBM does variable selection for fraud identification on a run-time basis, which takes a lot more processing time as compared to logistic regression. Logistic regression, on the other hand, requires an extensive variable selection/elimination process involving steps like elimination on the basis of correlation value, variable clustering, information value and stepwise procedure. It is apparent that GBM holds an edge over logistic regression, as in fraud detection, linearity is not expected in the data. That said, businesses rely heavily on insights from the mathematical equation generated by logistic regression. To implicate any changes, in the way a firm carries out its business, logistic regression gives flexibility in terms of a clear understanding of cause and-effect relationship that exists between dependent and independent variables. But, this doesn’t imply that logistic regression fares better than GBM in every situation.

Based on what a business aims to achieve from an analytics exercise, an appropriate technique out of the two can be chosen, as both of these techniques have their own merits and limitations.

Ways to manage fraudulent claims in your organization

Organizations should not go by predefined fraud variable lists based on historical understanding of the insurance industry, but should leverage predictive modeling for identifying fraudulent claims specific to their organization. In the case discussed above, logistic regression revealed some new variables influencing fraud claims from both internal as well as external databases.

In general, in the P&C industry, a fraudulent claim consists of the following characteristics:

  • The presence of a large number of contents damaged/missing

  • Contents missing/stolen typically include electronic items like mobile phones, laptops, cameras, etc.

  • Issues pertaining to coverage of the reported loss

  • Proximity of the claimant to a service center or claim vendor

  • Increased chances of claim arising from a new policy as compared to an old policy

  • Presence of keywords like “mysterious” , “disappearance”, “lost”, “unknown”, “strange” or “undetermined” in the loss description field

  • The number of times a claimant had been arrested in the past for any criminal activity

The logistic based analysis confirmed the above factors to be driving fraud claims. Apart from the above mentioned factors, a lot of other factors came up to be significant in the model we developed. These new fraud indicator factors were:

  • Number of utility cards that a claimant has - this shows a negative impact on the propensity of filing a fraudulent claim

  • Presence of terms like, “divorced”, “separated” or “domestic dispute” has a very strong positive relation with a claim being fraud

  • Incase description contains words like, “water leak” or “pipe burst”; the chances of it turning out to be a fraudulent claim goes down significantly

  • Presence of terms such as, “theft”, “broken” or “police report” tend to increase the chances of it being a fraudulent claim


An effective and efficient notice of loss organization, combined with fast-track claim handling during FNOL, can allow insurers to realign claim resources with more complex claim-handling activities. In this whitepaper we explored two approaches – logistic regression and GBM, separately, to understand their effectiveness and interactions with the data at hand. But, it isn’t necessary that both these techniques be used independently; stacking logistic regression on top of GBM might beat the results of just using GBM on the dataset. Use of logistic regression and GBM is not mutually exclusive; analysts can use both of them in tandem as well. Logistics can benefit by identifying trends and finding out causalities; machine learning models like GBM can be used to gain greater performance from a particular dataset. Various other approaches can be considered, which are beyond the scope of this whitepaper.

In the end, based on methods used, the client experienced results equivalent to the improvement showcased in the model, which indicated lift at 10 percent for FNOL to be 61.47 percent. This means, we can identify 61.47 percent of all fraud claims by using only 10 percent of the data. Therefore, exploring such varied advanced approaches based on business objectives, can and did enable greater success in claim fraud identification and likely saved the company from loss. Not only did they stop payment of fraudulent claims, they did not have to incur the additional cost of litigation necessary to recover lost funds. The roadmap for successful transformation is clear. Insurers must move on from today’s outdated fraud strategies to advanced data analytics.



Contact US