Using AI to Detect Data Anomalies

Business decision makers rarely crave more reams of data. They really want well-defined, actionable insights from data that they can evaluate and consider for future actions. They also want confidence that they aren’t missing something critical in the data, either because the data is not being analyzed thoroughly or correctly, or worse… because the data is incomplete or wrong.   

One of the most practical applications of artificial intelligence (AI) in analyzing regularly processed data sources is in the detection of data anomalies. By configuring expectations about the data as well as relevant types of anomalies, AI can be run in the background of a data processing system and autonomously deliver a prioritized list or streaming queue of data insights and anomalies that its algorithms deem most likely to be of interest. Having a Data Anomaly Detector tracking all ingested and processed data automates a formerly tedious and labor-heavy QA and troubleshooting process, reduces the risk of making business decisions based on faulty data, and improves speed to data-driven decision making.

Defining a Data Anomaly Detector

Simply put, a Data Anomaly Detector ensures data quality within the data pipelines. Organizations are ingesting increasingly large quantities of data from external and internal sources, from both streaming feeds and routine batch jobs. While it has become common, if not tedious, for operations teams to write basic rule-based sanity checks for their extract, transform, and load (ETL) jobs, more nuanced issues with data feeds are likely to go unnoticed, such as, for example, gradual shifts in data ranges over time, or a slowly changing percentage of missing values. With companies relying more and more on data to drive their critical models and processes, it is essential for this data to be properly vetted. 

Depiction of data set with anomalies being flagged

A Data Anomaly Detector identifies unusual patterns in live data feeds with a minimal amount of user input or oversight. It leverages a blend of user context knowledge and machine learning methodology to automatically hone in on outliers as well as less conspicuous long-term trends or patterns in data. One of the central design principles for a Data Anomaly Detector is to build it so that users can customize as little or as much as they want for each of their data feeds. Such a tool should work with minimal user input, and become considerably more effective in spotting anomalies as the human provides increasingly greater guidance on what the data should look like.  

How a Data Anomaly Detector works

A Data Anomaly Detector streams new data record-by-record into a platform which leverages parallel processing to incrementally update Bayesian machine learning models for various facets of each new record.  With each facet of each new record, the tool conducts backend predictive checks to quantify the extent to which the observed data deviates from model expectations. Any deviations deemed statistically significant are noted as anomalies and recorded along with the level of significance. Depending on the configuration of the detector, anomalous records above a certain threshold are then passed along to the user to evaluate and consider in prioritized order.  

In Fulcrum’s version of a Data Anomaly Detector, an intuitive, front-end graphical user interface (or GUI)  collects the contextual knowledge. The user experience is similar to preparing taxes with a web-based tax filing software, in that the GUI guides the user through a series of relatively simple prompts, and the user-provided knowledge then gets incorporated into the underlying models. This user-provided knowledge base may include basic information such as the regularity of data and file size expectations, for example, as well as more niche concepts such as distributional assumptions about data fields.

Depiction of user input through a GUI defining initial expected data model
Problems a Data Anomaly Detector solves

A Data Anomaly Detector complements the rules currently deployed in ETL processes, catching problem data early on through its intelligent models that logic-based rules may not catch. This extra layer of QA alerts the user that something may be wrong early in the data process, which enables the user to go back to the source and identify where there were issues in data collection and determine whether any action is needed. 

In addition to flagging violations to the expected thresholds in data quality for investigation, a Data Anomaly Detector also identifies valuable and possibly unexpected insights from data, such as gradual shifts over time. Big data is everywhere, and everyone is looking for ways to leverage its value. Its sheer size can make it cumbersome to work with, particularly for less technical users whose primary means of working with data in the past has been through spreadsheets. A popular option to empower non-technical personnel with big data is to create custom dashboards with the Data Anomaly Detector trends displayed in the forefront. This provides a means to explore the data in an intuitive and structured format and increases the accessibility of big data insights.  

Depiction of big data set being cleaned using AI

Moreover, for certain repetitive visualization tasks, a Data Anomaly Detector’s dashboards can be a quick and convenient shortcut, even for experienced technical users. Such users can hone their configuration of such a tool as it is exposed to greater amounts of data. A good Data Anomaly Detector will adjust its algorithms and produce increasingly valuable insights with greater regularity in addition to optimizing the frequency at which models are updated. The actionable results notifications will depend on the needs of the end-user and can be configured for each use case. For instance, some business users may prefer to receive an email report every morning with a summary of the top insights. Operations teams are likely to prefer real-time logging and flagging of any suspicious records and batches along with a summary report at regular intervals.

Incorporating Data Anomaly Detection into everyday processes

A Data Anomaly Detector should be more than just software. In Fulcrum’s case, it is a platform that will break up and run machine learning models simultaneously across a cluster of servers. The most practical way to incorporate a Data Anomaly Detector into a company’s existing processes is to have it work in parallel with existing ETL jobs. This can happen at any point or even at multiple points in the ETL process. For instance, as data is ingested from a feed to undergo ETL, these data can simultaneously be passed into a Data Anomaly Detector for processing. As a self-contained data platform, it should work in parallel to identify anomalies without blocking or delaying existing ETL jobs.

To learn more about how Fulcrum can help you with Data Anomaly Detection to increase confidence in your data integrity and detect insights yet to be generated from your data streams, contact us.