INPUT VALIDATION MODULE
When I first joined Optum, I worked as a developer implementing a data validation solution called IVM (Input Validation Module). This Apache Spark-based data validation engine was created for a specific department, but word spread and I was tasked with leading our risk department through a POC (proof of concept).
Client: internal risk department
Languages used: Python, JSON, Linux Command
Technology Stack: Apache Spark, Apache Drill, Tableau
Overview - In order to validate, I needed to do the following:
1) Flatten the data (this client had normalized data, and I needed a flat table for each file type)
2) Configure rules for validation - a plain word example: "Check if Field One has 5 characters"
3) Create a way to use and visualize the output
4) Diagram the System
line by line validation
ability to toggle rules
rules match and go beyond current-state validation tool
aggregation and statistics on validated data
scalable to large data sets
data highly normalized but needed to be flattened
accommodate large and small data sets
Because this department services external clients who submit data to the government, the data needed to be flattened prior to validation. The data in question contained personal health information in the form of claims, as well as personal identifiers for both patients and providers.
To flatten the data, I created a Python script, which read an existing excel document containing information about the data loops in a given file type, as well as read the segmented data file. With these two files, the python script was able to flatten 6 different types of segmented data files to prepare them for validation.
Using documentation about the contents of the client's existing validation engine, I recreated over 80% of the current state rules in JSON (which configures the validation engine). The remaining 20% were referential integrity rules that required security to be handled post-POC.
To describe, a rule might command the engine to check whether a given field is null or conforms to a certain format (ie. a date check).
As I worked on the rule configuration, I met regularly with the client to describe how the data might flow in the POC.
For the POC, the process was rather manual. Client here refers to an external client who is sending data to a DA (data analyst) within our risk department.
All data flows through the DA representing their movement of data files on a virtual machine.
With the JSON rule set, the python flattening script, and minimal training, the client was able to feed the validation engine and receive results in the form of the IVM Report (overall validation totals), a parquet file (detailed validation results), and run timings (for tuning the time the engine takes).
This took care of 4 of the 6 goals - we had not met the entirety of the current system in terms of rule content, but there were performance gains.
The last goal was to visualize. Using Apache Drill, I loaded the parquet file and created queries for different parts of the validation. Then, I created a connection from Drill to Tableau Desktop and created a schema within Tableau.
A pain point was that the current state system did not show aggregations by state. I created an example dashboard to show member totals by state, member totals by health plan, and the number of active primary care physicians.
This project progress from Proof Of Concept in August, 2018, generating inter-department revenue for my department. I am still working with this team to implement a permanent solution - removing some of the manual processes and housing the software on its own server.
Other teams have taken note, and we are implementing this validation engine in multiple places across the Optum enterprise.
For my work on this project, which lasted over a year, I received a Bravo Award (internal accolade bonus given by a peer or manager), 2 distinct raises, as well as a promotion. As part of the promotion, I moved from developer to Product Manager for IVM.
In addition to how this helped my career. I also became comfortable with testing system components, and not striving for a perfectly orchestrated system the first time. Technology is amazing, but it can also be slow and iterative - especially building a solution to last.
During this POC, I worked on code with two other developers - either retooling their code, referencing it, or adding to it. This fine tuned my attention to detail and reinforced to me the importance of documenting and commenting on ANY project, but particularly code, in order to make collaboration easier.