One of the required courses to graduate at UTD with a Computer Science degree is called Senior Design. In this course each person is responsible for finding a group of their peers to work with. Each group is then partnered with a company which gives the group a problem to solve. The company usually mentors the group through this process and points them in the right direction.
I started this class while everyone was still quarantining as much as possible because of COVID 19. Because of this, the class was completely remote. I used GroupMe to find a group of my peers to work with. We were then given a list of potential projects. We submitted our preferences and were matched with a company called Launch Consulting Group. Launch then contacted us and sent us the information we needed about the project that we were to complete.
For our project, Launch tasked us with building a machine learning model that could predict new healthcare claims pricing based off of the pricing of historic claims. Nobody in our group had much experience with Machine Learning or Data Science so we began with a lot of research.
After wrapping our heads around the problem, we decided that we needed to find a data set that we could use to build the model. Most of the information that we required to build the model would be considered pretty sensitive so we needed to find either anonymous data or synthetic data. We eventually settled on the synthetic data set available form cms.gov. This data set came in the form of multiple csv files and had columns for deductible, pricing, diagnosis codes, procedure codes, along with many others.
To work with this data set we needed to import it into some kind of database to make it easier to work with. We were guided to use a product called Snowflake by Launch. Snowflake was used to merge all of our csv files into one and to run or SQL queries.
We then moved our organized data into S3 on Amazon Web Services. We then used a Jupyter Notebook within AWS to read our data into. After some trial and error, we realized that we could not support the entire data set so we decided to limit the scope of the project to the twenty most common terms from the columns labeled Admitting Diagnosis and Diagnosis Codes (ICD9_DGNS_CD_) one through four. Our data was then filtered in Snowflake to only include these values and imported once again to S3.
To build our model we first read out data into a dataframe using pandas. We then had to convert all of our data intro numeric values. The codes were converted using one-hot encoding. This is where each unique diagnosis code becomes it’s own column and each row has a 1 under that column if it originally contained the code or a 0 if it didn’t. This encoding is the main reason that we had to limit the scope of the project. If we had encoded each code like this, we would have ended up with at least 100,000 columns. Since we were all relatively new to these kind of projects, that was something that we couldn’t handle and led to us limiting the scope of our project.
After moving our filtered data into a dataframe we used the sklearn library to train a linear regression model. Our model was trained on 80 percent and tested on 20 percent of our data. After some attempts to handle outliers and trying to increase accuracy we ended up with and r squared of 0.46 and a mean squared error of 8323341.72. These results were not as accurate as we would have liked but we believe that they can be better with suppport for more features as well as actual data. We believe that actual data would be more linear than synthetic data.
This model was then stored into a pickle file. We built an interface with tkinter that could open this model and use it to predict a new claim. Through testing our model with claims, we found that the model was 58% accurate when predicting a claim within $2000 of the actual price and 33% accurate with its prediction within $1000 of the actual price.
I had a lot of fun building this model and am grateful for the opportunity to learn these new technologies. I was also ecstatic that I was able to learn these new technologies and be successful with them. If you have any questions feel free leave a comment. Have a great day.