Skip to the content.

DATA PORTFOLIO

Welcome to my data portfolio! Here you can find a selection of my projects and skills in data science and analytics.

Table of Contents

  1. Achievements
  2. Projects
  3. Micro Projects
  4. Skills
  5. Contact

Achievements

I have a website on which you can see some of projects and blogs as a MS Data Science student.

Projects

Projects I have worked on:

AI Discourse at Scale: Topic, Sentiment, and Community Dynamics Across Reddit

This project analyzed 4.2 billion Reddit comments and submissions from June 2023 to July 2024 using Apache Spark. It built a scalable ETL pipeline that filtered 445 GB of parquet archives into AI-focused subreddit clusters. Spark NLP performed sentiment analysis, topic modeling, and temporal aggregation over 14 months. Processing by subreddit and month generated summaries of comment volume, unique users, and activity spikes, revealing discourse drivers. The project provides a reproducible snapshot of digital conversations during a tech inflection point, showcasing scalable distributed data engineering for behavioral analytics with an efficient pipeline for billion-row datasets, supporting real-time community insights with PySpark, Spark NLP, parquet I/O, distributed aggregations, and an interactive website.

AWS Public Blockchain Analysis with Athena & PySpark

This project explores Ethereum’s on-chain activity using AWS Athena, S3, and PySpark for large-scale SQL querying, data transformation, and visualization. Transaction-level data from the public blockchain lake was queried via boto3, processed in distributed Spark DataFrames, and visualized in Python. Final analysis quantifies consistent daily throughput (~1.6 M tx/day) and how gas markets, congestion, and wallet networks interact—demonstrating the power of cloud-based analytics pipelines for real-time blockchain intelligence.

Scrollytelling with Quarto: Close Read Prize Contest

This project analyzes the financial risk of semiconductor stocks using O-GARCH and Value-at-Risk (VaR) models in R to assess stock volatility and investment risks. The findings are presented on a personal website built with the R library ‘qmd-lab/closeread’, HTML and CSS. Examining stock volatility and investment risks highlights key factors influencing market fluctuations and helps investors make informed decisions.



Assessing Bias in Mortgage Lending Using Supervised Machine Learning Methods

I applied Python-supervised machine learning algorithms to assess bias in mortgage lending decisions. The project predicts loan approval outcomes based on applicant data, mitigating bias and improving decision-making processes in mortgage lending through data-driven insights.



SmartRetail: Customer Segmentation for Micro-Targeting

I implemented customer segmentation techniques to enhance marketing strategies using R. By analyzing consumer data, the project can identify distinct customer groups, enabling more targeted and effective marketing campaigns tailored to specific audience segments.



Airbnb Housing Factors Influencing Prices Project

The objective of this Python project is to analyze various factors affecting Airbnb pricing. By examining data on property features, locations, and host attributes, the study identifies key determinants that influence rental prices, providing insights for hosts to optimize their listings.



Capital One Fictional Company Credit Card Customer Churn

This project develops a sophisticated machine learning framework using AdaBoost and advanced feature engineering to predict credit card customer churn with 78.61% accuracy, achieving an exceptional 89.2% ROI through targeted retention campaigns. The solution combines SMOTEENN sampling for class imbalance, comprehensive behavioral analysis, and an interactive Streamlit dashboard to provide real-time risk assessment and actionable insights that prevent customer attrition and generate $384,750 in annual net business benefit.



Predicting Medical Insurance Premiums with Ensemble and Gradient Boosting ML Methods

This class project develops a ML framework using XGBoost regression and K-means clustering to predict medical insurance premiums with 80.3% accuracy (R²), achieving superior performance over traditional statistical models through advanced feature engineering of health profiles including age, BMI, chronic diseases, and surgical history. The solution combines predictive modeling with risk stratification into four distinct tiers (Low to Very High Risk) and delivers actionable insights through an interactive R Shiny dashboard (as a personal future work) that enables real-time premium calculations and personalized pricing strategies for insurance underwriters.



Micro Projects / Job Simulations

Micro project I have worked on:

Certifications

Skills

Contact

You can reach out to me via email or connect with me on LinkedIn.