Utilizing the Apache Spark ecosystem to build scalable machine learning pipelines for large datasets, demonstrating the full cycle from preprocessing to model evaluation across multiple ML tasks.
This project utilizes Apache Spark's distributed computing capabilities to implement scalable machine learning pipelines. It demonstrates end-to-end data processing, from raw data ingestion through distributed preprocessing to model training and evaluation on large-scale datasets.
Execution plans were visualized using Spark DAGs, and models were evaluated based on: