Distributed Systems • Java • Spring Boot

Distributed File Storage System

A robust distributed file storage system providing scalable, fault-tolerant storage across multiple nodes with consistent hashing, data replication, and REST API interface.

GitHub Repository

Project Goals

Create a highly available distributed storage system
Implement efficient file distribution across multiple nodes
Ensure data integrity through replication mechanisms
Develop a user-friendly API for file operations
Build fault tolerance for node failures
Design a scalable architecture for future growth

System Architecture

The system follows a distributed architecture with the following key components:

API Gateway: Central entry point for all client requests
Metadata Service: Tracks file locations and node status
Storage Nodes: Distributed servers that store file chunks
Replication Manager: Ensures data redundancy across nodes
Client Application: Interface for end-users

Key Features

Consistent Hashing

Distributes files across nodes based on content hash
Minimizes data redistribution when nodes join or leave
Ensures balanced load across all storage nodes

Data Replication

Maintains multiple copies of each file across different nodes
Configurable replication factor based on data importance
Automatic repair mechanism when nodes recover

Fault Tolerance

Graceful handling of node failures without data loss
Automatic redistribution of workload to healthy nodes
Health monitoring and automatic recovery procedures

REST API Interface

Comprehensive API for file operations (upload, download, delete)
File versioning and metadata management
Authentication and access control mechanisms

Technical Implementation

Backend Development

Implemented in Java with Spring Boot framework
Used Apache ZooKeeper for distributed coordination
Implemented custom serialization for efficient data transfer
Created asynchronous processing for large file operations

Client Application

Developed command-line client with intuitive interface
Implemented chunking for large file uploads
Added retry mechanisms for network failures
Created progress tracking for long-running operations

Testing Strategy

Comprehensive unit tests for core components
Integration tests for system-wide behavior
Chaos testing to simulate node failures
Performance testing under various load conditions

Performance Metrics

500 MB/s Throughput

120ms Avg Latency

98% Replication Speed

5min Recovery Time

Throughput: Up to 500 MB/s with 10 storage nodes
Latency: Average 120ms response time for file retrieval
Replication: 98% of nodes synchronized within 2 seconds
Recovery: Complete node recovery in under 5 minutes
Scalability: Linear performance scaling up to 50 nodes tested

Technologies Used

Backend: Java, Spring Boot, Apache ZooKeeper
Data Storage: Custom file system implementation
Communication: gRPC, REST APIs
Deployment: Docker, Kubernetes
Monitoring: Prometheus, Grafana
Testing: JUnit, Mockito, Chaos Monkey

Lessons Learned

Importance of proper failure detection mechanisms
Benefits of asynchronous operations for high throughput
Challenges in maintaining consistency across distributed systems
Value of comprehensive logging for troubleshooting
Tradeoffs between consistency, availability, and partition tolerance

Future Enhancements

Implementation of erasure coding for storage efficiency
Geographic distribution for disaster recovery
Addition of data compression algorithms
Integration with S3-compatible API
Web-based administration interface