Overview

Personal Webpage Overview (2).png

Introduction

At Samsung Research, I initiated this project as part of the Health Research Platform. A key feature of the platform was to provide visualized and analyzed data collected from patients and health experiment volunteers. However, the existing data processing architecture could not keep up with the growing data volume as the project expanded. Since millions of health-related data points were being generated by mobile and wearable devices, a more robust data pipeline was required. Through this project, the Health Research Platform successfully delivered its core features: health data analysis and visualization.

Task

System Design
- Decomposition into Microservices (module separation by tasks)
- Resource Allocation
- Cost Optimization
- Deployment Design
Data Processing
- Data Transfer
- Data Compression
- Data Formatting
- Data Visualization
Data Repository Management
- MongoDB
- AWS S3
- AWS Redshift
- PostgreSQL

Approach

System Design
- Decomposition into Microservices ( Module separation by tasks )
  - Before this project, the backend server handled all data processing.
  - I separated resource-heavy tasks from the backend server to minimize performance degradation in business logic computation.
- Resource Allocation
  - Utilized AWS EFS for data processing jobs to prevent node disk-full errors.
  - Allocated appropriate CPU, RAM, disk resources by task.
- Cost Optimization
  - Moved data not required for business logic out of the database and into the storage system.
- Deployment Design
  - Designed microservices deployment.
  - Enabled Helm chart parameterization to adapt to new data sources.
Data Processing
- Data Transfer
  - Executed parallel and concurrent Rclone transfers to&from S3 using bash scripts.
- Data Compression
  - Implemented parallel and concurrent compression of raw data on the shared file system (AWS EFS).
  - Implemented data compression via Bash scripts with Kubernetes CronJob.
- Data Formatting
  - Airflow
    - Formatted unstructured data from S3 and stored it in Redshift for visualization.
    - Refactored legacy code to reduce in-memory load, preventing OOM errors.
  - Kubernetes CronJob & Python scripts
    - Reformatted unstructured data from MongoDB and stored it in Redshift for visualization.
- Data Visualization
  - Wrote SQL queries for visualization in Apache Superset.
  - Embedded Apache Superset dashboards into the frontend.
Data Repository Management
- MongoDB
  - Configured sharding and replication.
  - Stored non-relational data required for service business logic (e.g., sensor data).
- AWS S3
  - Stored raw data.
- AWS Redshift
  - Stored formatted data for visualization.
- PostgreSQL
  - Stored relational data required for service business logic (e.g., user info).