Data Engineering
Data Analytics & Activity Tracking System
Objective
Build a real-time activity tracking platform that consolidates metadata from diverse sources, enabling semantic search and AI-powered duplicate detection through an analytics dashboard.
About the Project
A robust platform to track and analyze file activities across multiple data sources, supporting Azure File Storage, Microsoft Teams, and other systems, with automated workflows and real-time insights.
Automated metadata processing from multiple regional data sources using Airflow DAGs with scheduled executions. Designed a scalable queuing system for efficient handling of data pipelines across diverse sources, and consolidated and structured metadata in DuckDB for high-performance querying and analysis.
Built an intuitive Streamlit analytics dashboard for real-time visualization and reporting of file activities. Integrated AI tools to enable semantic file search, detect duplicate projects, and surface related content using embeddings and pre-trained models with fine-tuning.
