Charles Lee

Logo

I am a graduate from a data analytics boot camp from the continuing education division of the University of California, Irvine. In my time at the data analytics boot camp, we covered various topics within the world of data analytics, including machine learning, big data analytics, data visualizations, web scrapping, social media mining, front-end visualizations, databases (SQL and NoSQL), full-stack development and business intelligence software (Tableau). In addition, we covered the programming languages of Python, R and JavaScript.

Currently, I work full-time as a design verification engineer in a chip design team at Broadcom but I am transitioning into the data analytics realm. Additionally, I freelance part-time as a data scientist for Mixcord. During my transition period, I have continued to grow myself by taking various courses at Business Science University, DataCamp, and Udemy. I am currently enrolled in Udacity's data engineering nanodegree program. Overall I love data analytics and hope to continue to grow in this area.

View My LinkedIn Profile

View My GitHub Profile

Data Engineering Projects

Project 1: Data Modeling with Postgres ETL Script with Postgres: Created an ETL script to read music data from JSON sources and to store into a Postgres DB. JSON files consisted of song data for music and log data for user events for music streaming service. ETL script will locate all JSON source files for music and user events and parse each file. The data will then be stored into a Postgres DB which is designed using a star schema optimized for song play analysis.

Project 2: Data Modeling with Apache Cassandra ETL Script with Apache Cassandra: Created an ETL script to read data from CSV source files, clean and transform data to be stored in an Apache Cassandra backend. Created tables to partition data based upon query requirements. Also created a test script to query data from Cassandra backend for testing and validation purposes.

Project 3: Data Warehouse with Amazon Redshift ETL Script with Amazon Redshift: Created an ETL pipeline in Python for a database hosted on Redshift. ETL consisted of loading data from S3 to staging tables on Redshift then executing SQL statements on staging tables to create analytics tables for a star-schema based design. Created test queries to confirm ETL pipeline success.

Project 4: Data Lake ETL Pipeline using Apache Spark ETL Script Data Lake: Created an ETL script to load data from S3. Data is cleaned and formatted to create tables for a star schema optimized queries for analysis and written to S3 for storage. The ETL process is done using PySpark.

Project 5: Data Pipeline with Apache Airflow ETL pipeline using Apache Airflow. ETL Pipeline copies data from S3 bucket to staging tables in Amazon Redshift. From staging tables, fact and dimension tables are created for song play analysis using a star-schema based design. Lastly data quality checks are run to ensure that data is read and transformed properly.