DSC 190 C00 Algorithms for Data Science

Instructor

Arya Mazumdar arya@ucsd.edu

Class

MoWe 11:00 - 12:20

Due to ongoing Covid-19 pandemic, the lectures will be delivered over zoom: https://ucsd.zoom.us/j/94864021235 Email instructor for password.

Discussion/Office Hour

Tu 12:00 - 12:50

Description

With the advent of large-scale machine learning, online social networks, and computationally intensive models, data scientists must deal with data that is massive in size, arrives fast, and must be processed within interactive or online manner. This course studies the mathematical foundations of massive data processing, developing algorithms and analyzing them. We explore methods for sampling, sketching, and distributed processing of large scale databases, clustering, dimensionality reduction, and methods of optimization for the purpose of scalable statistical description, querying, pattern mining, and learning from data.

Textbook

Mainly the following book will be followed.

  1. Avrim Blum, John Hopcroft, and Ravindran Kannan: Foundations of Data Science. Link.

We will also refer to the following book.

  1. Jure Leskovec, Anand Rajaraman, Jeff Ullman: Mining of Massive Datasets. Link.

Both the books are available freely, follow the links.

Grading Plan

There will be 4 home assignments, 1 midterm exam and 1 project.

Homework Plan

Homeworks are individual effort. Each of the four homeworks will account for 10% of the grade. You will have one week to complete the homework. Homework are due 11 am Pacific time on the specified day.

Schedule

Lecture Date Topics Notes
1 Mo Jan 4 Course logistics, Singular Value Decomposition and Dimensionality Reduction Lecture 1
2 We Jan 6 SVD Lecture 2
3 Mo Jan 11 PCA, Power Method, HITS Algorithm, PageRank Homework 1 Out Lecture 3
4 We Jan 13 Markov Chain, PageRank Lecture 4
5 We Jan 20 Pagerank, Clustering, k-Center, Farthest Traversal Homework 1 Submission Lecture 5
6 Mo Jan 25 k-means, Hierarchical Clustering, CURE Lecture 6
7 We Jan 27 Spectral Clustering Homework 2 Out Lecture 7
8 Mo Feb 1 Similarity Queries, Min-Hashing, Signatures Lecture 8
9 We Feb 3 Locality Sensitive Hashing (LSH) Lecture 9
10 Mo Feb 8 Data Streaming, Bloom Filter Homework 2 Submission Lecture 10
Midterm We Feb 10 In-class exam: Open Book. Midterm
11 We Feb 17 Bloom Filter, Probability Review Lecture 11
12 Mo Feb 22 Probability Review, Count-Distinct: the Flajolet-Martin Algorithm Lecture 12
13 We Feb 24 Analysis of FM, Count-Min Sketch, Heavy Hitters Homework 3 (combines 4) Out Lecture 13
14 Mo Mar 1 Count-Min Sketch, Heavy Hitters; Supervised Learning: the Perceptron Lecture 14
15 We Mar 3 Perceptron, Linear Separability of Data and SVM Lecture 15
16 Mo Mar 8 Soft-SVM and Stochastic Gradient Descent Homework 3 Submission Lecture 16
17 We Mar 10 Convex Optimization and Gradient Descent, Recap Project Report Submission Lecture 17

Midterm

The exam (open book, open notes) will be uploaded here 11 am Feb 10. The completed exam should be submitted by 11am the next day.