10605
Large datasets pose difficulties across the machine learning pipeline. They are difficult to visualize, and it can be hard to determine what sorts of errors and biases may be present in them. They are computationally expensive to process, and the cost of learning is often hard to predict---for instance, an algorithm that runs quickly on a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.
This course is intended to provide students practical knowledge of, and experience with, the issues involving large datasets. Among the topics considered are: data cleaning, visualization, and pre-processing at scale; principles of parallel and distributed computing for machine learning; techniques for scalable deep learning; analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity; and methods for low-latency inference. We gained experience with common large-scale computing libraries and infrastructure, including Databricks, Apache Spark and TensorFlow.
One major project submitted for the coursework could be seen at here
Major Assignments:
1. Building word count application using spark
2. Entity resolution as Text similarity
3. Linear Regression Model to predict release year of song given a set of audio features using Mlib from pyspark
4. Click through Rate prediction pipeline
5. Neural style transfer in tensorflow
6. Autodiff& MLP to classify DB media dataset
7. Model compression and optimization methods