# Data Mining and Big Data Analytics

**Course code:**CNSC 6006

**Office:** N11 308

**Course schedule:** This course will take place during the first half of the winter term, starting on January 10th 2018.

**Office hours:** by appointment

**Teaching Assistant:** Milán Janosov

## Prerequisites:

You need to be proficient with Python to take this course – read the “to satisfy the prerequisite” section below. Basic programming skills and basic skills in statistics and linear algebra are required.

**IMPORTANT:**During the first class, we will hand out a test to check the prerequisites among the students. Those that do not reach the minimum threshold will not be able to take the course, even if regularly registered. Both students currently registered and in the waiting list need to take the test during the first class. The test does not count for the final grade of the course.

## Course Description

Data mining and big data analytics is the process of examining data to uncover hidden patterns, unknown correlations and other useful information that can be used to make better decisions. This course is an introduction to concepts of data mining, machine learning and big data analytics. We will cover the key data mining methods of clustering, classification and pattern mining are illustrated, together with practical tools for their execution. We will show applications of these tools to a number of datasets, showing how theory and digital traces of human activities at societal scale can help us understand and forecast many complex socio-economic phenomena. The course will have a *practical* approach, with homeworks, hands-on classes and with the development of a project.

Students are free to work in any computer language/network software they feel most comfortable. However, during the class all examples and sample code will be provided in Python and Jupyter notebooks, and use of Python is strongly encouraged.

## Course Organization

Lectures: theory classes and hands-on sessions. Use of a computer will be required during some lectures. Students can use their own laptops. Instructions on the required software will be provided during the first class.

Topics and tentative calendar:

- Class 1: Test to check the prerequisites for the course. Introduction to the course. Introduction to data mining and knowledge discovering process. Examples of application domains. Data types and formats.

- Class 2: Types of learning (e.g., supervised, unsupervised, semi-supervised, reinforcement learning). Data mining tasks (e.g., classification, regression, probability estimation, clustering). Exploratory data analysis and data understanding. Explanation vs. prediction.

- Class 3: Basic machine learning models: K-Nearest Neighbors. Decision Trees. Naïve Bayes classifiers.

- Class 4: Generalization, overfitting and underfitting. Cross-validation. Model evaluation and comparison (e.g., metrics for classification, metrics for regression, confusion matrix, precision-recall curves, ROC curves).

- Class 5: Hands-on session: application of concepts on data and real-world situations.

- Class 6: Alternative machine learning models: Support Vector Machines, Linear Discriminant Analysis, Ensemble methods.

- Class 7: Preprocessing and feature engineering (e.g., imputation, scaling, dealing with categorical variables). Features selection. Dimensionality reduction. Learning from imbalanced data.

- Class 8: Clustering. Taxonomy of clustering concepts: distance-based (separation, centroids, contiguity), density-based, partitional vs. hierarchical. Methods for centroid-based clustering (k-means), hierarchical clustering (single, complete and average linkage), density-based clustering (DBSCAN).

- Class 9: Introduction to frequent itemset mining. Applications for finding association rules. Level-wise algorithms, apriori. Introduction to recommender systems.

- Class 10: Final project presentation.

## Cheating

In short: don't do it! You may work with friends to help guide problem solving or consult stack overflow (or similar) to work out a solution, but copying—from friends, previous students, or the Internet—is strictly prohibited. NEVER copy blindly blocks of code – we can tell immediately.

If caught cheating, you will fail this course. Ask questions in recitation and at office hours. If you're really stuck and can't get help, write as much code as you can and write comments within your code explaining where you're stuck.

## Textbooks and reading

- Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining, AddisonWesley, 2006.

- Mining of Massive Datasets. Leskovec, Rajaraman, Ullman, Cambridge University Press

- A list of papers and online resources will be provided during classes

Further information, such as the course website, assessment deadlines, office hours, contact details etc. will be given during the course.

**The instructor reserves the right to modify this syllabus as deemed necessary any time during the term.** Any modifications to the syllabus will be discussed with students during a class period. Students are responsible for information given in class.

The aim of the course is to provide a basic but comprehensive introduction to data mining. By the end of the course students will be able to

- Choose the right algorithms for data science problems

- Demonstrate knowledge of statistical data analysis techniques used in decision making

- Apply principles of Data Science to the analysis of large-scale problems

- Implement and use data mining software to solve real-world problems

**What you will NOT learn in this course**: Advanced coding and data visualization. This course is about the methods and algorithms to find information in the data. However, we will not discuss details of implementation or data handling. For learning to code, consider attending “Scientific Python MATH5027”. For learning to visualize data, consider attending “Data and Network Visualization CNSC6012”.

Students are expected to attend lectures and hands-on sessions, to hand 1 to 3 assignments during the course and to develop a project during the entire term.

Grading:

- Attendance of the classes and hands-on sessions: 30% of the final grade

- Assignments: 30% of the final grade

- Final project: 40% of the final grade

You need to be proficient with Python to take this course – read the “to satisfy the prerequisite” section below. Basic programming skills and basic skills in statistics and linear algebra are required.

To satisfy the prerequisites:

This course has a focus on data mining and big data analytics. As such, we use a programming language, Python, to solve real world learning problems and extract knowledge from real datasets. Since we need to pick one programming language for the course, we require students to prove proficiency with Python before the course starts, in one of the following ways:

a) Have passed the course MATH 5027 “Scientific Python”.

b) Take for grade or audit the course “Data Management with Python” CNSC 6016, given during the first six weeks of the winter term.

c) Take a MOOC course on programming with Python and show the certificate. I recommend the course on Code Academy, however other courses are also fine. Please bring the syllabus of the course together with the certificate.

d) Show and discuss a project you developed in Python. Projects from someone else (web, friend, previous students) are not considered.

If you use options c) or d): if there is a waiting list for the course, the certificate or the project must be shown before the beginning of the term to hold a place among the regular attendees. If there is no waiting list, it is fine to provide the certificate or show your previous project before the course begins (January 10th). However, the instructor holds no responsibility in case you do not satisfy the prerequisite and need to drop the course.