Statistical Methods in Network Science and Data Analysis
- Proven proficiency with Python – read the “To satisfy the prerequisites” section at the bottom
- Basic skills in statistics and linear algebra
The increasing volume and nature of big data sets in business, economics, social and political sciences call for more complex and sophisticated mathematical and data mining tools. The complex systems monitored by big data bases are successfully described in terms of networks. In this course we will present and discuss mathematical and data mining tools used to characterize large empirical or model networks. Large data sets will be computationally investigated and the limits of the used algorithms will be discussed. The assessment of the statistical validity of the observed results will be analyzed and, when possible, quantitatively evaluated. Besides the mathematical theory, the course will have a practical approach with homeworks, hands-on classes and with the development of a project. During the class all examples and sample codes will be provided in Python and Jupyter notebooks.
Lectures: 24 classes of 100 min. Around half of the classes will be theory only. The other half will include programming exercises or evaluation of data sets. Therefore, use of a computer will be required during some lectures. Students can form groups and use their own laptops. Instructions on the required software will be provided during the first class.
Overview of the course. Basic concepts that will be treated in the course. Large data sets. Data mining approaches. Network science approach to large sets of data. Computational tools.
2 Networks and the need for new statistics
Fundamental concepts and indicators of a network: degree, diameter, walk. Network organizing principles: sparsity, scale-free networks, small world networks, clustering. Statistical motivation for focusing on distribution tails.
3 Computational complexity, probability and statistics
Basic concepts of algorithm run-time analysis. Basic concepts of probability. Heavy-tailed distributions and generative processes. Central limit theorem.
4 Statistical inference
Hypothesis testing and null models, regression, maximum likelihood estimation, especially in networks.
5 Distribution fitting and generation, degree-degree correlations
Fitting power laws and other distributions. Kolmogorov Smirnov test. Assortative networks, joint degree distributions.
6 Python: Statistics and plotting
Using scientific Python for statistics and plotting: histograms, distributions, central limit theorem.
7 Python: NetworkX
Using scientific Python for numerical network analysis: network models, null models, layouts.
8 Fundamental concepts in data mining
Types of data. Different types of attributes. Asymmetric attributes. Dimensionality, sparsity and resolution. Data pre-processing. Measures of similarity and dissimilarity.
9 Classification in data mining
Classification and regression for predictive analysis. Classification model. Target function. Training set and test set. Decision trees. Sensitivity and specificity. Model overfitting.
10 Principal Component Analysis
Properties of the PCA. Selection of the most relevant principal components. PCA with covariance and correlation matrices. Selection of the principal components.
11 Cluster analysis
Similarity measures. Agglomerative and divisive clustering techniques. k-means clustering algorithm. Single linkage, average linkage and complete linkage. Comparison of outputs of different methods.
12 Proximity based networks
Similarity based networks. Correlation networks. Partial correlation networks. Planar maximally graph networks.
13 Midterm test
14 Sampling in networks
Basic elements of statistical sampling theory. Induced and incident subgraph sampling. Star and snowball sampling. Estimation of totals network graphs.
15 Estimating in networks
Mapping to the problems of species detection or population detection. The case of random and scale free graphs. The case of Internet. Estimating in subgraphs sampled by induced and incident subgraph sampling. Estimation of original network graphs.
16 Statistical paradoxes on networks
Friendship paradox, Simpon’s paradox, Braess’s paradox.
17 Motif detection and triad census
Motifs in a directed graph. Basic motifs. Efficient algorithms to find motifs. Optimality of the triangle-finding algorithm. Shuffling, methods in directed networks.
18 Signed networks and social balance
Signed networks, structural balance and triadic closure, shuffling methods, null models.
19 Statistical aspects of community detection algorithms
Overview of community detection algorithms. Stochastic and deterministic algorithms. Indicators for the comparison of different partitions. Landscape of the fitness of a partition.
20 Link prediction
Problem description and evaluation metrics. Local similarity indices. Global similarity indices. Applications of the problem of link prediction. Classification of partially labeled nodes.
21 Statistical limits in data mining
Total information awareness. Multiple hypothesis test corrections. The Bonferroni correction. The False Discovery Rate corrections. False positive, False Negative, True positive and True negative outcomes.
22 Network flows
Internet traffic analysis. Diagnosing volume anomalies. Traffic matrix estimation. Gravity and radiation model.
23 Applications of maximum matching
Shareability networks, network control theory, kidney exchange matching.
24 Project presentations
Textbooks and reading
- Latora, Nicosia, Russo. Complex Networks. Cambridge University Press, 2017
- Kolaczyk. Statistical Analysis of Network Data. Springer, 2010
- Easley, Kleinberg. Networks, crowds, and markets, Cambridge University Press, 2010
- Han, Kamber, Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2012
- Leskovec, Rajaraman, Ullman. Mining of Massive Datasets. Cambridge University Press
- A list of papers and online resources will be provided during classes
In short: don't do it. You may work with others to help guide problem solving or consult stack overflow (or similar) to work out a solution, but copying—from friends, previous students, or the Internet—is strictly prohibited. NEVER copy blindly blocks of code – we can tell immediately.
If caught cheating, you will fail this course. Ask questions in recitation and at office hours. If you are stuck with a programming task and cannot get help, write code as far as you can and explain in the code comments where and why you are stuck.
Further information, such as the course website, assessment deadlines, office hours, contact details etc. will be given during the course.
The instructor reserves the right to modify this syllabus as deemed necessary any time during the term. Any modifications to the syllabus will be discussed with students during a class period. Students are responsible for information given in class.
By successfully completing the course the students will be able to:
- Learn how to perform data analysis and use statistical methods in the investigation of networks.
- Evaluate the statistical reliability of empirical estimations against an appropriate null hypothesis.
- Learn how to make use of large sets of data for investigating networks observed in the fields of social sciences.
- Perform empirical analyses and statistical validation of large datasets obtained from the Internet or from other business and scientific sources.
What you will NOT learn in this course
Fundamentals of coding and data visualization. This course has the prerequisite that you already have a basic proficiency with Python and will be able to develop and apply your skills towards data analysis and statistical visualization. For learning the basics to code (for-loops, lists, functions, reading and writing data from/to files, etc.), consider attending “MATH5016: Scientific Python” or “Data Management with Python”. For learning to visualize data, consider attending “CNSC6012: Data and Network Visualization”.
(1) Assessment type 1 (30% of the final grade). Attendance in at least 80% of classes, active cooperation, and homework: Students will get home assignments consisting of statistical analyses, simple problems or data processing, which they will have to complete individually and submit electronically.
(2) Assessment type 2 (30% of the final grade). The midterm test will be after week 6, consisting of questions related to the course that can be answered or solved by hand. The use of materials, including calculators or computers, is not permitted during the midterm test.
(3) Assessment type 3 (40% of the final grade). In the final project work students will have to perform a research project involving the investigation of a large network individually. The investigation and characterization of the network might include one or more of the following aspects: (i) developing of a network model, (ii) running network simulations, (iii) comparing network metrics with null hypotheses, (iv) critically analyzing results of a large scale empirical investigations. Students will have to prepare a project presentation and a written report.
Requirements for audit
Attendance in at least 80% of classes, active cooperation, and completing the home assignments.
To satisfy the prerequisites
Part of this course focuses on applying scientific programming with Python for research. We make no use of programs with a Graphical User Interface, like those available with spreadsheets. Since we need to pick one programming language for the course, we require students to prove proficiency with Python before the course starts, in one of the following ways:
a) Having taken for grade or audit the course “MATH 5016: Scientific Python”.
b) Take a MOOC course on programming with Python and show the certificate.
c) Show and discuss a project you developed in Python. Projects from someone else (web, friend, previous students) are not considered.
The instructor holds no responsibility in case you do not satisfy the prerequisite and need to drop the course.