Big Data Analysis and Mining
Keypoints of Big Data A&M final exam.
Question types
 Singlechoice questions
 Multiplechoice questions
 Shortanswer questions
Chapter 1: Introduction of big data mining

Driving factors
Data storage, computation power, data availability 
* What's big data?
A massive volume of either structured or unstructured data that is so large that traditional databases or softwares can't process 
* 4V
Volume, Velocity, Variety, Veracity 
* What's big data mining?
Extracting meaningful patterns from big data. 
KDD(Knowledge Discovery in Databases) process
Selection, preprocess, transformation, data mining, data interpretation 
Main tasks of data mining
Association rule mining, cluster analysis, classification, outlier detection
Chapter 2: Foundation of data mining

Concepts
Supervised learnig versus unsupervised learning (semisupervised) 
Overfitting
Straties: Increase sample size
 Remove outliers
 Decrease the model complexity
 Crossvalidation
 Add regularization term (l2norm : smothness / l1norm : sparse)

Typical classification algorithms

KNN
Lazy learning (no training step)
Benefits: no training step
Drawbacks: high computation cost and low accuracy 
Naive Bayes
Assumption: class conditional independence
Benefits: probability prediction 
SVM
 Linear separate problem
Find a hyperplane to maximize the margin (maximum marginal hyperplane)
* Benefits: few training samples, high generalization (only effected by some support vectors which decreases structure risk)  Nonlinear problem
Project lowdimensional to highdimensional data using \(\Phi(x)\)
Kernel function (reduce computation): \(K(x,z)=\Phi(x)\cdot\Phi(z)\)
 Linear separate problem


Ensemble learning
Criteria: good performance and diversity w.r.t. each classifier
Strategies:
 Bagging > Random forest (Sampling data with replacement)
Sample some bootstrap datasets to train basic learners.  Boosting > AdaBoost / XGBoost
Give more weight to wrong classified data.  Stacking: output as new features
 Bagging > Random forest (Sampling data with replacement)

Clustering

KMeans
Limitations: The value of K
 Sensitive to initilizations
 Not compatible with arbitary shaped clusters
 Not compatible with nonGaussian clusters

Hierarchical clustering
SingleLink: minimum distance
CompleteLink：maximum distance
AverageLink：average distanceBenefits: data structure visualization, arbitary number of clusters

Densitybased clustering: DBSCAN
Benefits: Arbitary shaped clusters
 NonGaussian clusters
 No need to specify the number of clusters
Drawbacks: senetive to hyperparameters (minpts, \(\xi\))


Association rule mining: Apriori
Chapter 3: Hashing

* MinHash
 What's MinHash?
 ** How to compute signature matrix using MinHash

* Local Sensitive Hash (LSH)
Trick: diviede signature(input) matrix into many bands, and hash them into buckets
Note:
 Results depend on the parameters: b(bands number), r(rows per band)
 Both MinHash and LSH are approximate computation

\(\diamond\) Learn to hash
Chapter 4: Sampling
Given a p.d.f, we want to draw some samples from it.

Inverse sampling
Basic idea:
\[ p.d.f.\Longrightarrow c.d.f.\Longrightarrow \{x_i=F^{1}(p) \mid p \sim U(0,1) \} \]Drawbacks: difficult to compute \(c.d.f\) and \(F^{1}(p)\)

Rejection sampling
Basic idea: introduce a proposal distribution \(q(x)\), and draw samples from \(q(x)\). We can get \(a\) at \(x_i\) where \(x_i \sim q(x), a \sim U(0,q(x_i))\). If \(a \le p(x_i)\) then accept this sample, otherwise reject it.

Importance sampling
Basic idea: not reject samples but assign weight(\(\frac{P(x_i)}{Q(x_i)}\)) to each sample.
** Importance sampling vs Rejection sampling:
 RS assign same weight(1) to all samples, and only some samples are reserved.
 IS assign different weight and all samples are reserved.
 IS is more robust to the proposal distribution.

Markov Chain Monte Carlo(MCMC) sampling
** Basic idea: it is a class of alogrithms for sampling from a p.d.f. based on constructing a Markov chain that has the desirable distribution as its equilibrium distribution.
Detailed Balanced Condition(D.B.C.):
\[ P(x_i)\cdot Q(x_j \mid x_i)=P(x_j) \cdot Q(x_i \mid x_j) \]It is difficult to meet D.B.C, so just add
acceptance ratio
terms \(\alpha(x_j \mid x_i)=P(x_j) \cdot Q(x_i \mid x_j)\) to the equation:\[ P(x_i)\cdot Q(x_j \mid x_i) \cdot \alpha(x_j \mid x_i)=P(x_j) \cdot Q(x_i \mid x_j) \cdot \alpha(x_i \mid x_j) \]And then use \(Q'(x_j\mid x_i)= Q(x_j \mid x_i) \cdot \alpha(x_j \mid x_i)\) as a new proposal distribution. Always set \(Q(x' \mid x)\) as a Gaussian centered on \(x\).
Procedure:
 Initialization \(X_0 = x_0\)
 While \(t=0,1,2,\dots\) do
the \(t^{th}\) state \(X_t=x_t\), sample \(y \sim q(x \mid x_t), u \sim U[0,1]\)
if \(u<\alpha(y,x_t)=p(y)q(x_ty)\)
\(X_{t+1}=y\)
else
sample another \(y\) and \(\alpha\)

* MetropolisHastings(MH) sampling
Acceptance ratio
in MCMC sampling is so small that the effiency of sampling is low. While holding the D.B.C., MH just enlarges theaccetance ratio
by setting:\[ \alpha(X_j \mid X_i)=\min \left( 1,\frac{p(X_j)q(X_i \mid X_j)}{p(X_i)q(X_j \mid X_i)} \right) \] 
* Gibbs sampling
MH sampling has large acceptance ration while Gibbs sampling further make acceptance ration being 100%.
\[ \begin{gather*} \left\{\begin{aligned} &p(x_1,y_1)p(y_2 \mid x_1) = p(x_1)p(y_1 \mid x_1)p(y_2 \mid x_1)\\ &p(x_1,y_2)p(y_1 \mid x_1) = p(x_1)p(y_2 \mid x_1)p(y_1 \mid x_1) \end{aligned}\right. \\ \downarrow \\ p(x_1,y_1)p(y_2 \mid x_1)=p(x_1,y_2)p(y_1 \mid x_1)\\ p(A)p(y_2 \mid x_1)=p(B)p(y_1 \mid x_1) \end{gather*} \]Choose one dimension and change it by conditional probability while fixing other dimensions.
Example: \(X=\{e,t,w\}\)
 Step 1: Initialize e,t and w, such that we have \(X_0 = \left( e_0,t_o,w_0\right)\)
 Step 2: generate \(e_1\) based on \(p(e \mid t=t_0,w=w_0)\)
 Step 3: generate \(t_1\) based on \(p(t \mid e=e_1,w=w_0)\)
 Step 4: generate \(w_1\) based on \(p(w \mid e=e_1,t=t_1)\)
 Step 5: repeat steps 24 n times, and we obtain a Markov chain
** Gibbs sampling versus MH sampling:
 Gibbs 100% accept while MH < 100%
 Gibbs needs all conditional distributions while MH needs joint p.d.f.

* Reservoir sampling
Reservoir sampling is a random sampling algorithm used for data steams.
Procedure:
 Include first \(k\) items in the reservoir from the data stream
 When \(n > k\):
 Pick \(j\) uniformly from \((1,n)\)
 If \(j \le k\), swap the latest item with the \(j\text{}th\) element in the reservoir
The probability of each elements in the reservoir are the same:\(\frac{k}{n}\) at any time.
Chapter 5: Data stream

Unique properties: infinite length / evolving nature

Challenges: single pass, memory limitation, lowtime complexity, concept drift

What is concept drift?
\(Pr(C \mid X)\) changes all the time with the data stream.

Concept drift detection
 Distribution based: ADWIN
 Error rate based: DDM

Classificateion:

VFDT
Key point: **Hoeffeling bound, needs few samples to generate the decision tree 
CVFDT
Concept adaption 
Syncstream  prototype based


Data stream clustering

Basic framework:
 Online: summerize data \(\rightarrow\) microcluster
 Offline: clustering

Data structure:
 Prototypes

**Cluster feature \(CF =(N,LS,SS)\)
It's good for online update for its additivity property:
\[ \overrightarrow{CF}(D_1)+\overrightarrow{CF}(D_2)=\overrightarrow{CF}(D_1 \cup D_2) \] 
Grids
 Others

Clustering algorithms:
 Clustream(KMeans)
 Denstream(DBSCAN)

Chapter 6: Graph mining

Key node identification
 Centrality: degree / betweeness / closeness

** Kshell decomposition
Gradually remove points whose degree = 0,1,2...
Benefits: intuitive, low time complexity, hierarchy
Drawbacks: not available for star network or tree network，each node is coarse 
** Page rank
Basic idea: a page is important if its link pages are also important

Community detection
 Cutbased
 Minimum cut
 Ratio Cut
 Normalized cut
 Modularity(Expected Cut)
 Distance dynamics
 Cutbased
Chapter 7: Hadoop and Spark

** What is Hadoop and Spark?
Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Spark's goal was to generalize MapReduce to support new apps within a same engine to get more efficient and much simpler for end users. 
* Design principles of Hadoop
 Automatic parallelization
 Fault tolerance & automatic recovery
 Clean and simple programming abstraction

Hadoop ecosystems
 ** Hadoop Distributed File System(HDFS)
Used for storage: NameNode(meta information), DataNode (actual data), heart beat  * MapReduce
Used for distributed execution: JobTracker, TaskTracker
 ** Hadoop Distributed File System(HDFS)

Spark
 Resilient Distributed Dataset(RDD), can be stored in main memory
 Operation: transformation(lazy), action

**** MapReduce vs Spark

MapReduce
 Great for onepass computation, not for multipass computaion
 No efficient primitives/APIs for data sharing

Spark
 Extends a programming language(Scala) with RDD
 Provides clean APIs for Java, Python, Scala, R
