Big Data Analysis and Mining

Keypoints of Big Data A&M final exam.

Question types

Single-choice questions
Multiple-choice questions
Short-answer questions

Chapter 1: Introduction of big data mining

Driving factors
Data storage, computation power, data availability
* What's big data?
A massive volume of either structured or unstructured data that is so large that traditional databases or softwares can't process
* 4V
Volume, Velocity, Variety, Veracity
* What's big data mining?
Extracting meaningful patterns from big data.
KDD(Knowledge Discovery in Databases) process
Selection, preprocess, transformation, data mining, data interpretation
Main tasks of data mining
Association rule mining, cluster analysis, classification, outlier detection

Chapter 2: Foundation of data mining

Concepts
Supervised learnig versus unsupervised learning (semi-supervised)
Overfitting
Straties:
1. Increase sample size
2. Remove outliers
3. Decrease the model complexity
4. Cross-validation
5. Add regularization term (l2-norm : smothness / l1-norm : sparse)
Typical classification algorithms
1. KNN
  Lazy learning (no training step)
  Benefits: no training step
  Drawbacks: high computation cost and low accuracy
2. Naive Bayes
  Assumption: class conditional independence
  Benefits: probability prediction
3. SVM
  - Linear separate problem
    Find a hyperplane to maximize the margin (maximum marginal hyperplane)
    * Benefits: few training samples, high generalization (only effected by some support vectors which decreases structure risk)
  - Nonlinear problem
    Project low-dimensional to high-dimensional data using \(\Phi(x)\)
    Kernel function (reduce computation): \(K(x,z)=\Phi(x)\cdot\Phi(z)\)
Ensemble learning

Criteria: good performance and diversity w.r.t. each classifier

Strategies:
- Bagging -> Random forest (Sampling data with replacement)
  Sample some bootstrap datasets to train basic learners.
- Boosting -> AdaBoost / XGBoost
  Give more weight to wrong classified data.
- Stacking: output as new features
Clustering
1. K-Means
  Limitations:
  - The value of K
  - Sensitive to initilizations
  - Not compatible with arbitary shaped clusters
  - Not compatible with non-Gaussian clusters
2. Hierarchical clustering
  Single-Link: minimum distance
  Complete-Link：maximum distance
  Average-Link：average distance
  
  Benefits: data structure visualization, arbitary number of clusters
3. Density-based clustering: DBSCAN
  Benefits:
  - Arbitary shaped clusters
  - Non-Gaussian clusters
  - No need to specify the number of clusters
  Drawbacks: senetive to hyperparameters (minpts, \(\xi\))
Association rule mining: Apriori

Chapter 3: Hashing

* Min-Hash
- What's Min-Hash?
- ** How to compute signature matrix using Min-Hash
* Local Sensitive Hash (LSH)

Trick: diviede signature(input) matrix into many bands, and hash them into buckets

Note:
- Results depend on the parameters: b(bands number), r(rows per band)
- Both Min-Hash and LSH are approximate computation
\(\diamond\) Learn to hash

Chapter 4: Sampling

Given a p.d.f, we want to draw some samples from it.

Inverse sampling

Basic idea:

\[ p.d.f.\Longrightarrow c.d.f.\Longrightarrow \{x_i=F^{-1}(p) \mid p \sim U(0,1) \} \]

Drawbacks: difficult to compute \(c.d.f\) and \(F^{-1}(p)\)
Rejection sampling

Basic idea: introduce a proposal distribution \(q(x)\), and draw samples from \(q(x)\). We can get \(a\) at \(x_i\) where \(x_i \sim q(x), a \sim U(0,q(x_i))\). If \(a \le p(x_i)\) then accept this sample, otherwise reject it.
Importance sampling

Basic idea: not reject samples but assign weight(\(\frac{P(x_i)}{Q(x_i)}\)) to each sample.

** Importance sampling vs Rejection sampling:
1. RS assign same weight(1) to all samples, and only some samples are reserved.
2. IS assign different weight and all samples are reserved.
3. IS is more robust to the proposal distribution.
Markov Chain Monte Carlo(MCMC) sampling

** Basic idea: it is a class of alogrithms for sampling from a p.d.f. based on constructing a Markov chain that has the desirable distribution as its equilibrium distribution.

Detailed Balanced Condition(D.B.C.):

\[ P(x_i)\cdot Q(x_j \mid x_i)=P(x_j) \cdot Q(x_i \mid x_j) \]

It is difficult to meet D.B.C, so just add acceptance ratio terms \(\alpha(x_j \mid x_i)=P(x_j) \cdot Q(x_i \mid x_j)\) to the equation:

\[ P(x_i)\cdot Q(x_j \mid x_i) \cdot \alpha(x_j \mid x_i)=P(x_j) \cdot Q(x_i \mid x_j) \cdot \alpha(x_i \mid x_j) \]

And then use \(Q'(x_j\mid x_i)= Q(x_j \mid x_i) \cdot \alpha(x_j \mid x_i)\) as a new proposal distribution. Always set \(Q(x' \mid x)\) as a Gaussian centered on \(x\).

Procedure:
1. Initialization \(X_0 = x_0\)
2. While \(t=0,1,2,\dots\) do
  the \(t^{th}\) state \(X_t=x_t\), sample \(y \sim q(x \mid x_t), u \sim U[0,1]\)
  if \(u<\alpha(y,x_t)=p(y)q(x_t|y)\)
  \(X_{t+1}=y\)
  else
  sample another \(y\) and \(\alpha\)
* Metropolis-Hastings(M-H) sampling

Acceptance ratio in MCMC sampling is so small that the effiency of sampling is low. While holding the D.B.C., MH just enlarges the accetance ratio by setting:

\[ \alpha(X_j \mid X_i)=\min \left( 1,\frac{p(X_j)q(X_i \mid X_j)}{p(X_i)q(X_j \mid X_i)} \right) \]
* Gibbs sampling

M-H sampling has large acceptance ration while Gibbs sampling further make acceptance ration being 100%.

\[ \begin{gather*} \left\{\begin{aligned} &p(x_1,y_1)p(y_2 \mid x_1) = p(x_1)p(y_1 \mid x_1)p(y_2 \mid x_1)\\ &p(x_1,y_2)p(y_1 \mid x_1) = p(x_1)p(y_2 \mid x_1)p(y_1 \mid x_1) \end{aligned}\right. \\ \downarrow \\ p(x_1,y_1)p(y_2 \mid x_1)=p(x_1,y_2)p(y_1 \mid x_1)\\ p(A)p(y_2 \mid x_1)=p(B)p(y_1 \mid x_1) \end{gather*} \]

Choose one dimension and change it by conditional probability while fixing other dimensions.

Example: \(X=\{e,t,w\}\)
- Step 1: Initialize e,t and w, such that we have \(X_0 = \left( e_0,t_o,w_0\right)\)
- Step 2: generate \(e_1\) based on \(p(e \mid t=t_0,w=w_0)\)
- Step 3: generate \(t_1\) based on \(p(t \mid e=e_1,w=w_0)\)
- Step 4: generate \(w_1\) based on \(p(w \mid e=e_1,t=t_1)\)
- Step 5: repeat steps 2-4 n times, and we obtain a Markov chain
** Gibbs sampling versus M-H sampling:
1. Gibbs 100% accept while M-H < 100%
2. Gibbs needs all conditional distributions while M-H needs joint p.d.f.
* Reservoir sampling

Reservoir sampling is a random sampling algorithm used for data steams.

Procedure:
1. Include first \(k\) items in the reservoir from the data stream
2. When \(n > k\):
  - Pick \(j\) uniformly from \((1,n)\)
  - If \(j \le k\), swap the latest item with the \(j\text{-}th\) element in the reservoir
The probability of each elements in the reservoir are the same:\(\frac{k}{n}\) at any time.

Chapter 5: Data stream

Unique properties: infinite length / evolving nature
Challenges: single pass, memory limitation, low-time complexity, concept drift
What is concept drift?

\(Pr(C \mid X)\) changes all the time with the data stream.
Concept drift detection
1. Distribution based: ADWIN
2. Error rate based: DDM
Classificateion:
1. VFDT
  Key point: **Hoeffeling bound, needs few samples to generate the decision tree
2. CVFDT
  Concept adaption
3. Syncstream - prototype based
Data stream clustering
- Basic framework:
  1. Online: summerize data \(\rightarrow\) microcluster
  2. Offline: clustering
- Data structure:
  1. Prototypes
  2. **Cluster feature \(CF =(N,LS,SS)\)
    
    It's good for online update for its additivity property:
    
    \[ \overrightarrow{CF}(D_1)+\overrightarrow{CF}(D_2)=\overrightarrow{CF}(D_1 \cup D_2) \]
  3. Grids
  4. Others
- Clustering algorithms:
  1. Clustream(K-Means)
  2. Denstream(DBSCAN)

Chapter 6: Graph mining

Key node identification
1. Centrality: degree / betweeness / closeness
2. ** K-shell decomposition
  
  Gradually remove points whose degree = 0,1,2...
  Benefits: intuitive, low time complexity, hierarchy
  Drawbacks: not available for star network or tree network，each node is coarse
3. ** Page rank
  Basic idea: a page is important if its link pages are also important
Community detection
- Cut-based
  1. Minimum cut
  2. Ratio Cut
  3. Normalized cut
  4. Modularity(Expected Cut)
- Distance dynamics

Chapter 7: Hadoop and Spark

** What is Hadoop and Spark?

Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Spark's goal was to generalize MapReduce to support new apps within a same engine to get more efficient and much simpler for end users.
* Design principles of Hadoop
- Automatic parallelization
- Fault tolerance & automatic recovery
- Clean and simple programming abstraction
Hadoop ecosystems
- ** Hadoop Distributed File System(HDFS)
  Used for storage: NameNode(meta information), DataNode (actual data), heart beat
- * Map-Reduce
  Used for distributed execution: JobTracker, TaskTracker
Spark
- Resilient Distributed Dataset(RDD), can be stored in main memory
- Operation: transformation(lazy), action
**** MapReduce vs Spark
- MapReduce
  - Great for one-pass computation, not for multi-pass computaion
  - No efficient primitives/APIs for data sharing
- Spark
  - Extends a programming language(Scala) with RDD
  - Provides clean APIs for Java, Python, Scala, R