Big Data Analysis and Mining
Keypoints of Big Data A&M final exam.
Question types
- Single-choice questions
- Multiple-choice questions
- Short-answer questions
Chapter 1: Introduction of big data mining
-
Driving factors
Data storage, computation power, data availability -
* What's big data?
A massive volume of either structured or unstructured data that is so large that traditional databases or softwares can't process -
* 4V
Volume, Velocity, Variety, Veracity -
* What's big data mining?
Extracting meaningful patterns from big data. -
KDD(Knowledge Discovery in Databases) process
Selection, preprocess, transformation, data mining, data interpretation -
Main tasks of data mining
Association rule mining, cluster analysis, classification, outlier detection
Chapter 2: Foundation of data mining
-
Concepts
Supervised learnig versus unsupervised learning (semi-supervised) -
Overfitting
Straties:- Increase sample size
- Remove outliers
- Decrease the model complexity
- Cross-validation
- Add regularization term (l2-norm : smothness / l1-norm : sparse)
-
Typical classification algorithms
-
KNN
Lazy learning (no training step)
Benefits: no training step
Drawbacks: high computation cost and low accuracy -
Naive Bayes
Assumption: class conditional independence
Benefits: probability prediction -
SVM
- Linear separate problem
Find a hyperplane to maximize the margin (maximum marginal hyperplane)
* Benefits: few training samples, high generalization (only effected by some support vectors which decreases structure risk) - Nonlinear problem
Project low-dimensional to high-dimensional data using \(\Phi(x)\)
Kernel function (reduce computation): \(K(x,z)=\Phi(x)\cdot\Phi(z)\)
- Linear separate problem
-
-
Ensemble learning
Criteria: good performance and diversity w.r.t. each classifier
Strategies:
- Bagging -> Random forest (Sampling data with replacement)
Sample some bootstrap datasets to train basic learners. - Boosting -> AdaBoost / XGBoost
Give more weight to wrong classified data. - Stacking: output as new features
- Bagging -> Random forest (Sampling data with replacement)
-
Clustering
-
K-Means
Limitations:- The value of K
- Sensitive to initilizations
- Not compatible with arbitary shaped clusters
- Not compatible with non-Gaussian clusters
-
Hierarchical clustering
Single-Link: minimum distance
Complete-Link:maximum distance
Average-Link:average distanceBenefits: data structure visualization, arbitary number of clusters
-
Density-based clustering: DBSCAN
Benefits:- Arbitary shaped clusters
- Non-Gaussian clusters
- No need to specify the number of clusters
Drawbacks: senetive to hyperparameters (minpts, \(\xi\))
-
-
Association rule mining: Apriori
Chapter 3: Hashing
-
* Min-Hash
- What's Min-Hash?
- ** How to compute signature matrix using Min-Hash
-
* Local Sensitive Hash (LSH)
Trick: diviede signature(input) matrix into many bands, and hash them into buckets
Note:
- Results depend on the parameters: b(bands number), r(rows per band)
- Both Min-Hash and LSH are approximate computation
-
\(\diamond\) Learn to hash
Chapter 4: Sampling
Given a p.d.f, we want to draw some samples from it.
-
Inverse sampling
Basic idea:
\[ p.d.f.\Longrightarrow c.d.f.\Longrightarrow \{x_i=F^{-1}(p) \mid p \sim U(0,1) \} \]Drawbacks: difficult to compute \(c.d.f\) and \(F^{-1}(p)\)
-
Rejection sampling
Basic idea: introduce a proposal distribution \(q(x)\), and draw samples from \(q(x)\). We can get \(a\) at \(x_i\) where \(x_i \sim q(x), a \sim U(0,q(x_i))\). If \(a \le p(x_i)\) then accept this sample, otherwise reject it.
-
Importance sampling
Basic idea: not reject samples but assign weight(\(\frac{P(x_i)}{Q(x_i)}\)) to each sample.
** Importance sampling vs Rejection sampling:
- RS assign same weight(1) to all samples, and only some samples are reserved.
- IS assign different weight and all samples are reserved.
- IS is more robust to the proposal distribution.
-
Markov Chain Monte Carlo(MCMC) sampling
** Basic idea: it is a class of alogrithms for sampling from a p.d.f. based on constructing a Markov chain that has the desirable distribution as its equilibrium distribution.
Detailed Balanced Condition(D.B.C.):
\[ P(x_i)\cdot Q(x_j \mid x_i)=P(x_j) \cdot Q(x_i \mid x_j) \]It is difficult to meet D.B.C, so just add
acceptance ratio
terms \(\alpha(x_j \mid x_i)=P(x_j) \cdot Q(x_i \mid x_j)\) to the equation:\[ P(x_i)\cdot Q(x_j \mid x_i) \cdot \alpha(x_j \mid x_i)=P(x_j) \cdot Q(x_i \mid x_j) \cdot \alpha(x_i \mid x_j) \]And then use \(Q'(x_j\mid x_i)= Q(x_j \mid x_i) \cdot \alpha(x_j \mid x_i)\) as a new proposal distribution. Always set \(Q(x' \mid x)\) as a Gaussian centered on \(x\).
Procedure:
- Initialization \(X_0 = x_0\)
- While \(t=0,1,2,\dots\) do
the \(t^{th}\) state \(X_t=x_t\), sample \(y \sim q(x \mid x_t), u \sim U[0,1]\)
if \(u<\alpha(y,x_t)=p(y)q(x_t|y)\)
\(X_{t+1}=y\)
else
sample another \(y\) and \(\alpha\)
-
* Metropolis-Hastings(M-H) sampling
Acceptance ratio
in MCMC sampling is so small that the effiency of sampling is low. While holding the D.B.C., MH just enlarges theaccetance ratio
by setting:\[ \alpha(X_j \mid X_i)=\min \left( 1,\frac{p(X_j)q(X_i \mid X_j)}{p(X_i)q(X_j \mid X_i)} \right) \] -
* Gibbs sampling
M-H sampling has large acceptance ration while Gibbs sampling further make acceptance ration being 100%.
\[ \begin{gather*} \left\{\begin{aligned} &p(x_1,y_1)p(y_2 \mid x_1) = p(x_1)p(y_1 \mid x_1)p(y_2 \mid x_1)\\ &p(x_1,y_2)p(y_1 \mid x_1) = p(x_1)p(y_2 \mid x_1)p(y_1 \mid x_1) \end{aligned}\right. \\ \downarrow \\ p(x_1,y_1)p(y_2 \mid x_1)=p(x_1,y_2)p(y_1 \mid x_1)\\ p(A)p(y_2 \mid x_1)=p(B)p(y_1 \mid x_1) \end{gather*} \]Choose one dimension and change it by conditional probability while fixing other dimensions.
Example: \(X=\{e,t,w\}\)
- Step 1: Initialize e,t and w, such that we have \(X_0 = \left( e_0,t_o,w_0\right)\)
- Step 2: generate \(e_1\) based on \(p(e \mid t=t_0,w=w_0)\)
- Step 3: generate \(t_1\) based on \(p(t \mid e=e_1,w=w_0)\)
- Step 4: generate \(w_1\) based on \(p(w \mid e=e_1,t=t_1)\)
- Step 5: repeat steps 2-4 n times, and we obtain a Markov chain
** Gibbs sampling versus M-H sampling:
- Gibbs 100% accept while M-H < 100%
- Gibbs needs all conditional distributions while M-H needs joint p.d.f.
-
* Reservoir sampling
Reservoir sampling is a random sampling algorithm used for data steams.
Procedure:
- Include first \(k\) items in the reservoir from the data stream
- When \(n > k\):
- Pick \(j\) uniformly from \((1,n)\)
- If \(j \le k\), swap the latest item with the \(j\text{-}th\) element in the reservoir
The probability of each elements in the reservoir are the same:\(\frac{k}{n}\) at any time.
Chapter 5: Data stream
-
Unique properties: infinite length / evolving nature
-
Challenges: single pass, memory limitation, low-time complexity, concept drift
-
What is concept drift?
\(Pr(C \mid X)\) changes all the time with the data stream.
-
Concept drift detection
- Distribution based: ADWIN
- Error rate based: DDM
-
Classificateion:
-
VFDT
Key point: **Hoeffeling bound, needs few samples to generate the decision tree -
CVFDT
Concept adaption -
Syncstream - prototype based
-
-
Data stream clustering
-
Basic framework:
- Online: summerize data \(\rightarrow\) microcluster
- Offline: clustering
-
Data structure:
- Prototypes
-
**Cluster feature \(CF =(N,LS,SS)\)
It's good for online update for its additivity property:
\[ \overrightarrow{CF}(D_1)+\overrightarrow{CF}(D_2)=\overrightarrow{CF}(D_1 \cup D_2) \] -
Grids
- Others
-
Clustering algorithms:
- Clustream(K-Means)
- Denstream(DBSCAN)
-
Chapter 6: Graph mining
-
Key node identification
- Centrality: degree / betweeness / closeness
-
** K-shell decomposition
Gradually remove points whose degree = 0,1,2...
Benefits: intuitive, low time complexity, hierarchy
Drawbacks: not available for star network or tree network,each node is coarse -
** Page rank
Basic idea: a page is important if its link pages are also important
-
Community detection
- Cut-based
- Minimum cut
- Ratio Cut
- Normalized cut
- Modularity(Expected Cut)
- Distance dynamics
- Cut-based
Chapter 7: Hadoop and Spark
-
** What is Hadoop and Spark?
Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Spark's goal was to generalize MapReduce to support new apps within a same engine to get more efficient and much simpler for end users. -
* Design principles of Hadoop
- Automatic parallelization
- Fault tolerance & automatic recovery
- Clean and simple programming abstraction
-
Hadoop ecosystems
- ** Hadoop Distributed File System(HDFS)
Used for storage: NameNode(meta information), DataNode (actual data), heart beat - * Map-Reduce
Used for distributed execution: JobTracker, TaskTracker
- ** Hadoop Distributed File System(HDFS)
-
Spark
- Resilient Distributed Dataset(RDD), can be stored in main memory
- Operation: transformation(lazy), action
-
**** MapReduce vs Spark
-
MapReduce
- Great for one-pass computation, not for multi-pass computaion
- No efficient primitives/APIs for data sharing
-
Spark
- Extends a programming language(Scala) with RDD
- Provides clean APIs for Java, Python, Scala, R
-