Machine Learning
2020年08月24日

Big Data Analysis and Mining

Keypoints of Big Data A&M final exam.

Question types

  1. Single-choice questions
  2. Multiple-choice questions
  3. Short-answer questions

Chapter 1: Introduction of big data mining

  1. Driving factors
    Data storage, computation power, data availability

  2. * What's big data?
    A massive volume of either structured or unstructured data that is so large that traditional databases or softwares can't process

  3. * 4V
    Volume, Velocity, Variety, Veracity

  4. * What's big data mining?
    Extracting meaningful patterns from big data.

  5. KDD(Knowledge Discovery in Databases) process
    Selection, preprocess, transformation, data mining, data interpretation

  6. Main tasks of data mining
    Association rule mining, cluster analysis, classification, outlier detection

Chapter 2: Foundation of data mining

  1. Concepts
    Supervised learnig versus unsupervised learning (semi-supervised)

  2. Overfitting
    Straties:

    1. Increase sample size
    2. Remove outliers
    3. Decrease the model complexity
    4. Cross-validation
    5. Add regularization term (l2-norm : smothness / l1-norm : sparse)
  3. Typical classification algorithms

    1. KNN
      Lazy learning (no training step)
      Benefits: no training step
      Drawbacks: high computation cost and low accuracy

    2. Naive Bayes
      Assumption: class conditional independence
      Benefits: probability prediction

    3. SVM

      • Linear separate problem
        Find a hyperplane to maximize the margin (maximum marginal hyperplane)
        * Benefits: few training samples, high generalization (only effected by some support vectors which decreases structure risk)
      • Nonlinear problem
        Project low-dimensional to high-dimensional data using \(\Phi(x)\)
        Kernel function (reduce computation): \(K(x,z)=\Phi(x)\cdot\Phi(z)\)
  4. Ensemble learning

    Criteria: good performance and diversity w.r.t. each classifier

    Strategies:

    • Bagging -> Random forest (Sampling data with replacement)
      Sample some bootstrap datasets to train basic learners.
    • Boosting -> AdaBoost / XGBoost
      Give more weight to wrong classified data.
    • Stacking: output as new features
  5. Clustering

    1. K-Means
      Limitations:

      • The value of K
      • Sensitive to initilizations
      • Not compatible with arbitary shaped clusters
      • Not compatible with non-Gaussian clusters
    2. Hierarchical clustering
      Single-Link: minimum distance
      Complete-Link:maximum distance
      Average-Link:average distance

      Benefits: data structure visualization, arbitary number of clusters

    3. Density-based clustering: DBSCAN
      Benefits:

      • Arbitary shaped clusters
      • Non-Gaussian clusters
      • No need to specify the number of clusters

      Drawbacks: senetive to hyperparameters (minpts, \(\xi\))

  6. Association rule mining: Apriori

Chapter 3: Hashing

  1. * Min-Hash

    • What's Min-Hash?
    • ** How to compute signature matrix using Min-Hash
  2. * Local Sensitive Hash (LSH)

    Trick: diviede signature(input) matrix into many bands, and hash them into buckets

    Note:

    • Results depend on the parameters: b(bands number), r(rows per band)
    • Both Min-Hash and LSH are approximate computation
  3. \(\diamond\) Learn to hash

Chapter 4: Sampling

Given a p.d.f, we want to draw some samples from it.

  1. Inverse sampling

    Basic idea:

    \[ p.d.f.\Longrightarrow c.d.f.\Longrightarrow \{x_i=F^{-1}(p) \mid p \sim U(0,1) \} \]

    Drawbacks: difficult to compute \(c.d.f\) and \(F^{-1}(p)\)

  2. Rejection sampling

    Basic idea: introduce a proposal distribution \(q(x)\), and draw samples from \(q(x)\). We can get \(a\) at \(x_i\) where \(x_i \sim q(x), a \sim U(0,q(x_i))\). If \(a \le p(x_i)\) then accept this sample, otherwise reject it.

  3. Importance sampling

    Basic idea: not reject samples but assign weight(\(\frac{P(x_i)}{Q(x_i)}\)) to each sample.

    ** Importance sampling vs Rejection sampling:

    1. RS assign same weight(1) to all samples, and only some samples are reserved.
    2. IS assign different weight and all samples are reserved.
    3. IS is more robust to the proposal distribution.
  4. Markov Chain Monte Carlo(MCMC) sampling

    ** Basic idea: it is a class of alogrithms for sampling from a p.d.f. based on constructing a Markov chain that has the desirable distribution as its equilibrium distribution.

    Detailed Balanced Condition(D.B.C.):

    \[ P(x_i)\cdot Q(x_j \mid x_i)=P(x_j) \cdot Q(x_i \mid x_j) \]

    It is difficult to meet D.B.C, so just add acceptance ratio terms \(\alpha(x_j \mid x_i)=P(x_j) \cdot Q(x_i \mid x_j)\) to the equation:

    \[ P(x_i)\cdot Q(x_j \mid x_i) \cdot \alpha(x_j \mid x_i)=P(x_j) \cdot Q(x_i \mid x_j) \cdot \alpha(x_i \mid x_j) \]

    And then use \(Q'(x_j\mid x_i)= Q(x_j \mid x_i) \cdot \alpha(x_j \mid x_i)\) as a new proposal distribution. Always set \(Q(x' \mid x)\) as a Gaussian centered on \(x\).

    Procedure:

    1. Initialization \(X_0 = x_0\)
    2. While \(t=0,1,2,\dots\) do
        the \(t^{th}\) state \(X_t=x_t\), sample \(y \sim q(x \mid x_t), u \sim U[0,1]\)
        if \(u<\alpha(y,x_t)=p(y)q(x_t|y)\)
          \(X_{t+1}=y\)
        else
          sample another \(y\) and \(\alpha\)
  5. * Metropolis-Hastings(M-H) sampling

    Acceptance ratio in MCMC sampling is so small that the effiency of sampling is low. While holding the D.B.C., MH just enlarges the accetance ratio by setting:

    \[ \alpha(X_j \mid X_i)=\min \left( 1,\frac{p(X_j)q(X_i \mid X_j)}{p(X_i)q(X_j \mid X_i)} \right) \]
  6. * Gibbs sampling

    M-H sampling has large acceptance ration while Gibbs sampling further make acceptance ration being 100%.

    \[ \begin{gather*} \left\{\begin{aligned} &p(x_1,y_1)p(y_2 \mid x_1) = p(x_1)p(y_1 \mid x_1)p(y_2 \mid x_1)\\ &p(x_1,y_2)p(y_1 \mid x_1) = p(x_1)p(y_2 \mid x_1)p(y_1 \mid x_1) \end{aligned}\right. \\ \downarrow \\ p(x_1,y_1)p(y_2 \mid x_1)=p(x_1,y_2)p(y_1 \mid x_1)\\ p(A)p(y_2 \mid x_1)=p(B)p(y_1 \mid x_1) \end{gather*} \]

    Choose one dimension and change it by conditional probability while fixing other dimensions.

    Example: \(X=\{e,t,w\}\)

    • Step 1: Initialize e,t and w, such that we have \(X_0 = \left( e_0,t_o,w_0\right)\)
    • Step 2: generate \(e_1\) based on \(p(e \mid t=t_0,w=w_0)\)
    • Step 3: generate \(t_1\) based on \(p(t \mid e=e_1,w=w_0)\)
    • Step 4: generate \(w_1\) based on \(p(w \mid e=e_1,t=t_1)\)
    • Step 5: repeat steps 2-4 n times, and we obtain a Markov chain

    ** Gibbs sampling versus M-H sampling:

    1. Gibbs 100% accept while M-H < 100%
    2. Gibbs needs all conditional distributions while M-H needs joint p.d.f.
  7. * Reservoir sampling

    Reservoir sampling is a random sampling algorithm used for data steams.

    Procedure:

    1. Include first \(k\) items in the reservoir from the data stream
    2. When \(n > k\):
      • Pick \(j\) uniformly from \((1,n)\)
      • If \(j \le k\), swap the latest item with the \(j\text{-}th\) element in the reservoir

    The probability of each elements in the reservoir are the same:\(\frac{k}{n}\) at any time.

Chapter 5: Data stream

  1. Unique properties: infinite length / evolving nature

  2. Challenges: single pass, memory limitation, low-time complexity, concept drift

  3. What is concept drift?

    \(Pr(C \mid X)\) changes all the time with the data stream.

  4. Concept drift detection

    1. Distribution based: ADWIN
    2. Error rate based: DDM
  5. Classificateion:

    1. VFDT
      Key point: **Hoeffeling bound, needs few samples to generate the decision tree

    2. CVFDT
      Concept adaption

    3. Syncstream - prototype based

  6. Data stream clustering

    • Basic framework:

      1. Online: summerize data \(\rightarrow\) microcluster
      2. Offline: clustering
    • Data structure:

      1. Prototypes
      2. **Cluster feature \(CF =(N,LS,SS)\)

        It's good for online update for its additivity property:

        \[ \overrightarrow{CF}(D_1)+\overrightarrow{CF}(D_2)=\overrightarrow{CF}(D_1 \cup D_2) \]
      3. Grids

      4. Others
    • Clustering algorithms:

      1. Clustream(K-Means)
      2. Denstream(DBSCAN)

Chapter 6: Graph mining

  1. Key node identification

    1. Centrality: degree / betweeness / closeness
    2. ** K-shell decomposition

      Gradually remove points whose degree = 0,1,2...
      Benefits: intuitive, low time complexity, hierarchy
      Drawbacks: not available for star network or tree network,each node is coarse

    3. ** Page rank
      Basic idea: a page is important if its link pages are also important

  2. Community detection

    • Cut-based
      1. Minimum cut
      2. Ratio Cut
      3. Normalized cut
      4. Modularity(Expected Cut)
    • Distance dynamics

Chapter 7: Hadoop and Spark

  1. ** What is Hadoop and Spark?

    Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
    Spark's goal was to generalize MapReduce to support new apps within a same engine to get more efficient and much simpler for end users.

  2. * Design principles of Hadoop

    • Automatic parallelization
    • Fault tolerance & automatic recovery
    • Clean and simple programming abstraction
  3. Hadoop ecosystems

    • ** Hadoop Distributed File System(HDFS)
      Used for storage: NameNode(meta information), DataNode (actual data), heart beat
    • * Map-Reduce
      Used for distributed execution: JobTracker, TaskTracker
  4. Spark

    • Resilient Distributed Dataset(RDD), can be stored in main memory
    • Operation: transformation(lazy), action
  5. **** MapReduce vs Spark

    • MapReduce

      • Great for one-pass computation, not for multi-pass computaion
      • No efficient primitives/APIs for data sharing
    • Spark

      • Extends a programming language(Scala) with RDD
      • Provides clean APIs for Java, Python, Scala, R

Powered by Mume. Copyright © 2019-2024.

Euruson. All rights reserved.