Machine Learning
2020年8月24日

Big Data A&M Keypoints

Keypoints of Big Data Analysis and Mining final exam.

Question types

  1. Single-choice questions
  2. Multiple-choice questions
  3. Short-answer questions

Chapter 1: Introduction of big data mining

  1. Driving factors
    Data storage, computation power, data availability

  2. * What's big data?
    A massive volume of either structured or unstructured data that is so large that traditional databases or softwares can't process

  3. * 4V
    Volume, Velocity, Variety, Veracity

  4. * What's big data mining?
    Extracting meaningful patterns from big data.

  5. KDD(Knowledge Discovery in Databases) process
    Selection, preprocess, transformation, data mining, data interpretation

  6. Main tasks of data mining
    Association rule mining, cluster analysis, classification, outlier detection

Chapter 2: Foundation of data mining

  1. Concepts
    Supervised learnig versus unsupervised learning (semi-supervised)

  2. Overfitting
    Basic reason: dfferent distributions of training and test data

    Straties:

    1. Increase sample size
    2. Remove outliers
    3. Decrease the model complexity
    4. Cross-validation
    5. Add regularization term (l2-norm : smothness / l1-norm : sparse)
  3. Typical classification algorithms

    1. KNN
      Lazy learning (no training step)

      Benefits: no training step
      Drawbacks: high computation cost and low accuracy

    2. Naive Bayes
      Assumption: class conditional independence

      Benefits: probability prediction

    3. SVM

      • Linear separate problem
        Find a hyperplane to maximize the margin (maximum marginal hyperplane)
        * Benefits: few training samples, high generalization (only effected by some support vectors which decreases structure risk)

      • Nonlinear problem
        Project low-dimensional to high-dimensional data using
        Kernel function (reduce computation):

  4. Ensemble learning
    Criteria: good performance and diversity w.r.t. each classifier

    Strategies:

    • Bagging -> Random forest (Sampling data with replacement)
      Sample some bootstrap datasets to train basic learners.
    • Boosting -> AdaBoost / XGBoost
      Give more weight to wrong classified data.
    • Stacking: output as new features
  5. Clustering

    1. K-Means
      Limitations:

      • The value of K
      • Sensitive to initilizations
      • Not compatible with arbitary shaped clusters
      • Not compatible with non-Gaussian clusters
    2. Hierarchical clustering

      Single-Link: minimum distance
      Complete-Link:maximum distance
      Average-Link:average distance

      Benefits: data structure visualization, arbitary number of clusters

    3. Density-based clustering: DBSCAN
      Benefits:

      • Arbitary shaped clusters
      • Non-Gaussian clusters
      • No need to specify the number of clusters

      Drawbacks: senetive to hyperparameters (minpts, )

  6. Association rule mining: Apriori

Chapter 3: Hashing

  1. * Min-Hash

    • What's Min-Hash?
    • ** How to compute signature matrix using Min-Hash
  2. * Local Sensitive Hash (LSH)
    Trick: diviede signature(input) matrix into many bands, and hash them into buckets

    Note:

    • Results depend on the parameters: b(bands number), r(rows per band)
    • Both Min-Hash and LSH are approximate computation
  3. Learn to hash

Chapter 4: Sampling

Given a p.d.f, we want to draw some samples from it.

  1. Inverse sampling
    Basic idea:

    Drawbacks: difficult to compute and

  2. Rejection sampling
    Basic idea: introduce a proposal distribution , and draw samples from . We can get at where . If then accept this sample, otherwise reject it.

  3. Importance sampling
    Basic idea: not reject samples but assign weight() to each sample.

    ** Importance sampling vs Rejection sampling:

    1. RS assign same weight(1) to all samples, and only some samples are reserved.
    2. IS assign different weight and all samples are reserved.
    3. IS is more robust to the proposal distribution.
  4. Markov Chain Monte Carlo(MCMC) sampling
    ** Basic idea: it is a class of alogrithms for sampling from a p.d.f. based on constructing a Markov chain that has the desirable distribution as its equilibrium distribution.

    Detailed Balanced Condition(D.B.C.): It is difficult to meet D.B.C, so just add acceptance ratio terms to the equation: And then use as a new proposal distribution. Always set as a Gaussian centered on .

    Procedure:

    1. Initialization
    2. While do
        the state , sample
        if
          
        else
          sample another and
  5. * Metropolis-Hastings(M-H) sampling
    Acceptance ratio in MCMC sampling is so small that the effiency of sampling is low. While holding the D.B.C., MH just enlarges the accetance ratio by setting:

  6. * Gibbs sampling
    M-H sampling has large acceptance ration while Gibbs sampling further make acceptance ration being 100%.

    Choose one dimension and change it by conditional probability while fixing other dimensions.

    Example:

    • Step 1: Initialize e,t and w, such that we have
    • Step 2: generate based on
    • Step 3: generate based on
    • Step 4: generate based on
    • Step 5: repeat steps 2-4 n times, and we obtain a Markov chain

    ** Gibbs sampling versus M-H sampling:

    1. Gibbs 100% accept while M-H < 100%
    2. Gibbs needs all conditional distributions while M-H needs joint p.d.f.
  7. * Reservoir sampling
    Reservoir sampling is a random sampling algorithm used for data steams.

    Procedure:

    1. Include first items in the reservoir from the data stream
    2. When :
      • Pick uniformly from
      • If , swap the latest item with the element in the reservoir

    The probability of each elements in the reservoir are the same: at any time.

Chapter 5: Data stream

  1. Unique properties: infinite length / evolving nature

  2. Challenges: single pass, memory limitation, low-time complexity, concept drift

  3. What is concept drift?
    changes all the time with the data stream.

  4. Concept drift detection

    1. Distribution based: ADWIN
    2. Error rate based: DDM
  5. Classificateion:

    1. VFDT
      Key point: **Hoeffeling bound, needs few samples to generate the decision tree

    2. CVFDT
      Concept adaption

    3. Syncstream - prototype based

  6. Data stream clustering

    Basic framework:

    1. Online: summerize data microcluster
    2. Offline: clustering

    Data structure:

    1. Prototypes
    2. **Cluster feature

      It's good for online update for its additivity property:
    3. Grids
    4. Others

    Clustering algorithms:

    1. Clustream(K-Means)
    2. Denstream(DBSCAN)

Chapter 6: Graph mining

  1. Key node identification

    1. Centrality: degree / betweeness / closeness
    2. ** K-shell decomposition
      Gradually remove points whose degree = 0,1,2...
      Benefits: intuitive, low time complexity, hierarchy
      Drawbacks: not available for star network or tree network,each node is coarse
    3. ** Page rank
      Basic idea: a page is important if its link pages are also important
  2. Community detection

    • Cut-based
      1. Minimum cut
      2. Ratio Cut
      3. Normalized cut
      4. Modularity(Expected Cut)
    • Distance dynamics

Chapter 7: Hadoop and Spark

  1. ** What is Hadoop and Spark?
    Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
    Spark's goal was to generalize MapReduce to support new apps within a same engine to get more efficient and much simpler for end users.

  2. * Design principles of Hadoop

    • Automatic parallelization
    • Fault tolerance & automatic recovery
    • Clean and simple programming abstraction
  3. Hadoop ecosystems

    • ** Hadoop Distributed File System(HDFS)
      Used for storage: NameNode(meta information), DataNode (actual data), heart beat
    • * Map-Reduce
      Used for distributed execution: JobTracker, TaskTracker
  4. Spark

    • Resilient Distributed Dataset(RDD), can be stored in main memory
    • Operation: transformation(lazy), action
  5. **** MapReduce vs Spark

    • MapReduce

      • Great for one-pass computation, not for multi-pass computaion
      • No efficient primitives/APIs for data sharing
    • Spark

      • Extends a programming language(Scala) with RDD
      • Provides clean APIs for Java, Python, Scala, R

Powered by Mume. Copyright © 2019-2024.

Euruson. All rights reserved.