Big Data A&M Keypoints
Keypoints of Big Data Analysis and Mining final exam.
Question types
- Single-choice questions
- Multiple-choice questions
- Short-answer questions
Chapter 1: Introduction of big data mining
-
Driving factors
Data storage, computation power, data availability -
* What's big data?
A massive volume of either structured or unstructured data that is so large that traditional databases or softwares can't process -
* 4V
Volume, Velocity, Variety, Veracity -
* What's big data mining?
Extracting meaningful patterns from big data. -
KDD(Knowledge Discovery in Databases) process
Selection, preprocess, transformation, data mining, data interpretation -
Main tasks of data mining
Association rule mining, cluster analysis, classification, outlier detection
Chapter 2: Foundation of data mining
-
Concepts
Supervised learnig versus unsupervised learning (semi-supervised) -
Overfitting
Basic reason: dfferent distributions of training and test dataStraties:
- Increase sample size
- Remove outliers
- Decrease the model complexity
- Cross-validation
- Add regularization term (l2-norm : smothness / l1-norm : sparse)
-
Typical classification algorithms
-
KNN
Lazy learning (no training step)Benefits: no training step
Drawbacks: high computation cost and low accuracy -
Naive Bayes
Assumption: class conditional independenceBenefits: probability prediction
-
SVM
-
Linear separate problem
Find a hyperplane to maximize the margin (maximum marginal hyperplane)
* Benefits: few training samples, high generalization (only effected by some support vectors which decreases structure risk) -
Nonlinear problem
Project low-dimensional to high-dimensional data using
Kernel function (reduce computation):
-
-
-
Ensemble learning
Criteria: good performance and diversity w.r.t. each classifierStrategies:
- Bagging -> Random forest (Sampling data with replacement)
Sample some bootstrap datasets to train basic learners. - Boosting -> AdaBoost / XGBoost
Give more weight to wrong classified data. - Stacking: output as new features
- Bagging -> Random forest (Sampling data with replacement)
-
Clustering
-
K-Means
Limitations:- The value of K
- Sensitive to initilizations
- Not compatible with arbitary shaped clusters
- Not compatible with non-Gaussian clusters
-
Hierarchical clustering
Single-Link: minimum distance
Complete-Link:maximum distance
Average-Link:average distanceBenefits: data structure visualization, arbitary number of clusters
-
Density-based clustering: DBSCAN
Benefits:- Arbitary shaped clusters
- Non-Gaussian clusters
- No need to specify the number of clusters
Drawbacks: senetive to hyperparameters (minpts, )
-
-
Association rule mining: Apriori
Chapter 3: Hashing
-
* Min-Hash
- What's Min-Hash?
- ** How to compute signature matrix using Min-Hash
-
* Local Sensitive Hash (LSH)
Trick: diviede signature(input) matrix into many bands, and hash them into bucketsNote:
- Results depend on the parameters: b(bands number), r(rows per band)
- Both Min-Hash and LSH are approximate computation
-
Learn to hash
Chapter 4: Sampling
Given a p.d.f, we want to draw some samples from it.
-
Inverse sampling
Basic idea:
Drawbacks: difficult to compute and
-
Rejection sampling
Basic idea: introduce a proposal distribution , and draw samples from . We can get at where . If then accept this sample, otherwise reject it. -
Importance sampling
Basic idea: not reject samples but assign weight() to each sample.** Importance sampling vs Rejection sampling:
- RS assign same weight(1) to all samples, and only some samples are reserved.
- IS assign different weight and all samples are reserved.
- IS is more robust to the proposal distribution.
-
Markov Chain Monte Carlo(MCMC) sampling
** Basic idea: it is a class of alogrithms for sampling from a p.d.f. based on constructing a Markov chain that has the desirable distribution as its equilibrium distribution.Detailed Balanced Condition(D.B.C.): It is difficult to meet D.B.C, so just add
acceptance ratio
terms to the equation: And then use as a new proposal distribution. Always set as a Gaussian centered on .Procedure:
- Initialization
- While do
the state , sample
if
else
sample another and
-
* Metropolis-Hastings(M-H) sampling
Acceptance ratio
in MCMC sampling is so small that the effiency of sampling is low. While holding the D.B.C., MH just enlarges theaccetance ratio
by setting: -
* Gibbs sampling
M-H sampling has large acceptance ration while Gibbs sampling further make acceptance ration being 100%.Choose one dimension and change it by conditional probability while fixing other dimensions.
Example:
- Step 1: Initialize e,t and w, such that we have
- Step 2: generate based on
- Step 3: generate based on
- Step 4: generate based on
- Step 5: repeat steps 2-4 n times, and we obtain a Markov chain
** Gibbs sampling versus M-H sampling:
- Gibbs 100% accept while M-H < 100%
- Gibbs needs all conditional distributions while M-H needs joint p.d.f.
-
* Reservoir sampling
Reservoir sampling is a random sampling algorithm used for data steams.Procedure:
- Include first items in the reservoir from the data stream
- When :
- Pick uniformly from
- If , swap the latest item with the element in the reservoir
The probability of each elements in the reservoir are the same: at any time.
Chapter 5: Data stream
-
Unique properties: infinite length / evolving nature
-
Challenges: single pass, memory limitation, low-time complexity, concept drift
-
What is concept drift?
changes all the time with the data stream. -
Concept drift detection
- Distribution based: ADWIN
- Error rate based: DDM
-
Classificateion:
-
VFDT
Key point: **Hoeffeling bound, needs few samples to generate the decision tree -
CVFDT
Concept adaption -
Syncstream - prototype based
-
-
Data stream clustering
Basic framework:
- Online: summerize data microcluster
- Offline: clustering
Data structure:
- Prototypes
- **Cluster feature
It's good for online update for its additivity property: - Grids
- Others
Clustering algorithms:
- Clustream(K-Means)
- Denstream(DBSCAN)
Chapter 6: Graph mining
-
Key node identification
- Centrality: degree / betweeness / closeness
- ** K-shell decomposition
Gradually remove points whose degree = 0,1,2...
Benefits: intuitive, low time complexity, hierarchy
Drawbacks: not available for star network or tree network,each node is coarse - ** Page rank
Basic idea: a page is important if its link pages are also important
-
Community detection
- Cut-based
- Minimum cut
- Ratio Cut
- Normalized cut
- Modularity(Expected Cut)
- Distance dynamics
- Cut-based
Chapter 7: Hadoop and Spark
-
** What is Hadoop and Spark?
Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Spark's goal was to generalize MapReduce to support new apps within a same engine to get more efficient and much simpler for end users. -
* Design principles of Hadoop
- Automatic parallelization
- Fault tolerance & automatic recovery
- Clean and simple programming abstraction
-
Hadoop ecosystems
- ** Hadoop Distributed File System(HDFS)
Used for storage: NameNode(meta information), DataNode (actual data), heart beat - * Map-Reduce
Used for distributed execution: JobTracker, TaskTracker
- ** Hadoop Distributed File System(HDFS)
-
Spark
- Resilient Distributed Dataset(RDD), can be stored in main memory
- Operation: transformation(lazy), action
-
**** MapReduce vs Spark
-
MapReduce
- Great for one-pass computation, not for multi-pass computaion
- No efficient primitives/APIs for data sharing
-
Spark
- Extends a programming language(Scala) with RDD
- Provides clean APIs for Java, Python, Scala, R
-