Terms In Data Science (A-Z)
A:
? Accuracy: Correct predictions divided by total predictions.
? Area Under Curve: Represents performance under ROC curve.
? ARIMA: Time series forecasting method.
B:
? Bias: Difference between true value and predicted value.
? Bayes Theorem: Calculates event probability based on prior knowledge.
? Binomial Distribution: Models number of successes in fixed trials.
C:
? Clustering: Grouping data points based on similarities.
? Confusion Matrix: Evaluates classification model performance.
? Cross-validation: Assesses model performance by training/testing on data subsets.
D:
? Decision Trees: Tree-like model for classification and regression.
? Dimensionality Reduction: Reducing dataset features while preserving important information.
? Discriminative Models: Learn boundaries between classes.
E:
? Ensemble Learning: Combines multiple models for better performance.
? EDA (Exploratory Data Analysis): Analyzing and visualizing data patterns.
? Entropy: Measure of randomness in information.
F:
? Feature Engineering: Creating new features to improve model performance.
? F-score: Balances precision and recall in binary classification.
? Feature Extraction: Automatically extracting meaningful features from data.
G:
? Gradient Descent: Optimization algorithm minimizing function by adjusting parameters iteratively.
? Gaussian Distribution: Normal distribution with bell-shaped curve.
? Gradient Boosting: Sequentially builds weak learners for improved performance.
H:
? Hypothesis: Testable statement in statistical inference.
? Hierarchical Clustering: Organizes data into tree-like structure.
? Heteroscedasticity: Unequal variance of errors in regression model.
I:
? Information Gain: Measures feature importance in decision trees.
? Independent Variable: Variable manipulated to observe effects on dependent variable.
? Imbalance: Unequal distribution of classes in dataset.
J:
? Jupyter: Interactive computing environment for data analysis.
? Joint Probability: Probability of events occurring together.
? Jaccard Index: Measures similarity between two sets.
K:
? Kernel Density Estimation: Estimates probability density of continuous variable.
? KS Test: Compares two probability distributions.
? KMeans Clustering: Partitions data into K clusters based on similarity.
L:
? Likelihood: Chance of observing data given a model.
? Linear Regression: Models relationship between dependent and independent variables.
? L1/L2 Regularization: Prevents overfitting by adding penalty terms.
M:
? Maximum Likelihood Estimation: Estimates statistical model parameters.
? Multicollinearity: High correlation between independent variables in regression.
领英推荐
? Mutual Information: Amount of information shared between variables.
N:
? Naive Bayes: Probabilistic classifier assuming feature independence.
? Normalization: Scales data to mean 0 and std-dev 1.
? Null Hypothesis: No significant difference/effect in statistical testing.
O:
? Overfitting: Model performs well on training but poorly on new data.
? Outliers: Data points significantly different from others.
? One-hot encoding: Converts categorical variables into binary vectors.
P:
? PCA: Reduces dimensionality by transforming data into components.
? Precision: True positive predictions among all positives.
? p-value: Probability of result under null hypothesis.
Q:
? QQ-plot: Compares distribution of two datasets graphically.
? QR decomposition: Factorizes matrix into orthogonal and upper triangular matrix.
R:
? Random Forest: Ensemble method using multiple decision trees.
? Recall: True positives among all actual positives.
? ROC Curve: Shows binary classifier performance at different thresholds.
S:
? SVM: Algorithm for classification and regression.
? Standardisation: Scales data to mean 0 and std-dev 1.
? Sampling: Selecting subset of data points from larger dataset.
T:
? t-SNE: Visualizes high-dimensional data in lower dimensions.
? t-distribution: Used in hypothesis testing with small sample sizes.
? Type I/II Error: False positive/negative in hypothesis testing.
U:
? Underfitting: Model too simple to capture data patterns.
? UMAP: Visualizes high-dimensional data.
? Uniform Distribution: All outcomes equally likely.
V:
? Variance: Spread of data points around mean.
? Validation Curve: Shows model performance across hyperparameter values.
? Vanishing Gradient: Gradients become small during deep network training.
W:
? Word embedding: Represents words as dense vectors in NLP.
? Word cloud: Visualizes word frequency by size.
? Weights: Parameters learned during model training.
X:
? XGBoost: Gradient boosting library.
? XLNet: Language model for NLP.
Y:
? YOLO: Real-time object detection system.
? Yellowbrick: Python library for machine learning visualization.
Z:
? Z-score: Standardized value showing data point’s deviation from mean.
? Z-test: Compares sample mean to population mean.
? Zero-shot learning: Model recognizes new classes without prior examples.
Thanks for Sharing! ?? Sachin M