Terms In Data Science (A-Z)

Terms In Data Science (A-Z)

A:

? Accuracy: Correct predictions divided by total predictions.

? Area Under Curve: Represents performance under ROC curve.

? ARIMA: Time series forecasting method.

B:

? Bias: Difference between true value and predicted value.

? Bayes Theorem: Calculates event probability based on prior knowledge.

? Binomial Distribution: Models number of successes in fixed trials.

C:

? Clustering: Grouping data points based on similarities.

? Confusion Matrix: Evaluates classification model performance.

? Cross-validation: Assesses model performance by training/testing on data subsets.

D:

? Decision Trees: Tree-like model for classification and regression.

? Dimensionality Reduction: Reducing dataset features while preserving important information.

? Discriminative Models: Learn boundaries between classes.

E:

? Ensemble Learning: Combines multiple models for better performance.

? EDA (Exploratory Data Analysis): Analyzing and visualizing data patterns.

? Entropy: Measure of randomness in information.

F:

? Feature Engineering: Creating new features to improve model performance.

? F-score: Balances precision and recall in binary classification.

? Feature Extraction: Automatically extracting meaningful features from data.

G:

? Gradient Descent: Optimization algorithm minimizing function by adjusting parameters iteratively.

? Gaussian Distribution: Normal distribution with bell-shaped curve.

? Gradient Boosting: Sequentially builds weak learners for improved performance.

H:

? Hypothesis: Testable statement in statistical inference.

? Hierarchical Clustering: Organizes data into tree-like structure.

? Heteroscedasticity: Unequal variance of errors in regression model.

I:

? Information Gain: Measures feature importance in decision trees.

? Independent Variable: Variable manipulated to observe effects on dependent variable.

? Imbalance: Unequal distribution of classes in dataset.

J:

? Jupyter: Interactive computing environment for data analysis.

? Joint Probability: Probability of events occurring together.

? Jaccard Index: Measures similarity between two sets.

K:

? Kernel Density Estimation: Estimates probability density of continuous variable.

? KS Test: Compares two probability distributions.

? KMeans Clustering: Partitions data into K clusters based on similarity.

L:

? Likelihood: Chance of observing data given a model.

? Linear Regression: Models relationship between dependent and independent variables.

? L1/L2 Regularization: Prevents overfitting by adding penalty terms.

M:

? Maximum Likelihood Estimation: Estimates statistical model parameters.

? Multicollinearity: High correlation between independent variables in regression.

? Mutual Information: Amount of information shared between variables.

N:

? Naive Bayes: Probabilistic classifier assuming feature independence.

? Normalization: Scales data to mean 0 and std-dev 1.

? Null Hypothesis: No significant difference/effect in statistical testing.

O:

? Overfitting: Model performs well on training but poorly on new data.

? Outliers: Data points significantly different from others.

? One-hot encoding: Converts categorical variables into binary vectors.

P:

? PCA: Reduces dimensionality by transforming data into components.

? Precision: True positive predictions among all positives.

? p-value: Probability of result under null hypothesis.

Q:

? QQ-plot: Compares distribution of two datasets graphically.

? QR decomposition: Factorizes matrix into orthogonal and upper triangular matrix.

R:

? Random Forest: Ensemble method using multiple decision trees.

? Recall: True positives among all actual positives.

? ROC Curve: Shows binary classifier performance at different thresholds.

S:

? SVM: Algorithm for classification and regression.

? Standardisation: Scales data to mean 0 and std-dev 1.

? Sampling: Selecting subset of data points from larger dataset.

T:

? t-SNE: Visualizes high-dimensional data in lower dimensions.

? t-distribution: Used in hypothesis testing with small sample sizes.

? Type I/II Error: False positive/negative in hypothesis testing.

U:

? Underfitting: Model too simple to capture data patterns.

? UMAP: Visualizes high-dimensional data.

? Uniform Distribution: All outcomes equally likely.

V:

? Variance: Spread of data points around mean.

? Validation Curve: Shows model performance across hyperparameter values.

? Vanishing Gradient: Gradients become small during deep network training.

W:

? Word embedding: Represents words as dense vectors in NLP.

? Word cloud: Visualizes word frequency by size.

? Weights: Parameters learned during model training.

X:

? XGBoost: Gradient boosting library.

? XLNet: Language model for NLP.

Y:

? YOLO: Real-time object detection system.

? Yellowbrick: Python library for machine learning visualization.

Z:

? Z-score: Standardized value showing data point’s deviation from mean.

? Z-test: Compares sample mean to population mean.

? Zero-shot learning: Model recognizes new classes without prior examples.

Thanks for Sharing! ?? Sachin M

要查看或添加评论,请登录

社区洞察

其他会员也浏览了