Interview Tips for Data Analyst
1) What are the responsibilities of a Data analyst?
The responsibilities of a Data analyst are to:
2) What is required to become a data analyst?
To become a data analyst,
3) Mention what are the various steps in an analytics project?
Various steps in an analytics project include
4) What is data cleansing?
Data cleaning, also referred to as data cleansing, deals with identifying and removing errors and inconsistencies from data in order to enhance the quality of data.
5) List out some of the best practices for data cleaning?
Some of the best practices for data cleaning include:
6) Define logistic regression?
Logistic regression is a statistical method for examining a dataset in which there are one or more independent variables that defines an outcome.
7) List of some of the best tools that can be useful for data-analysis?
8) Mention what is the difference between data mining and data profiling?
The difference between data mining and data profiling is:
Data profiling: Targets on the instance analysis of individual attributes. It gives information on various attributes like value range, discrete value and their frequency, occurrence of null values, data type, length, etc.
Data mining: It focuses on cluster analysis, detection of unusual records, dependencies, sequence discovery, relation holding between several attributes, etc.
9) List out some common problems faced by data analysts?
10) Mention the name of the framework developed by Apache for processing large data sets for an application in a distributed computing environment.
Hadoop and MapReduce are programming frameworks developed by Apache for processing large data sets for an application in a distributed computing environment.
11) What are the missing patterns that are generally observed?
12) Explain what is KNN imputation method?
In KNN imputation, the missing attribute values are imputed by using the attributes value that are most similar to the attribute whose values are missing. By using a distance function, the similarity of two attributes is determined.
13) What data validation methods are used by data analysts?
Usually, methods used by data analyst for data validation are
14) What should be done with suspected or missing data?
15) How do you deal with multi-source problems?
To deal the multi-source problems,
16) What is an Outlier?Interview Tips for Data Analyst
The outlier is a commonly used terms by analysts referred for a value that appears far away and diverges from an overall pattern in a sample. There are two types of Outliers
17) What is Hierarchical Clustering Algorithm?
Hierarchical clustering algorithm combines and divides existing groups, creating a hierarchical structure that showcase the order in which groups are divided or merged.
18) What is K-mean Algorithm?
K mean is a famous partitioning method. Objects are classified as belonging to one of K groups, k chosen a priori.
In K-mean algorithm,
19) What are the key skills required for Data Analyst?
20) What is collaborative filtering?
Collaborative filtering is a simple algorithm to create a recommendation system based on user behavioral data. The most important components of collaborative filtering are users- items- interest.
A good example of collaborative filtering is when you see a statement like “recommended for you” on online shopping sites that’s pops out based on your browsing history.
21) What are the tools used in Big Data?
Tools used in Big Data include
22) What is KPI, design of experiments and 80/20 rule?
KPI: It stands for Key Performance Indicator, it is a metric that consists of any combination of spreadsheets, reports or charts about business process
Design of experiments: It is the initial process used to split your data, sample and set up of a data for statistical analysis
80/20 rules: It means that 80 percent of your income comes from 20 percent of your clients
23) What is MapReduce?
MapReduce is a framework that processes large data sets, splitting them into subsets, processing each subset on a different server and then blending results obtained on each.
24) What is Clustering? What are the properties for clustering algorithms?
Clustering is a classification method that is applied to data. Clustering algorithm divides a data set into natural groups or clusters.
Properties for clustering algorithm are
25) What statistical methods are useful for data analysis?
26) What is time series analysis?
Time series analysis can be done in two domains, the frequency domain and the time domain. In Time series analysis the output of a particular process can be forecast by analyzing the previous data with the help of various methods like exponential smoothening, log-linear regression method, etc.
27) What is correlogram analysis?
A correlogram analysis is the common form of spatial analysis in geography. It consists of a series of estimated autocorrelation coefficients calculated for a different spatial relationship. It can be used to construct a correlogram for distance-based data when the raw data is expressed as distance rather than values at individual points.
28) What is a hash table?
In computing, a hash table is a map of keys to values. It is a data structure used to implement an associative array. It uses a hash function to compute an index into an array of slots, from which desired value can be fetched.
29) What are hash table collisions? How is it avoided?
A hash table collision happens when two different keys hash to the same value. Two data cannot be stored in the same slot in array.
Here are two techniques to avoid hash table collision:
It uses the data structure to store multiple items that hash to the same slot.
It searches for other slots using a second function and store item in first empty slot that is found
29) Explain what is imputation? List out different types of imputation techniques?
During imputation we replace missing data with substituted values. The types of imputation techniques involve are
30) Which imputation method is more favorable?
Although single imputation is widely used, it does not reflect the uncertainty created by missing data at random. So, multiple imputation is more favorable then single imputation in case of data missing at random.
31) What is n-gram?
N-gram:
An n-gram is a contiguous sequence of n items from a given sequence of text or speech. It is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n-1).
32) What are the criteria for a good data model?