#### Data Science Interview Questions

Data Science Interview Questions Set 1

**#1 What do you understand by the Selection Bias?
What are its various types?**

Selection bias is typically associated with research that doesn’t have a random selection of participants. It is a type of error that occurs when a researcher decides who is going to be studied. On some occasions, selection bias is also referred to as the selection effect.

In other words, selection bias is a distortion of statistical analysis that results from the sample collecting method. When selection bias is not taken into account, some conclusions made by a research study might not be accurate. Following are the various types of selection bias:

**Sampling Bias –**A systematic error resulting due to a non-random sample of a populace causing certain members of the same to be less likely included than others that results in a biased sample.**Time Interval –**A trial might be ended at an extreme value, usually due to ethical reasons, but the extreme value is most likely to be reached by the variable with the most variance, even though all variables have a similar mean.**Data**– Results when specific data subsets are selected for supporting a conclusion or rejection of bad data arbitrarily.**Attrition**– Caused due to attrition, i.e. loss of participants, discounting trial subjects, or tests that didn’t run to completion.

**#2 Please explain the goal
of A/B Testing.**

A/B Testing is a statistical hypothesis testing meant for a randomized experiment with two variables, A and B. The goal of A/B Testing is to maximize the likelihood of an outcome of some interest by identifying any changes to a webpage.

A highly reliable method for finding out the best online marketing and promotional strategies for a business, A/B Testing can be employed for testing everything, ranging from sales emails to search ads and website copy.

**#3 Between Python and R, which one would you pick for text analytics, and why?**

For text analytics, Python will gain an upper hand over R due to these reasons:

- The Pandas library in Python offers easy-to-use data structures as well as high-performance data analysis tools
- Python has a faster performance for all types of text analytics
- R is a best-fit for machine learning than mere text analysis

**#4 Please explain the role of data cleaning
in data analysis.**

Data cleaning can be a daunting task due to the fact that with the increase in the number of data sources, the time required for cleaning the data increases at an exponential rate.

This is due to the vast volume of data generated by additional sources. Also, data cleaning can solely take up to 80% of the total time required for carrying out a data analysis task.

Nevertheless, there are several reasons for using data cleaning in data analysis. Two of the most important ones are:

- Cleaning data from different sources helps in transforming the data into a format that is easy to work with
- Data cleaning increases the accuracy of a machine learning model

**#5 What are the feature selection
methods used to select the right variables?**

There are two main methods for feature selection:

**Filter
Methods**

This involves:

- Linear discrimination analysis
- ANOVA
- Chi-Square

The best analogy for selecting features is “bad data in, bad answer out.” When we’re limiting or selecting the features, it’s all about cleaning up the data coming in.

**Wrapper
Methods **

This involves:

**Forward Selection**: We test one feature at a time and keep adding them until we get a good fit**Backward Selection**: We test all the features and start removing them to see what works better**Recursive Feature Elimination**: Recursively looks through all the different features and how they pair together

Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of data analysis is performed with the wrapper method.

**For the given
points, how will you calculate the Euclidean distance in Python?**

plot1 = [1,3]

plot2 = [2,5]

The Euclidean distance can be calculated as follows:

euclidean_distance = sqrt( (plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2 )

Check out the Simplilearn’s video on “Data Science Interview Question” curated by industry experts to help you prepare for an interview.

**#6 What are dimensionality
reduction and its benefits?**

Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely.

This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features; for example, there’s no point in storing a value in two different units (meters and inches). This is an important Data Science Interview Questions

**#7 How can you select k for k-means? **

We use the elbow method to select k for k-means clustering. The idea of the elbow method is to run k-means clustering on the data set where ‘k’ is the number of clusters.

Within the sum of squares (WSS), it is defined as the sum of the squared distance between each member of the cluster and its centroid.

**#8 What is the significance of p-value?**

**p-value typically ≤ 0.05**

This indicates strong evidence against the null hypothesis; so you reject the null hypothesis.

**p-value typically > 0.05**

This indicates weak evidence against the null hypothesis, so you accept the null hypothesis.

**p-value at cutoff 0.05 **

This is considered to be marginal, meaning it could go either way.

**#9 What do you understand by Deep Learning?**

Deep Learning is a paradigm of machine learning that displays a great degree of analogy with the functioning of the human brain. It is a neural network method based on convolutional neural networks (CNN).

Deep learning has a wide array of uses, ranging from social network filtering to medical image analysis and speech recognition. Although Deep Learning has been present for a long time, it’s only recently that it has gained worldwide acclaim. This is mainly due to:

- An increase in the amount of data generation via various sources
- The growth in hardware resources required for running Deep Learning models

Caffe, Chainer, Keras, Microsoft Cognitive Toolkit, Pytorch, and TensorFlow are some of the most popular Deep Learning frameworks as of today.

**#10 Please explain Gradient Descent.**

The degree of change in the output of a function relating to the changes made to the inputs is known as a gradient. It measures the change in all weights with respect to the change in error. A gradient can also be comprehended as the slope of a function. This is an important Data Science Interview Questions

Gradient Descent refers to escalating down to the bottom of a valley. Simply, consider this something as opposed to climbing up a hill. It is a minimization algorithm meant for minimizing a given activation function.

**#11 How does Backpropagation work? Also, it states its various variants.**

Backpropagation refers to a training algorithm used for multilayer neural networks. Following the backpropagation algorithm, the error is moved from an end of the network to all weights inside the network. Doing so allows for efficient computation of the gradient.

Backpropagation works in the following way:

- Forward propagation of training data
- Output and target is used for computing derivatives
- Backpropagate for computing the derivative of the error with respect to the output activation
- Using previously calculated derivatives for output generation
- Updating the weights

Following are the various variants of Backpropagation:

**Batch Gradient Descent –**The gradient is calculated for the complete dataset and update is performed on each iteration- Mini-batch Gradient Descent – Mini-batch samples are used for calculating gradient and updating parameters (a variant of the Stochastic Gradient Descent approach)
**Stochastic Gradient Descent –**Only a single training example is used to calculate gradient and updating parameters

**#12 What do you know about Autoencoders?**

Autoencoders are simplistic learning networks used for transforming inputs into outputs with minimum possible error. It means that the outputs resulted are very close to the inputs.

A couple of layers are added between the input and the output with the size of each layer smaller than the size pertaining to the input layer. An autoencoder receives unlabeled input that is encoded for reconstructing the output.

**#13 What is the full form of GAN? Explain GAN?**

The full form of GAN is Generative Adversarial Network. Its task is to take inputs from the noise vector and send it forward to the Generator and then to Discriminator to identify and differentiate the unique and fake inputs.

**#14 What are the vital components of GAN?**

There are two vital components of GAN. These include the following:

**Generator:**The Generator act as a Forger, which creates fake copies.- Discriminator: The Discriminator act as a recognizer for fake and unique (real) copies.

**#15 What is
the Computational Graph?**

A computational graph is a graphical presentation that is based on TensorFlow. It has a wide network of different kinds of nodes wherein each node represents a particular mathematical operation. The edges in these nodes are called tensors. This is the reason the computational graph is called a TensorFlow of inputs. The computational graph is characterized by data flows in the form of a graph; therefore, it is also called the DataFlow Graph. This is an important Data Science Interview Questions

**#16 What are
tensors?**

Tensors are the mathematical objects that represent the collection of higher dimensions of data inputs in the form of alphabets, numerals, and rank fed as inputs to the neural network. This is an important Data Science Interview Questions

**#17 Why are
Tensorflow considered a high priority in learning Data Science?**

Tensorflow is considered a high priority in learning Data Science because it provides support to using computer languages such as C++ and Python. This way, it makes various processes under data science to achieve faster compilation and completion within the stipulated time frame and faster than the conventional Keras and Torch libraries. Tensorflow supports the computing devices, including the CPU and GPU for faster inputs, editing, and analysis of the data.

**#18 What is
Dropout in Data Science?**

Dropout is a toll in Data Science, which is used for dropping out the hidden and visible units of a network on a random basis. They prevent the overfitting of the data by dropping as much as 20% of the nodes so that the required space can be arranged for iterations needed to converge the network.

**#19 What is Batch normalization in Data Science?**

Batch Normalization in Data Science is a technique through which attempts could be made to improve the performance and stability of the neural network. This can be done by normalizing the inputs in each layer so that the mean output activation remains 0 with the standard deviation at 1.

**#20 How can outlier values
be treated?**

You can drop outliers only if it is a garbage value.

Example: height of an adult = ABC ft. This cannot be true, as the height cannot be a string value. In this case, outliers can be removed.

If the outliers have extreme values, they can be removed. For example, if all the data points are clustered between zero to 10, but one point lies at 100, then we can remove this point.

If you cannot drop outliers, you can try the following:

- Try a different model. Data detected as outliers by linear models can be fit by nonlinear models. Therefore, be sure you are choosing the correct model.
- Try normalizing the data. This way, the extreme data points are pulled to a similar range.
- You can use algorithms that are less affected by outliers; an example would be random forests.

**If You Want To Get More Daily Such Jobs Updates, Career Advice Then Join the Telegram Group From Given Link And Never Miss Update.**

**Join Telegram Group of Daily Jobs Updates for 2010-2021 Batch: Click Here**

**Why You’re Not Getting Response From Recruiter?: ****Click here**

**How To Get a Job Easily: Professional Advice For Job Seekers: ****Click here**

**Cognizant Latest News: Up To 20K+ Employees Will Be Hired: Click here**

**COVID-19 Live Tracker India & Coronavirus Live Update: ****Click here**

**Why Remove China Apps took down from Play store?: ****Click here**

**Feel Like Demotivated? Check Out our Motivation For You: ****Click here**

**List of Best Sites To Watch Free Movies Online in 2020: ****Click here**

**5 Proven Tips For How To Look Beautiful and Attractive: ****Click here**