Data science is an interdisciplinary field that mines raw data, analyses it, and discovers patterns that can be used to extract valuable insights. The core foundation of data science is stats, computer science, machine learning, deep learning, analysis of data, visualization of data, and various other technologies.
Because of the importance of data, data science has grown in popularity throughout the years. Data is regarded as the future’s new oil, which, when correctly examined and used, may be extremely useful to stakeholders. Not only that, but a data scientist is exposed to working in a variety of fields, solving real-world practical challenges with cutting-edge technologies. The most common real-time application is fast food delivery in apps like Uber Eats, which assists the delivery worker by showing the fastest feasible path to the destination from the restaurant.
Data Science is also utilized in item recommendation algorithms on e-commerce sites such as Amazon, Flipkart, and others, which indicate what items the customer should buy based on their search history. Data Science is becoming increasingly popular in fraud detection applications to detect any fraud involved in credit-based financial applications, not simply recommendation systems. A skilled data scientist can understand data, innovate, and be creative while solving problems that support business and strategic objectives. As a result, it is the most lucrative employment in the twenty-first century.
In this post, we will look at the most often requested Data Science Technical Interview Questions, which will be useful for both aspiring and seasoned data scientists.
Data Science Interview Questions for New Graduates
1. What exactly is meant by the term "Data Science"?
Data Science is an interdisciplinary field that consists of numerous scientific procedures, algorithms, tools, and machine learning approaches that strive to help uncover common patterns and extract meaningful insights from provided raw input data through statistical and mathematical analysis.
- It starts with obtaining the business needs and related data.
- After acquiring data, it is maintained through data cleansing, data warehousing, data staging, and data architecture.
- Data processing is the work of examining, mining, and analyzing data in order to provide a summary of the insights collected from the data.
- Following the completion of the exploratory processes, the cleansed data is submitted to various algorithms such as predictive analysis, regression, text mining, recognition patterns, and so on, depending on the needs.
- In the last stage, the outcomes are graphically appealingly communicated to the business. This is where data visualization, reporting, and various business intelligence tools come into play.
2. What exactly is the distinction between data analytics and data science?
Data science is the endeavor of converting data via the use of numerous technical analysis methodologies in order to derive useful insights that a data analyst may apply to their business circumstances.
Data analytics is concerned with testing current hypotheses and facts and providing answers to inquiries in order to make better and more successful business decisions.
Data Science drives innovation by addressing questions that lead to new connections and solutions to future challenges. Data analytics is concerned with extracting current meaning from existing historical context, whereas data science is concerned with predictive modelling.
Data Science is a broad subject that uses diverse mathematical and scientific tools and methods to solve complicated problems, whereas data analytics is a narrow profession that deals with certain concentrated problems utilizing fewer statistical and visualization techniques.
3. What are some of the sampling techniques? What is the primary benefit of sampling?
Data analysis cannot be performed on a big volume of data at once, especially when dealing with enormous datasets. It is critical to collect some data samples that can be used to represent the entire population and then analyses them. While doing so, it is critical to carefully select sample data from the massive dataset that properly represents the complete dataset.
Based on the use of statistics, there are primarily two types of sampling techniques:
- Clustered sampling, simple random sampling, and stratified sampling are all probability sampling approaches.
- Techniques for non-probability sampling include quota sampling, convenience sampling, snowball sampling, and others.
4. Make a list of the conditions that cause overfitting and underfitting.
Overfitting occurs when a model performs well only on a subset of the training data. When new data is fed into the model, it fails to produce any results. These situations develop as a result of the model’s low bias and high variance. Overfitting is more likely in decision trees.
Underfitting occurs when the model is so simplistic that it is unable to recognize the correct relationship in the data and hence performs poorly even on test data. This can occur as a result of excessive bias and low variance. Under fitting is more common in linear regression.
5. Distinguish between long and wide format data.
Data in Long Formats
- Each row of data represents a subject’s one-time information. Each subject’s data would be organised in different/multiple rows.
- By seeing rows as groupings, the data can be recognised.
- This data format is most typically used in R analysis and is written to log files at the end of each experiment.
Wide Formats Data
- The repeated responses of a subject are separated into columns in this case.
- By seeing columns as groups, the data may be recognised.
- This data format is rarely used in R analysis, however it is extensively used in statistical tools for repeated measures ANOVAs.
6. What is the difference between Eigenvectors and Eigenvalues?
Eigenvectors are column vectors or unit vectors with the same length/magnitude. They are also known as right vectors. Eigenvalues are coefficients that are applied to eigenvectors to give them variable length or magnitude values.
Eigen decomposition is the process of breaking down a matrix into Eigenvectors and Eigenvalues. These are then employed in machine learning approaches such as PCA (Principal Component Analysis) to extract useful insights from the given matrix.
7. What does it signify when the p-values are high and low?
A p-value is a measure of the likelihood of obtaining outcomes that are equal to or greater than those obtained under a certain hypothesis, assuming that the null hypothesis is correct. This shows the likelihood that the observed discrepancy occurred by coincidence.
- A p-value of less than 0.05 indicates that the null hypothesis can be rejected and that the data is unlikely to be true null.
- A high p-value, i.e. values 0.05, suggests that the null hypothesis is strong. It denotes that the data is true null.
- A p-value of 0.05 indicates that the hypothesis is open to interpretation.
8. When is resampling performed?
Resampling is a sampling technique used to improve accuracy and quantify the uncertainty of population parameters. It is done to ensure that the model is good enough by training it on different patterns in a dataset to guarantee that variances are handled. It is also done when models need to be validated using random subsets or when labelling data points while doing tests.
9. What exactly do you mean by Imbalanced Data?
When data is spread unequally across several categories, it is said to be highly unbalanced. These datasets cause an error in model performance and inaccuracy.
10. Are there any disparities in the expected and mean values?
There aren’t many distinctions between these two, but it’s worth noting that they’re employed in various settings. In general, the mean value relates to the probability distribution, but the anticipated value is used in contexts containing random variables.
11. How do you define Survivorship Bias?
This bias refers to a logical fallacy that occurs while focusing on components that survived a procedure and ignoring those that did not function due to a lack of prominence. This prejudice can lead to incorrect findings.
12. Define the terms key performance indicators (KPI), lift, model fitting, robustness, and DOE.
- KPI: KPI stands for Key Performance Indicator, and it monitors how successfully a company fulfils its goals.
- Lift is a measure of the target model’s performance when compared to a random choice model. Lift represents how well the model predicts compared to the absence of a model.
- Model fitting: How well the model under examination fits the provided observations.
- Robustness: This shows the system’s ability to properly handle differences and variances.
- DOE is an abbreviation for the design of experiments, which is the task design that aims to describe and explain information variance under postulated conditions to reflect factors.
13. Identify and define confounding variables.
Confounders are another term for confounding variables. These variables are a form of extraneous variable that influences both independent and dependent variables, resulting in false association and mathematical correlations between variables that are correlated but not casually related to one another.
14. What is the definition and explanation of selection bias?
When a researcher must choose which person to study, he or she is subject to selection bias. Selection bias is connected with studies in which the participant selection is not random. The selection effect is another name for selection bias. The manner of sample collection contributes to the selection bias.
The following are four types of selection bias:
- Sampling Bias: Due to a non-random population, some individuals of a population have fewer chances of being included than others, resulting in a biassed sample. As a result, a systematic inaccuracy known as sampling bias occurs.
- Time interval: Trials may be terminated early if any extreme value is reached, but if all variables are invariant, the variables with the highest variance have a greater probability of obtaining the extreme value.
- Data: It occurs when specific data is arbitrarily chosen and the generally agreed-upon criteria are not followed.
- Attrition: In this context, attrition refers to the loss of participants. It is the exclusion of subjects who did not complete the study.
15. What is the bias-variance trade-off?
Let us first learn the meaning of bias and variance in detail:
Bias: A type of inaccuracy in a machine learning model that occurs when an ML Algorithm is oversimplified. When a model is trained, it makes simplified assumptions in order to comprehend the goal function. Decision Trees, SVM, and other low-bias algorithms are examples. The logistic and linear regression algorithms, on the other hand, have a large bias.
Variance is a type of error as well. When an ML algorithm is made extremely sophisticated, it is introduced into an ML Model. In addition, this model learns noise from the training data set. It also performs poorly on the test data set. This may result in excessive lifting as well as hypersensitivity.
When the complexity of a model is increased, the error decreases. This is due to the model’s decreased bias. However, this does not always occur until we reach a point known as the ideal point. If we keep increasing the model’s complexity after this point, it will be over lifted and suffer from the problem of excessive variance. This circumstance can be represented graphically as illustrated below:
16. What exactly is logistic regression? Give an example of a time when you employed logistic regression.
The logit model is another name for logistic regression. It is a method for forecasting the binary outcome of a linear combination of variables (called the predictor variables).
Assume we want to forecast the outcome of an election for a specific political leader. So we want to know whether or not this leader will win the election. As a result, the outcome is binary, i.e. win (1) or defeat (2). (0). However, the input is a combination of linear variables such as advertising budget, previous work done by the leader and the party, and so on.
17. What exactly is deep learning? What is the distinction between deep and machine learning?
- Deep learning is a machine learning paradigm. Multiple layers of processing are used in deep learning to extract high-value characteristics from data. The neural networks are built in such a way that they attempt to mimic the human brain.
- Deep learning has demonstrated extraordinary performance in recent years due to its strong parallel with the human brain.
- The distinction between machine learning and deep learning is that deep learning is a paradigm or subset of machine learning inspired by the structure and operations of the human brain, known as artificial neural networks.
18. Why is data cleaning so important? How are the data cleaned?
To get good insights while running an algorithm on any data, it is critical to have correct and clean data that contains only essential information. Dirty data frequently leads to poor or erroneous insights and predictions, which might have negative consequences.
For example, when launching any large campaign to market a product, if our data analysis tells us to target a product that has no demand in reality, the campaign is doomed to fail. As a result, the company’s revenue is lost. This is where the value of having accurate and clean data comes into play.
- Data Cleaning of the data coming from diverse sources helps in data transformation and resulting in the data where the data scientists can work on.
- Properly cleaned data improves model accuracy and yields very good predictions.
- When the dataset is very vast, it becomes difficult to run data on it. If the data is large, the data cleanup stage takes a long time (about 80% of the time). It cannot be used while the model is running. As a result, cleansing data before running the model increases the model’s speed and efficiency.
- Data cleaning assists in identifying and correcting any structural flaws in the data. It also aids in the removal of duplicates and the maintenance of consistency.
19. How do you handle missing values during analysis?
To determine the degree of missing values, we must first identify the variables that have missing values. Assume a pattern is discovered. The analyst should now focus on them because they may lead to fascinating and significant discoveries. If no patterns are found, we can replace the missing numbers with the median or mean values, or we can just ignore the missing data.
If the variable is categorical, the mean, minimum, and maximum values are assigned by default. The default value is assigned to the missing field. If we have a data distribution, we present the mean value for a normal distribution.
If 80% of the values for a variable are missing, we would eliminate the variable rather than treat the missing values.
20. How will you handle missing values in your data analysis?
After determining which variables contain missing values, the impact of missing values can be determined.
If the data analyst discovers a pattern in these missing variables, there is a potential that important insights will emerge.
In case such patterns are not identified, then these missing data can either be disregarded or can be replaced with default values such as mean, minimum, maximum, or median values.
If the missing values are for categorical variables, they are given default values. Missing values are assigned mean values if the data has a normal distribution. If 80% of the values are missing, the analyst must decide whether to replace them with default values or drop the variables.