SQL Interview Questions and answers – 2021

1. What is DBMS?

A Database Management System (DBMS) is a program that controls creation, maintenance and use of a database. DBMS can be termed as File Manager that manages data in a database rather than saving it in file systems.

Database Management System (DBMS) is a software application that interacts with the user, applications, and the database itself to capture and analyze data. A database is a structured collection of data. 

A DBMS allows a user to interact with the database. The data stored in the database can be modified, retrieved and deleted and can be of any type like strings, numbers, images, etc.

There are two types of DBMS:

  • Relational Database Management System: The data is stored in relations (tables). Example – MySQL.
  • Non-Relational Database Management System: There is no concept of relations, tuples and attributes.  Example – MongoDB

2. What is RDBMS?

RDBMS stands for Relational Database Management System. RDBMS store the data into the collection of tables, which is related by common fields between the columns of the table. It also provides relational operators to manipulate the data stored into the tables.

3. What is SQL?

SQL stands for the Structured Query Language. SQL is a standard query language used for maintaining the relational database and perform many different operations of data manipulation on the data. SQL initially was invented in 1970. It is a database language used for database creation, deletion, fetching rows and modifying rows, etc. sometimes it is pronounced as ‘sequel.’

4. What are the usages of SQL?

  • SQL is responsible for maintaining the relational data and the data structures present in the database.
  • To execute queries against a database
  • To retrieve data from a database
  • To inserts records in a database
  • To updates records in a database
  • To delete records from a database
  • To create new databases
  • To create new tables in a database
  • To create views in a database
  • To perform complex operations on the database.

5. Does SQL support programming?

SQL refers to the Standard Query Language, which is not actually the programming language. SQL doesn’t have a loop, Conditional statement, logical operations, it can not be used for anything other than data manipulation. It is used like commanding (Query) language to access databases. The primary purpose of SQL is to retrieve, manipulate, update and perform complex operations like joins on the data present in the database.

6. What do you mean by table and field in SQL?

A table refers to a collection of data in an organised manner in form of rows and columns. A field refers to the number of columns in a table.

For example:

Table:StudentInformation
Field: Stu Id, Stu Name, Stu Marks

7. What are joins in SQL?

A JOIN clause is used to combine rows from two or more tables, based on a related column between them. It is used to merge two tables or retrieve data from there. There are 4 types of joins, as you can refer to below:

Inner join: Inner Join in SQL is the most common type of join. It is used to return all the rows from multiple tables where the join condition is satisfied. 

Left Join:  Left Join in SQL is used to return all the rows from the left table but only the matching rows from the right table where the join condition is fulfilled.

Right Join: Right Join in SQL is used to return all the rows from the right table but only the matching rows from the left table where the join condition is fulfilled.

Full Join: Full join returns all the records when there is a match in any of the tables. Therefore, it returns all the rows from the left-hand side table and all the rows from the right-hand side table.

8. What is a Database?

Database is nothing but an organized form of data for easy access, storing, retrieval and managing of data. This is also known as structured form of data which can be accessed in many ways.

Example: School Management Database, Bank Management Database.

9. What is a primary key?

A primary key is a combination of fields which uniquely specify a row. This is a special kind of unique key, and it has implicit NOT NULL constraint. It means, Primary key values cannot be NULL.

10. What is a unique key?

A Unique key constraint uniquely identified each record in the database. This provides uniqueness for the column or set of columns.

A Primary key constraint has automatic unique constraint defined on it. But not, in the case of Unique Key.

There can be many unique constraint defined per table, but only one Primary key constraint defined per table.

11. What is a foreign key?

A foreign key is one table which can be related to the primary key of another table. Relationship needs to be created between two tables by referencing foreign key with the primary key of another table.

12. What is a Self-Join?

A self JOIN is a case of regular join where a table is joined to itself based on some relation between its own column(s). Self-join uses the INNER JOIN or LEFT JOIN clause and a table alias is used to assign different names to the table within the query.

SELECT A.emp_id AS "Emp_ID",A.emp_name AS "Employee",
B.emp_id AS "Sup_ID",B.emp_name AS "Supervisor"
FROM employee A, employee B
WHERE A.emp_sup = B.emp_id;

13. What is a Cross-Join?

Cross join can be defined as a cartesian product of the two tables included in the join. The table after join contains the same number of rows as in the cross-product of number of rows in the two tables. If a WHERE clause is used in cross join then the query will work like an INNER JOIN.

SELECT stu.name, sub.subject 
FROM students AS stu
CROSS JOIN subjects AS sub;

14. What is the difference between CHAR and VARCHAR2 datatype in SQL?

Both Char and Varchar2 are used for characters datatype but varchar2 is used for character strings of variable length whereas Char is used for strings of fixed length. For example, char(10) can only store 10 characters and will not be able to store a string of any other length whereas varchar2(10) can store any length i.e 6,8,2 in this variable.

15. What are Constraints?

Constraints in SQL are used to specify the limit on the data type of the table. It can be specified while creating or altering the table statement. The sample of constraints are:

  • NOT NULL
  • CHECK
  • DEFAULT
  • UNIQUE
  • PRIMARY KEY
  • FOREIGN KEY

16. What is the difference between DELETE and TRUNCATE statements?

DELETETRUNCATE
Delete command is used to delete a row in a table.Truncate is used to delete all the rows from a table.
You can rollback data after using delete statement.You cannot rollback data.
It is a DML command.It is a DDL command.
It is slower than truncate statement.It is faster.

17. What do you mean by data integrity? 

Data Integrity defines the accuracy as well as the consistency of the data stored in a database. It also defines integrity constraints to enforce business rules on the data when it is entered into an application or a database.

18. What is an Index?

An index refers to a performance tuning method of allowing faster retrieval of records from the table. An index creates an entry for each value and hence it will be faster to retrieve data.

CREATE INDEX index_name 	 /* Create Index */
ON table_name (column_1, column_2);

DROP INDEX index_name; 	 /* Drop Index */

19. Explain different types of index in SQL.

There are three types of index in SQL namely:

Unique Index:

This index does not allow the field to have duplicate values if the column is unique indexed. If a primary key is defined, a unique index can be applied automatically.

CREATE UNIQUE INDEX myIndex
ON students (enroll_no);

Clustered Index:

This index reorders the physical order of the table and searches based on the basis of key values. Each table can only have one clustered index.

Non-Clustered Index:

Non-Clustered Index does not alter the physical order of the table and maintains a logical order of the data. Each table can have many nonclustered indexes.

20. What is a Data Definition Language?

Data definition language (DDL) is the subset of the database which defines the data structure of the database in the initial stage when the database is about to be created. It consists of the following commands: CREATE, ALTER and DELETE database objects such as schema, tables, view, sequence, etc.

21. What is a Data Manipulation Language?

Data manipulation language makes the user able to retrieve and manipulate data. It is used to perform the following operations.

  • Insert data into database through INSERT command.
  • Retrieve data from the database through SELECT command.
  • Update data in the database through UPDATE command.
  • Delete data from the database through DELETE command.

22. What is Data Control Language?

Data control language allows you to control access to the database. DCL is the only subset of the database which decides that what part of the database should be accessed by which user at what point of time. It includes two commands GRANT and REVOKE.

GRANT: to grant the specific user to perform a particular task

REVOKE: to cancel previously denied or granted permissions.

23. What is Normalization in SQL?

Normalization is used to decompose a larger, complex table into simple and smaller ones. This helps us in removing all the redundant data.

Generally, in a table, we will have a lot of redundant information which is not required, so it is better to divide this complex table into multiple smaller tables which contains only unique information.

Let’s look at the rules for a table to be in first normal form, second normal form and third normal form:

First normal form:

  • A relation schema is in 1NF, if and only if:
  • All attributes in the relation are atomic(indivisible value)
  • There are no repeating elements or group of elements.

Second normal form:

  • A relation is said to be in 2NF, if and only if:
  • It is in 1st Normal Form.
  • No partial dependency exists between non-key attributes and key attributes.

Third Normal form:

  • A relation R is said to be in 3NF if and only if:
  • It is in 2NF.
  • No transitive dependency exists between non-key attributes and key attributes through another non-key attribute.

24. What is pl sql?

PL SQL stands for Procedural language constructs for Structured Query Language. PL SQL was introduced by Oracle to overcome the limitations of plain sql. So, pl sql adds in procedural language approach to the plain vanilla sql.

While, with the help of sql, we were able to DDL and DML queries, with the help of PL SQL, we will be able to create functions, triggers and other procedural constructs.

25. What is ETL in SQL?

ETL stands for Extract, Transform and Load. It is a three step process, where we would have to start off by extracting the data from sources. Once we collate the data from different sources, what we have is raw data. This raw data has to be transformed into tidy format, which will come in the second phase. Finally, we would have to load this tidy data into tools which would help us to find insights.

26. What is OLAP?

OLAP stands for Online Analytical Processing. And a class of software programs which are characterized by relatively low frequency of online transactions. Queries are often too complex and involve a bunch of aggregations.

27. What is a “TRIGGER” in SQL?

  • A trigger allows you to execute a batch of SQL code when an insert, update or delete command is run against a specific table as TRIGGER is said to be the set of actions that are performed whenever commands like insert, update or delete are given through queries.
  • The trigger is said to be activated when these commands are given to the system.
  • Triggers are the particular type of stored procedures that are defined to execute automatically in place or after data modifications.
  • Triggers are generated using CREATE TRIGGER statement.

28. What is ACID property in a database?

ACID property is used to ensure that the data transactions are processed reliably in a database system.

A single logical operation of a data is called transaction.

ACID is an acronym for Atomicity, Consistency, Isolation, Durability.

Atomicity: it requires that each transaction is all or nothing. It means if one part of the transaction fails, the entire transaction fails and the database state is left unchanged.

Consistency: the consistency property ensure that the data must meet all validation rules. In simple words you can say that your transaction never leaves your database without completing its state.

Isolation: this property ensure that the concurrent property of execution should not be met. The main goal of providing isolation is concurrency control.

Durability: durability simply means that once a transaction has been committed, it will remain so, come what may even power loss, crashes or errors.

29. What are the syntax and use of the COALESCE function?

The syntax of COALESCE function:

COALESCE(exp1, exp2, …. expn) 

The COALESCE function is used to return the first non-null expression given in the parameter list.

30. What is the SQL query to display the current date?

There is a built-in function in SQL called GetDate() which is used to return the current timestamp.

31. Are NULL values same as that of zero or a blank space?

A NULL value is not at all same as that of zero or a blank space. NULL value represents a value which is unavailable, unknown, assigned or not applicable whereas a zero is a number and blank space is a character.

32. What is subquery in SQL?

A subquery is a query inside another query where a query is defined to retrieve data or information back from the database. In a subquery, the outer query is called as the main query whereas the inner query is called subquery. Subqueries are always executed first and the result of the subquery is passed on to the main query. It can be nested inside a SELECT, UPDATE or any other query. A subquery can also use any comparison operators such as >,< or =.

33. What is the difference between ‘HAVING’ CLAUSE and a ‘WHERE’ CLAUSE?

HAVING clause can be used only with SELECT statement. It is usually used in a GROUP BY clause and whenever GROUP BY is not used, HAVING behaves like a WHERE clause.
Having Clause is only used with the GROUP BY function in a query whereas WHERE Clause is applied to each row before they are a part of the GROUP BY function in a query.

34. How can you fetch first 5 characters of the string?

There are a lot of ways to fetch characters from a string. For example:

Select SUBSTRING(StudentName,1,5) as studentname from student.

35. What is a View?

A view is a virtual table which consists of a subset of data contained in a table. Since views are not present, it takes less space to store. View can have data of one or more tables combined and it depends on the relationship.

36. What are Views used for?

A view refers to a logical snapshot based on a table or another view. It is used for the following reasons:

  • Restricting access to data.
  • Making complex queries simple.
  • Ensuring data independence.
  • Providing different views of same data.

37. What is a Stored Procedure?

A Stored Procedure is a function which consists of many SQL statements to access the database system. Several SQL statements are consolidated into a stored procedure and execute them whenever and wherever required which saves time and avoid writing code again and again.

38. List some advantages and disadvantages of Stored Procedure?

Advantages:

A Stored Procedure can be used as a modular programming which means create once, store and call for several times whenever it is required. This supports faster execution. It also reduces network traffic and provides better security to the data.

Disadvantage:

The only disadvantage of Stored Procedure is that it can be executed only in the database and utilizes more memory in the database server.

39. What is Auto Increment in SQL?

Autoincrement keyword allows the user to create a unique number to get generated whenever a new record is inserted into the table.
This keyword is usually required whenever PRIMARY KEY in SQL is used.

40. What is a Datawarehouse?

Datawarehouse refers to a central repository of data where the data is assembled from multiple sources of information. Those data are consolidated, transformed and made available for the mining as well as online processing. Warehouse data also have a subset of data called Data Marts.

41. What is RANK, DANCE_RANK and ROW_NUMBER Functions?

Lets understand all terms with following example:

Here I have an Employe table, the following is the sample data for the Employe Table.

ROW_NUMBER() Function with Partition By clause:

Row_number plays a very important role in SQL server. Row_Number function can help to perform more complex ordering of row in the report format that allow the over clause in SQL standard.

Syntax

ROW_NUMBER () OVER ([PARTITION BY value_exp, … [ n ]] ORDER BY_clause)

If we want to add row number to each group, partition by help to reset for every group, let’s take a look at the Employe table.

Select *, ROW_NUMBER() over(partition by Empname order by Empname ) as rownumber from Employee

The following is the OUTPUT of the above query

Rank() Function in SQL Server:

This function will assign a unique value to each distinct Row, but it leaves a group between the groups

SELECT EmpName, EmpSalary ,rank() over(order by EmpName) as rankID from Employe;

The following is the OUTPUT of the above query.

Gap represents number of occurrence example – EmpName=”atul” is repeated 3 times and has rank “1”, the next rank will be 1+3=4 and same with the next value.

Dense_Rank() Function in SQL Server:

Dense_Rank() Function is similar to Rank with only difference, this will not leave gaps between groups.

SELECT EmpName ,EmpSalary ,DENSE_RANK() over(order by EmpName) as DrankID from Employee

The following is the OUTPUT of the above query:

Summery:

42. Explain Normalization and De-Normalization.

Normalization is the process of removing redundant data from the database by splitting the table in a well-defined manner in order to maintain data integrity. This process saves much of the storage space.

De-normalization is the process of adding up redundant data on the table in order to speed up the complex queries and thus achieve better performance.

43. What is a deadlock?

It is an unwanted situation where two or more transactions are waiting indefinitely for one another to release the locks. 

44. What is ALIAS command?

This command provides another name to a table or a column. It can be used in WHERE clause of a SQL query using the “as” keyword.

45. What is Datawarehouse?

Datawarehouse is a central repository of data from multiple sources of information. Those data are consolidated, transformed and made available for the mining and online processing. Warehouse data have a subset of data called Data Marts.

46. What is OLTP?

OLTP stands for Online Transaction Processing, is a class of software applications capable of supporting transaction-oriented programs. An essential attribute of an OLTP system is its ability to maintain concurrency. To avoid single points of failure, OLTP systems are often decentralized. These systems are usually designed for a large number of users who conduct short transactions. Database queries are usually simple, require sub-second response times and return relatively few records. Here is an insight into the working of an OLTP system.

47. What are the differences between OLTP and OLAP?

OLTP stands for Online Transaction Processing, is a class of software applications capable of supporting transaction-oriented programs. An important attribute of an OLTP system is its ability to maintain concurrency. OLTP systems often follow a decentralized architecture to avoid single points of failure. These systems are generally designed for a large audience of end users who conduct short transactions. Queries involved in such databases are generally simple, need fast response times and return relatively few records. Number of transactions per second acts as an effective measure for such systems.

OLAP stands for Online Analytical Processing, a class of software programs which are characterized by relatively low frequency of online transactions. Queries are often too complex and involve a bunch of aggregations. For OLAP systems, the effectiveness measure relies highly on response time. Such systems are widely used for data mining or maintaining aggregated, historical data, usually in multi-dimensional schemas.

Reference:-

https://www.interviewbit.com

https://www.edureka.co

https://www.javatpoint.com

https://www.toptal.com/

https://hackr.io

https://www.mygreatlearning.com

https://www.guru99.com

https://www.edureka.co

Questions and answers for dimensionality reductions

1. What is dimensionality reduction?

When we have a dataset with multiple input features, we know the model will overfit. To reduce input feature space, we can either drop or extract features, this is basically a dimension reduction.

Now let’s discuss more about both techniques.

  • Drop irrelevant, redundant features as they do not contribute to the accuracy of the predictive problem. When we drop such input variable, we lose information stored in these variables.
  • We can create a new independent variable from existing input variables. This way we do not lose the information in the variables. This is feature extraction

2. Explain Principal Component Analysis?

When we have a large dataset of correlated input variables and we want to reduce the number of input variables to a smaller feature space. while doing this we still want to maintain the critical information. We can solve this by using Principal Component Analysis-PCA.

Now let’s understand the PCA features in little bit more details.

PCA reduce dimensionality of the data using feature extraction. It does this by using variables that help explain most variability of the data in the dataset.

PCA removes redundant information by removing correlated features. PCA creates new independent variables that are independent from each other. This takes care of multicollinearity issue.

PCA is an unsupervised technique. It only looks at the input features and does not take into account the output or the target variable.

3. Importance and limitation of Principal Component Analysis?

Following are the advantages of PCA

  • Removes Correlated Features – To visualize our all features in data, we must reduce the same in data, to do that we need to find out the correlation among the features (correlated variables). Finding correlation manually in thousands of features is nearly impossible, frustrating and time-consuming. PCA does this for you efficiently.
  • Improve algorithm performance – With so many features, the performance of your algorithm will drastically degrade. PCA is a very common way to speed up your Machine Learning algorithm by getting rid of correlated variables which don’t contribute in any decision making.
  • Improve Visualization – It is very hard to visualize and understand the data in high dimensions. PCA transforms a high dimensional data to low dimensional data (2 dimension) so that it can be visualized easily. 

Following are the limitation of PCA

  • Independent variable become less interpretable – After implementing PCA on the dataset, your original features will turn into Principal Components. Principal Components are the linear combination of your original features. Principal Components are not as readable and interpretable as original features.
  • Data standardization is must before PCA – You must standardize your data before implementing PCA, otherwise PCA will not be able to find the optimal Principal Components.
  • Information loss – Although Principal Components try to cover maximum variance among the features in a dataset, if we don’t select the number of Principal Components with care, it may miss some information as compared to the original list of features.

4. What is t-SNE and How to apply t-SNE ?

t-Distributed Stochastic Neighbor Embedding (t-SNE) is an unsupervised, non-linear technique primarily used for data exploration and visualizing high-dimensional data. In simpler terms, t-SNE gives you a feel or intuition of how the data is arranged in a high-dimensional space.

Let’s understand each and every term in details.

Scholastic – Not definite but random probability
Neighbourhood – Concerned only about retaining the structure of neighbourhood points.
Embedding – It means picking up a point from high dimensional space and placing it into lower dimension

5. How to apply t-SNE ?

Basically, it measure similarities between points in the high dimensional space.

Let’s see below image and try to understand the algorithm.

t-SNE example

Suppose we are reducing d-dimensional data into 2-dimensional data using t-SNE.
From the above picture we can see that x2 and x3 are in the neighborhood of x1 [N(x1) = {x2, x3}] and x5 is in the neighborhood of x4 [N(x4) = {x5}].

As t-SNE preserves the distances in a neighborhood,

d(x1, x2) ≈ d(x’1, x’2)
d(x1, x3) ≈ d(x’1, x’3)
d(x4, x5) ≈ d(x’4, x’5)

For every point, it constructs a notion of which other points are its ‘neighbors,’ trying to make all points have the same number of neighbors. Then it tries to embed them so that those points all have the same number of neighbors.

6. What is Crowding problem?

Sometimes it is impossible to preserve the distances in all the neighbourhoods. This problem is called Crowding Problem or When we model a high-dimensional dataset in 2 (or 3) dimensions, it is difficult to segregate the nearby datapoints from moderately distant datapoints and gaps can not form between natural clusters.

For example, when a data point, ‘x’ is a neighbor to 2 data points that are not neighboring to each other, this may result in losing the neighborhood of ‘x’ with one of the data points as t-SNE is concerned only within the neighborhood zone.

7. How to interpret t-SNE output?

There are 3 parameters
a) Steps: number of iterations.
b) Perplexity: can be thought of as the number of neighboring points.
c) Epsilon: It is for data visualization and determines the speed which it should be changed.

Statistics interview questions and answers for data scientist

Statistics interview questions and answers for data scientist

How do you assess the statistical significance of an insight?

We need to perform hypothesis testing to determine statistical significance. Will take following steps.

  • First will define null hypothesis and alternate hypothesis
  • We will calculate p- value
  • Last, we would set the level of the significance (alpha) and if the p-value is less than the alpha, you would reject the null — in other words, the result is statistically significant.

What is the Central Limit Theorem and why is it important?

  • Central limit theorem is very important concept in stats. It states that no matter the underlying distribution of the data set, the sampling distribution would be equal to the mean of original distribution and variance would be n times smaller, where n is the size of sample
  • The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger.
  • Sample sizes equal to or greater than 30 are considered sufficient for the CLT to hold.
  • A key aspect of CLT is that the average of the sample means and standard deviations will equal the population mean and standard deviation.

Example-

Suppose that we are interested in estimating the average height among all people. Collecting data for every person in the world is impossible. While we can’t obtain a height measurement from everyone in the population, we can still sample some people. The question now becomes, what can we say about the average height of the entire population given a single sample. The Central Limit Theorem addresses this question exactly.”

What is sampling? How many sampling methods do you know?

Data sampling is a statistical analysis technique used to select, manipulate and analyse a subset of data points to identify patterns and trends in the larger data set. It enables data scientists and other data analysts to work with a small, manageable amount of data about a statistical population to build and run analytical models more quickly, while still producing accurate findings.

  • Simple random sampling: Software is used to randomly select subjects from the whole population.
  • Stratified sampling: Subsets of the data sets or population are created based on a common factor, and samples are randomly collected from each subgroup.
  • Cluster sampling: The larger data set is divided into subsets (clusters) based on a defined factor, then a random sampling of clusters is analyzed.
  • Multistage sampling: A more complicated form of cluster sampling, this method also involves dividing the larger population into a number of clusters. Second-stage clusters are then broken out based on a secondary factor, and those clusters are then sampled and analyzed. This staging could continue as multiple subsets are identified, clustered and analyzed.
  • Systematic sampling: A sample is created by setting an interval at which to extract data from the larger population — for example, selecting every 10th row in a spreadsheet of 200 items to create a sample size of 20 rows to analyze.

Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?

Selection bias is the phenomenon of selecting individuals, groups or data for analysis in such a way that proper randomization is not achieved, ultimately resulting in a sample that is not representative of the population.

Types of selection bias include:

  • sampling bias: a biased sample caused by non-random sampling
  • time interval: selecting a specific time frame that supports the desired conclusion. e.g. conducting a sales analysis near Christmas.
  • attrition: attrition bias is similar to survivorship bias, where only those that ‘survived’ a long process are included in an analysis, or failure bias, where those that ‘failed’ are only included
  • observer selection: related to the Anthropic principle, which is a philosophical consideration that any data we collect about the universe is filtered by the fact that, in order for it to be observable, it must be compatible with the conscious and sapient life that observes it. 

Handling missing data can make selection bias worse because different methods impact the data in different ways. For example, if you replace null values with the mean of the data, you adding bias in the sense that you’re assuming that the data is not as spread out as it might actually be.

What is the difference between type I vs type II error?

Anytime we make a decision using statistics there are four possible outcomes, with two representing correct decisions and two representing errors.

Type – I Error:

A type 1 error is also known as a false positive and occurs when a researcher incorrectly rejects a true null hypothesis. This means that your report that your findings are significant when in fact they have occurred by chance.

The probability of making a type I error is represented by your alpha level (α), which is the p-value. A p-value of 0.05 indicates that you are willing to accept a 5% chance that you are wrong when you reject the null hypothesis

Type – II Error:

A type II error is also known as a false negative and occurs when a researcher fails to reject a null hypothesis which is really false. Here a researcher concludes there is not a significant effect, when actually there really is.

The probability of making a type II error is called Beta (β), and this is related to the power of the statistical test (power = 1- β). You can decrease your risk of committing a type II error by ensuring your test has enough power.

What are the four main things we should know before studying data analysis?

Following are the key point that we should know:

  • Descriptive statistics
  • Inferential statistics
  • Distributions (normal distribution / sampling distribution)
  • Hypothesis testing

What is the difference between inferential statistics and descriptive statistics?

Descriptive Analysis – It uses the data to provide description of the population either through numerical calculations or graph or tables.

Inferential statistics – Provides information of a sample and we need to inferential statistics to reach to a conclusion about the population.

How to calculate range and interquartile range?

IQR = Q3 – Q1

Where, Q3 is the third quartile (75 percentile) 

Where, Q1 is the first quartile (25 percentile)

What is the benefit of using box plot?

A box plot, also known as a box and whisker plot, is a type of graph that displays a summary of a large amount of data in five numbers. These numbers include the median, upper quartile, lower quartile, minimum and maximum data values.

Following are the advantages of Box-plot:

  • Handle Large data easily – Due to the five-number data summary, a box plot can handle and present a summary of a large amount of data. Organizing data in a box plot by using five key concepts is an efficient way of dealing with large data too unmanageable for other graphs, such as line plots or stem and leaf plots.
  • A box plot shows only a simple summary of the distribution of results, so that it you can quickly view it and compare it with other data.
  • A box plot is a highly visually effective way of viewing a clear summary of one or more sets of data. 
  • A box plot is one of very few statistical graph methods that show outliers. Any results of data that fall outside of the minimum and maximum values known as outliers are easy to determine on a box plot graph.

What is the meaning of standard deviation?

It represents how far are the data points from the mean

(σ) = √(∑(x-µ)2 / n)

Variance is the square of standard deviation

What is left skewed distribution and right skewed distribution?

Left skewed

  • The left tail is longer than the right side
    • Mean < median < mode

Right skewed

  • The right tail is longer than the left side
    • Mode < median < mean

What does symmetric distribution mean?

The part of the distribution that is on the left side of the median is same as the part of the distribution that is on the right side of the median

Few examples are – uniform distribution, binomial distribution, normal distribution

What is the relationship between mean and median in normal distribution?

In the normal distribution mean is equal to median

What does it mean by bell curve distribution and Gaussian distribution?

Normal distribution is called bell curve distribution / Gaussian distribution.It is called bell curve because it has the shape of a bell.It is called Gaussian distribution as it is named after Carl Gauss.

How to convert normal distribution to standard normal distribution?

Standardized normal distribution has mean = 0 and standard deviation = 1. To convert normal distribution to standard normal distribution we can use the formula

X (standardized) = (x-µ) / σ

What is an outlier? What can I do with outlier?

An outlier is an abnormal value (It is at an abnormal distance from rest of the data points). 

Following thing we can do with outliers

Remove outlier

  • When we know the data-point is wrong (negative age of a person)
    • When we have lots of data
    • We should provide two analyses. One with outliers and another without outliers.

Keep outlier

  • When there are lot of outliers (skewed data)
    • When results are critical
    • When outliers have meaning (fraud data)

What is the difference between population parameters and sample statistics?

Population parameters are:

  • Mean = µ
    • Standard deviation = σ

Sample statistics are:

  • Mean = x (bar)
    • Standard deviation = s

How to find the mean length of all fishes in the sea?

Define the confidence level (most common is 95%). Take a sample of fishes from the sea (to get better results the number of fishes > 30). Calculate the mean length and standard deviation of the lengths. Calculate t-statistics. Get the confidence interval in which the mean length of all the fishes should be.

What are the effects of the width of confidence interval?

  • Confidence interval is used for decision making
  • As the confidence level increases the width of the confidence interval also increases
  • As the width of the confidence interval increases, we tend to get useless information also.

Mention the relationship between standard error and margin of error?

As the standard error increases the margin of error also increases.

What is p-value and what does it signify?

The p-value reflects the strength of evidence against the null hypothesis. p-value is defined as the probability that the data would be at least as extreme as those observed, if the null hypothesis were true.

  • P- Value > 0.05 denotes weak evidence against the null hypothesis which means the null hypothesis cannot be rejected.
  • P-value < 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected.
  • P-value=0.05 is the marginal value indicating it is possible to go either way.

How to calculate p-value using manual method?

  • Find H0 and H1
  • Find n, x(bar) and s
  • Find DF for t-distribution
  • Find the type of distribution – t or z distribution
  • Find t or z value (using the look-up table)
  • Compute the p-value to critical value

What is the difference between one tail and two tail hypothesis testing?

  • Two tail test – When null hypothesis contain an equality (=) or inequality sign (<>)
  • One tail test – When the null hypothesis does not contain equality (=) or inequality sign (<, >, <=, >= )

What is A/B testing?

A/B testing is a form of hypothesis testing and two-sample hypothesis testing to compare two versions, the control and variant, of a single variable. It is commonly used to improve and optimize user experience and marketing.

What is R-squared and Adjusted R-square?

R-squared or R2 is a  value in which your input variables explain the variation of your output / predicted variable. So, if R-square is 0.8, it means 80% of the variation in the output variable is explained by the input variables. So, in simple terms, higher the R squared, the more variation is explained by your input variables and hence better is your model.

However, the problem with R-squared is that it will either stay the same or increase with addition of more variables, even if they do not have any relationship with the output variables. This is where “Adjusted R square” comes to help. Adjusted R-square penalizes you for adding variables which do not improve your existing model.

Hence, if you are building Linear regression on multiple variable, it is always suggested that you use Adjusted R-squared to judge goodness of model. In case you only have one input variable, R-square and Adjusted R squared would be exactly same.

Typically, the more non-significant variables you add into the model, the gap in R-squared and Adjusted R-squared increases.

Explain ANOVA and it’s applications?

Analysis of Variance (abbreviated as ANOVA) is an extremely useful technique which is used to compare the means of multiple samples. Whether there is a significant difference between the mean of 2 samples, can be evaluated using z-test or t-test but in case of more than 2 samples, t-test can not be applied as it accumulates the error and it will be cumbersome as the number of sample will increase (for example: for 4 samples — 12 t-test will have to be performed). The ANOVA technique enables us to perform this simultaneous test. Here is the procedure to perform ANOVA.

Let’s see with example: Imagine we want to compare the salary of Data Scientist across 3 cities of india — Bengaluru, Delhi and Mumbai. In order to do so, we collected data shown below.

Following picture explains the steps followed to get the Anova results

There is a limitation of ANOVA that it does not tell which pair is having significant difference. In above example, It is clear that there is a significant difference between the means of Data Scientist salary among these 3 cities but it does not provide any information on which pair is having the significant difference

What is the difference between Correlation and Covariance?

Correlation and Covariance are statistical concepts which are generally used to determine the relationship and measure the dependency between two random variables. Actually, Correlation is a special case of covariance which can be observed when the variables are standardized. This point will become clear from the formulas :

Here listed key differences between covariance and correlation

Reference –

Analyticsindiamag

Towardsdatascience

Springboard