Azure Data Factory Interview Question and Answers

1.    What is Azure Data Factory?

Azure Data factory is a cloud based ETL/ELT tool.

Azure Data Factory is a cloud-based integration service offered by Microsoft that lets you create data-driven workflows for orchestrating and automating data movement and data transformation overcloud. Data Factory services also offer to create and running data pipelines that move and transform data and then run the pipeline on a specified schedule.

2.    What is Linked Service?

A linked service is a connection to a specific service or data store that can either be a source of data, or a destination (also called target or sink).

A linked service will store the connection string, but also any method on how to authenticate with the service. Once a linked service is created, you can reuse it everywhere. For example, if you have a data warehouse in Azure SQL database, you will only need to define this connection once.

Linked services can be found in the Manage section of ADF Studio (lowest icon in the left menu bar.

There we can find the two linked services we created in the previous part:

3. What is the difference between Linked Service and Dataset?

Dataset:

Datasets can be considered as the source and target of a pipeline. A pipeline can have multiple Datasets, sometimes extracting a file, transforming it, and then writing it to a different folder within the same storage account. Thus a Dataset is just a term to describe tables, file storages, or ERP systems.

Linked Services:

Linked Services can be through of as the connection string, or the authorization that the pipeline uses to connect to a Dataset. A linked service can be the key value of a blob storage account, the username and password of a database, or a Service account that is used to log into a BW system. While it is possible to have more than one Linked Service attached to the same Dataset, usually you will see a one to one relationship.

4. What is Integration Runtime? What are the different type?

Integration runtime is the infrastructure that is used for computations. Azure Data Factory uses it to offer many integration capabilities. They can be data flows and data movement, activity dispatch, and SSIS package execution.

The most important integration runtime is the one we’ve been using all this time: the Azure-IR. Every installation of ADF has a default IR: the AutoResolveIntegrationRuntime. You can find it when you go to the Manage section of ADF and then click on Integration Runtimes.

There are 3 types of the integration runtime available in the Azure data factory. We can choose based upon our requirement the specific integration runtime best fitted in specific scenario. The three types are :

  • Azure IR
  • Self-hosted
  • Azure-SSIS

Azure IR:

The Azure-IR is a fully managed, serverless compute service. You don’t have to do anything to manage, except pay for the duration it has been running compute. You can always use the default Azure-IR, but you can also create a new one.

In the new window, choose the option with “Azure, Self-Hosted”

In the next step, choose Azure again.

In the following screen, enter a name for the new IR. Also choose your closest region.

You can also configure the IR to use a Virtual Network, but this is an advanced setting

Keep in mind that billing for pipeline durations is several magnitudes higher when you’re using a virtual network. In the third pane, we can configure the compute power for data flows.

There are two main reasons to create your own Azure-IR:

  • You want to specify a specific region for your compute. For example, if regulations specify your data can never leave a certain reason, you need to create your own Azure-IR located in that region.
  • You want to specify a data flow runtime with different settings than the default one. Especially the Time To Live setting is something that is worth changing (shorter if you want to save on costs, longer if you don’t want to restart you cluster too often during development/debugging).

The Self-hosted IR

Suppose you have data on-premises that you need to access from ADF. How can ADF reach this data store when it is in the Azure cloud? The self-hosted IR provides us with a solution. You install the self-hosted IR on one of your local machines. This IR will then act as a gateway through which ADF can reach the on-premises data.

Another use case for the self-hosted IR is when you want to run compute on your own machines instead of in the Azure cloud. This might be an option if you want to save costs (the billing for pipeline durations is lower on the self-hosted IR than one the Azure-IR) or if you want to control everything yourself. ADF will then act as an orchestrator, while all of the compute is running on your own local servers.

After clicking on New then select Self-Hosted.

Specify the name of the Integration Runtime and click on create.

After the creation of IR open it and copy key 1 or key 2.

Paste that key into Installed Microsoft Integration Runtime On your PC.

After pasting the key click on Register.

The Azure-SSIS IR

ADF provides us with the opportunity to run Integration Services packages inside the ADF environment. This can be useful if you want to quickly migrate SSIS project to the Azure cloud, without a complete rewrite of your projects. The Azure-SSIS IR provides us with a scale-out cluster of virtual machines that can run SSIS packages. You create an SSIS catalog in either Azure SQL database or in Azure SQL Server Managed Instance.

5. What Are the different types of trigger in ADF?

Totally there are 3 types of triggers available in Azure Data Factory,

  • Schedule triggers
  • Tumbling window triggers
  • Event triggers

Schedule Trigger:

Schedule triggers are common triggers that can execute a pipeline on the time schedule we set. Schedule triggers offer more flexibility by giving many options to schedule like Minute, Hour, Day(s), Week(s), or Month(s). We can define the start and end date for when the trigger should be active, and it will run only from the moment it is created. The schedule can also be set to run on future calendar dates and times like every 15th and last Saturday of the month or every month’s first and fourth Monday etc.

The New Trigger pane will open. The default trigger type is Schedule, but you can also choose Tumbling Window and Event:

Choose the start date, optionally an end date, and whether or not to activate the trigger immediately after you publish it:

Choose the recurrence, either minutes, hours, days, weeks, or months:

Depending on the recurrence you choose, you can also configure the advanced settings.

If you choose days, you can configure the times:

If you choose weeks, you can configure both the days and times:

Months has two options. You can either configure month days and times, such as the 15th day and the last day.

or week days and times, like the first Monday or the last Sunday.

Tumbling Window Trigger:

Tumbling window triggers run at a specific time or on a periodic interval from a mentioned start time. It is very useful compared to scheduled triggers when you are dealing with historical data to copy or move. For instance, if you want the data to be copied from the SQL database, it will take the pipeline to run in the past and get that piece of data copied. It works by breaking the tumbling window trigger for every hour (if you have defined it for 1 hour) and pass on the start/end time for each time window into the SQL query, post which the data between the start/end time are returned to be saved into a destination of your choice.

Tumbling window triggers have the same settings as schedule triggers for start date, end date, and activation. However, the recurrence setting is different, you can only choose minutes or hours:

You can also specify several advanced settings:

Event Triggers:

Using event-based triggers we can schedule to run the pipelines in response to an event from azure blob storage. The most common use case is configuring to events like file arrival or deletion in blob storage. Azure data factory works on its integration with Azure event grid which works on similar lines but slightly different methodology. Event-based triggers work with not only blob but with ADLS too. Event triggers work on many to many relationships, a single trigger can start multiple pipelines and multiple triggers can start a single pipeline.

Event triggers do not have settings for start date and end date, but you can choose whether or not to activate the trigger immediately after you publish it. The main settings for event triggers are container and blob path. Blob path can begin with a folder path and/or end with a file name or extension:

Trigger Now

Trigger now isn’t really a trigger type, it’s more like a trigger action. You can manually trigger a pipeline, just like debugging pipelines. After you have triggered a pipeline, you can to open up the Monitor page to check

the status and see the output.

Adding triggers to pipelines

Once you have created your triggers, open the pipeline that you want to trigger. From here, you can trigger now or click add trigger, then New/Edit:

This opens the add triggers pane, where you can select the trigger:

In the triggers tab, you can now see that the trigger has a pipeline attached to it, and you can click to activate it:

6. What are the ARM template in ADF? What are they used for?

ARM templates are a form of infrastructure as code, a concept where you define the infrastructure you need to be deployed. You no longer need to click around the portal creating virtual machines or writing scripts to deploy a storage account. Instead, the template defines the resources, and the Azure ARM management layer is responsible for creating the infrastructure.

These templates use a declarative syntax to let you define your deployment in the form of JSON (JavaScript Object Notation) files.

The primary benefits of using ARM are:

  • Consistency: ARM provides a consistent way to create, deploy, and manage Azure resources.
  • Modularity: ARM enables you to define your infrastructure as a set of reusable templates, making it easier to manage and maintain.
  • Automation: ARM enables you to automate the deployment and management of your infrastructure, saving time and reducing errors. If you have deploy the pipeline on UAT and you want to do the same in production you can use this template.
  • Tagging: ARM enables you to tag your resources, making it easier to organize, track, and manage them.

7. How do you deploy Data factory code to higher environment?

  1. Create a feature branch
  2. Create PR to merge code to Dev branch.
  3. Publish the code from dev to generate Arm Templates.
  4. This can trigger an automated CI-CD Dev-Ops pipeline to pass code to higher environments.

For More Details follow –

https://www.mssqltips.com/sqlservertip/6510/using-azure-devops-ci-cd-to-deploy-azure-data-factory-environments/

8. What are the difficulties that you have faced while copying data from on premise to cloud? How do you resolve it?

There are following challenges:

Throughput/Speed:  

We face speed limit during data copy.

When you copy binary files and other files as-is, the copy data process is fast and simple, like copying files on your computer. You take one file, and copy it into a different location:

However, the copy data activity is powerful. You are not limited to copying files as-is, the copy data activity can do some pretty cool things during copying:

Serialization and deserialization can be explained as converting file formats during copying. For example, we can use a CSV file as a source and copy it into an Azure SQL Database destination. To do that, we need to read the source data in CSV format and transfer it as bits and bytes over the network (serialization), then convert that stream of bits and bytes to a SQL format so we can load it into our database (deserialization).

Now following setting change in copy activity can improve performance:

Enable Staging:

Compression and decompression can be explained as zipping and unzipping files during copying. You don’t need to first copy the file and then zip or unzip it. You just specify that the source is zipped and the destination is unzipped, or vice versa. Definitely it will improve copy performance.

Degree of Copy Parallelism:

The degree of copy parallelism value specifies the maximum number of connections that can read from your source or write to your sink in parallel. It will increase the copy activity of the throughput.

Data Integration Units (DIUs):

The more DIUs you specify, the more power you throw at the copy data activity. And the more power you throw at the copy data activity, the more you pay for it.

9. What are the different activities you have used in data factory?

Here are a few of the most used activities:

  • Copy Data Activity to copy the data between datasets.
  • ForEach Activity for looping.
  • Get Metadata Activity that can provide metadata about any data source.
  • Set Variable Activity to define and initiate variables within pipelines.
  • Lookup Activity to do a lookup to get some values from a table/file.
  • Wait Activity to wait for a specified amount of time before/in between the pipeline run.
  • Validation Activity will validate the presence of files within the dataset.
  • Web Activity to call a custom REST endpoint from an ADF pipeline.

10. Can you execute For-each inside another for-each activity?

No, inside for-each we cannot use another for each activity.

11. What are the different Metadata options that you get in Get Metadata activity?

You can specify the following metadata types in the Get Metadata activity field list to retrieve the corresponding information:

Metadata typeDescription
itemNameName of the file or folder.
itemTypeType of the file or folder. Returned value is File or Folder.
sizeSize of the file, in bytes. Applicable only to files.
createdCreated datetime of the file or folder.
lastModifiedLast modified datetime of the file or folder.
childItemsList of subfolders and files in the given folder. Applicable only to folders. Returned value is a list of the name and type of each child item.
contentMD5MD5 of the file. Applicable only to files.
structureData structure of the file or relational database table. Returned value is a list of column names and column types.
columnCountNumber of columns in the file or relational table.
existsWhether a file, folder, or table exists. If exists is specified in the Get Metadata field list, the activity won’t fail even if the file, folder, or table doesn’t exist. Instead, exists: false is returned in the output.

To demonstrate Get Metadata activity at work, I will create a data flow with the following logic:

  • Read the list of the files available in the source folder, using Get Metadata activity and pass this data to ForEach activity
  • Within the ForEach activity, read the properties of each file, using another Get Metadata activity and pass it to conditional activity, to determine if the file has been modified within the last 7 days
  • Copy each recently changed file into the destination database.

Here are the steps to create this data flow:

Next, click the ‘+New‘ button to add metadata field and select Child Items from the dropdown list-this field will produce names of the files in the csvfiles container:

Next, let’s add ForEach activity to our pipeline (I’ve named it as ForEach_AC), link it to the Success end of the Get_Folder_Metadata_ACactivity  and add the following expression to its Items text box: @activity(‘Get_Folder_Metadata_AC’).output.childItems. This expression will extract childItems values from the output of the Get_Folder_Metadata_AC activity:

Next, let us switch to Activities tab and click the Add activity button to start adding child tasks to the ForEach activity:

Let us switch to the Dataset tab to specify dataset details. Because this activity is going to receive its file name from the parent activity, its source needs to be a parameterized dataset. I have created a parameterized blob storage dataset (I have named it as blobSTG_DS3) with the FileName parameter.

Next, let’s return to Get_File_Metadata_AC activity, select dataset BlobSTG_DS3 dataset we just created and enter an expression @item().name into its FileName parameter text box. This expression is going to pass the next file name value from ForEach activity’s item collection to the BlobSTG_DS3 dataset:

Next, let’s add the fields Last modified and Item Name fieldsusing the New button- these fields will be required for the subsequent activities:

Next, let’s add If Condition activity (I’ve named it as CheckDate_AC) and link it to the Success end of the Get_File_Metadata_AC activity:

Let’s switch to the Settings tab and enter the following expression: @greaterOrEquals(activity(‘Get_File_Metadata_AC’).output.lastModified,adddays(utcnow(),-7)). This expression will check whether or not the Last modified field obtained from the previous activity falls within last 7 days period:

Next, switch to Activities tab and click Add If True Activity button, to start building tasks which will be executed when the condition evaluates to true:

Within child activities window, add a Copy activity (I’ve named it as Copy_Data_AC), select BlobSTG_DS3 dataset as its source and assign an expression @activity(‘Get_File_Metadata_AC’).output.itemName to its FileName parameter. This expression will ensure that next file name, extracted by Get_File_Metadata_AC activity is passed as the input file name for copy activity. Here’s the screenshot:

Next, switch to the Sink tab, select FactInternetSales_DS dataset we created earlier and enter following purge query for destination table – Delete from FactInternetSales. This will ensure that this activity doesn’t fail with duplicate key errors, in case the rows we’re transferring already exist in the destination table:

Finally, let’s click the navigation link at the top of the screen and return to the parent pipeline’s design screen:

Now that we have completed building data flow, we can test it. I will execute this pipeline twice to test the following scenarios:

  • The first execution will be done with older files in the source container
  • The second execution will be done with a newly modified file added to the source folder.

Let us open the blob storage page and ensure that all the files existing in its csvfiles container are dated more than 7 days from the execution date:

Let us start the pipeline in the debug mode and examine execution logs in the Output window:

As you can see from the logs, all the activities, except the copy activity has executed successfully. The copy activity did not run, because the files in the source container are older than 7 days.

Next, I will upload a new file, using Upload button from Azure portal’s blob storage page:

Let us execute the pipeline in a debug mode again and examine execution logs. As you can see from the logs, this execution included the copy activity as well, which is what we expected:

As usual, we will need to publish the changes, to ensure that they are permanent.

12. If you want to use the output by executing a query which activity shall you use?

  • Lookup activity helps in retrieving lookup datasets from any Azure data factory-supported resources.
  • From the database, we can get dataset using Table, Query, and Stored Procedure.

In the case of files stored in Azure storage:

In the case of data stored in the database:

13. How you can verify the presence of a file in storage?

Validation activity can be used to check the existence of a file.

Step-1:

Step – 2:

Step – 3:

You please make sure that you are passing the correct filename to @dataset().FileName parameter. Passing incorrect filename may be causing your validation activity to wait till timeout happens.

14. Have you used Execute Notebook Activity? How you pass parameters to your notebooks in execute notebook activity?

Using Base Parameter property we can pass parameters to notebook, incase parameters are not satisfied in activity, default values from the notebook will be executed.

15. How do you handle incremental Data in Data factory?

There are couple of ways to handle incremental data:

Delta data loading from database by using a watermark:

  1. Define a watermark in your source database. A watermark is a column that has the last updated time stamp or an incrementing key.
  2. The delta loading solution loads the changed data between an old watermark and a new watermark.

The workflow for this approach is depicted in the following diagram:

Example –

https://learn.microsoft.com/en-us/azure/data-factory/solution-template-delta-copy-with-control-table

Delta data loading from SQL DB by using the Change Tracking technology:

Change Tracking: A lightweight solution in SQL Server and Azure SQL Database, providing an efficient change tracking mechanism for applications.

It enables an application to easily identify data that was inserted, updated, or deleted.

The workflow for this approach is depicted in the following diagram:

Example  –

https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-change-tracking-feature-powershell

Loading new and changed files only by using LastModifiedDate:

One can copy new and changed files only by using LastModifiedDate to the destination store.

  1. ADF will scan all the files from the source store,
  2. Apply the file filter by their LastModifiedDate,
  3. Only copy the new and updated file since last time to the destination store.

Please be aware that if you let ADF scan huge amounts of files but you only copy a few files to the destination, this will still take a long time because of the file scanning process.

Example –

https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-lastmodified-copy-data-tool

Using qlik replicate Tool:

Qlik Replicate empowers organizations to accelerate data replication, ingestion and streaming across a wide variety of heterogeneous databases, data warehouses, and big data platforms.

To know more just explore this tool.

16. How you send email notifications on your pipeline failure?

There are couple of ways to email notification:

  1. Using Logic Apps with Web/Web hook activity
  2. Alert and Metrics

Triggering the Logic App from ADF:

Suppose you have already created a pipeline with some activities in ADF.

Now we want to send an email if an error occurs within an activity. Add a web activity to the canvas and connect another activity to this new activity using the arrow. When the connection has been made, right-click on the connection to change it to a Failure precedence constraint.

This will change the color of the connector to red.

Now we need to go the Azure Logic App and copy the HTTP POST URL:

Paste this URL into the URL field in the settings pane of the Web activity. The method needs to be set to POST.

We also need to add a header, where we will set the Content-Type to application/json. In the body, we enter the following JSON (following the structure mentioned before):

We’re using system parameters to retrieve the name of the data factory and the name of the pipeline. All the other fields in the settings pane can be left as-is.

At the end of the pipeline, we can add a similar Web activity to send an email notifying users that the pipeline has finished successfully. This time we use the following body:

Now we can run the pipeline and wait to see if any emails come in:

Configuring Multiple Notifications

Suppose we have more than one single activity for which we want to send a notification if something goes wrong. The following won’t work in ADF:

Alert and Metrics:

Azure Data Factory pipeline run history can be accessed in the Studio of the respective Factory. The runs, inputs, outputs, and failure details will be available for each run.

Failed runs in a Data Factory Pipeline can be monitored by navigating to Monitor -> Alerts & metrics. You will need to identify the monitoring criteria to define the alert logic and evaluation period.

You can set up notification channels to get alerts and stay informed on violations. Azure Action groups enable you to group several notification channels and instantly notify failure alerts to multiple channels.

The image shown below displays a sample alert triggered using the alert rule configured in the Data Factory:

17. Describe the use of ADF in your project?

There are couple of things you can mention to your interviewer>

  1. How you copy data from source
  2. How you do data transformation (either using ADF or you call the Databrics notebook)
    1. If you mentioned ADF, just let them know are the different activity you have used for the same.
  3. How you store the final data.
  4. What you have done for performance improvement (Refer to Question -08)

Note – Remember several data transformation and cleaning ADF is not suitable that much, instead of that in organization people prefer those transformation work on Databrics Notebook.

So in short, Describe how many data source you have like csv, sql and other, how amny pipeline you have used in that project, what are the different transforma activity you have used in the same.

18. How do you test your pipeline when you have created code in feature branch?

Use Debug Mode – Whenever you are using any feature branch you have to use debug option.

Monitor Section –> Pipeline Runs –> Debug

19. Have you used Get metadata Activity? Can you give one use case for ‘child items’ property in get metadata activity?

Get the list of the files or folders from a specific location in Azure blob storage:

  • Go to the data factory and create one pipeline. 
  • Create a linked service which will point to the Azure blob storage location.
  • Create a dataset for Azure blob storage using the linked service created in above steps.
  • Select Getmetadata activity and go to the dataset tab. Under the dataset tab you will see the field dataset there select the dataset which we have created in above step to connect to the Azure blob storage.
  • Scroll down and there you will see the attribute field list. In front of it you will see a plus sign click on it. Drop down box will appear, in this drop down box you will see multiple properties. Here for now just select the child items as we wanted to select all the list of file names and folder names available inside the folder location. Hence just select the child items.  This will return the list of all the folders and files available inside our pointed location along with their type.

  • Our pipeline is now ready to run. Just go to the debug tab and click on the debug. This will execute the pipeline. It will take a few seconds based on the size of your folder and after that at the bottom pane you will see that your activity has been completed successfully.

Click on the output of the getmetadata activity. There you will see the output JSON with the property child items and in this child items you will see an array of file and folder names along with their type. For files you will see the type is file and for the folder you will see the type as  folder.

Iterate over list of files available inside folder in Azure storage location:

  • For iterating over the file names you have to use the foreach activity. So let’s connect the output of getmetadata activity to the input of foreach activity. There you can iterate over the getmetadata output  which contains array of files names.
  • Under the foreach activity setting tab you will see items field. In the items field provide the output of the getmetadata activity. This will be the array of filenames over which you want to iterate. 

Let’s see how you can get the last modified date of a file using getmetadata activity with an example.

  • Under the dataset tab you will see another property with name field list. In front of this property you will see a plus sign click on thispPlus sign and add the property.
  • Select the last modified date property from the dropdown.

  • As in this case we want to get the last modified date of a file or folder that’s why we have selected the property last modified date which will pull out the last updated timestamp for a file or a folder.
  • Our pipeline is almost ready to run. Just go to the debug tab and click on the debug. It will execute the pipeline for our demo purpose, we have just added one activity. This will show get-metadata activity output under the output tab.

Go to the output tab and open the output for getmetadata activity. You will see the JSON and here you can see that the last modified date property is giving you the last modified timestamp for your files or folder which you have selected in the dataset.

20. What are mapping and Wrangling data flow in Data Factory?

Mapping Data Flow –

  • Mapping data flows are visually designed data transformations in Azure Data Factory.
  • When there is a situation like you need to perform transformations using two or more datasets then you use a Mapping data flow.
  • You can perform several transformations such as Filter, JOIN, Aggregate, Union, Lookup, Sort, etc using mapping data flows.
  • Mapping data flows can be executed within ADF pipelines using data flow activities.
  • Azure Data Factory handles the code transformation and execution of Mapping Data Flow behind the scenes.
  • Mapping Data Flows activity can be created individually or within an Azure Data Factory pipeline.

Steps to create a Mapping Data Flow:

  1. Open the Azure Data Factory using the Azure portal, then click on Author & Monitor

Click on the Author button then click on Data flows option. By clicking on three dots select New data flow option.

2. Select Mapping Data Flow. Click

Steps to build transformation logic in the data flow canvas:

Once you create your Data Flow, you’ll be automatically sent to the data flow canvas.

Note: The assumption is that you are already aware of the basic building blocks of a data factory- like creating linked services, Pipelines, etc.

  1. In the data flow canvas, add a source by clicking on the Add Source

2. Name your source. Click on New to create a new source dataset.

3. Choose Azure Blob Storage. Click

4. Choose DelimitedText. Click Continue.

5. Name your dataset. Let’s say empDataDataset. In the linked service dropdown choose Linked service if you have created before or you can click +New button to create new linked service.

6. Once you’re back at the dataset creation window choose your File path. As the CSV file has headers, check First row as header. Select From connection/store to import the header schema directly from the file in storage. Click OK when done.

7. To add a transformation, click on the + icon on the data source which is next to your source node on the data flow canvas as specified in the screenshot below.


Transformation settings

The first tab in each transformation’s configuration pane contains the settings specific to that transformation. For more information, see that transformation’s documentation page.

Optimize

The Optimize tab contains settings to configure partitioning schemes.

21. Consider you have load data from 100 tables into another target, how would you design pipeline for the same?

So inorder to achive that, from system information schema get a list of tables and use foreach loop to copy the data from each table to target.

Now here is the important steps you need to keep in mind inorder to achive this process.

Understanding your data source.

i. Check the source data tables that have been asked to be copied.

ii. Determine the number of rows and columns (yes, columns matter) in each table to assess their size.

iii. Understand how the source tables change, their delta load logic.

iv. Know regular loading times for these source tables.

In this example, let source database = Oracle DB 1.

i. In this source, a daily truncate insert operation is performed on 47 tables. 42 of these tables have a size that is under 40 MB, making them relatively tiny. In contrast, 5 tables totaling more than 100 MB in size are considerably bigger.

ii. Each day, 8 large tables are loaded incrementally from the source.

iii. They all get updated somewhere between 1:50 to 2:20 AM.

Deciding how many pipelines to create.

When developing new pipelines, it is crucial to prioritize their long-term sustainability, especially when considering the possibility of having a large number of pipelines in production, potentially reaching thousands.

To maintain this, we will design pipelines such that each pipeline serves a single purpose. This does not imply that you should build a new pipeline for each table (avoid this) since there may be hundreds of them and you would not be able to maintain it. So what is a single purpose for a pipeline?

In this example, I will create 3 pipelines.

i. A pipeline that loads the 42 small daily full load tables.
ii. A pipeline that loads the 5 large daily full load tables.
iii. A pipeline that loads the 8 large incremental load tables.

Based on the size of the tables and the logic employed to edit them, this isolation is carried out.

How to create a good lookup source file.

The lookup activity that reads a lookup file from your storage explorer to obtain the list of tables it needs to load is the first task for each pipeline. Make your lookup files as detailed as necessary because they are what dynamically drives the remainder of your actions.

At least for me, the following fields must be filled out in the file.

i. source table name : the source table name that you are copying
ii. sink file name — the sink filename to be stored in your landing zone.
iii. delta load condition : The sql where condition to copy only modified data from source. For truncate and load tables, it could be where 1=1 or just blank. For delta tables, it could be anything, example where updated_at >= current_date() -1
iv. source job name : the source job name in charge of bringing up the source table.
v. target table name : Name of the target table in databricks

Three lookup files, one for each pipeline will be made in this pipeline. Here is an example of the lookup file created for 5 large full laod tables

Creating good data sets.

Like pipelines, you need to create datasets that are sustainable too.

In this example, we will create 3 datasets. Assume the linked services are already created.

i. One to read lookup files in storage account.
ii. One to define landing zone sink for copy activity.
iii. One to define source database for copy activity.

Do not create a unique dataset for each lookup file. It is better to have all of your lookup files in one folder inside a container and create a dataset where you can pass the filename dynamically from a pipeline. I am keeping all of my lookup files in a directory /meta/

Example, creating dataset:

Example, using dataset in pipeline:

How to set dependency the right way.

It becomes a challenge to set dependency in a pipeline that loads multiple tables at once because we cannot use event based triggers. So, how can you ensure that the data you are copying has been updated using time based triggers? Let me explain using this example:

As you can see from the preceding text, the source tables are changed between 1:50 and 2:20 AM. For 2:00 AM, I’ll construct a time-based trigger.

The data source owners will build a batch status table in which they will log a status each day after each table is updated. In our scenario, a vendor completes this.

I’ll construct a dynamic pipeline that determines whether a table status has been updated for today in this configuration table and waits until it hasn’t been changed.

By entering the tablename and batch name from my lookup activity for today, I will use this pipeline inside my foreach loop activity to determine if the source table has already been refreshed and wait if it has not.

Batch Check pipeline:

Usage :

[Optimization] Copying data the right way.

Although copying data is one of the easiest tasks in ADF, a lot can go wrong in this situation. Make sure that you are efficiently querying the source table and that you are just copying the necessary data.

After the batch check pipeline in this example, I’ll add a delete activity and a copy activity inside the foreach activity. If the sink file already exists, the delete activity will remove it, and the copy activity will copy the source file when I pass a dynamic delta load condition from my lookup file.

Two things to keep in mind:

Here, I’m turning off the for loop’s sequential options. This will guarantee that the copying processes take place concurrently. Avoid checking this. Consider that it takes 30 minutes for each of your five tables to be replicated to the landing zone. The total running time, if done sequentially, is 5 * 30 minutes.

Do not maintain old sink files, rather overwrite them. You don’t want to be logging into your storage account after a year and see 365*55 old sink files that have no use.

[Optimization] Working with Databricks notebook activity.

I used to believe that I would add a databricks notebook step within the for loop when a copy activity was complete to read the files from the landing zone and load my databricks table. My pipeline would be completed rapidly because these tasks would likewise be carried out concurrently.

NOTE – However this was not the case. Doing this could end badly for you and make your pipeline run longer. so You have to do this in optimise manner.

For small tables I will create a notebook that reads the lookup files and creates a dataframe, then iterates through the rows in the dataframe and loads each table from landing zone to databricks table. (this does not take as long as copy activity).

Let’s say you ran 40 notebooks in parallel, and each time only 4 notebooks would be able to make a successful request. Your retry interval is set to 3 minutes, So this would add N retries * 3 minutes to your pipeline runtime.

Can you do the same for large tables? Probably not. Because iterating through a dataframe and loading them one by one can take very long, so a single notebook will run longer. In this case, you might be better off running notebook calls in parallel inside the foreach activity. What you can do is batch large tables in different pipelines and then run a single notebook to load the tables in this batch at once.

Reference – https://medium.com/@ishanpradhan/constructing-and-improving-adf-pipelines-to-effectively-copy-multiple-tables-823a06caae14

22. How you can handle Full And Incremental Data Loading Concept From Source To Destination Using Azure Data Factory?

Objective: Our objective is to load data incrementally or fully from a source table to a destination table using Azure Data Factory Pipeline. First, we will go through the concept and design, and later the implementation steps using Azure Data Factory.

During the first run, the load activity is the full load of tables. Then, insert few records and perform the load activity once again which should be the incremental load. The configuration table data will updated automatically after the first run(full load), so that in second run it will automatically go for incremental load.

Development Process using Azure Data Factory:
1. First, create two Azure SQL DB resources: a. Source DB and b: Target DB
2. Create the Azure Data Factory resource.
3. Create the configuration Table.
4. Create two tables in the source DB e.g. customer table and transaction table.
5. Load data into the configuration table for customer table and transaction table. Load also an extra row for promotion table. After loading data it should be look like.

6. Load few data into the customer table and transaction table.
7. Go to the Azure Data Factory > Manage > Linked services > New and create two new linked services one for source and another for target.

8. Then, create two parameterized data set (Azure Data Factory > Author > Datasets > New dataset), one for source and another for target so that we can pass the table names on runtime from configuration table and this will help to handle multiple table using a single pipeline.

9. Go to the Azure Data Factory > Author > Pipeline > New Pipeline
10. First, create a lookup activity with query:
select * from [schema].[configuration table name] where Active_Indicator=1;
11. Create a For Each activity to iterate each table and perform activity(full or incremental load) for each table.

12. Then, create a If Condition activity, if ‘Incremental_Full_Load’ is 1, go for the ‘Incremental Load Activity’(associated with True value of if condition) else ‘Full Load Activity’(associated with False value of if condition).

13. Within the Incremental Load Activity,
a. first create a lookup to get the ‘Max_Last_Updated_Dat
e’ from the configuration table for each desire table.
b. Then, using Copy Data activity, move data from source to target.
c. After that, using lookup activity, get the max value of the ‘added_date’ from the target table and update that value in the configuration table for the particular row for the particular ‘Table_Name’ value.

14. Within Full Load Activity, we will do step ‘b’ and step ‘c’. Only a small addition that we also update the ‘Max_Last_Updated_Date’ value to 1 in step ‘c’. So that, the fully loaded tables will automatically perform incremental load from second time.

Reference – https://apu-nandi88.medium.com/full-and-incremental-data-loading-concept-from-source-to-destination-using-azure-data-factory-dd39627d6f21

23. How do you handle duplicate rows in Data flows?

One of the most frequently asked questions in a data engineering interview. In this article let’s see how we can remove duplicate records.

Consider an employee table with duplicate records of an employee.

Select a Data flow activity to remove duplicate records.

Select a Dataset from the storage where data is placed in an Azure.

Preview all the records from the dataset.

Select the specific records and check the dataset columns.

Check the selected specific records.

Select the Windows function to segregate the duplicate records.

Preview the mapped data from the dataset.

Now filter the records where rownum ==1 as duplicate records are sorted with the latest entry first.

Preview the records where rownum=1.

Now add the nonduplicate records to the sink.

Preview the non-duplicate/unique records.

Now check the destination folder in the storage account.

Reference – https://nidhig631.medium.com/removal-of-duplicate-records-in-adf-azure-data-factory-e7da613d1c1c

24. If your input has column ‘Title’ and you get bad records in that column, how do you handle it?

Conditional split transformation defines the maximum length of “title” to be five. Any row that is less than or equal to five will go into the GoodRows stream. Any row that is larger than five will go into the BadRows stream.

Now we need to log the rows that failed. Add a sink transformation to the BadRows stream for logging. Here, we’ll “auto-map” all of the fields so that we have logging of the complete transaction record. This is a text-delimited CSV file output to a single file in Blob Storage. We’ll call the log file “badrows.csv”.

The completed data flow is shown below. We are now able to split off error rows to avoid the SQL truncation errors and put those entries into a log file. Meanwhile, successful rows can continue to write to our target database.

If you choose the error row handling option in the sink transformation and set “Output error rows”, ADF will automatically generate a CSV file output of your row data along with the driver-reported error messages. You do not need to add that logic manually to your data flow with that alternative option. There will be a small performance penalty incurred with this option so that ADF can implement a 2-phase methodology to trap errors and log them.

25. What is the difference between pipeline and Dataflow?

A pipeline is an orchestrator and does not tranform data. It manages a series of one or more activities, such as copy data or executive stored procedure. Data flow is one of these activity types and very different from a pipeline.

26. How you can open and read contents of a file in ADF?

We can use web activity and provide url of file that we want to read from the storage accoount. Output of the web activity contaains the response. Then we can use set variable activity to get the contents of the file in an output response.

27. If you want to Delete files in Blob storage how can you do that?

We can use delete activity in the pipeline and provide configurations for the files thet we want to delete. Alsso we can apply triggers if regular or event deletion is needed.

Transform data using mapping data flows

In this post, we’ll use the Azure Data Factory user interface (UX) to create a pipeline that copies and transforms data from an Azure Data Lake Storage (ADLS) Gen2 source to an ADLS Gen2 sink using mapping data flow. The configuration pattern in this tutorial can be expanded upon when transforming data using mapping data flow.

In this post, will do the following steps:

  • Create a data factory.
  • Create a pipeline with a Data Flow activity.
  • Build a mapping data flow with four transformations.
  • Test run the pipeline.
  • Monitor a Data Flow activity

Prerequisites

  • Azure subscription. If you don’t have an Azure subscription, create a free Azure account before you begin.
  • Azure storage account. You use ADLS storage as a source and sink data stores.

Create a data factory

In this step, you create a data factory and open the Data Factory UX to create a pipeline in the data factory.

  1. Open Microsoft Edge or Google Chrome. Currently, Data Factory UI is supported only in the Microsoft Edge and Google Chrome web browsers.
  2. On the left menu, select Create a resource > Integration > Data Factory:

3. On the New data factory page, under Name, enter ADFTutorialDataFactory.

The name of the Azure data factory must be globally unique. If you receive an error message about the name value, enter a different name for the data factory. (for example, yournameADFTutorialDataFactory). 

4. Select the Azure subscription in which you want to create the data factory.

5. For Resource Group, take one of the following steps:

a. Select Use existing, and select an existing resource group from the drop-down list.

b. Select Create new, and enter the name of a resource group.

6. Under Version, select V2.

7. Under Location, select a location for the data factory. Only locations that are supported are displayed in the drop-down list. Data stores (for example, Azure Storage and SQL Database) and computes (for example, Azure HDInsight) used by the data factory can be in other regions.

8. Select Create.

9. After the creation is finished, you see the notice in Notifications center. Select Go to resource to navigate to the Data factory page.

10. Select Author & Monitor to launch the Data Factory UI in a separate tab.

Create a pipeline with a Data Flow activity

In this step, you’ll create a pipeline that contains a Data Flow activity.

  1. On the home page of Azure Data Factory, select Orchestrate.

2. In the General tab for the pipeline, enter TransformMovies for Name of the pipeline.

3. In the Activities pane, expand the Move and Transform accordion. Drag and drop the Data Flow activity from the pane to the pipeline canvas.

4. In the Adding Data Flow pop-up, select Create new Data Flow and then name your data flow TransformMovies. Click Finish when done.

5. In the top bar of the pipeline canvas, slide the Data Flow debug slider on. Debug mode allows for interactive testing of transformation logic against a live Spark cluster. Data Flow clusters take 5-7 minutes to warm up and users are recommended to turn on debug first if they plan to do Data Flow development


Build transformation logic in the data flow canvas

Once you create your Data Flow, you’ll be automatically sent to the data flow canvas. In case you are not redirected to data flow canvas, in the panel below the canvas, go to Settings and select Open, located beside data flow field. This will open the dataflow canvas.

In this step, you’ll build a data flow that takes the moviesDB.csv in ADLS storage and aggregates the average rating of comedies from 1910 to 2000. You’ll then write this file back to the ADLS storage.

  1. In the data flow canvas, add a source by clicking on the Add Source box.

2. Name your source MoviesDB. Click on New to create a new source dataset.

3. Choose Azure Data Lake Storage Gen2. Click Continue.

4. Choose DelimitedText. Click Continue.

5. Name your dataset MoviesDB. In the linked service dropdown, choose New.

6. In the linked service creation screen, name your ADLS gen2 linked service ADLSGen2 and specify your authentication method. Then enter your connection credentials. In this tutorial, we’re using Account key to connect to our storage account. You can click Test connection to verify your credentials were entered correctly.

7. Once you’re back at the dataset creation screen, enter where your file is located under the File path field. In this tutorial, the file moviesDB.csv is located in container sample-data. As the file has headers, check First row as header. Select From connection/store to import the header schema directly from the file in storage. Click OK when done.

8. If your debug cluster has started, go to the Data Preview tab of the source transformation and click Refresh to get a snapshot of the data. You can use data preview to verify your transformation is configured correctly.

9. Next to your source node on the data flow canvas, click on the plus icon to add a new transformation. The first transformation you’re adding is a Filter.

10. Name your filter transformation FilterYears. Click on the expression box next to Filter on to open the expression builder. Here you’ll specify your filtering condition.

11. The data flow expression builder lets you interactively build expressions to use in various transformations. Expressions can include built-in functions, columns from the input schema, and user-defined parameters.

In this tutorial, you want to filter movies of genre comedy that came out between the years 1910 and 2000. As year is currently a string, you need to convert it to an integer using the toInteger() function. Use the greater than or equals to (>=) and less than or equals to (<=) operators to compare against literal year values 1910 and 2000. Union these expressions together with the and (&&) operator. The expression comes out as:

toInteger(year) >= 1910 && toInteger(year) <= 2000

To find which movies are comedies, you can use the rlike() function to find pattern ‘Comedy’ in the column genres. Union the rlike expression with the year comparison to get:

toInteger(year) >= 1910 && toInteger(year) <= 2000 && rlike(genres, 'Comedy')

If you’ve a debug cluster active, you can verify your logic by clicking Refresh to see expression output compared to the inputs used. There’s more than one right answer on how you can accomplish this logic using the data flow expression language.

Click Save and Finish once you’re done with your expression.

12. Fetch a Data Preview to verify the filter is working correctly.

13. The next transformation you’ll add is an Aggregate transformation under Schema modifier.

14. Name your aggregate transformation AggregateComedyRatings. In the Group by tab, select year from the dropdown to group the aggregations by the year the movie came out.

15. Go to the Aggregates tab. In the left text box, name the aggregate column AverageComedyRating. Click on the right expression box to enter the aggregate expression via the expression builder.

16. To get the average of column Rating, use the avg() aggregate function. As Rating is a string and avg() takes in a numerical input, we must convert the value to a number via the toInteger() function. This is expression looks like:

avg(toInteger(Rating))

Click Save and Finish when done.

17. Go to the Data Preview tab to view the transformation output. Notice only two columns are there, year and AverageComedyRating.

18. Next, you want to add a Sink transformation under Destination.

19. Name your sink Sink. Click New to create your sink dataset.

20. Choose Azure Data Lake Storage Gen2. Click Continue.

21. Choose DelimitedText. Click Continue.

22. Name your sink dataset MoviesSink. For linked service, choose the ADLS gen2 linked service you created in step 6. Enter an output folder to write your data to. In this tutorial, we’re writing to folder ‘output’ in container ‘sample-data’. The folder doesn’t need to exist beforehand and can be dynamically created. Set First row as header as true and select None for Import schema. Click Finish.

Now you’ve finished building your data flow. You’re ready to run it in your pipeline.

Running and monitoring the Data Flow

You can debug a pipeline before you publish it. In this step, you’re going to trigger a debug run of the data flow pipeline. While data preview doesn’t write data, a debug run will write data to your sink destination.

  1. Go to the pipeline canvas. Click Debug to trigger a debug run.

2. Pipeline debug of Data Flow activities uses the active debug cluster but still take at least a minute to initialize. You can track the progress via the Output tab. Once the run is successful, click on the eyeglasses icon to open the monitoring pane.

3. In the monitoring pane, you can see the number of rows and time spent in each transformation step.

4. Click on a transformation to get detailed information about the columns and partitioning of the data.

Reference – https://learn.microsoft.com/

Mapping data flows in Azure Data Factory

What are mapping data flows?

Mapping data flows are visually designed data transformations in Azure Data Factory. Data flows allow data engineers to develop data transformation logic without writing code. The resulting data flows are executed as activities within Azure Data Factory pipelines that use scaled-out Apache Spark clusters. Data flow activities can be operationalized using existing Azure Data Factory scheduling, control, flow, and monitoring capabilities.

Mapping data flows provide an entirely visual experience with no coding required. Your data flows run on ADF-managed execution clusters for scaled-out data processing. Azure Data Factory handles all the code translation, path optimization, and execution of your data flow jobs.

Getting started:

Data flows are created from the factory resources pane like pipelines and datasets. To create a data flow, select the plus sign next to Factory Resources, and then select Data Flow.

This action takes you to the data flow canvas, where you can create your transformation logic. Select Add source to start configuring your source transformation.

Authoring data flows

Mapping data flow has a unique authoring canvas designed to make building transformation logic easy. The data flow canvas is separated into three parts: the top bar, the graph, and the configuration panel.

Graph

The graph displays the transformation stream. It shows the lineage of source data as it flows into one or more sinks. To add a new source, select Add source. To add a new transformation, select the plus sign on the lower right of an existing transformation.

Configuration panel

The configuration panel shows the settings specific to the currently selected transformation. If no transformation is selected, it shows the data flow. In the overall data flow configuration, you can add parameters via the Parameters tab.

Each transformation contains at least four configuration tabs.

Transformation settings

The first tab in each transformation’s configuration pane contains the settings specific to that transformation. For more information, see that transformation’s documentation page.

Optimize

The Optimize tab contains settings to configure partitioning schemes. To learn more about how to optimize your data flows.

Inspect

The Inspect tab provides a view into the metadata of the data stream that you’re transforming. You can see column counts, the columns changed, the columns added, data types, the column order, and column references. Inspect is a read-only view of your metadata. You don’t need to have debug mode enabled to see metadata in the Inspect pane.

As you change the shape of your data through transformations, you’ll see the metadata changes flow in the Inspect pane. If there isn’t a defined schema in your source transformation, then metadata won’t be visible in the Inspect pane. Lack of metadata is common in schema drift scenarios.

Data preview

If debug mode is on, the Data Preview tab gives you an interactive snapshot of the data at each transform. For more information.

Debug mode

Debug mode allows you to interactively see the results of each transformation step while you build and debug your data flows. The debug session can be used both in when building your data flow logic and running pipeline debug runs with data flow activities.

Data Factory – Move files to time folder structure

Scenario:

I have a folder full of CSV files. Each file name is a date for which this file contains data (i.e. 2021-10-01T00:00:00Z). I want to organize these files into folders, with a time hierarchy. Meaning, the top level folders will be years (2022, 2023…), the second level will be months in this year (1,2,3…12), and the next level will be folders for a day in a month (1,2…30/31/28). Each day folder will contain one or many files with data referencing that day.

This Structure is important because it will enable you to use partitions when reading that data with tools like SQL serverless pool or spark pool in Synapse. You can read only the data required for the current query and save time and resources.

In this post, I’ll use Azure data factory to copy the files to the new structure. I’m assuming the files are already on an Azure storage account, in one container, and we want to build the new folder structure on another container.

Let’s launch data factory and create a new pipeline.

First, we’ll add 2 parameters, to hold the containers names (source and destination)

We’ll use a metadata activity to get a list of files in our source folder, so drag this activity into the canvas.

In the new metadata activity, on the settings tab, under “field list” click on new and add “child items”.

Create a dataset to your source folder of type blob storage or data lake storage. Set your container to the parameter we added. Leave the directory and file name blank. The file type can be binary since we are only copying the files, and not accessing their content in this pipeline.

Now add a foreach activity, and connect it with a success (green) link from the metadata activity.

On the items field, type in:

@activity('Get file names').output.childItems

This will make the foreach activity loop on each result (which is one for each file in the source folder)

Click on the + sign inside the foreach activity to add an activity that will run for each file, and then click “copy data”.

Click on the copy data activity you just added, to get It’s settings.

On the source tab, create a new dataset, with the same properties as the one we created for the metadata activity, except now we’ll add a parameter to the file name.

Use this dataset in the source tab. A new parameter appear under the dataset for the file name. Click on “dynamic content” and write @item().name . This expression will take the file name in each loop.

You can check “delete files after completion” if you want the source files to be removed after successful copy.

Now on the sink tab, we need to create a new dataset to our destination folder. I am using Azure data lake storage. This time we are creating a parameter on the container and folder parts, leaving the file name blank.

Back on the sink tab, you now can insert a value for the dataset parameters. For the folder value, Click on “dynamic content” and paste in this:

@concat(
  formatDateTime(item().name ,'yyyy'),
   '/',
   formatDateTime(item().name ,'MM'),
   '/',
  formatDateTime(item().name ,'dd')
)

Explaining the expression:

Item().name is the file on the current loop. Something like 2021-10-01T00:00:00Z.

formatDateTime extract part of the timestamp string,

and concat brings the all folder path together.

In our example, that will be 2021/10/1 (the / symbol show moving to a different folder level)

Our pipeline is ready, let click on debug to test it.

We’ll need to supply the parameters, which are the names of our source and destination containers.

Our pipeline works and copied the files to the right folders (while creating those folders)

Reference – https://www.madeiradata.com/

Using Data Factory activities making sure that the files smaller than 1KB will be deleted from the source storage account?

Following activity will take place:

  • ForEach activity for iteration
  • Get Metadata to get the size of all files in the source storage
  • If Condition to check the size of the files
  • Delete activity to delete all files smaller than 1KB

Demo Overview

In the demo that we will discuss in this article, we will create an Azure Data Factory pipeline, that will read data stored in CSV files located in an Azure Blob Storage container, making sure that the file extension is CSV and the size of the file larger than or equal to 1KB, and write the data to an Azure SQL Database table.

Prerequisites

In order to create that pipeline, make sure that you have an Azure Data Factory, an Azure Storage account where the CSV files are stored, as shown below:

And an Azure SQL Database where the data will be written, where we need to add the current client IP address to the Azure SQL Database firewall settings, in order to be able to connect using SSMS from my machine, and enable Allow Azure Services and Resources to access this Server firewall setting to allow the Azure Data Factory to access it, as shown below:

Linked Services and DataSets

The first step in creating the Azure Data Factory pipeline is creating the source and sink linked services and datasets. To achieve that, open the Azure Data Factory, click on Author & Monitor to launch the Data Factory UI.

From the opened Azure Data Factory UI in the Azure Portal, click on the Manage button to create the Linked Services, as shown below:

Now, we need to create the linked service that points to the Azure Blob Storage container where the CSV files are stored. To create the linked service, click on the Linked Services option under the Connections list, then click on the New button, as below:

From the New Linked Service window, choose the Azure Blob Storage data store, then click Continue to proceed:

In the new Azure Blob Storage Linked service window, provide a meaningful name for the linked service, the Integration Runtime that will be used to connect to the Azure Blob Storage, which is Azure IR in our case, the authentication method that will be used to connect to that storage account, the Azure Subscription where this storage account is created and the name of that storage account.

After providing all required information, click on Test Connection to verify whether we are able to connect to that storage account using the provided information or not then click Create if the connection is tested successfully, as shown below:

And the created Azure Blob Storage linked service that will act as the data source in our pipeline will be shown in the linked services list, as below:

With the source linked service created, we need to create the source dataset, which points to the container, folder or file under the source storage account that contains the source data.

To create a new dataset, click on the Author button, choose Datasets under the Factory Resources list, choose to create a New dataset, as shown below:

In the New Dataset window, choose Azure Blob Storage data store, then click Continue to proceed:

In the Select Format window, choose DelimitedText format as we will read from CSV files, as shown below:

From the DataSet Set Properties window, provide a meaningful name for the source dataset, the linked service that contains the connection information for that dataset, browse to identify the container and the folder where the data source files are located and check the “First row as header” option if the CSV files contain the names of the columns in the first row, then click OK to proceed:

Before saving all dataset settings, review the delimiter and path settings then click Save to create the dataset, as below:

You can also click on the Preview option to review the shape of the source data, as shown below:

Let us move to the next step, in which we need to create the linked service and dataset for the sink data store, which is the Azure SQL Database in this demo.

We will follow the same steps, but this time, from the New Linked Service data stores, we will choose the Azure SQL Database store then click Continue to proceed:

In the New Linked Service window, provide a meaningful name for the Azure SQL DB linked service, the Integration Runtime that will be used to connect to that database, which the Azure IR in this scenario, the subscription where this database is created, the name of the Azure SQL server and database and finally the credentials that will be used to connect to that Azure SQL Database.

After providing all required information, click on Test Connection to verify whether we are able to connect to that Azure SQL Database using the provided information or not then click Create if the connection is tested successfully, as shown below:

And the created Azure SQL Database linked service that will act as the data sink in our pipeline will be shown in the linked services list, as below:

With the sink linked service created successfully, we will create the dataset that points to the database table where the data will be written in the Azure Data Factory pipeline.

Following the same previous steps, clicking on the New Dataset option from the Author window, select the Azure SQL Database as the data store type for that dataset then click Continue, as shown below:

In the dataset properties window, provide a meaningful name for the dataset, the linked service that contains the connection information, the name of the table where the data will be written, then click OK to proceed:

Now, review the dataset information then click Save to create the dataset, as shown below:

Well done! The linked services and datasets that will be used to connect to the source and the sink of our pipeline are created and configured successfully. Let us move to the next step in which we will create the Azure Data Factory pipeline.

Create a New Pipeline

To create a new pipeline, click on the New pipeline option under the Factory Resources of the Author page, as shown below:

In the new pipeline page, expand the General Activities and drag then drop the Get Metadata activity to the design surface. We will use this Get Metadata activity to get the list of all source files and retrieve the name of these files, so that it will be easier for us to filter the types of these files.

In the General tab of the Get Metadata activity, provide a meaningful name of the activity that reflects the purpose of that activity, as shown below:

In the Dataset tab of the Get Metadata activity, provide the name of the source dataset, which points to the Azure Blob Storage container, where the source files are stored.

In the Field List, click New to add a new argument and choose the Child Item as an argument to get the names of the source files, as below:

To test that activity, click on the Debug option, to execute that activity within the Azure Data Factory pipeline in the debug mode, then click to check the output of the activity execution, where it will return the list of files located in the source container and the names of these files, as shown below:

The next activity that we will add to the pipeline is the Filter activity, which will be used to take the name of the files from the Get Metadata activity and pass only the files with CSV extension.

The Filter activity can be found under the Iterations and Conditions activity list then dragged and dropped to the pipeline designer surface. After adding the Filter activity, drag the output of the Get Metadata activity to be the input of the Filter activity, then provide a meaningful name for the Filter activity, as shown below:

In the Settings tab of the Filter activity, specify the Items option value as @activity().output.childItems to return the list of files from the Get Metadata activity and the Conditions option value as @endswith(item().name,’csv’) to pass only the files with CSV extension, as shown below:

Let us test the Azure Data Factory pipeline until that point. Click on the Debug button to execute the pipeline under debug mode and confirm that both activities are executed successfully, and that the output of the Filter activity is the files with CSV extension, as shown below:

Now, we will add to the Azure Data Factory pipeline the ForEach activity that will help in iterating through all files passed from the Filter activity and check the size of these input files, if the size of the file is equal to or larger than 1KB, the file will be written to the Azure SQL Database table, otherwise, the file will be deleted from the source container.

Expand the Iterations and Conditional activities then drag and drop the ForEach activity to the designer surface. Once added, drag the output arrow of the Filter activity and drop it as an input to the ForEach activity and provide a meaningful name for that activity, as shown below:

In the Settings tab of the ForEach activity, check to loop the input items Sequentially and provide the @activity(‘filterInput’).output.Value in the Items option to take the value returned from the Filter activity as an input to the ForEach activity, as shown below:

Inside the ForEach activity icon, click on the pencil icon to add a new sub-activity inside the ForEach activity. Here we will add a new Get Metadata activity that will check all the input files and return the size of each file. First, provide a meaningful name for that activity, as shown below:

Under the Dataset settings, we will create a new Azure Blob Storage dataset with DelimitedText format and add “Filename” as a parameter for that dataset with “CSV” as a default value, as shown below:

And set @dataset().filename as the file name value in the File Path setting of the dataset, as below:

Once the Dataset is created, go back to the Dataset settings of the Get Metadata activity and set the filename parameter value as @item().name to parse the name of the files that are returned from the dataset, then from the Field List, add a new argument and select the Size as an argument, to return the size of the files that are pointed by the source dataset, as shown below:

Let us test the Azure Data Factory pipeline execution till that point, by clicking on Debug to execute the pipeline under the debug mode, and check the output of the Get Metadata activity, where you can see that it is executed inside the ForEach activity number of times equal to the number of files in the source container, and return the size of each file, as shown below:

Now we will add a new activity to the Azure Data Factory pipeline inside the ForEach activity that will be used to act based on the source file size, where we will copy the data inside the file if the file size is larger than or equal to 1KB and delete the file that is smaller than 1KB.

The best activity that will help in that scenario is the If Condition activity under the Iterations and Conditional activities list that will take the input of the Get Metadata activity and check the returned size. After dragging the If Condition activity and drop it inside the ForEach activity then drag the output arrow of the Get Metadata activity and use it as an input for the If Condition, provide a meaningful name for the If Condition activity, as below:

In the Activities tab of the If Condition activity, provide this Expression: @greaterOrEquals(activity(‘GetFileSize’).output.size,1024) as the Boolean condition of the If Condition that will be used to evaluate the size of each file and ensure that it is equal to or larger than 1 KB, as below:

Now, click on the pencil icon to add a Copy activity when the condition evaluates True, where you need to provide a meaningful name for that copy activity as below:

And select the previously created Dataset that points to the Azure Blob Storage container as a source for the Copy activity, making sure to provide the @item().name as a value for the filename dataset parameter, as shown below:

Then select the Azure SQL Database dataset as a sink for the copy activity, as below:

Again, we need to add a Delete activity for the False If Condition evaluation, to delete the files that are smaller than 1KB, where you need to provide first a meaningful name for that activity as shown below:

And select the previously created Dataset that points to the Azure Blob Storage container as a source for the Delete activity, considering to provide the @item().name as a value for the filename dataset parameter, as shown below:

Then selecting the Azure Storage account that will be used for the Delete activity logging, as below:

Now, the Azure Data Factory pipeline is ready with all loops and conditions. Let us execute it under debug mode, using the Debug button, to confirm that it is working as expected and you can later create a trigger to schedule it.

After executing it, you will see that, as we have 10 files in the source container that meets both CSV and 1KB size conditions, the files will be copied to the Azure SQL Database successfully, as shown below:

Also, we can confirm that the data is copied successfully to the Azure SQL Database by connecting to the database using SSMS, and querying the data, as shown below:

How can we load multiple(50)tables at a time using azure Data Factory?

Here are the important steps to create this solution:

  1. Select the watermark column: Select one column for each table in the source data store, which can be used to identify the new or updated records for every run. Normally, the data in this selected column (for example, last_modify_time or ID) keeps increasing when rows are created or updated. The maximum value in this column is used as a watermark.
  2. Prepare a data store to store the watermark value: In this tutorial, you store the watermark value in a SQL database.
  3. Create a pipeline with the following activities:
    • Create a ForEach activity that iterates through a list of source table names that is passed as a parameter to the pipeline. For each source table, it invokes the following activities to perform delta loading for that table.
    • Create two lookup activities. Use the first Lookup activity to retrieve the last watermark value. Use the second Lookup activity to retrieve the new watermark value. These watermark values are passed to the Copy activity.
    • Create a Copy activity that copies rows from the source data store with the value of the watermark column greater than the old watermark value and less than the new watermark value. Then, it copies the delta data from the source data store to Azure Blob storage as a new file.
    • Create a StoredProcedure activity that updates the watermark value for the pipeline that runs next time.

Here is the high-level solution diagram:

Prerequisites
SQL Server. You use a SQL Server database as the source data store in this tutorial.
Azure SQL Database. You use a database in Azure SQL Database as the sink data store. If you don’t have a database in SQL Database, see Create a database in Azure SQL Database for steps to create one.

Create a data factory

  1. Launch Microsoft Edge or Google Chrome web browser. Currently, Data Factory UI is supported only in Microsoft Edge and Google Chrome web browsers.
  2. On the left menu, select Create a resource > Integration > Data Factory:

In the New data factory page, enter ADFMultiIncCopyTutorialDF for the name.

The name of the Azure Data Factory must be globally unique. If you see a red exclamation mark with the following error, change the name of the data factory (for example, yournameADFIncCopyTutorialDF) and try creating again.

4. Select your Azure subscription in which you want to create the data factory.

5. For the Resource Group, do one of the following steps:

  • Select Use existing, and select an existing resource group from the drop-down list.
  • Select Create new, and enter the name of a resource group

6. Select V2 for the version.

7. Select the location for the data factory. Only locations that are supported are displayed in the drop-down list. The data stores (Azure Storage, Azure SQL Database, etc.) and computes (HDInsight, etc.) used by data factory can be in other regions.

8. Click Create.

9. After the creation is complete, you see the Data Factory page as shown in the image.

10. Select Open on the Open Azure Data Factory Studio tile to launch the Azure Data Factory user interface (UI) in a separate tab.

Create self-hosted integration runtime:

As you are moving data from a data store in a private network (on-premises) to an Azure data store, install a self-hosted integration runtime (IR) in your on-premises environment. The self-hosted IR moves data between your private network and Azure.

  1. On the home page of Azure Data Factory UI, select the Manage tab from the leftmost pane.

2. Select Integration runtimes on the left pane, and then select +New.

3. In the Integration Runtime Setup window, select Perform data movement and dispatch activities to external computes, and click Continue.

4. Select Self-Hosted, and click Continue.

5. Enter MySelfHostedIR for Name, and click Create.

6. Click Click here to launch the express setup for this computer in the Option 1: Express setup section.

7. In the Integration Runtime (Self-hosted) Express Setup window, click Close.

8. In the Web browser, in the Integration Runtime Setup window, click Finish.

9. Confirm that you see MySelfHostedIR in the list of integration runtimes.

Create linked services

You create linked services in a data factory to link your data stores and compute services to the data factory. In this section, you create linked services to your SQL Server database and your database in Azure SQL Database.


Create the SQL Server linked service

In this step, you link your SQL Server database to the data factory.

  1. In the Connections window, switch from Integration Runtimes tab to the Linked Services tab, and click + New.

2. In the New Linked Service window, select SQL Server, and click Continue.

3. In the New Linked Service window, do the following steps:

  1. Enter SqlServerLinkedService for Name.
  2. Select MySelfHostedIR for Connect via integration runtime. This is an important step. The default integration runtime cannot connect to an on-premises data store. Use the self-hosted integration runtime you created earlier.
  3. For Server name, enter the name of your computer that has the SQL Server database.
  4. For Database name, enter the name of the database in your SQL Server that has the source data. You created a table and inserted data into this database as part of the prerequisites.
  5. For Authentication type, select the type of the authentication you want to use to connect to the database.
  6. For User name, enter the name of user that has access to the SQL Server database. If you need to use a slash character (\) in your user account or server name, use the escape character (\). An example is mydomain\\myuser.
  7. For Password, enter the password for the user.
  8. To test whether Data Factory can connect to your SQL Server database, click Test connection. Fix any errors until the connection succeeds.
  9. To save the linked service, click Finish.

Create the Azure SQL Database linked service

In the last step, you create a linked service to link your source SQL Server database to the data factory. In this step, you link your destination/sink database to the data factory.

  1. In the Connections window, switch from Integration Runtimes tab to the Linked Services tab, and click + New.
  2. In the New Linked Service window, select Azure SQL Database, and click Continue.
  3. In the New Linked Service window, do the following steps:
    • Enter AzureSqlDatabaseLinkedService for Name.
    • For Server name, select the name of your server from the drop-down list.
    • For Database name, select the database in which you created customer_table and project_table as part of the prerequisites.
    • For User name, enter the name of user that has access to the database.
    • For Password, enter the password for the user.
    • To test whether Data Factory can connect to your SQL Server database, click Test connection. Fix any errors until the connection succeeds.
    • To save the linked service, click Finish.

4. Confirm that you see two linked services in the list.

Create datasets

In this step, you create datasets to represent the data source, the data destination, and the place to store the watermark.

Create a source dataset

  • In the left pane, click + (plus), and click Dataset.
  • In the New Dataset window, select SQL Server, click Continue.
  • You see a new tab opened in the Web browser for configuring the dataset. You also see a dataset in the tree view. In the General tab of the Properties window at the bottom, enter SourceDataset for Name.
  • Switch to the Connection tab in the Properties window, and select SqlServerLinkedService for Linked service. You do not select a table here. The Copy activity in the pipeline uses a SQL query to load the data rather than load the entire table.

Create a sink dataset

  1. In the left pane, click + (plus), and click Dataset.
  2. In the New Dataset window, select Azure SQL Database, and click Continue.
  3. You see a new tab opened in the Web browser for configuring the dataset. You also see a dataset in the tree view. In the General tab of the Properties window at the bottom, enter SinkDataset for Name.
  4. Switch to the Parameters tab in the Properties window, and do the following steps:
    • Click New in the Create/update parameters section.
    • Enter SinkTableName for the name, and String for the type. This dataset takes SinkTableName as a parameter. The SinkTableName parameter is set by the pipeline dynamically at runtime. The ForEach activity in the pipeline iterates through a list of table names and passes the table name to this dataset in each iteration.

5. Switch to the Connection tab in the Properties window, and select AzureSqlDatabaseLinkedService for Linked service. For Table property, click Add dynamic content.

6. In the Add Dynamic Content window, select SinkTableName in the Parameters section.

7. After clicking Finish, you see “@dataset().SinkTableName” as the table name

Create a pipeline

The pipeline takes a list of table names as a parameter. The ForEach activity iterates through the list of table names and performs the following operations:

  1. Use the Lookup activity to retrieve the old watermark value (the initial value or the one that was used in the last iteration).
  2. Use the Lookup activity to retrieve the new watermark value (the maximum value of the watermark column in the source table).
  3. Use the Copy activity to copy data between these two watermark values from the source database to the destination database.
  4. Use the StoredProcedure activity to update the old watermark value to be used in the first step of the next iteration.

Create the pipeline

  1. In the left pane, click + (plus), and click Pipeline.
  2. In the General panel under Properties, specify IncrementalCopyPipeline for Name. Then collapse the panel by clicking the Properties icon in the top-right corner.
  3. In the Parameters tab, do the following steps:
    1. Click + New.
    2. Enter tableList for the parameter name.
    3. Select Array for the parameter type.
  4. In the Activities toolbox, expand Iteration & Conditionals, and drag-drop the ForEach activity to the pipeline designer surface. In the General tab of the Properties window, enter IterateSQLTables.
  5. Switch to the Settings tab, and enter @pipeline().parameters.tableList for Items. The ForEach activity iterates through a list of tables and performs the incremental copy operation.

6. Select the ForEach activity in the pipeline if it isn’t already selected. Click the Edit (Pencil icon) button.

7. In the Activities toolbox, expand General, drag-drop the Lookup activity to the pipeline designer surface, and enter LookupOldWaterMarkActivity for Name.

8. Switch to the Settings tab of the Properties window, and do the following steps:

  1. Select WatermarkDataset for Source Dataset.
  2. Select Query for Use Query.
  3. Enter the following SQL query for Query.
    select * from watermarktable where TableName = '@{item().TABLE_NAME}'

9. Drag-drop the Lookup activity from the Activities toolbox, and enter LookupNewWaterMarkActivity for Name.

10. Switch to the Settings tab.

  1. Select SourceDataset for Source Dataset.
  2. Select Query for Use Query.
  3. Enter the following SQL query for Query.
select MAX(@{item().WaterMark_Column}) as NewWatermarkvalue from    @{item().TABLE_NAME}

11. Drag-drop the Copy activity from the Activities toolbox, and enter IncrementalCopyActivity for Name.

12. Connect Lookup activities to the Copy activity one by one. To connect, start dragging at the green box attached to the Lookup activity and drop it on the Copy activity. Release the mouse button when the border color of the Copy activity changes to blue.

13. Select the Copy activity in the pipeline. Switch to the Source tab in the Properties window.

  1. Select SourceDataset for Source Dataset.
  2. Select Query for Use Query.
  3. Enter the following SQL query for Query.
select * from @{item().TABLE_NAME} where @{item().WaterMark_Column} > '@{activity('LookupOldWaterMarkActivity').output.firstRow.WatermarkValue}' and @{item().WaterMark_Column} <= '@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}'

14. Switch to the Sink tab, and select SinkDataset for Sink Dataset.

15. Do the following steps:

  1. In the Dataset properties, for SinkTableName parameter, enter @{item().TABLE_NAME}.
  2. For Stored Procedure Name property, enter @{item().StoredProcedureNameForMergeOperation}.
  3. For Table type property, enter @{item().TableType}.
  4. For Table type parameter name, enter @{item().TABLE_NAME}.

16. Drag-and-drop the Stored Procedure activity from the Activities toolbox to the pipeline designer surface. Connect the Copy activity to the Stored Procedure activity.

17. Select the Stored Procedure activity in the pipeline, and enter StoredProceduretoWriteWatermarkActivity for Name in the General tab of the Properties window.

18. Switch to the SQL Account tab, and select AzureSqlDatabaseLinkedService for Linked Service.

19. Switch to the Stored Procedure tab, and do the following steps:

  1. For Stored procedure name, select [dbo].[usp_write_watermark].
  2. Select Import parameter.
  3. Specify the following values for the parameters:
NameTypeValue
LastModifiedtimeDateTime@{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}
TableNameString@{activity('LookupOldWaterMarkActivity').output.firstRow.TableName}

20. Select Publish All to publish the entities you created to the Data Factory service.

21. Wait until you see the Successfully published message. To see the notifications, click the Show Notifications link. Close the notifications window by clicking X.

Run the pipeline

  1. On the toolbar for the pipeline, click Add trigger, and click Trigger Now.
  2. In the Pipeline Run window, enter the following value for the tableList parameter, and click Finish.
[
    {
        "TABLE_NAME": "customer_table",
        "WaterMark_Column": "LastModifytime",
        "TableType": "DataTypeforCustomerTable",
        "StoredProcedureNameForMergeOperation": "usp_upsert_customer_table"
    },
    {
        "TABLE_NAME": "project_table",
        "WaterMark_Column": "Creationtime",
        "TableType": "DataTypeforProjectTable",
        "StoredProcedureNameForMergeOperation": "usp_upsert_project_table"
    }
]

Monitor the pipeline:

  1. Switch to the Monitor tab on the left. You see the pipeline run triggered by the manual trigger. You can use links under the PIPELINE NAME column to view activity details and to rerun the pipeline.
  2. To see activity runs associated with the pipeline run, select the link under the PIPELINE NAME column. For details about the activity runs, select the Details link (eyeglasses icon) under the ACTIVITY NAME column.
  3. Select All pipeline runs at the top to go back to the Pipeline Runs view. To refresh the view, select Refresh.

Azure Data Engineering Questions and Answers – 2023

1.    What is Data Engineering?

Data Engineering is a field within the broader domain of data management that focuses on designing, building, and maintaining systems and infrastructure to support the collection, storage, processing, and analysis of large volumes of data. It plays a crucial role in the data lifecycle, ensuring that data is properly ingested, transformed, and made available for various data-driven applications and analytical processes.

The primary goal of data engineering is to create a robust and scalable data infrastructure that enables organizations to efficiently manage their data and extract meaningful insights from it. Data engineers are responsible for implementing data pipelines, integrating different data sources, and ensuring data quality and consistency. Here are some key aspects of data engineering:

1. Data Ingestion: Data engineers collect and ingest data from various sources, including databases, APIs, log files, sensors, and external systems. They design the data ingestion process to handle real-time or batch data streams, depending on the specific requirements of the organization.

2. Data Transformation: Raw data often needs to be processed and transformed into a usable format before it can be analyzed. Data engineers perform data transformations, such as data cleaning, enrichment, aggregation, and normalization, to make it suitable for analysis.

3. Data Storage: Data engineers choose appropriate storage solutions based on the volume, velocity, and variety of data. Common data storage technologies include relational databases, data warehouses, data lakes, and NoSQL databases.

4. Data Modeling: Data engineers create data models that define the structure and relationships of data within the organization. They may use data modeling techniques like entity-relationship diagrams or schema-on-read approaches to represent the data.

5. Data Governance and Security: Data engineers implement security measures and data governance policies to ensure data privacy, compliance with regulations, and data integrity.

6. ETL/ELT: ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) processes are crucial components of data engineering. ETL involves extracting data from various sources, transforming it, and then loading it into the target data storage. ELT involves loading raw data first and then performing transformations within the data storage system itself.

7. Big Data Technologies: In dealing with large-scale and high-velocity data, data engineers often work with big data technologies like Apache Hadoop, Apache Spark, and cloud-based data services.

Data engineering is fundamental for data-driven decision-making, business intelligence, machine learning, and various data analytics applications. It bridges the gap between raw data and meaningful insights, making it a crucial discipline in today’s data-driven world.

2.    What is Azure Synapse analytics?

As of my last update in September 2021, Azure Synapse Analytics is an integrated analytics service provided by Microsoft as part of the Azure cloud platform. It combines big data and data warehousing capabilities into a single solution, enabling organizations to ingest, prepare, manage, and serve data for business intelligence, machine learning, and real-time analytics purposes. Azure Synapse Analytics was formerly known as Azure SQL Data Warehouse but was rebranded and expanded to offer more comprehensive analytics capabilities.

1. Unified Data Platform: It brings together data integration, data warehousing, big data analytics, and data exploration capabilities into a single unified platform, allowing users to work with diverse data sources seamlessly.

Key features of Azure Synapse Analytics include:

2. Data Integration: Azure Synapse Analytics provides built-in connectors and integration with various data sources, both on-premises and in the cloud, making it easier to ingest and process data from different systems.

3. Data Warehousing: It offers a distributed, scalable, and fully managed data warehousing solution to store and manage large volumes of structured and semi-structured data.

4. Big Data Analytics: The service allows users to perform big data analytics using Apache Spark, enabling data engineers and data scientists to analyze and process large datasets efficiently.

5. Real-time Analytics: Azure Synapse Analytics can be integrated with Azure Stream Analytics, enabling real-time data ingestion and analytics for streaming data scenarios.

6. Machine Learning Integration: It integrates with Azure Machine Learning, facilitating the deployment and operationalization of machine learning models on large datasets.

7. Serverless On-Demand Queries: Users can run serverless SQL queries on data residing in various locations, making it easier to analyze data without requiring a dedicated data warehouse.

8. Power BI Integration: Azure Synapse Analytics seamlessly integrates with Power BI, Microsoft’s business analytics service, enabling users to create interactive reports and dashboards.

9. Security and Governance: The service provides robust security features, including role-based access control, encryption, and data masking, ensuring data privacy and compliance with industry standards.

Overall, Azure Synapse Analytics is designed to provide an end-to-end analytics solution for modern data-driven businesses, allowing them to gain insights from large and diverse datasets efficiently and effectively. However, please note that Microsoft may have introduced new features or updates to Azure Synapse Analytics beyond my last update, so it’s always a good idea to refer to the official Azure documentation for the most current information.

3. Explain the data masking feature of Azure?

Data masking in Azure is a security feature that helps protect sensitive information by obfuscating or hiding sensitive data from unauthorized users or applications. It is an essential part of data security and compliance strategies, ensuring that only authorized individuals can access the actual sensitive data while others are provided with masked or scrambled representations.

Azure provides various tools and services to implement data masking, and one of the primary services used for this purpose is Azure SQL Database. Here’s an overview of how data masking works in Azure:

1. Sensitive Data Identification: The first step in data masking is to identify the sensitive data that needs protection. This includes personally identifiable information (PII), financial data, healthcare data, or any other information that could pose a risk if accessed by unauthorized users.

2. Data Masking Rules: Once sensitive data is identified, data masking rules are defined to specify how the data should be masked. These rules can vary based on the type of data and the level of security required. For example, data might be partially masked, fully masked, or substituted with random characters.

3. Data Masking Operations: When data masking is enabled, the data in the specified columns is automatically masked based on the defined rules. This process happens at the database level, ensuring that applications and users interacting with the database only see the masked data.

4. Access Control: Access control mechanisms, such as role-based access control (RBAC) and permissions, are used to ensure that only authorized users have access to the original (unmasked) data. Access to the masked data is typically provided to users who do not require access to the actual sensitive information.

5. Dynamic Data Masking: In some cases, dynamic data masking is used, which allows administrators to define masking rules dynamically at query runtime based on the user’s privileges. This means that different users may see different masked versions of the data based on their access rights.

Data masking is particularly beneficial in scenarios where developers, testers, or support personnel need access to production data for testing or troubleshooting purposes but should not be exposed to actual sensitive information. By using data masking, organizations can strike a balance between data privacy and the practical need to use realistic datasets for various purposes.

It’s important to note that while data masking provides an additional layer of security, it is not a substitute for robust access controls, encryption, or other security measures. A comprehensive data security strategy should include multiple layers of protection to safeguard sensitive information throughout its lifecycle.

4. Difference between Azure Synapse Analytics and Azure Data Lake Storage?

FeatureAzure Synapse AnalyticsAzure Data Lake Storage
Primary PurposeUnified analytics service that combines big data and data warehousing capabilities.Scalable and secure data lake storage for storing and managing large volumes of structured and unstructured data.
Data ProcessingProvides data warehousing, big data analytics (using Apache Spark), and real-time data processing.Primarily focused on storing and managing data; data processing is typically performed using separate services or tools.
Data IntegrationOffers built-in connectors for integrating data from various sources for analysis.Primarily focused on data storage, but data can be ingested and processed using other Azure services like Data Factory or Databricks.
Query LanguageSupports T-SQL (Transact-SQL) for querying structured data and Apache Spark SQL for big data analytics.No built-in query language; data is typically accessed and processed through other Azure services or tools.
Data OrganizationOrganizes data into dedicated SQL data pools and Apache Spark pools for different types of analytics.Organizes data into hierarchical directories and folders, suitable for storing raw and processed data.
Data SecurityProvides robust security measures for data protection, access control, and auditing.Offers granular access control through Azure Active Directory, and data can be encrypted at rest and in transit.
Schema ManagementRequires structured data with a predefined schema for SQL-based analytics.Supports both structured and unstructured data, allowing schema-on-read for flexible data processing.
Use CasesSuitable for data warehousing, big data analytics, and real-time analytics scenarios.Ideal for big data storage, data exploration, data archiving, and serving as a data source for various data processing workloads.
Integration with Other ServicesIntegrates with other Azure services like Power BI, Azure Machine Learning, and Azure Stream Analytics.Can be integrated with various Azure services like Data Factory, Databricks, and Azure HDInsight for data processing.

5. Describe various windowing functions of Azure Stream Analytics?

Azure Stream Analytics provides several windowing functions that help you perform calculations on data streams within specific time or event intervals. These functions enable you to perform time-based aggregations, sliding window computations, and more. Here are some of the key windowing functions available in Azure Stream Analytics:

Tumbling Windows:
  • Tumbling windows divide the data stream into fixed-size, non-overlapping time intervals.
  • Each event belongs to one and only one tumbling window based on its timestamp.
  • Useful for performing time-based aggregations over distinct time intervals.

Hopping window
  • Hopping windows are similar to tumbling windows but allow overlapping time intervals.
  • You can specify the hop size and window size, determining how much overlap exists between adjacent windows.
  • Useful for calculating aggregates over sliding time intervals.

Sliding window
  • Sliding windows divide the data stream into fixed-size, overlapping time intervals.
  • Unlike hopping windows, sliding windows always have an overlap between adjacent windows.
  • You can specify the window size and the slide size.
  • Useful for analyzing recent data while considering historical context.

With the following input data (illustrated above):

Session Windows:
  • Session windows group events that are close together in time based on a specified gap duration.
  • The gap duration is the maximum time interval between events that belong to the same session window.
  • Useful for detecting and analyzing bursts or periods of activity in data streams.

With the following input data (illustrated above):

Snapshot window

Snapshot windows group events that have the same timestamp. Unlike other windowing types, which require a specific window function (such as SessionWindow()), you can apply a snapshot window by adding System.Timestamp() to the GROUP BY clause.

6. What are the different storage types in Azure?

The following are the various advantages of the Java collection framework:

Storage TypesOperations
Files Azure Files is an organized way of storing data on the cloud. The main advantage of using Azure Files over Azure Blobs is that Azure Files allows for organizing the data in a folder structure. Also, Azure Files is SMB (Server Message Block) protocol compliant, i.e., and can be used as a file share.
BlobsBlob stands for a large binary object. This storage solution supports all kinds of files, including text files, videos, images, documents, binary data, etc.
QueuesAzure Queue is a cloud-based messaging store for establishing and brokering communication between various applications and components.
DisksThe Azure disk is used as a storage solution for Azure VMs (Virtual Machines)
TablesTables are NoSQL storage structures for storing structured data that does not meet the standard RDBMS (relational database schema).

7. What are the different security options available in the Azure SQL database?

Azure SQL Database provides a range of security options to ensure the confidentiality, integrity, and availability of your data. These options are designed to protect your database from unauthorized access, data breaches, and other security threats. Here are some of the key security features available in Azure SQL Database:

1. Firewall Rules:
  • Azure SQL Database allows you to configure firewall rules to control which IP addresses or IP address ranges can access your database.
  • By default, all access from outside Azure’s datacenters is blocked until you define the necessary firewall rules.
2. Authentication:
  • Azure SQL Database supports two types of authentications: SQL Authentication and Azure Active Directory Authentication.
  • SQL Authentication uses usernames and passwords to authenticate users, while Azure Active Directory Authentication enables users to sign in with their Azure AD credentials.
3. Transparent Data Encryption (TDE):
  • TDE automatically encrypts data at rest, providing an additional layer of security for your database.
  • The data and log files are encrypted using a database encryption key, which is further protected by a service-managed certificate.
4. Always Encrypted:
  • Always Encrypted is a feature that allows you to encrypt sensitive data in the database while keeping the encryption keys within your application.
  • This ensures that even database administrators cannot access the plaintext data.
5. Auditing and Threat Detection:
  • Azure SQL Database offers built-in auditing that allows you to track database events and store audit logs in Azure Storage.
  • Threat Detection provides continuous monitoring and anomaly detection to identify potential security threats and suspicious activities.
6. Row-Level Security (RLS):
  • RLS enables you to control access to rows in a database table based on the characteristics of the user executing a query.
  • This feature is useful for implementing fine-grained access controls on a per-row basis.
7. Virtual Network Service Endpoints:
  • With Virtual Network Service Endpoints, you can extend your virtual network’s private IP address space and route traffic securely to Azure SQL Database over the Azure backbone network.
  • This helps to ensure that your database can only be accessed from specific virtual networks, improving network security.
8. Data Masking:
  • Data Masking enables you to obfuscate sensitive data in query results, making it possible to share the data with non-privileged users without revealing the actual values.
9. Threat Protection:
  • Azure SQL Database’s Threat Protection feature helps identify and mitigate potential database vulnerabilities and security issues.
10. Advanced Data Security (ADS):
  • Advanced Data Security is a unified package that includes features like Vulnerability Assessment, Data Discovery & Classification, and SQL Injection Protection to enhance the overall security of your database.

By leveraging these security options, you can ensure that your Azure SQL Database remains protected against various security threats and maintain the confidentiality and integrity of your data. It is essential to configure and monitor these security features based on your specific requirements and compliance standards.

8. How data security is implemented in Azure Data Lake Storage(ADLS) Gen2?

Azure Data Lake Storage Gen2 (ADLS Gen2) provides robust data security features to protect data at rest and in transit. It builds on the security features of Azure Blob Storage and adds hierarchical namespace support, enabling it to function as both an object store and a file system. Here’s how data security is implemented in Azure Data Lake Storage Gen2:

1. Role-Based Access Control (RBAC):
  • ADLS Gen2 leverages Azure RBAC to control access to resources at the Azure subscription and resource group levels.
  • You can assign roles such as Storage Account Contributor, Storage Account Owner, or Custom roles with specific permissions to users, groups, or applications.
2. POSIX Access Control Lists (ACLs):
  • ADLS Gen2 supports POSIX-like ACLs, allowing you to grant fine-grained access control to individual files and directories within the data lake.
  • This enables you to define access permissions for specific users or groups, controlling read, write, and execute operations.
3. Shared Access Signatures (SAS):
  • SAS tokens provide limited and time-bound access to specific resources in ADLS Gen2.
  • You can generate SAS tokens with custom permissions and time constraints, which are useful for granting temporary access to external entities without exposing storage account keys.
4. Data Encryption:
  • Data at rest is automatically encrypted using Microsoft-managed keys (SSE, Server-Side Encryption).
  • Optionally, you can bring your encryption keys using customer-managed keys (CMEK) for an added layer of control.
5. Azure Private Link:
  • ADLS Gen2 can be integrated with Azure Private Link, which allows you to access the service over a private, dedicated network connection (Azure Virtual Network).
  • This helps to prevent data exposure to the public internet and enhances network security.
6. Firewall and Virtual Network Service Endpoints:
  • ADLS Gen2 allows you to configure firewall rules to control which IP addresses or IP address ranges can access the data lake.
  • Virtual Network Service Endpoints enable secure access to the data lake from within an Azure Virtual Network.
7. Data Classification and Sensitivity Labels:
  • ADLS Gen2 supports Azure Data Classification and Sensitivity Labels, allowing you to classify and label data based on its sensitivity level.
  • These labels can be used to enforce policies, auditing, and access control based on data classification.
8. Auditing and Monitoring:
  • ADLS Gen2 offers auditing capabilities that allow you to track access to the data lake and log events for compliance and security monitoring.
  • You can integrate ADLS Gen2 with Azure Monitor to get insights into the health and performance of the service.

By combining these security features, Azure Data Lake Storage Gen2 ensures that your data is protected from unauthorized access, tampering, and data breaches. It is crucial to configure these security options based on your specific data security requirements and compliance standards to safeguard your data effectively.

9. What are the various data flow partition schemes available in Azure?

Partition SchemeExplanationUsage
Round RobinIt is the most straightforward partition scheme which spreads data evenly across partitions.No good key candidates were available in the data.
HashHash of columns creates uniform partitions such that rows with similar values fall in the same partition.It is used to check for partition skew.
Dynamic RangeSpark dynamics range based on the provided columns or expression.Select the column that will be used for partitioning.
Fixed RangeA fixed range of values based on the user-created expression for disturbing data across partitions.A good understanding of data is required to avoid partition skew.
KeyPartition for each unique value in the selected column.Good understanding of data cardinality is required.

10. Why is the Azure data factory needed?

Azure Data Factory is a cloud-based data integration service provided by Microsoft Azure. It is designed to address the challenges of data movement and data orchestration in modern data-centric environments. The primary reasons why Azure Data Factory is needed are as follows:

1. Data Integration and Orchestration: In today’s data landscape, organizations often deal with data spread across various sources and formats, both on-premises and in the cloud. Azure Data Factory enables seamless integration and orchestration of data from diverse sources, making it easier to collect, transform, and move data between systems.

2. ETL (Extract, Transform, Load) Workflows: ETL processes are fundamental in data warehousing and analytics. Azure Data Factory allows you to create complex data workflows, where you can extract data from different sources, apply transformations, and load it into the target destination efficiently.

3. Serverless and Scalable: Azure Data Factory is a serverless service, meaning you don’t need to manage the underlying infrastructure. It automatically scales up or down based on demand, ensuring that you can process data of any volume without worrying about infrastructure limitations.

4. Integration with Azure Services: As part of the Azure ecosystem, Data Factory seamlessly integrates with other Azure services like Azure Blob Storage, Azure SQL Database, Azure Data Lake, Azure Databricks, etc. This integration enhances data processing capabilities and enables users to take advantage of Azure’s broader suite of analytics and storage solutions.

5. Data Transformation and Data Flow: Azure Data Factory provides data wrangling capabilities through data flows, allowing users to build data transformation logic visually. This simplifies the process of data cleansing, enrichment, and preparation for analytics or reporting.

6. Monitoring and Management: Azure Data Factory comes with built-in monitoring and management tools that allow you to track the performance of your data pipelines, troubleshoot issues, and set up alerts for critical events.

7. Hybrid Data Movement: For organizations with a hybrid cloud strategy or data stored on-premises, Azure Data Factory offers connectivity to on-premises data sources using a secure data gateway. This enables smooth integration of cloud and on-premises data.

8. Time Efficiency: Data Factory enables you to automate data pipelines and schedule data movements and transformations, reducing manual intervention and saving time in the data integration process.

Overall, Azure Data Factory plays a crucial role in simplifying data integration, enabling organizations to gain insights from their data, and facilitating the development of robust data-driven solutions in the Azure cloud environment.

11. What is Azure Data Factory?

Azure Data Factory is a cloud-based integration service offered by Microsoft that lets you create data-driven workflows for orchestrating and automating data movement and data transformation overcloud. Data Factory services also offer to create and running data pipelines that move and transform data and then run the pipeline on a specified schedule.

12. What is Integration Runtime?

Integration runtime is nothing but a compute structure used by Azure Data Factory to give integration capabilities across different network environments.

Types of Integration Runtimes:
  • Azure Integration Runtime – It can copy data between cloud data stores and dispatch the activity to a variety of computing services such as SQL Server, Azure HDInsight
  • Self Hosted Integration Runtime – It’s software with basically the same code as Azure Integration runtime, but it’s installed on on- premises systems or virtual machines over virtual networks.
  • Azure SSIS Integration Runtime – It helps to execute SSIS packages in a managed environment. So when we lift and shift the SSIS packages to the data factory, we use Azure SSIS Integration Runtime.

13. What are the different components used in Azure Data Factory?

Azure Data Factory consists of several numbers of components. Some components are as follows:

  • Pipeline: The pipeline is the logical container of the activities.
  • Activity: It specifies the execution step in the Data Factory pipeline, which is substantially used for data ingestion and metamorphosis.
  • Dataset: A dataset specifies the pointer to the data used in the pipeline conditioning.
  • Mapping Data Flow: It specifies the data transformation UI logic.
  • Linked Service: It specifies the descriptive connection string for the data sources used in the channel conditioning.  Let‘s say we’ve an SQL server, so we need a connecting string connected to an external device, and we will mention the source and the destination for it.
  • Trigger: It specifies the time when the pipeline will be executed.
  • Control flow: It’s used to control the execution flow of the pipeline activities

14. What is the key difference between the Dataset and Linked Service in Azure Data Factory?

Dataset specifies a source to the data store described by the linked service. When we put data to the dataset from a SQL Server instance, the dataset indicates the table’s name that contains the target data or the query that returns data from dissimilar tables.

Linked service specifies a definition of the connection string used to connect to the data stores. For illustration, when we put data in a linked service from a SQL Server instance, the linked service contains the name for the SQL Server instance and the credentials used to connect to that case.

15. What is the difference between Azure Data Lake and Azure Data Warehouse?

Azure Data LakeData Warehouse
Data Lake is a capable way of storing any type, size, and shape of data.Data Warehouse acts as a repository for already filtered data from a specific resource.
It is mainly used by Data Scientists.It is more frequently used by Business Professionals.
It is highly accessible with quicker updates.It becomes a pretty rigid and costly task to make changes in Data Warehouse.
It defines the schema after when the data is stored successfully.Datawarehouse defines the schema before storing the data.
It uses ELT (Extract, Load and Transform) process.It uses ETL (Extract, Transform and Load) process.
It is an ideal platform for doing in-depth analysis.It is the best platform for operational users. 

16. Difference between Data Lake Storage and Blob Storage.

Data Lake StorageBlob Storage
It is an optimized storage solution for big data analytics workloads.Blob Storage is general-purpose storage for a wide variety of scenarios. It can also do Big Data Analytics.
It follows a hierarchical file system.It follows an object store with a flat namespace.
In Data Lake Storage, data is stored as files inside folders.Blob storage lets you create a storage account. Storage account has containers that store the data.
It can be used to store Batch, interactive, stream analytics,  and machine learning data.We can use it to store text files, binary data, media storage for streaming and general purpose data.

16. What are the steps to create an ETL process in Azure Data Factory?

The ETL (Extract, Transform, Load) process follows four main steps:

i) Connect and Collect: Connect to the data source/s and move data to local and crowdsource data storage.

ii) Data transformation using computing services such as HDInsight, Hadoop, Spark, etc.

iii) Publish: To load data into Azure data lake storage, Azure SQL data warehouse, Azure SQL databases, Azure Cosmos DB, etc.

iv)Monitor: Azure Data Factory has built-in support for pipeline monitoring via Azure Monitor, API, PowerShell, Azure Monitor logs, and health panels on the Azure portal.

17. What are the key differences between the Mapping data flow and Wrangling data flow transformation activities in Azure Data Factory?

In Azure Data Factory, the main dissimilarity between the Mapping data flow and the Wrangling data flow transformation activities is as follows

The Mapping data flow activity is a visually allowed data transformation activity that facilitates users to plan graphical data transformation logic. It does not need the users to be expert developers. It’s executed as an activity within the ADF pipeline on an ADF completely managed scaled-out Spark cluster.

On the other hand, the Wrangling data flow activity is a code–free data preparation activity. It’s integrated with Power Query Online to make the Power Query M functions available for data wrangling using spark execution.

18. Can we pass parameters to a pipeline run?

Yes definitely, we can very easily pass parameters to a pipeline run. Pipeline runs are the first-class, top-level concepts in Azure Data Factory. We can define parameters at the pipeline level, and then we can pass the arguments to run a pipeline.

You can also define default values for the parameters in the pipeline.

19. Can an activity in a pipeline consume arguments that are passed to a pipeline run?

Each activity within the pipeline can consume the parameter value that’s passed to the pipeline and run with the @parameter construct.

Parameters are a first-class, top-level concept in Data Factory. We can define parameters at the pipeline level and pass arguments as you execute the pipeline run on demand or using a trigger. 

20. Can an activity output property be consumed in another activity?

An activity output can be consumed in a subsequent activity with the @activity construct.

21. How do I gracefully handle null values in an activity output?

You can use the @coalesce construct in the expressions to handle the null values gracefully.

23. What has changed from private preview to limited public preview in regard to data flows?

There are a couple of things which have been changed mentioned below:

  • You are no longer required to bring your own Azure Databricks Clusters.
  • Data Factory will manage cluster creation and tear down process.
  • We can still use Data Lake Storage Gen 2 and Blob Storage to store those files. You can use the appropriate linked services. You can also use the appropriate linked services for those of the storage engines.
  • Blob data sets and Azure Data Lake storage gen 2 are separated into delimited text and Apache Parquet datasets.

24. What is Azure SSIS Integration Runtime?

Azure SSIS Integration is a fully managed cluster of virtual machines that are hosted in Azure and dedicated to run SSIS packages in the data factory. We can easily scale up the SSIS nodes by configuring the node size or scaled out by configuring the number of nodes on the Virtual Machine’s cluster.

We must create an SSIS integration runtime and an SSISDB catalog hosted in the Azure SQL server database or Azure SQL-managed instance before executing an SSIS package

25. An Azure Data Factory Pipeline can be executed using three methods. Mention these methods.

Methods to execute Azure Data Factory Pipeline:

  • Debug Mode
  • Manual execution using trigger now
  • Adding schedule, tumbling window/event trigger

26. Can we monitor and manage Azure Data Factory Pipelines?

Yes, we can monitor and manage ADF Pipelines using the following steps:

  • Click on the monitor and manage on the data factory tab.
  • Click on the resource manager.
  • Here, you will find- pipelines, datasets, and linked services in a tree format.

27. What are the steps involved in the ETL process?

ETL (Extract, Transform, Load) process follows four main steps:

  • Connect and Collect – helps in moving the data on-premises and cloud source data stores
  • Transform – lets users collect the data by using compute services such as HDInsight Hadoop, Spark etc.
  • Publish – Helps in loading the data into Azure data warehouse, Azure SQL database, and Azure Cosmos DB etc
  • Monitor – It helps support the pipeline monitoring via Azure Monitor, API and PowerShell, Log Analytics, and health panels on the Azure Portal.

28. What are the different ways to execute pipelines in Azure Data Factory?

There are three ways in which we can execute a pipeline in Data Factory:

  • Debug mode can be helpful when trying out pipeline code and acts as a tool to test and troubleshoot our code.
  • Manual Execution is what we do by clicking on the ‘Trigger now’ option in a pipeline. This is useful if you want to run your pipelines on an ad-hoc basis.
  • We can schedule our pipelines at predefined times and intervals via a Trigger. As we will see later in this article, there are three types of triggers available in Data Factory. 

29. What is the purpose of Linked services in Azure Data Factory?

Linked services are used majorly for two purposes in Data Factory:

  1. For a Data Store representation, i.e., any storage system like Azure Blob storage account, a file share, or an Oracle DB/ SQL Server instance.
  2.  For Compute representation, i.e., the underlying VM will execute the activity defined in the pipeline. 

30. Can you Elaborate more on Data Factory Integration Runtime?

The Integration Runtime, or IR, is the compute infrastructure for Azure Data Factory pipelines. It is the bridge between activities and linked services. The linked Service or Activity references it and provides the computing environment where the activity is run directly or dispatched. This allows the activity to be performed in the closest region to the target data stores or computing Services.

The following diagram shows the location settings for Data Factory and its integration runtimes:

Azure Data Factory supports three types of integration runtime, and one should choose based on their data integration capabilities and network environment requirements.

  1. Azure Integration Runtime: To copy data between cloud data stores and send activity to various computing services such as SQL Server, Azure HDInsight, etc.
  2. Self-Hosted Integration Runtime: Used for running copy activity between cloud data stores and data stores in private networks. Self-hosted integration runtime is software with the same code as the Azure Integration Runtime but installed on your local system or machine over a virtual network. 
  3. Azure SSIS Integration Runtime: You can run SSIS packages in a managed environment. So, when we lift and shift SSIS packages to the data factory, we use Azure SSIS Integration Runtime. 

31. What are ARM Templates in Azure Data Factory? What are they used for?

An ARM template is a JSON (JavaScript Object Notation) file that defines the infrastructure and configuration for the data factory pipeline, including pipeline activities, linked services, datasets, etc. The template will contain essentially the same code as our pipeline.

ARM templates are helpful when we want to migrate our pipeline code to higher environments, say Production or Staging from Development, after we are convinced that the code is working correctly.

32. What are the two types of compute environments supported by Data Factory to execute the transform activities?

Below are the types of computing environments that Data Factory supports for executing transformation activities: –

i) On-Demand Computing Environment: This is a fully managed environment provided by ADF. This type of calculation creates a cluster to perform the transformation activity and automatically deletes it when the activity is complete.

ii) Bring Your Environment: In this environment, you can use ADF to manage your computing environment if you already have the infrastructure for on-premises services. 

33. If you want to use the output by executing a query, which activity shall you use? 

Look-up activity can return the result of executing a query or stored procedure.

The output can be a singleton value or an array of attributes, which can be consumed in subsequent copy data activity, or any transformation or control flow activity like ForEach activity.

34. What are some useful constructs available in Data Factory?

  • parameter: Each activity within the pipeline can consume the parameter value passed to the pipeline and run with the @parameter construct.
  • coalesce: We can use the @coalesce construct in the expressions to handle null values gracefully.
  • activity: An activity output can be consumed in a subsequent activity with the            @activity construct. 

35. Can we push code and have CI/CD (Continuous Integration and Continuous Delivery) in ADF?

Data Factory fully supports CI/CD of your data pipelines using Azure DevOps and GitHub. This allows you to develop and deliver your ETL processes incrementally before publishing the finished product. After the raw data has been refined into a business-ready consumable form, load the data into Azure Data Warehouse or Azure SQL Azure Data Lake, Azure Cosmos DB, or whichever analytics engine your business uses can point to from their business intelligence tools.

36. What do you mean by variables in the Azure Data Factory?

Variables in the Azure Data Factory pipeline provide the functionality to hold the values. They are used for a similar reason as we use variables in any programming language and are available inside the pipeline.

Set variables and append variables are two activities used for setting or manipulating the values of the variables. There are two types of variables in a data factory: –

i) System variables:  These are fixed variables from the Azure pipeline. For example, pipeline name, pipeline id, trigger name, etc. You need these to get the system information required in your use case.

ii) User variable: A user variable is declared manually in your code based on your pipeline logic.

37. What are the different activities you have used in Azure Data Factory?

Here you can share some of the significant activities if you have used them in your career, whether your work or college project. Here are a few of the most used activities :

  1. Copy Data Activity to copy the data between datasets.
  2. ForEach Activity for looping.
  3. Get Metadata Activity that can provide metadata about any data source.
  4. Set Variable Activity to define and initiate variables within pipelines.
  5. Lookup Activity to do a lookup to get some values from a table/file.
  6. Wait Activity to wait for a specified amount of time before/in between the pipeline run.
  7. Validation Activity will validate the presence of files within the dataset.
  8. Web Activity to call a custom REST endpoint from an ADF pipeline.

38. How can I schedule a pipeline?

You can use the time window or scheduler trigger to schedule a pipeline. The trigger uses a wall-clock calendar schedule, which can schedule pipelines periodically or in calendar-based recurrent patterns (for example, on Mondays at 6:00 PM and Thursdays at 9:00 PM).

 Currently, the service supports three types of triggers:

  • Tumbling window trigger: A trigger that operates on a periodic interval while retaining a state.
  • Schedule Trigger: A trigger that invokes a pipeline on a wall-clock schedule.
  • Event-Based Trigger: A trigger that responds to an event. e.g., a file getting placed inside a blob.
    Pipelines and triggers have a many-to-many relationship (except for the tumbling window trigger). Multiple triggers can kick off a single pipeline, or a single trigger can kick off numerous pipelines.

39. How can you access data using the other 90 dataset types in Data Factory?

The mapping data flow feature allows Azure SQL Database, Azure Synapse Analytics, delimited text files from Azure storage account or Azure Data Lake Storage Gen2, and Parquet files from blob storage or Data Lake Storage Gen2 natively for source and sink data source. Use the Copy activity to stage data from any other connectors and then execute a Data Flow activity to transform data after it’s been staged.

40. Can a value be calculated for a new column from the existing column from mapping in ADF?

We can derive transformations in the mapping data flow to generate a new column based on our desired logic. We can create a new derived column or update an existing one when developing a derived one. Enter the name of the column you’re making in the Column textbox.

You can use the column dropdown to override an existing column in your schema. Click the Enter expression textbox to start creating the derived column’s expression. You can input or use the expression builder to build your logic.

41. How is the lookup activity useful in the Azure Data Factory?

In the ADF pipeline, the Lookup activity is commonly used for configuration lookup purposes, and the source dataset is available. Moreover, it retrieves the data from the source dataset and then sends it as the activity output. Generally, the output of the lookup activity is further used in the pipeline for making decisions or presenting any configuration as a result.

Simply put, lookup activity is used for data fetching in the ADF pipeline. The way you would use it entirely relies on your pipeline logic. Obtaining only the first row is possible, or you can retrieve the complete rows depending on your dataset or query.

42. Elaborate more on the Get Metadata activity in Azure Data Factory.

The Get Metadata activity is used to retrieve the metadata of any data in the Azure Data Factory or a Synapse pipeline. We can use the output from the Get Metadata activity in conditional expressions to perform validation or consume the metadata in subsequent activities.

It takes a dataset as input and returns metadata information as output. Currently, the following connectors and the corresponding retrievable metadata are supported. The maximum size of returned metadata is 4 MB

Please refer to the snapshot below for supported metadata which can be retrieved using the Get Metadata activity.

43. What does it mean by the breakpoint in the ADF pipeline?

To understand better, for example, you are using three activities in the pipeline, and now you want to debug up to the second activity only. You can do this by placing the breakpoint at the second activity. To add a breakpoint, click the circle present at the top of the activity.

44. Can you share any difficulties you faced while getting data from on-premises to Azure cloud using Data Factory?

One of the significant challenges we face while migrating from on-prem to the cloud is throughput and speed. When we try to copy the data using Copy activity from on-prem, the process rate could be faster, and hence we need to get the desired throughput. 

There are some configuration options for a copy activity, which can help in tuning this process and can give desired results.

i) We should use the compression option to get the data in a compressed mode while loading from on-prem servers, which is then de-compressed while writing on the cloud storage.

ii) Staging area should be the first destination of our data after we have enabled the compression. The copy activity can decompress before writing it to the final cloud storage buckets.

iii) Degree of Copy Parallelism is another option to help improve the migration process. This is identical to having multiple threads processing data and can speed up the data copy process.

There is no right fit-for-all here, so we must try different numbers like 8, 16, or 32 to see which performs well.

iv) Data Integration Unit is loosely the number of CPUs used, and increasing it may improve the performance of the copy process. 

45. How to copy multiple sheet data from an Excel file?

When using an Excel connector within a data factory, we must provide a sheet name from which we must load data. This approach is nuanced when we have to deal with a single or a handful of sheets of data, but when we have lots of sheets (say 10+), this may become a tedious task as we have to change the hard-coded sheet name every time!

However, we can use a data factory binary data format connector for this and point it to the Excel file and need not provide the sheet name/s. We’ll be able to use copy activity to copy the data from all the sheets present in the file.

46. Is it possible to have nested looping in Azure Data Factory?

There is no direct support for nested looping in the data factory for any looping activity (for each / until). However, we can use one for each/until loop activity which will contain an execute pipeline activity that can have a loop activity. This way, when we call the looping activity, it will indirectly call another loop activity, and we’ll be able to achieve nested looping.

47. How to copy multiple tables from one datastore to another datastore?

An efficient approach to complete this task would be:

  • Maintain a lookup table/ file containing the list of tables and their source, which needs to be copied.
  • Then, we can use the lookup activity and each loop activity to scan through the list.
  • Inside the for each loop activity, we can use a copy activity or a mapping dataflow to copy multiple tables to the destination datastore.

48. What are some performance-tuning techniques for Mapping Data Flow activity?

We could consider the below set of parameters for tuning the performance of a Mapping Data Flow activity we have in a pipeline.

i) We should leverage partitioning in the source, sink, or transformation whenever possible. Microsoft, however, recommends using the default partition (size 128 MB) selected by the Data Factory as it intelligently chooses one based on our pipeline configuration.

Still, one should try out different partitions and see if they can have improved performance.

ii) We should not use a data flow activity for each loop activity. Instead, we have multiple files similar in structure and processing needs. In that case, we should use a wildcard path inside the data flow activity, enabling the processing of all the files within a folder.

iii) The recommended file format to use is ‘. parquet’. The reason being the pipeline will execute by spinning up spark clusters, and Parquet is the native file format for Apache spark thus, it will generally give good performance.

iv) Multiple logging modes are available: Basic, Verbose, and None.

We should only use verbose mode if essential, as it will log all the details about each operation the activity performs. e.g., It will log all the details of the operations performed for all our partitions. This one is useful when troubleshooting issues with the data flow.

The basic mode will give out all the necessary basic details in the log, so try to use this one whenever possible.

v)  Try to break down a complex data flow activity into multiple data flow activities. Let’s say we have several transformations between source and sink, and by adding more, we think the design has become complex. In this case, try to have it in multiple such activities, which will give two advantages:

  • All activities will run on separate spark clusters, decreasing the run time for the whole task.
  • The whole pipeline will be easy to understand and maintain in the future. 

49. What are some of the limitations of ADF?

Azure Data Factory provides great functionalities for data movement and transformations. However, there are some limitations as well.

i) We can’t have nested looping activities in the data factory, and we must use some workaround if we have that sort of structure in our pipeline. All the looping activities come under this: If, Foreach, switch, and until activities.

ii) The lookup activity can retrieve only 5000 rows at a time and not more than that. Again, we need to use some other loop activity along with SQL with the limit to achieve this sort of structure in the pipeline.

iii) We can have 40 activities in a single pipeline, including inner activity, containers, etc. To overcome this, we should modularize the pipelines regarding the number of datasets, activities, etc.

50. How are all the components of Azure Data Factory combined to complete an ADF task?

The below diagram depicts how all these components can be clubbed together to fulfill Azure Data Factory ADF tasks.

51. How do you send email notifications on pipeline failure?

There are multiple ways to do this:

  1. Using Logic Apps with Web/Webhook activity.
    Configure a logic app that, upon getting an HTTP request, can send an email to the required set of people for failure. In the pipeline, configure the failure option to hit the URL generated by the logic app.
  2. Using Alerts and Metrics from pipeline options.
    We can set up this from the pipeline itself, where we get numerous options for email on any activity failure within the pipeline.

52. Imagine you need to process streaming data in real time and store the results in an Azure Cosmos DB database. How would you design a pipeline in Azure Data Factory to efficiently handle the continuous data stream and ensure it is correctly stored and indexed in the destination database? 

Here are the steps to design a pipeline in Azure Data Factory to efficiently handle streaming data and store it in an Azure Cosmos DB database. 

  1. Set up an Azure Event Hub or Azure IoT Hub as the data source to receive the streaming data. 
  2. Use Azure Stream Analytics to process and transform the data in real time using Stream Analytics queries.
  3. Write the transformed data to a Cosmos DB collection as an output of the Stream Analytics job.
  4. Optimize query performance by configuring appropriate indexing policies for the Cosmos DB collection.
  5. Monitor the pipeline for issues using Azure Data Factory’s monitoring and diagnostic features, such as alerts and logs.

54. How can one combine or merge several rows into one row in ADF? Can you explain the process?

In Azure Data Factory (ADF), you can merge or combine several rows into a single row using the “Aggregate” transformation. 

55. How do you copy data as per file size in ADF?

You can copy data based on file size by using the “FileFilter” property in the Azure Data Factory. This property allows you to specify a file pattern to filter the files based on size. 

Here are the steps you can follow to copy data based on the file size: 

  • Create a dataset for the source and destination data stores.
  • Now, set the “FileFilter” property to filter the files based on their size in the source dataset. 
  • In the copy activity, select the source and destination datasets and configure the copy behavior per your requirement.
  • Run the pipeline to copy the data based on the file size filter.

56. How can you insert folder name and file count from blob into SQL table?

You can follow these steps to insert a folder name and file count from blob into the SQL table: 

  • Create an ADF pipeline with a “Get Metadata” activity to retrieve the folder and file details from the blob storage.
  • Add a “ForEach” activity to loop through each folder in the blob storage.
  • Inside the “ForEach” activity, add a “Get Metadata” activity to retrieve the file count for each folder.
  • Add a “Copy Data” activity to insert the folder name and file count into the SQL table.
  • Configure the “Copy Data” activity to use the folder name and file count as source data and insert them into the appropriate columns in the SQL table.
  • Run the ADF pipeline to insert the folder name and file count into the SQL table.

57. What are the various types of loops in ADF?

Loops in Azure Data Factory are used to iterate over a collection of items to perform a specific action repeatedly. There are three major types of loops in Azure Data Factory: 

  • For Each Loop: This loop is used to iterate over a collection of items and perform a specific action for each item in the collection. For example, if you have a list of files in a folder and want to copy each file to another location, you can use a For Each Loop to iterate over the list of files and copy each file to the target location.
  • Until Loop: This loop repeats a set of activities until a specific condition is met. For example, you could use an Until Loop to keep retrying an operation until it succeeds or until a certain number of attempts have been made.
  • While Loop: This loop repeats a specific action while a condition is true. For example, if you want to keep processing data until a specific condition is met, you can use a While Loop to repeat the processing until the condition is no longer true.

58. What is Data factory parameterization?

Data factory allows for parameterization of pipelines via three elements:

  • Parameters
  • Variables
  • Expression

Parameter:

Parameters are simply input values for operations in data factory. Each action has a set of predefined parameters that need to be supplied.

Additionally, some blocks like pipeline and datasets allow you to define custom parameters.

Variables:

Variable, set values inside the pipelines using Set variable function. Variables are temporary values that are used within pipeline and workflow to control execution of the workflow.

Variable can be modified through expression using Set variable action during the execution of the work flow.

Variables support 3 data types: string, bool and array. We refer these user variable as below @variables(‘variableName’)

Expression:

Expression is a JSON based formula, which allow for modification of variable or any parameters for pipeline, action or connection in data factory.

Typical scenario:

Most common scenarios that mandate parameterization are:

  • Dynamic input file name coming from external service
  • Dynamic output table name
  • Appending date to output
  • Changing connection parameter name like database name
  • Conditional programming
  • And many more…

59. What are the different Slowly changing dimension types?

Star schema design theory refers to common SCD types. The most common are Type 1 and Type 2. In practice a dimension table may support a combination of history tracking methods, including Type 3 and Type 6.

Type 1 SCD:

Type 1 SCD always reflects the latest values, and when changes in source data are detected, the dimension table data is overwritten. This design approach is common for columns that store supplementary values, like the email address or phone number of a customer. When a customer email address or phone number changes, the dimension table updates the customer row with the new values. It’s as if the customer always had this contact information. The key field, such as CustomerID, would stay the same so the records in the fact table automatically link to the updated customer record.

Type 2 SCD:

Type 2 SCD supports versioning of dimension members. Often the source system doesn’t store versions, so the data warehouse load process detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and EndDate) and possibly a flag column (for example, IsCurrent) to easily filter by current dimension members.

Type 3 SCD:

Type 3 SCD supports storing two versions of a dimension member as separate columns. The table includes a column for the current value of a member plus either the original or previous value of the member. So Type 3 uses additional columns to track one key instance of history, rather than storing additional rows to track each change like in a Type 2 SCD.

This type of tracking may be used for one or two columns in a dimension table. It is not common to use it for many members of the same table. It is often used in combination with Type 1 or Type 2 members.

Type 6 SCD:

Type 6 SCD combines Type 1, 2, and 3. When a change happens to a Type 2 member you create a new row with appropriate StartDate and EndDate. In Type 6 design you also store the current value in all versions of that entity so you can easily report on the current value or the historical value.

Using the sales region example, you split the Region column into CurrentRegion and HistoricalRegion. The CurrentRegion always shows the latest value and the HistoricalRegion shows the region that was valid between the StartDate and EndDate. So for the same salesperson, every record would have the latest region populated in CurrentRegion while HistoricalRegion works exactly like the region field in the Type 2 SCD example.

60.What is the difference between the Mapping data flow and Wrangling data flow transformation activities in Data Factory?

Mapping data flow activity is a visually designed data transformation activity that allows us to design a graphical data transformation logic without the need to be an expert developer, and executed as an activity within the ADF pipeline on an ADF fully managed scaled-out Spark cluster.

Wrangling data flow activity is a code-free data preparation activity that integrates with Power Query Online in order to make the Power Query M functions available for data wrangling using spark execution.

61.When copying data from or to an Azure SQL Database using Data Factory, what is the firewall option that we should enable to allow the Data Factory to access that database?

Allow Azure services and resources to access this server firewall option.

62. Which activity should you use if you need to use the results from running a query?

A look-up activity can give you results from a query or a program. These results can be a single thing, a list of stuff, or something related to how your program works. You can then use these results in another copy data activity.

63. How are the remaining 90 dataset types in Data Factory used for data access?

You can easily use Azure Synapse Analytics, Azure SQL Database, and certain storage accounts for your data in the mapping data flow feature. It also supports Parquet files from blob storage or Data Lake Storage Gen2. However, for data from other sources, you’ll need to first copy it using the Copy activity before you can work with it in a Data Flow activity.

64. Give more details about the Azure Data Factory’s Get Metadata activity

The Azure Data Factory’s “Get Metadata” activity is a fundamental component of Azure Data Factory (ADF) that allows you to retrieve metadata information about various data-related objects within your data factory or linked services. Metadata includes information about datasets, tables, files, folders, or other data-related entities. Here are some key details about the Get Metadata activity:

Purpose:

  • Metadata Retrieval: It is used to retrieve metadata information without moving or processing the actual data. This information can be used for control flow decisions, dynamic parameterization, or for designing complex data workflows.

Use Cases:

  • Dynamic Execution: You can use the metadata retrieved by this activity to dynamically control which datasets or files to process or copy in subsequent activities.
  • Validation: It can be used to check the existence of datasets or files before attempting to use them, ensuring your pipeline only processes existing resources.

Output:

  • The Get Metadata activity outputs the metadata as structured JSON data, which includes details like file names, sizes, column names, table schemas, and more, depending on the metadata type and source.

Dynamic Expressions:

  • You can use dynamic expressions within the activity configuration to parameterize your metadata retrieval. For example, you can use expressions to construct folder paths or table names at runtime.

In summary, the Get Metadata activity in Azure Data Factory is a powerful tool for gathering information about your data sources and structures, enabling dynamic and data-driven ETL (Extract, Transform, Load) workflows within your data factory pipelines. It helps you make informed decisions and perform data operations based on the state of your data sources.

Databrics interview questions and answers – 2023

1. What is Databrics ?

Databricks provides a collaborative and interactive workspace that allows data engineers, data scientists, and analysts to work together on big data projects. It supports various programming languages, including Scala, Python, R, and SQL, making it accessible to a wide range of users.

The platform’s core component is Apache Spark, which enables distributed data processing and analysis across large clusters of computers. Databricks simplifies the deployment and management of Spark clusters, abstracting away much of the underlying infrastructure complexities.

Key features of Databricks include:

Unified Analytics: Databricks offers a collaborative workspace where users can write and execute code, explore data, visualize results, and share insights with their teams.

Apache Spark Integration: Databricks integrates with Apache Spark, providing a high-level interface to manage and scale Spark clusters for big data processing.

Delta Lake: This is an additional feature offered by Databricks, providing a versioned, transactional data lake that improves data reliability and simplifies data management.

Machine Learning: Databricks supports building and deploying machine learning models using popular libraries like MLlib, Scikit-learn, and TensorFlow.

Data Integration: Databricks can connect to various data sources, such as data lakes, databases, and data warehouses, allowing users to analyze and process data from multiple sources.

Job Scheduling: Users can schedule and automate data processing jobs, making it easier to run repetitive tasks at specific times or intervals.

Security and Collaboration: Databricks provides role-based access control and allows teams to collaborate securely on projects.

2. What is the difference between Databrics and Data factory?

In simple terms, Databricks is a platform for big data analysis and machine learning, while Data Factory is a cloud-based service used for data integration, movement, and orchestration. Databricks is more suitable for data analysis and advanced analytics tasks, while Data Factory is ideal for data movement and transformation between different data sources and destinations.

AspectDatabricksData Factory
PurposeUnified data analytics platformCloud-based data integration and orchestration service
FunctionAnalyzing and processing big dataMoving, transforming, and orchestrating data between various sources and destinations
Primary Use CaseData analysis, data engineering, and machine learningETL (Extract, Transform, Load) workflows, data integration, and data migration
Core TechnologyApache Spark (big data processing)Data integration service with connectors to various data sources
ProgrammingSupports Scala, Python, R, SQLNo direct programming required, mostly configuration-based
CollaborationCollaborative workspace for data teamsCollaboration features for development and monitoring of data workflows
Data TransformationProvides tools for data transformation within the platformData transformation happens within the data pipelines using Data Factory’s activities
Data Source/ Sink ConnectivityConnects to various data sources and destinations for analysisConnects to various data stores and cloud services for data movement
Data Source/ Sink ConnectivityProvides support for real-time data processingOffers some real-time data movement and transformation capabilities
Batch ProcessingSupports batch processing for large-scale dataPrimarily designed for batch-oriented data integration tasks
Advanced AnalyticsAllows building and deploying machine learning modelsFocused on data movement and orchestration, not advanced analytics
Use Case ExampleAnalyzing customer data for insightsMoving data from an on-premises database to a cloud data warehouse

3. Describe Spark Architecture.

The architecture of an Apache Spark application is based on a distributed computing model that allows it to process large-scale data across multiple nodes in a cluster. Spark applications are designed to be fault-tolerant, scalable, and efficient. Here’s an overview of the key components and the flow of data in a typical Apache Spark application:

1. Driver Program:
  • The Driver Program is the entry point of a Spark application. It runs on the master node and is responsible for creating the SparkContext, which is the entry point to any Spark functionality.
  • The Driver Program defines the high-level control flow of the application, including tasks, stages, and dependencies between transformations and actions.

2. SparkContext (SC):
  • SparkContext is the entry point to any Spark functionality. It establishes a connection to the cluster manager (e.g., YARN, Mesos, or Spark’s standalone cluster manager) to acquire resources for the application.
  • The SparkContext coordinates with the Cluster Manager to allocate and manage executors across the worker nodes.

3. Cluster Manager:
  • The Cluster Manager is responsible for managing the resources across the cluster, allocating tasks to workers, and ensuring fault tolerance.
  • It can be YARN, Mesos, or Spark’s standalone cluster manager.

4. Worker Nodes:
  • Worker Nodes are the machines in the cluster where actual data processing takes place.
  • Each worker node hosts one or more executors, which are separate JVM processes responsible for executing tasks.

5. Executors:
  • Executors are processes on the worker nodes responsible for executing individual tasks and storing data in memory or disk storage.
  • They manage the data that is cached and reused across multiple Spark operations, reducing the need to read data from disk repeatedly.

6. Resilient Distributed Dataset (RDD):
  • RDD is the fundamental data abstraction in Spark. It represents an immutable, fault-tolerant collection of objects that can be processed in parallel.
  • RDDs are divided into partitions, and each partition is processed on a separate executor.

7. Transformations:
  • Transformations are operations on RDDs that create new RDDs from existing ones. Examples include map, filter, reduceByKey, etc.
  • Transformations are lazily evaluated, meaning they are not executed until an action is called.

8. Actions:
  • Actions are operations that trigger the execution of transformations and return results to the Driver Program or write data to external storage.
  • Examples of actions include collect, count, save, etc.
  • Actions trigger the actual computation and materialize the RDD lineage.

9. Lineage Graph (DAG):
  • Spark constructs a Directed Acyclic Graph (DAG) of the transformations (logical execution plan) and optimizes it before execution to minimize data shuffling and improve performance.

10. Data Sources and Sinks:
  • Spark applications can read data from various sources (e.g., HDFS, S3, databases) and write data to external storage systems (e.g., HDFS, databases, file systems).

The architecture of Spark allows it to distribute data and computation across the cluster efficiently, providing fault tolerance, data locality, and high-performance data processing.

4. What is the maximum size of worker node you have used?

Storage 64 GB worker node.

Note – If you say more than that in interview garbage in and out problem occur so strict to max 64 GB.

5. How do you choose cluster configuration in Databrics?

  • For normal ETL/ELT work – Memory optimize
  • Quick Dev test – General Purpose
  • For Shuffle intensive work(e.g lot many join) – Storage optimize

6. What is the difference between repartition and coalesce?

Both repartition and coalesce are Spark transformations used to control the distribution of data across partitions in a DataFrame or an RDD (Resilient Distributed Dataset). They affect how data is physically stored and distributed across the nodes of a cluster.

In summary, the key differences are:

  • repartition can increase or decrease the number of partitions and performs a full data shuffle, while coalesce only reduces the number of partitions and tries to minimize data movement.
  • Use repartition when you need to achieve a specific number of partitions or evenly distribute data for better parallelism. Use coalesce when you want to reduce the number of partitions efficiently without triggering a full shuffle.
# For DataFrame repartition
df.repartition(numPartitions)
# For DataFrame coalesce
df.coalesce(numPartitions)

7. What is the drawback of coalesce?

The coalesce transformation in Spark is a powerful tool to reduce the number of partitions in a DataFrame or RDD without a full shuffle, which can be beneficial for certain use cases. However, it also has some drawbacks that you should be aware of:

Data Skew:
  • When reducing the number of partitions with coalesce, Spark tries to minimize data movement and merges partitions into a smaller number of partitions. This approach can lead to data skew, where some partitions may end up with significantly more data than others.
  • Data skew can negatively impact performance since a few partitions may become “hotspots” that are processed much slower than others, leading to inefficient resource utilization.
Uneven Data Distribution:
  • Similar to data skew, coalesce can result in uneven data distribution across the reduced partitions. If the data has an uneven distribution or if certain keys have significantly more data than others, this can lead to suboptimal performance for subsequent operations like joins or aggregations.

Limited to Reduce Partitions:

  • Unlike repartition, which allows you to increase or decrease the number of partitions, coalesce can only reduce the number of partitions. You cannot use coalesce to increase the number of partitions, which might be needed for certain operations that benefit from increased parallelism.

Less Parallelism:
  • By reducing the number of partitions, you may reduce the level of parallelism in subsequent processing. This can lead to underutilization of resources, particularly if you have a large cluster with many available cores.

Optimal Partition Size:
  • Choosing the right number of partitions for coalesce is crucial. If you set the number of partitions too low, you might end up with partitions that are too large to fit into memory, leading to performance issues.

8. If we have .gz file, how are they distributed in the spark

No, it will not be distributed, it will be single threaded.

9. What is the difference between “from pyspark.sql.type import *” and from pyspark.sql.types import xyz ?

Performance.

If we write from pyspark.sql.type import * every library under sql will get imported which will affect the performance.

10. What is the default partition size?

11. Have you worked on Databrics SQL? What is it all about?

Databrics SQL provides a simple experience for SQL users who wants to run quick ad-hoc queries on their data lake, create multiple visualization type to explore query result from different perspectives, and build and share dashboards.

12. What are the SQL endpoints? How are they different from cluster?

SQL endpoints in Databricks are secure entry points that allow users to execute SQL queries against their data. They provide a familiar SQL interface for querying and analyzing data stored in Databricks tables or external data sources like Delta Lake or Apache Spark tables.

Difference from cluster:

  • A cluster in Databricks is a set of computing resources (e.g., virtual machines) used to run data processing and analytics workloads, such as Spark jobs and notebooks.
  • SQL endpoints, on the other hand, are query interfaces specifically designed for executing SQL queries against data without the need to start a separate cluster.
  • While clusters are general-purpose compute resources for various data processing tasks, SQL endpoints focus specifically on SQL query execution and can handle multiple concurrent queries efficiently.

13. What do you mean by interactive and job cluster?

Interactive Cluster: An interactive cluster in Databricks is designed for interactive data exploration and analysis. It provides a collaborative workspace where users can run notebooks and execute code interactively. Interactive clusters allow users to iteratively explore data, visualize results, and experiment with different analyses in a shared environment.

Job Cluster: A job cluster in Databricks is used for running batch jobs and automated data processing tasks. Unlike interactive clusters, job clusters are not continuously running and are launched on-demand to execute specific jobs at scheduled intervals or upon request. Job clusters are ideal for running ETL (Extract, Transform, Load) workflows, machine learning training, and other automated data processing tasks.

14. What is Auto scaling in databrics cluster?

  • Auto Scaling in Databricks cluster automatically adjusts the cluster resources (number of worker nodes) based on workload demand, optimizing resource utilization and cost efficiency.
  • Users can configure the Auto Scaling behaviour, setting minimum and maximum limits for the number of worker nodes and defining scaling policies based on metrics like CPU utilization or memory usage.

15. What is Managed and unmanaged tables in databrics?

In Databricks, managed and unmanaged tables are two different ways of organizing and managing data in the Databricks Delta Lake or Apache Spark table format:

Managed Tables:
  • Managed tables, also known as Delta tables, are tables whose metadata and data files are managed and controlled by Databricks.
  • When you create a managed table, Databricks handles the storage, organization, and cleanup of data files and metadata automatically.
  • This means that when you delete a managed table, Databricks will also delete all the associated data files from storage.
  • Managed tables are easy to work with since Databricks takes care of most of the underlying management tasks.
Unmanaged Tables:
  • Unmanaged tables are tables where you have direct control over the data files, and Databricks does not manage the data or metadata.
  • When you create an unmanaged table, you need to specify the data files’ location, and Databricks relies on you to handle file organization and cleanup.
  • If you delete an unmanaged table, Databricks only removes the metadata, but the data files remain intact in the specified location.
  • Unmanaged tables provide more control over the data, making them suitable for scenarios where you have existing data files in a specific location.

In summary, managed tables are automatically managed by Databricks, including data file storage and cleanup, while unmanaged tables require you to handle the data files’ management and organization manually.

16. How do you configure numbers of core in worker?

Generally, Number of cores = Numbers of partitions.

17. How do you handle BAD records in databrics?

There are two ways to handle errors in databrics:

MODE:

  1. Permissive(Error value will be stored as null and entire error row will be saved as column)
  2. Dropmalformed (Whole records that has error values will get drop off)
  3. Failfast(When error is recognized execution stops)

BAD RECORD PATH (Redirect value to separate path)

18. Given a list of json string, count the number of IP address, order of output does not matter?

To count the number of unique IP addresses from a list of JSON strings, you can follow these steps

  1. Parse the JSON strings and extract the IP addresses.
  2. Use a set to store the unique IP addresses.
  3. Count the number of elements in the set, which will give you the count of unique IP addresses.

Here’s the Python code to achieve this:

Output:

19. What is difference between narrow and wide transformation?

In Databricks (and Apache Spark), transformations are operations performed on a distributed dataset (RDD or DataFrame). These transformations are broadly classified into two categories: narrow transformations and wide transformations.

1. Narrow Transformations:
  • Narrow transformations are transformations where each input partition contributes to at most one output partition.
  • These transformations do not require data to be shuffled across the partitions, and they can be performed in parallel on each partition independently.
  • Examples of narrow transformations include `map`, `filter`, and `union`.
2. Wide Transformations:
  • Wide transformations are transformations where each input partition can contribute to multiple output partitions.
  • These transformations require data to be shuffled and redistributed across the partitions, which involves a data exchange between the worker nodes.
  • Wide transformations usually result in a stage boundary, which involves a physical data exchange and a shuffle operation.
  • Examples of wide transformations include `groupBy`, `join`, and `sortByKey` (for RDDs).

The main difference between narrow and wide transformations lies in how they handle data distribution and shuffling. Narrow transformations can be executed efficiently in a parallel and distributed manner without the need for shuffling data between partitions. On the other hand, wide transformations require a data shuffle, which can be a more expensive operation as it involves network communication and data redistribution.

To optimize Spark job performance, it’s essential to minimize the usage of wide transformations and use them judiciously. When designing data processing pipelines, try to keep as many transformations as possible as narrow transformations to reduce data shuffling and improve execution efficiency.

20. What is shuffle? why shuffle occur? What causes the shuffle?

In the context of Apache Spark (and Databricks), a “shuffle” refers to the process of redistributing data across partitions in a distributed computing environment. Shuffle occurs when data needs to be rearranged to satisfy a specific operation, such as aggregations, joins, or sorting, where data from multiple partitions needs to be brought together to perform the operation effectively.

Shuffle occurs for the following reasons:

1. Data Reorganization for Operations:
  • Certain operations, like grouping elements by a key in a `groupBy` or performing a `join` between two datasets, require data with the same keys to be brought together into the same partition to perform the operation.
  • For example, when you perform a `groupBy` on a DataFrame or RDD based on a particular column, Spark needs to gather all the rows with the same key from various partitions and organize them together.
2. Data Skew:
  • Data skew occurs when the distribution of data across partitions is uneven, causing some partitions to have significantly more data than others.
  • In scenarios with data skew, a few partitions can become overloaded with data, leading to slower processing times for those partitions and suboptimal resource utilization.
  • To handle data skew, Spark may need to redistribute data through shuffle to achieve a more balanced data distribution across partitions.
3. Performance Optimization:
  • In some cases, shuffle can be used to optimize performance by co-locating related data on the same partitions, reducing the amount of data exchange during certain operations.
  • For example, Spark can leverage shuffle to bring together data that will be accessed together, reducing the need to read data from various partitions separately.

It’s essential to be aware of the cost of shuffle operations since they involve data movement across the network, which can be expensive in terms of computation and time. Minimizing shuffle whenever possible is a key performance consideration when designing and optimizing Apache Spark or Databricks jobs. Proper partitioning of data and careful selection of appropriate operations can help reduce the frequency and impact of shuffling.

21. What is difference between sortBy and OrderBy?

We can use either sort() or Orderby() built in function to sort a particular dataframe in a ascending or descending order over at least one column.

Sort()

Sort() method will sort() the records in each partition and then return the final output which means that the order of the output data is not guaranteed because data is ordered on partition level but your data frame may have thousands of partitions distributed across the cluster.

Since the data is not collected into a single executor the sort() method is efficient thus more suitable  when sorting is not critical for your use case.

Orderby()

Unlike Sort(), the orderBy() function guarantee a total order in the output. This happen because the data will be collected into a single executor in order to be sorted.

This means that orderBy() is more inefficient compared to sort()

22. What is difference between map and map partition?

map()

map is a transformation that applies a function to each element of the RDD or DataFrame, processing the elements one at a time. whatever transformation you mentation on the data. It has a one-to-one mapping between input and output elements.

mapPartitions()

mapPartitions is a transformation that applies a function to each partition of the RDD or DataFrame, processing multiple elements within each partition at once. It has a one-to-many mapping between input partitions and output partitions.

In summary, map processes data element-wise, while mapPartitions processes data partition-wise, which can lead to more efficient data processing when dealing with expensive operations or when you need to maintain some state across multiple elements within a partition.

23. What is lazy evaluation in spark?

Lazy evaluation in Spark refers to the evaluation strategy where transformations on RDDs (Resilient Distributed Datasets) or DataFrames are not executed immediately when called. Instead, Spark builds up a logical execution plan (DAG – Directed Acyclic Graph) representing the sequence of transformations. The actual execution is deferred until an action is called.

Key points about lazy evaluation in Spark:

Transformations are Deferred:

When you apply a transformation on an RDD or DataFrame (e.g., `map`, `filter`, `groupBy`, etc.), Spark does not perform the actual computation right away. Instead, it records the transformation in the execution plan.

Logical Execution Plan (DAG):

Spark builds a directed acyclic graph (DAG) that represents the series of transformations in the order they were applied. Each node in the DAG represents a transformation, and the edges represent data dependencies.

Efficient Optimization:

Since Spark has access to the entire transformation sequence before executing any action, it can optimize the execution plan for performance, by combining and reordering transformations to reduce data shuffling and improve parallelism.

Lazy Execution with Actions:

Spark only performs the actual computation when an action is called (e.g., `collect`, `count`, `save`, etc.). Actions trigger the execution of the entire DAG of transformations, starting from the original RDD or DataFrame.

Benefits of lazy evaluation:
  • It allows Spark to optimize the execution plan before actually running the job, leading to more efficient data processing.
  • Lazy evaluation reduces unnecessary computation and data shuffling, as Spark can skip intermediate transformations that are not needed to compute the final result.
  • It provides an opportunity for developers to chain multiple transformations together before triggering the actual computation, which can lead to more concise and modular code.

Overall, lazy evaluation is a fundamental optimization technique in Spark that improves performance by deferring the execution of transformations until the results are required.

24. What are the different levels of persistence in spark?

In Apache Spark, persistence (also known as caching) is a technique to store intermediate or final computation results in memory or disk. By persisting data, Spark can reuse it across multiple Spark operations without re-computing it from the original data source. This can significantly improve the performance of iterative algorithms or when the same data is accessed multiple times.

There are different levels of persistence available in Spark:

1. MEMORY_ONLY (or MEMORY_ONLY_SER):
  • This is the simplest level of persistence, where the RDD or DataFrame is stored in memory as deserialized Java objects (MEMORY_ONLY) or serialized bytes (MEMORY_ONLY_SER).
  • MEMORY_ONLY is suitable when the data fits comfortably in memory and can be quickly accessed without deserialization overhead.
  • MEMORY_ONLY_SER is useful when the data size is larger than the available memory, as it saves memory space by keeping the data in a serialized format.

2. MEMORY_AND_DISK:
  • In this level of persistence, if an RDD or DataFrame does not fit entirely in memory, Spark spills the excess partitions to disk while keeping the remaining partitions in memory.
  • MEMORY_AND_DISK provides a balance between in-memory access speed and disk space utilization. However, accessing data from disk is slower than accessing it from memory, so it may lead to performance degradation for frequently accessed data that spills to disk.

3. MEMORY_AND_DISK_SER:
  • Similar to MEMORY_AND_DISK, but it stores data in serialized format to save memory space.
  • MEMORY_AND_DISK_SER can be beneficial when the dataset is large and cannot fit entirely in memory.

4. DISK_ONLY:
  • In this level of persistence, data is stored only on disk and not kept in memory.
  • DISK_ONLY is suitable when memory is limited, and the data size is much larger than the available memory.
  • Accessing data from disk is slower than accessing it from memory, so performance might be affected.

5. OFF_HEAP:
  • This is an experimental feature in Spark that stores data off-heap (outside the JVM heap space).
  • OFF_HEAP is useful for very large datasets that require a significant amount of memory, as it can help avoid Java garbage collection overhead.

Each persistence level offers different trade-offs between memory utilization and access speed. The choice of persistence level depends on factors like the size of data, available memory, and the frequency of data access in your Spark application. Properly selecting the persistence level can greatly impact the overall performance and efficiency of Spark jobs.

Import org.apache.spark.storage.storagelevel
Val rdd2 = rdd.persist(StorageLevel.MEMORY_ONLY)

25. How to check data skewness of how many rows present in each partition?

26. What are the different planes in Databrics?

In cloud computing and distributed systems, the concept of “planes” is used to categorize different aspects of the system’s architecture and management. The three main planes are:

Data Plane:
  • The Data Plane is responsible for handling and processing the actual user data and performing the intended operations on the data.
  • In Databricks, the Data Plane is where data processing tasks take place, such as executing Spark jobs, running notebooks, and performing data transformations.

Control Plane:
  • The Control Plane is responsible for managing and coordinating the overall system’s operation and behavior.
  • In Databricks, the Control Plane manages the cluster lifecycle, monitors resource allocation, schedules jobs, and handles administrative tasks such as authentication, access control, and cluster management.

Management Plane:
  • The Management Plane provides tools and interfaces for users and administrators to configure, monitor, and manage the system and its components.
  • In Databricks, the Management Plane includes the Databricks Workspace, which provides a collaborative environment for managing notebooks, libraries, dashboards, and other resources.

Each plane plays a crucial role in the operation and management of cloud-based distributed systems like Databricks, helping to separate concerns and provide a clear and organized way to design and manage complex systems.

27. How to drop duplicate from following dataframe?

STATION_ID     DATE_CHANGED  STATION_ID
16/7/2006 6:00
19/26/2000 6:00
27/29/2005 6:00
26/6/2001 6:00
46/8/2001 6:00
411/25/2003 7:00
76/12/2001 6:00
86/4/2001 6:00
84/3/2017 18:36
df.sort_values('DATE_CHANGED').drop_duplicates('STATION_ID',keep='last')

28. How to improve performance of a notebook in databrics?

To improve notebook performance in Databricks, you can follow these best practices and optimization techniques:

1. Cluster Configuration: Use an appropriately sized cluster based on your workload. Scaling up the cluster can significantly improve performance, especially for resource-intensive tasks. Databricks provides various cluster types with different CPU, memory, and GPU configurations. You can monitor cluster performance to identify bottlenecks.

 2. Autoscaling: Enable autoscaling to automatically adjust the number of workers based on the workload. This helps optimize resource allocation and reduce costs during idle periods.

 3. Caching and Persistence: Utilize caching and persistence to store intermediate results or dataframes that are frequently accessed. This can speed up subsequent operations by avoiding redundant computations.

 4. Optimize Data Format: Choose appropriate data formats to store and read data. For example, using Parquet format for storage is more efficient than using CSV or JSON due to its columnar storage and compression capabilities.

5. Limit Data Shuffling: Minimize data shuffling as it can be a resource-intensive operation. Data shuffling occurs when the data needs to be redistributed across partitions, and it can impact performance significantly.

 6. Tune Spark Configurations: Adjust Spark configurations like `spark.sql.shuffle.partitions`, `spark.executor.memory`, and `spark.driver.memory` to optimize performance based on your specific workload and data size.

7. Use Delta Lake: If possible, leverage Delta Lake for transactional and versioned data storage. It offers significant performance improvements for analytics workloads.

 8. Spark UI Optimization: Use the Spark UI to monitor job progress, identify performance bottlenecks, and optimize query plans. Look for stages with long execution times or high shuffle data, and try to optimize them.

 9. Use Databricks Runtime: Databricks Runtime is optimized by default and includes various performance improvements. Always use the latest stable version for the best performance and new features.

 10. Notebook Code Optimization: Review and optimize your notebook code regularly. Avoid unnecessary data transformations and use efficient Spark APIs and DataFrame operations when possible.

 11. Resource Management: Be mindful of resource utilization in notebooks. Avoid running multiple long-running notebooks simultaneously to prevent resource contention.

 12. Data Pruning: For large datasets, consider pruning the data to limit the amount of data processed, especially if you only need a specific time range or subset of data.13. External Data Cache: Use the Databricks External Data Cache to cache data across multiple notebooks or clusters, reducing data transfer time and computation overhead.

Remember that the effectiveness of these optimizations depends on your specific use case and data, so it’s essential to monitor and fine-tune your configurations regularly to achieve the best performance.

29. How to pass parameter though parent notebook in the child notebook?

In Databricks, you can pass parameters from a parent notebook to a child notebook using the dbutils.notebook.run function. This function allows you to call a child notebook and pass parameters as input to the child notebook. Here’s a step-by-step guide on how to do this.

Step 1: Define the Parent Notebook:

In the parent notebook, you can define the parameters you want to pass to the child notebook and then call the child notebook using the dbutils.notebook.run function.

# Parent Notebook

param1 = "value1"
param2 = 42

dbutils.notebook.run("ChildNotebook", 0, {"param1": param1, "param2": param2})

Step 2: Access Parameters in the Child Notebook:

In the child notebook, you can access the parameters passed from the parent notebook using the dbutils.widgets module. You need to define widgets with the same names as the parameters used in the parent notebook.

# ChildNotebook

param1 = dbutils.widgets.get("param1")
param2 = dbutils.widgets.get("param2")

print("Received parameters in child notebook:")
print("param1:", param1)
print("param2:", param2)

By using the dbutils.widgets.get function with the same name as the parameter, you can retrieve the values passed from the parent notebook.

30. What are the dbutils function available in databrics?

In Databricks, the `dbutils` module provides utility functions that help you interact with the Databricks environment, access files, manage libraries, and more. These functions are designed to simplify common tasks when working with Databricks notebooks and clusters. Here are some of the commonly used `dbutils` functions available in Databricks:

1. File System Operations:
  •    dbutils.fs.ls(path): List files in a directory.
  •    dbutils.fs.cp(source, destination): Copy a file or directory.
  •    dbutils.fs.mv(source, destination): Move or rename a file or directory.
  •    dbutils.fs.rm(path, recurse=True): Remove a file or directory.
  •    dbutils.fs.mkdirs(path): Create directories recursively.
2. Notebook Utilities:

   – `dbutils.notebook.run(notebook_path, timeout_seconds, arguments)`: Run a different notebook and pass parameters.
   – `dbutils.notebook.exit(result)`: Stop execution of the current notebook and return a result.

3. Widgets:
   – dbutils.widgets.text(name, defaultValue): Create a text widget.
   – dbutils.widgets.get(name): Get the value of a widget.
   – dbutils.widgets.dropdown (name, defaultValue, choices): Create a dropdown widget.
   – dbutils.widgets.combobox (name, defaultValue, choices): Create a combo box widget.
   – dbutils.widgets.multiselect (name, defaultValue, choices): Create a multiselect widget.

 4. Library Management:
   – dbutils.library.install (pypi_package): Install a PyPI package as a library.
   – dbutils.library.restartPython(): Restart the Python interpreter to apply library changes.
   – dbutils.library.remove (library_path): Remove a library.

 5. Secret Management:
   – dbutils.secrets.get(scope, key): Retrieve a secret from the Databricks Secret Manager.

 6. Azure Blob Storage Utilities:
   – dbutils.fs.mount(source, mount_point): Mount an Azure Blob Storage container.
   – dbutils.fs.unmount(mount_point): Unmount an Azure Blob Storage container.

31. What is broadcast join?

In Databricks, a broadcast join is a type of join operation used in Spark SQL that optimizes the performance of joining a small DataFrame (or table) with a larger DataFrame (or table). It is based on the concept of broadcasting the smaller DataFrame to all the nodes of the cluster, allowing each node to perform the join locally without shuffling the data. This approach is beneficial when one of the DataFrames is small enough to fit in memory and the other DataFrame is significantly larger.

When performing a broadcast join, Spark automatically broadcasts the smaller DataFrame to all the worker nodes in the cluster, making it available in memory. Then, the worker nodes use this broadcasted copy to perform the join locally with their respective partitions of the larger DataFrame. By doing so, Spark can avoid costly shuffling operations, which can lead to significant performance improvements.

To leverage the benefits of a broadcast join in Spark SQL, you can explicitly hint the optimizer to use it for a specific join by using the BROADCAST keyword. For example:

By adding the /*+ BROADCAST(small_df) */ hint to the SQL query, you inform Spark to use a broadcast join for the specified small_df DataFrame.

32. What is struct function?

In Databricks, the struct function is a Spark SQL function used to create a new struct column by combining multiple columns together. A struct column is similar to a struct or namedtuple in programming languages, allowing you to group multiple values together into a single column. This is especially useful when you want to create complex data types or nest multiple fields within a single DataFrame column.

The struct function takes multiple column expressions as arguments and returns a new column containing the combined values as a struct. Each input column becomes a field within the resulting struct column. The struct columns can then be used like any other DataFrame column, allowing you to access nested data in a structured manner.

Here’s the basic syntax of the struct function:

  • column1, column2, …, columnN: The input columns that you want to combine into the struct column.

Here’s a practical example of using the struct function:

Output look like this:

33. What is explode function does?

In Databricks, the explode function is a Spark SQL function used to transform an array or map column into multiple rows, effectively “exploding” the elements of the array or map into separate rows. This is particularly useful when you have data with nested arrays or maps, and you want to flatten the structure to perform operations on individual elements.

The explode function works on columns that contain arrays or maps and produces a new DataFrame with each element of the array or map expanded into separate rows. For arrays, each element of the array becomes a separate row, and for maps, the key-value pairs of the map become separate rows.

Here’s the basic syntax of the explode function:

  • column: The column containing the array or map that you want to explode.
  • alias: An optional alias to give a name to the resulting column after the explosion.

Let’s look at a practical example using the explode function:

The output will look like:

As you can see, the fruits array column is exploded into separate rows, where each fruit from the array becomes an individual row in the resulting DataFrame.

The explode function is a powerful tool when working with nested data structures, and it allows you to perform operations on individual elements within arrays or maps efficiently in Databricks. It can be combined with other Spark SQL functions to perform complex data manipulations on nested data.

34. Is it possible to combine Databricks and Azure Notebooks?

They operate similarly, but data transfer to the cluster requires manual coding. This Integration is now easily possible thanks to Databricks Connect. On behalf of Jupyter, Databricks makes a number of improvements that are specific to Databricks.

35. What exactly does caching entail?

Temporary holding is referred to as the cache. The process of temporarily storing information is referred to as caching. You’ll save time and lessen the load on the server when you come back to a frequently visited website because the browser will retrieve the data from the cache rather than from the server.

36. What are the different types of caching are there?

There are four types of caching that stand out:

  • Information caching
  • Web page caching
  • Widespread caching
  • Output or application caching.

37. Should you ever remove and clean up any leftover Data Frames?

Cleaning Frames is not necessary unless you use cache(), which will use a lot of network bandwidth. You should probably clean up any large datasets that are being cached but aren’t being used.

38. What different ETL operations does Azure Databricks carry out on data?

The various ETL processes carried out on data in Azure Databricks are listed below:

  • The data is converted from Databricks to the data warehouse.
  • Bold storage is used to load the data.
  • Data is temporarily stored using bold storage

39. Does Azure Key Vault work well as a substitute for Secret Scopes?

That is certainly doable. However, some setup is necessary. The preferred approach is this. Instead of changing the defined secret, start creating a scoped password that Azure Key Vault will backup if the data in secret needs to be changed.

40. How should Databricks code be handled when using TFS or Git for collaborative projects?

TFS is not supported, to start. Your only choices are dispersed Git repository systems and Git. Although it would be ideal to integrate Databricks with the Git directory of notebooks, it works much like a different project clone. Making a notebook, trying to commit it to version control, and afterwards updating it are the first steps.

41. Does Databricks have to be run on a public cloud like AWS or Azure, or can it also run on cloud infrastructure?

This is not true. The only options you have right now are AWS and Azure. But Databricks uses Spark, which is open-source. Although you could build your own cluster and run it in a private cloud, you’d be giving up access to Databricks’ robust features and administration.

42. How is a Databricks personal access token created?

  • On the Databricks desktop, click the “user profile” icon in the top right corner.
  • Choosing “User setting.”
  • Activate the “Access Tokens” tab.
  • A “Generate New Token” button will then show up. Just click it.

43. Is code reuse possible in the Azure notebook?

We should import the code first from Azure notebook into our notebook so that we can reuse it. There are two ways we can import it.

  • We must first create a component for the code if it is located on a different workstation before integrating it into the module.
  • We can import and use the code right away if it is on the same workstation.

44. What is a Databricks cluster?

The settings and computing power that make up a Databricks cluster allow us to perform statistical science, big data, and powerful analytic tasks like production ETL, workflows, deep learning, and stream processing.

45. Is databrics provide database services, if no then where it saves the delta table?

Azure Databricks does not directly provide traditional database services like SQL Server or Azure SQL Database. Instead, it is an Apache Spark-based analytics platform designed to process and analyze big data workloads. However, it can work with various data sources and databases to perform data processing and analytics tasks.

One of the popular data storage options in Azure Databricks is Delta Lake, which is an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions to data lakes. Delta Lake provides some database-like functionalities on top of cloud storage systems like Azure Data Lake Storage or Azure Blob Storage.

When you save a Delta table in Azure Databricks, the data is typically stored in your specified cloud storage (e.g., Azure Data Lake Storage or Azure Blob Storage). Delta Lake keeps track of the changes made to the data, allowing for versioning and transactional capabilities.

46. Is Databricks only available in the cloud and does not have an on-premises option?

Yes. Databricks’ foundational software, Apache Spark, was made available as an on-premises solution, allowing internal engineers to manage both the data and the application locally. Users who access Databricks with data on local servers will encounter network problems because it is a cloud-native application. The on-premises choices for Databricks are also weighed against workflow inefficiencies and inconsistent data.

47. In Parameter what does it mean by inferSchema as True

In the context of data processing frameworks like Apache Spark and PySpark (used in Azure Databricks), the inferSchema parameter is used when reading data from external sources like CSV, JSON, Parquet, etc. It is a configuration option that determines whether Spark should automatically infer the data types of the columns in the DataFrame based on the data content.

When inferSchema is set to True, Spark will automatically examine a sample of the data to infer the data types for each column. This process involves analyzing the data values in each column and making educated guesses about the appropriate data types. For example, if a column predominantly contains numeric values, Spark might infer it as an integer or a double data type. Similarly, if a column primarily contains strings, it might infer it as a string data type.

Setting inferSchema to True can be convenient and save time, especially when working with large datasets with many columns. Spark’s ability to automatically infer data types allows you to read data without explicitly specifying the schema, making the code shorter and more flexible.

However, it’s essential to be cautious when using inferSchema, as it relies on a sample of the data and might not always make accurate decisions, especially when the data is sparse or there are outliers. In such cases, it’s recommended to provide an explicit schema to ensure the correct data types are used.

48. What are the difference between Auto Loader vs COPY INTO?

Auto Loader:

Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any additional setup. Auto Loader provides a new Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory.

When to use Auto Loader instead of the COPY INTO?

  • If you’re going to ingest files in the order of thousands, you can use COPY INTO. If you are expecting files in the order of millions or more over time, use Auto Loader. Auto Loader requires fewer total operations to discover files compared to COPY INTO and can split the processing into multiple batches, meaning that Auto Loader is less expensive and more efficient at scale.
  • If your data schema is going to evolve frequently, Auto Loader provides better primitives around schema inference and evolution. See Configure schema inference and evolution in Auto Loader for more details.
  • Loading a subset of re-uploaded files can be a bit easier to manage with COPY INTO. With Auto Loader, it’s harder to reprocess a select subset of files. However, you can use COPY INTO to reload the subset of files while an Auto Loader stream is running simultaneously.
spark.readStream.format("cloudfiles") # Returns a stream data source, reads data as   it arrives based on the trigger.

.option("cloudfiles.format","csv") # Format of the incoming files
.option("cloudfiles.schemalocation", "dbfs:/location/checkpoint/")  The location to store the inferred schema and subsequent changes

.load(data_source)
.writeStream.option("checkpointlocation","dbfs:/location/checkpoint/") # The location of the stream’s checkpoint

.option("mergeSchema", "true") # Infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader when inferring the schema.

.table(table_name)) # target table

49. How to do partition data in the databrics?

Partitioning data in Databricks, or any other data processing platform, can greatly improve query performance and overall data organization. Partitioning involves dividing your data into smaller, more manageable parts based on specific criteria, such as a particular column’s values. This helps reduce the amount of data that needs to be scanned when executing queries, leading to faster query execution times. Here’s how you can partition data in Databricks:

1. Choose a Partition Column: Select a column in your dataset that you want to use for partitioning. Common choices are date columns, country codes, or any other attribute that can be used to logically group your data.

2. Organize Data: Arrange your data in the storage system (like a data lake) in a way that reflects the partitioning scheme you’ve chosen. This usually involves creating subdirectories for each partition, where the name of the subdirectory corresponds to the partition key’s value. For example, if you’re partitioning by year and month, your directory structure might look like:

3. Write Partitioned Data: When writing data into Databricks, make sure to specify the partition column and value appropriately. Databricks has integrations with various data sources like Parquet, Delta Lake, and more, which support partitioning.

For example, if you’re using Apache Spark to write partitioned Parquet data, you can do something like this:

4. Querying Partitioned Data: When querying partitioned data, Databricks and Spark will automatically take advantage of the partitioning scheme to optimize query performance. You can filter your queries based on the partition column, and Spark will only scan the relevant partitions, reducing the amount of data scanned.

50. How to rename a column name in dataframe?

51. If we have an employee data frame and we want to add 1000 to those employee’s salary who have a more than 10,000 salary how to do that?

You can achieve this using PySpark by applying a transformation to the DataFrame to update the salary values based on a condition. Here’s a sample code to add 1000 to the salary of employees who have a salary greater than 10,000:

Update using multiple conditions.

52. What is surrogate key in database concept?

A surrogate key is a unique identifier that is added to a database table to serve as a primary key. Unlike natural keys, which are based on existing data attributes (such as names, dates, or other real-world information), surrogate keys are typically artificially created and have no inherent meaning. They are used primarily to improve the efficiency and performance of database operations.

Surrogate keys offer several benefits:

1. Uniqueness: Surrogate keys are generated in a way that ensures their uniqueness across the entire table. This eliminates the risk of duplicate entries that might occur with natural keys.

2. Stability: Since surrogate keys are not based on changing attributes like names or addresses, they remain stable over time. This makes them ideal for linking data and maintaining relationships even when other attributes change.

3. Performance: Surrogate keys can improve database performance by making indexes and joins more efficient. They tend to be shorter than meaningful attributes, leading to faster data retrieval and better query performance.

4. Data Privacy: Surrogate keys help maintain data privacy by not exposing sensitive or personal information in the key itself. This can be important for compliance with data protection regulations.

5. Simplifying Joins: When tables are linked through relationships, using surrogate keys can simplify the process of joining tables, as the keys are consistent and predictable.

6. Data Warehousing: In data warehousing scenarios, surrogate keys are often used to create fact and dimension tables, facilitating efficient reporting and analysis.

An example of a surrogate key might be an auto-incrementing integer that is assigned to each new record added to a table. This key is unique and has no inherent meaning, making it an ideal primary key for efficient database operations.

It’s worth noting that while surrogate keys offer various advantages, there are cases where natural keys (based on meaningful attributes) might be more appropriate, especially when the data itself carries important semantic meaning. The choice between surrogate and natural keys depends on the specific requirements of the database schema and the overall data model.

53. If I have a dataframe called student where I have Name, email, department and activity date columns. email column has some duplicate how to remove duplicate name but want to keep the latest records using window function?

In this code, a window specification is defined partitioned by the “Name” column and ordered by the “ActivityDate” in descending order. The max("ActivityDate").over(window_spec) calculates the maximum activity date within each partition (each name). Then, we filter the rows where the rank is equal to 1, which corresponds to the latest records within each partition.

SQL Interview Questions and answers – 2021

1. What is DBMS?

A Database Management System (DBMS) is a program that controls creation, maintenance and use of a database. DBMS can be termed as File Manager that manages data in a database rather than saving it in file systems.

Database Management System (DBMS) is a software application that interacts with the user, applications, and the database itself to capture and analyze data. A database is a structured collection of data. 

A DBMS allows a user to interact with the database. The data stored in the database can be modified, retrieved and deleted and can be of any type like strings, numbers, images, etc.

There are two types of DBMS:

  • Relational Database Management System: The data is stored in relations (tables). Example – MySQL.
  • Non-Relational Database Management System: There is no concept of relations, tuples and attributes.  Example – MongoDB

2. What is RDBMS?

RDBMS stands for Relational Database Management System. RDBMS store the data into the collection of tables, which is related by common fields between the columns of the table. It also provides relational operators to manipulate the data stored into the tables.

3. What is SQL?

SQL stands for the Structured Query Language. SQL is a standard query language used for maintaining the relational database and perform many different operations of data manipulation on the data. SQL initially was invented in 1970. It is a database language used for database creation, deletion, fetching rows and modifying rows, etc. sometimes it is pronounced as ‘sequel.’

4. What are the usages of SQL?

  • SQL is responsible for maintaining the relational data and the data structures present in the database.
  • To execute queries against a database
  • To retrieve data from a database
  • To inserts records in a database
  • To updates records in a database
  • To delete records from a database
  • To create new databases
  • To create new tables in a database
  • To create views in a database
  • To perform complex operations on the database.

5. Does SQL support programming?

SQL refers to the Standard Query Language, which is not actually the programming language. SQL doesn’t have a loop, Conditional statement, logical operations, it can not be used for anything other than data manipulation. It is used like commanding (Query) language to access databases. The primary purpose of SQL is to retrieve, manipulate, update and perform complex operations like joins on the data present in the database.

6. What do you mean by table and field in SQL?

A table refers to a collection of data in an organised manner in form of rows and columns. A field refers to the number of columns in a table.

For example:

Table:StudentInformation
Field: Stu Id, Stu Name, Stu Marks

7. What are joins in SQL?

A JOIN clause is used to combine rows from two or more tables, based on a related column between them. It is used to merge two tables or retrieve data from there. There are 4 types of joins, as you can refer to below:

Inner join: Inner Join in SQL is the most common type of join. It is used to return all the rows from multiple tables where the join condition is satisfied. 

Left Join:  Left Join in SQL is used to return all the rows from the left table but only the matching rows from the right table where the join condition is fulfilled.

Right Join: Right Join in SQL is used to return all the rows from the right table but only the matching rows from the left table where the join condition is fulfilled.

Full Join: Full join returns all the records when there is a match in any of the tables. Therefore, it returns all the rows from the left-hand side table and all the rows from the right-hand side table.

8. What is a Database?

Database is nothing but an organized form of data for easy access, storing, retrieval and managing of data. This is also known as structured form of data which can be accessed in many ways.

Example: School Management Database, Bank Management Database.

9. What is a primary key?

A primary key is a combination of fields which uniquely specify a row. This is a special kind of unique key, and it has implicit NOT NULL constraint. It means, Primary key values cannot be NULL.

10. What is a unique key?

A Unique key constraint uniquely identified each record in the database. This provides uniqueness for the column or set of columns.

A Primary key constraint has automatic unique constraint defined on it. But not, in the case of Unique Key.

There can be many unique constraint defined per table, but only one Primary key constraint defined per table.

11. What is a foreign key?

A foreign key is one table which can be related to the primary key of another table. Relationship needs to be created between two tables by referencing foreign key with the primary key of another table.

12. What is a Self-Join?

A self JOIN is a case of regular join where a table is joined to itself based on some relation between its own column(s). Self-join uses the INNER JOIN or LEFT JOIN clause and a table alias is used to assign different names to the table within the query.

SELECT A.emp_id AS "Emp_ID",A.emp_name AS "Employee",
B.emp_id AS "Sup_ID",B.emp_name AS "Supervisor"
FROM employee A, employee B
WHERE A.emp_sup = B.emp_id;

13. What is a Cross-Join?

Cross join can be defined as a cartesian product of the two tables included in the join. The table after join contains the same number of rows as in the cross-product of number of rows in the two tables. If a WHERE clause is used in cross join then the query will work like an INNER JOIN.

SELECT stu.name, sub.subject 
FROM students AS stu
CROSS JOIN subjects AS sub;

14. What is the difference between CHAR and VARCHAR2 datatype in SQL?

Both Char and Varchar2 are used for characters datatype but varchar2 is used for character strings of variable length whereas Char is used for strings of fixed length. For example, char(10) can only store 10 characters and will not be able to store a string of any other length whereas varchar2(10) can store any length i.e 6,8,2 in this variable.

15. What are Constraints?

Constraints in SQL are used to specify the limit on the data type of the table. It can be specified while creating or altering the table statement. The sample of constraints are:

  • NOT NULL
  • CHECK
  • DEFAULT
  • UNIQUE
  • PRIMARY KEY
  • FOREIGN KEY

16. What is the difference between DELETE and TRUNCATE statements?

DELETETRUNCATE
Delete command is used to delete a row in a table.Truncate is used to delete all the rows from a table.
You can rollback data after using delete statement.You cannot rollback data.
It is a DML command.It is a DDL command.
It is slower than truncate statement.It is faster.

17. What do you mean by data integrity? 

Data Integrity defines the accuracy as well as the consistency of the data stored in a database. It also defines integrity constraints to enforce business rules on the data when it is entered into an application or a database.

18. What is an Index?

An index refers to a performance tuning method of allowing faster retrieval of records from the table. An index creates an entry for each value and hence it will be faster to retrieve data.

CREATE INDEX index_name 	 /* Create Index */
ON table_name (column_1, column_2);

DROP INDEX index_name; 	 /* Drop Index */

19. Explain different types of index in SQL.

There are three types of index in SQL namely:

Unique Index:

This index does not allow the field to have duplicate values if the column is unique indexed. If a primary key is defined, a unique index can be applied automatically.

CREATE UNIQUE INDEX myIndex
ON students (enroll_no);

Clustered Index:

This index reorders the physical order of the table and searches based on the basis of key values. Each table can only have one clustered index.

Non-Clustered Index:

Non-Clustered Index does not alter the physical order of the table and maintains a logical order of the data. Each table can have many nonclustered indexes.

20. What is a Data Definition Language?

Data definition language (DDL) is the subset of the database which defines the data structure of the database in the initial stage when the database is about to be created. It consists of the following commands: CREATE, ALTER and DELETE database objects such as schema, tables, view, sequence, etc.

21. What is a Data Manipulation Language?

Data manipulation language makes the user able to retrieve and manipulate data. It is used to perform the following operations.

  • Insert data into database through INSERT command.
  • Retrieve data from the database through SELECT command.
  • Update data in the database through UPDATE command.
  • Delete data from the database through DELETE command.

22. What is Data Control Language?

Data control language allows you to control access to the database. DCL is the only subset of the database which decides that what part of the database should be accessed by which user at what point of time. It includes two commands GRANT and REVOKE.

GRANT: to grant the specific user to perform a particular task

REVOKE: to cancel previously denied or granted permissions.

23. What is Normalization in SQL?

Normalization is used to decompose a larger, complex table into simple and smaller ones. This helps us in removing all the redundant data.

Generally, in a table, we will have a lot of redundant information which is not required, so it is better to divide this complex table into multiple smaller tables which contains only unique information.

Let’s look at the rules for a table to be in first normal form, second normal form and third normal form:

First normal form:

  • A relation schema is in 1NF, if and only if:
  • All attributes in the relation are atomic(indivisible value)
  • There are no repeating elements or group of elements.

Second normal form:

  • A relation is said to be in 2NF, if and only if:
  • It is in 1st Normal Form.
  • No partial dependency exists between non-key attributes and key attributes.

Third Normal form:

  • A relation R is said to be in 3NF if and only if:
  • It is in 2NF.
  • No transitive dependency exists between non-key attributes and key attributes through another non-key attribute.

24. What is pl sql?

PL SQL stands for Procedural language constructs for Structured Query Language. PL SQL was introduced by Oracle to overcome the limitations of plain sql. So, pl sql adds in procedural language approach to the plain vanilla sql.

While, with the help of sql, we were able to DDL and DML queries, with the help of PL SQL, we will be able to create functions, triggers and other procedural constructs.

25. What is ETL in SQL?

ETL stands for Extract, Transform and Load. It is a three step process, where we would have to start off by extracting the data from sources. Once we collate the data from different sources, what we have is raw data. This raw data has to be transformed into tidy format, which will come in the second phase. Finally, we would have to load this tidy data into tools which would help us to find insights.

26. What is OLAP?

OLAP stands for Online Analytical Processing. And a class of software programs which are characterized by relatively low frequency of online transactions. Queries are often too complex and involve a bunch of aggregations.

27. What is a “TRIGGER” in SQL?

  • A trigger allows you to execute a batch of SQL code when an insert, update or delete command is run against a specific table as TRIGGER is said to be the set of actions that are performed whenever commands like insert, update or delete are given through queries.
  • The trigger is said to be activated when these commands are given to the system.
  • Triggers are the particular type of stored procedures that are defined to execute automatically in place or after data modifications.
  • Triggers are generated using CREATE TRIGGER statement.

28. What is ACID property in a database?

ACID property is used to ensure that the data transactions are processed reliably in a database system.

A single logical operation of a data is called transaction.

ACID is an acronym for Atomicity, Consistency, Isolation, Durability.

Atomicity: it requires that each transaction is all or nothing. It means if one part of the transaction fails, the entire transaction fails and the database state is left unchanged.

Consistency: the consistency property ensure that the data must meet all validation rules. In simple words you can say that your transaction never leaves your database without completing its state.

Isolation: this property ensure that the concurrent property of execution should not be met. The main goal of providing isolation is concurrency control.

Durability: durability simply means that once a transaction has been committed, it will remain so, come what may even power loss, crashes or errors.

29. What are the syntax and use of the COALESCE function?

The syntax of COALESCE function:

COALESCE(exp1, exp2, …. expn) 

The COALESCE function is used to return the first non-null expression given in the parameter list.

30. What is the SQL query to display the current date?

There is a built-in function in SQL called GetDate() which is used to return the current timestamp.

31. Are NULL values same as that of zero or a blank space?

A NULL value is not at all same as that of zero or a blank space. NULL value represents a value which is unavailable, unknown, assigned or not applicable whereas a zero is a number and blank space is a character.

32. What is subquery in SQL?

A subquery is a query inside another query where a query is defined to retrieve data or information back from the database. In a subquery, the outer query is called as the main query whereas the inner query is called subquery. Subqueries are always executed first and the result of the subquery is passed on to the main query. It can be nested inside a SELECT, UPDATE or any other query. A subquery can also use any comparison operators such as >,< or =.

33. What is the difference between ‘HAVING’ CLAUSE and a ‘WHERE’ CLAUSE?

HAVING clause can be used only with SELECT statement. It is usually used in a GROUP BY clause and whenever GROUP BY is not used, HAVING behaves like a WHERE clause.
Having Clause is only used with the GROUP BY function in a query whereas WHERE Clause is applied to each row before they are a part of the GROUP BY function in a query.

34. How can you fetch first 5 characters of the string?

There are a lot of ways to fetch characters from a string. For example:

Select SUBSTRING(StudentName,1,5) as studentname from student.

35. What is a View?

A view is a virtual table which consists of a subset of data contained in a table. Since views are not present, it takes less space to store. View can have data of one or more tables combined and it depends on the relationship.

36. What are Views used for?

A view refers to a logical snapshot based on a table or another view. It is used for the following reasons:

  • Restricting access to data.
  • Making complex queries simple.
  • Ensuring data independence.
  • Providing different views of same data.

37. What is a Stored Procedure?

A Stored Procedure is a function which consists of many SQL statements to access the database system. Several SQL statements are consolidated into a stored procedure and execute them whenever and wherever required which saves time and avoid writing code again and again.

38. List some advantages and disadvantages of Stored Procedure?

Advantages:

A Stored Procedure can be used as a modular programming which means create once, store and call for several times whenever it is required. This supports faster execution. It also reduces network traffic and provides better security to the data.

Disadvantage:

The only disadvantage of Stored Procedure is that it can be executed only in the database and utilizes more memory in the database server.

39. What is Auto Increment in SQL?

Autoincrement keyword allows the user to create a unique number to get generated whenever a new record is inserted into the table.
This keyword is usually required whenever PRIMARY KEY in SQL is used.

40. What is a Datawarehouse?

Datawarehouse refers to a central repository of data where the data is assembled from multiple sources of information. Those data are consolidated, transformed and made available for the mining as well as online processing. Warehouse data also have a subset of data called Data Marts.

41. What is RANK, DANCE_RANK and ROW_NUMBER Functions?

Lets understand all terms with following example:

Here I have an Employe table, the following is the sample data for the Employe Table.

ROW_NUMBER() Function with Partition By clause:

Row_number plays a very important role in SQL server. Row_Number function can help to perform more complex ordering of row in the report format that allow the over clause in SQL standard.

Syntax

ROW_NUMBER () OVER ([PARTITION BY value_exp, … [ n ]] ORDER BY_clause)

If we want to add row number to each group, partition by help to reset for every group, let’s take a look at the Employe table.

Select *, ROW_NUMBER() over(partition by Empname order by Empname ) as rownumber from Employee

The following is the OUTPUT of the above query

Rank() Function in SQL Server:

This function will assign a unique value to each distinct Row, but it leaves a group between the groups

SELECT EmpName, EmpSalary ,rank() over(order by EmpName) as rankID from Employe;

The following is the OUTPUT of the above query.

Gap represents number of occurrence example – EmpName=”atul” is repeated 3 times and has rank “1”, the next rank will be 1+3=4 and same with the next value.

Dense_Rank() Function in SQL Server:

Dense_Rank() Function is similar to Rank with only difference, this will not leave gaps between groups.

SELECT EmpName ,EmpSalary ,DENSE_RANK() over(order by EmpName) as DrankID from Employee

The following is the OUTPUT of the above query:

Summery:

42. Explain Normalization and De-Normalization.

Normalization is the process of removing redundant data from the database by splitting the table in a well-defined manner in order to maintain data integrity. This process saves much of the storage space.

De-normalization is the process of adding up redundant data on the table in order to speed up the complex queries and thus achieve better performance.

43. What is a deadlock?

It is an unwanted situation where two or more transactions are waiting indefinitely for one another to release the locks. 

44. What is ALIAS command?

This command provides another name to a table or a column. It can be used in WHERE clause of a SQL query using the “as” keyword.

45. What is Datawarehouse?

Datawarehouse is a central repository of data from multiple sources of information. Those data are consolidated, transformed and made available for the mining and online processing. Warehouse data have a subset of data called Data Marts.

46. What is OLTP?

OLTP stands for Online Transaction Processing, is a class of software applications capable of supporting transaction-oriented programs. An essential attribute of an OLTP system is its ability to maintain concurrency. To avoid single points of failure, OLTP systems are often decentralized. These systems are usually designed for a large number of users who conduct short transactions. Database queries are usually simple, require sub-second response times and return relatively few records. Here is an insight into the working of an OLTP system.

47. What are the differences between OLTP and OLAP?

OLTP stands for Online Transaction Processing, is a class of software applications capable of supporting transaction-oriented programs. An important attribute of an OLTP system is its ability to maintain concurrency. OLTP systems often follow a decentralized architecture to avoid single points of failure. These systems are generally designed for a large audience of end users who conduct short transactions. Queries involved in such databases are generally simple, need fast response times and return relatively few records. Number of transactions per second acts as an effective measure for such systems.

OLAP stands for Online Analytical Processing, a class of software programs which are characterized by relatively low frequency of online transactions. Queries are often too complex and involve a bunch of aggregations. For OLAP systems, the effectiveness measure relies highly on response time. Such systems are widely used for data mining or maintaining aggregated, historical data, usually in multi-dimensional schemas.

Reference:-

https://www.interviewbit.com

https://www.edureka.co

https://www.javatpoint.com

https://www.toptal.com/

https://hackr.io

https://www.mygreatlearning.com

https://www.guru99.com

https://www.edureka.co

Questions and answers for dimensionality reductions

1. What is dimensionality reduction?

When we have a dataset with multiple input features, we know the model will overfit. To reduce input feature space, we can either drop or extract features, this is basically a dimension reduction.

Now let’s discuss more about both techniques.

  • Drop irrelevant, redundant features as they do not contribute to the accuracy of the predictive problem. When we drop such input variable, we lose information stored in these variables.
  • We can create a new independent variable from existing input variables. This way we do not lose the information in the variables. This is feature extraction

2. Explain Principal Component Analysis?

When we have a large dataset of correlated input variables and we want to reduce the number of input variables to a smaller feature space. while doing this we still want to maintain the critical information. We can solve this by using Principal Component Analysis-PCA.

Now let’s understand the PCA features in little bit more details.

PCA reduce dimensionality of the data using feature extraction. It does this by using variables that help explain most variability of the data in the dataset.

PCA removes redundant information by removing correlated features. PCA creates new independent variables that are independent from each other. This takes care of multicollinearity issue.

PCA is an unsupervised technique. It only looks at the input features and does not take into account the output or the target variable.

3. Importance and limitation of Principal Component Analysis?

Following are the advantages of PCA

  • Removes Correlated Features – To visualize our all features in data, we must reduce the same in data, to do that we need to find out the correlation among the features (correlated variables). Finding correlation manually in thousands of features is nearly impossible, frustrating and time-consuming. PCA does this for you efficiently.
  • Improve algorithm performance – With so many features, the performance of your algorithm will drastically degrade. PCA is a very common way to speed up your Machine Learning algorithm by getting rid of correlated variables which don’t contribute in any decision making.
  • Improve Visualization – It is very hard to visualize and understand the data in high dimensions. PCA transforms a high dimensional data to low dimensional data (2 dimension) so that it can be visualized easily. 

Following are the limitation of PCA

  • Independent variable become less interpretable – After implementing PCA on the dataset, your original features will turn into Principal Components. Principal Components are the linear combination of your original features. Principal Components are not as readable and interpretable as original features.
  • Data standardization is must before PCA – You must standardize your data before implementing PCA, otherwise PCA will not be able to find the optimal Principal Components.
  • Information loss – Although Principal Components try to cover maximum variance among the features in a dataset, if we don’t select the number of Principal Components with care, it may miss some information as compared to the original list of features.

4. What is t-SNE and How to apply t-SNE ?

t-Distributed Stochastic Neighbor Embedding (t-SNE) is an unsupervised, non-linear technique primarily used for data exploration and visualizing high-dimensional data. In simpler terms, t-SNE gives you a feel or intuition of how the data is arranged in a high-dimensional space.

Let’s understand each and every term in details.

Scholastic – Not definite but random probability
Neighbourhood – Concerned only about retaining the structure of neighbourhood points.
Embedding – It means picking up a point from high dimensional space and placing it into lower dimension

5. How to apply t-SNE ?

Basically, it measure similarities between points in the high dimensional space.

Let’s see below image and try to understand the algorithm.

t-SNE example

Suppose we are reducing d-dimensional data into 2-dimensional data using t-SNE.
From the above picture we can see that x2 and x3 are in the neighborhood of x1 [N(x1) = {x2, x3}] and x5 is in the neighborhood of x4 [N(x4) = {x5}].

As t-SNE preserves the distances in a neighborhood,

d(x1, x2) ≈ d(x’1, x’2)
d(x1, x3) ≈ d(x’1, x’3)
d(x4, x5) ≈ d(x’4, x’5)

For every point, it constructs a notion of which other points are its ‘neighbors,’ trying to make all points have the same number of neighbors. Then it tries to embed them so that those points all have the same number of neighbors.

6. What is Crowding problem?

Sometimes it is impossible to preserve the distances in all the neighbourhoods. This problem is called Crowding Problem or When we model a high-dimensional dataset in 2 (or 3) dimensions, it is difficult to segregate the nearby datapoints from moderately distant datapoints and gaps can not form between natural clusters.

For example, when a data point, ‘x’ is a neighbor to 2 data points that are not neighboring to each other, this may result in losing the neighborhood of ‘x’ with one of the data points as t-SNE is concerned only within the neighborhood zone.

7. How to interpret t-SNE output?

There are 3 parameters
a) Steps: number of iterations.
b) Perplexity: can be thought of as the number of neighboring points.
c) Epsilon: It is for data visualization and determines the speed which it should be changed.