Implementing Data Warehouse with SQL Server 2012 Interview Questions - Part 2 of 2 (On Applications from TK 70-463)

70-463 Application Questions

Case Scenario 1 (Chapter 1 - Case Scenario 1): A Quick POC Project

You are hired to implement a quick POC data warehousing project. You have to prepare the schema for sales data. Your customer’s SME would like to analyze sales data over customers, products, and time. Before creating a DW and tables, you need to make a couple of decisions and answer a couple of questions:

What kind of schema would you use? Answer.

For a quick POC project, you should use the Snowflake schema.

Hide the Answer.

What would the dimensions of your schema be? Answer.

You would have customer, product, and date dimensions.

Hide the Answer.
Do you expect additive measures only? Answer.

No, you should expect some non-additive measures as well. For example, prices and various percentages, such as discount percentage, are non-additive.

Hide the Answer.

Case Scenario 2 (Chapter 1 - Case Scenario 2): Extending the POC Project

After you implemented the POC sales data warehouse in Case Scenario 1, your customer was very satisfied. In fact, the business would like to extend the project to a real, long-term data warehouse. However, when interviewing analysts, you also discovered some points of dissatisfaction.

Interviews

Here’s a list of company personnel who expressed some dissatisfaction during their inter- views, along with their statements:

■ Sales SME “I don’t see correct aggregates over regions for historical data.”
■ DBA Who creates reports “My queries are still complicated, with many joins.”

You need to solve these issues.

How would you address the Sales SME issue? Answer.

You should implement a Type 2 solution for the slowly changing customer dimension.

Hide the Answer.

What kind of schema would you implement for a long-term DW? Answer.

For a long-term DW, you should choose a Star schema.

Hide the Answer.
How would you address the DBA’s issue? Answer.

With Star schema design, you would address the DBA’s issue automatically.

Hide the Answer.

Case Scenario 3 (Chapter 2 - Case Scenario 1): Slow DW Reports

You have created a data warehouse and populated it. End users have started using it for reports. However, they have also begun to complain about the performance of the reports. Some of the very slow reports calculate running totals. You need to answer the following questions.

What changes can you implement in your DW to speed up the reports? Answer.

You should consider using columnstore indexes, indexed views, data compression, and table partitioning.

Hide the Answer.

Does it make sense to check the source queries of the reports with running totals? Answer.

Yes, it is definitely worth checking the queries of the running totals reports. The queries probably use joins or subqueries to calculate the running totals. Consider using window functions for these calculations.

Hide the Answer.

Case Scenario 4 (Chapter 2 - Case Scenario 2): DW Administration Problems

Your end users are happy with the DW reporting performance. However, when talking with a DBA, you were notified of potential problems. The DW transaction log grows by more than 10 GB per night. In addition, end users have started to create reports from staging tables, and these reports show messy data. End users complain to the DBA that they cannot trust your DW if they get such messy data in a report.

How can you address the runaway log problem? Answer.

You should check the DW database recovery model and change it to Simple. In addition, you could use the DBCC SHRINKFILE command to shrink the transaction log to a reasonable size.

Hide the Answer.

What can you do to prevent end users from using the staging tables? Answer.

End users apparently have permissions, at least the SELECT permission, on the staging tables. Advise the DBA to revoke permissions from end users on staging tables. In addition, to speed up the security administration, you should put all staging tables in a separate schema, thus allowing the DBA to administer them as a group.

Hide the Answer.

Case Scenario 5 (Chapter 3 - Case Scenario 1): Copying production Data to Development

Your IT solution has been deployed to production, version one is complete, and it is now time to start the work on the next version. To keep the data in the development and testing environment as up to date as possible, your manager has asked you to design a data movement solution to be used on a regular basis to copy a subset of production data from the production data store into the development data store.

What method would you use to transfer the data on demand? Answer.

An SSIS package stored in the file system, in the database, or in an unscheduled SQL Server Agent Job would be appropriate.

Hide the Answer.

How would you maximize the reusability of the solution? Answer.

At the very least, the SSIS package would have to be parameterized so that it can be configured appropriately for the specific environment in which it is going to be used. Additionally, the programmatic logic should account for merging new or modified data with existing data.

Hide the Answer.

Case Scenario 6 (Chapter 3 - Case Scenario 2): Connection Manager parameterization

Data warehousing maintenance solutions have outgrown your company’s existing infrastructure, so new servers had to be installed, and this time all data warehousing applications will use a dedicated network. In phase 1, all of your SSIS solutions must be redeployed to new servers, and the system administrator has decided that SSIS projects deserve more network bandwidth, so in phase 2 you will be allowed to dedicate a greater share of the network bandwidth to your data movement processes.

How much additional development work will you have to do to complete phase 1? Answer.

A properly parameterized SSIS package can be redeployed and reconfigured as many times as needed, without the need for any additional development activities.

Hide the Answer.

What will you have to do to reconfigure all of the connection managers to use larger network packets for phase 2? Answer.

The network packet size property of OLE DB connections is not exposed to parameterization; therefore, the entire connection string would have to be parameterized.

Hide the Answer.

Case Scenario 7 (Chapter 4 - Case Scenario 1): Creating a Cleanup Process

In your data management system, there are two data warehouses (that is, in addition to the operational data store); the principal data warehouse contains all the data, including all historical data, and the secondary data warehouse is used by web applications, exposing your data to customers, and should therefore only contain current data. Your data warehousing solution is already in place, moving data into both data warehouses.

You have been tasked with creating an additional process to determine which data is no longer current and must therefore be removed from the secondary data warehouse.

How will you determine which rows need to be removed? Answer.

There are several techniques that could be used, but probably the simplest one would be to identify current rows first, and then use an anti-join (for example, by using EXCEPT or NOT EXISTS) against the entire row set, to identify the rows that are not current.

Hide the Answer.

What technique would you use to perform the removal as efficiently as possible? Answer.

To use SQL Server for maximum efficiency when removing a set of rows, you could use a staging table in the same database, fill it with primary key values identifying the rows to be removed, and then issue a set-oriented DELETE statement (such as DELETE… WHERE EXISTS or MERGE…WHEN MATCHED THEN DELETE).

Hide the Answer.
You identify performance issues with joins. You realize that there are no indexes created explicitly in the system; there are only the ones created by default through primary key and unique constraints. What can you recommend? Answer.

The customer should examine foreign key relationships and evaluate creating indexes on the foreign key columns.

Hide the Answer.

Case Scenario 8 (Chapter 4 - Case Scenario 2): Integrating External Processes

In its data management scenarios, your company uses a mix of proprietary solutions that your team has developed and third party, off-the-shelf solutions that to you, a seasoned developer, appear just like black boxes—you can trust these solutions to work as expected without even a faint clue about how this is actually done.

In your data warehousing solution, you need to consolidate data from a solution of your own and from two diverse black boxes, one of which has a special data extraction tool (a stand-alone application) that retrieves data from the internal store and saves it to files in the file system, while the other one exposes its API layer, providing access to its internal data retrieval functionalities.

What functionalities does SSIS provide that can be used to integrate such diverse solutions into a single SSIS process? Answer.

To integrate external processes, implemented as individual applications, into an SSIS solution, you could use the Execute Process task or a Script task with the appropriate business logic to control the execution of the external applications. To improve reusability and simplify deployment, a Custom task could also be developed to replace the Script task.

Hide the Answer.

How would you use the SQL Server platform to solve this problem? Answer.
The three diverse solutions would probably be integrated by using three different techniques, as follows:
- To retrieve the data from the data store of the proprietary solution, you could simply create a connection manager to connect to this data store and use standard SSIS tasks to extract the data.
- To retrieve the data from the second application, the External Process task could be used to execute data retrieval, followed by a Foreach Loop task to process the files and a Bulk Insert task or a data flow task to extract the data from them by using the Flat File connection manager.
- To retrieve the data from the third application, a Script task could be used to access the application’s API, retrieve the data from the application’s internal data store, and perhaps save it to a file (for example, an SSIS File or a Flat File) to be processed later.
To simplify data transformation and loading, all three solutions would only focus on data extraction, loading the data into a staging table, and thus providing a single, unified data source to be consumed by a single transformation and loading process.

Hide the Answer.

Case Scenario 9 (Chapter 5 - Case Scenario): New Source System

The marketing department has requested that you add the market share by product informa- tion into the data warehouse. They have purchased a yearly subscription for this information feed, and the data is available monthly on an FTP server. The data is in a flat file. It contains information about the product group in the first field, and the market share information is in 12 fields (a separate field for each month).

You need to integrate this new source to an existing DW. Answer the following questions:

How would you model the fact table to store market share results? Answer.

You would need to have three columns (two dimensions and one measure)—product group, reporting month, and a market share value. The fact table would be semi-additive (not additive over time).

Hide the Answer.

Which tasks and transformations in SSIS would you use to populate this fact table? Answer.

First, you would use an FTP task to copy the file to the machine on which SSIS is installed. You can then import the file by using a data flow task configured with a Flat File source adapter. Then you would need to unpivot the columns for months to get a row for each month, and use the Lookup task to get appropriate surrogate keys from existing the dimension, and use the ODBC Destination adapter to write the data to the new fact table.

Hide the Answer.

Case Scenario 10 (Chapter 6 - Case Scenario 1): Complete Solutions

Your manager has asked you to develop a complete data warehousing solution—the ETL as well as multidimensional processing, and even report caching. He suggested using a single SSIS package for this particular solution.

The manager has used the following claims about his proposal:

Using a single SSIS package simplifies deployment and maintenance.

Placing all the elements into a single package will also allow any additional develop- ment activities to be performed more quickly.

Using a single SSIS package is the only way of ensuring the correct order of execution for such a high number of operations.

SSIS development does not require any division of labor; all developmental tasks can be performed by a single developer.

Variables in one package are not accessible to other packages; therefore, using a single package ensures proper variable reusability.

How would you approach this proposal? Do you agree with the manager’s claims? Answer.

When discussing the proposed plan with your manager, you could start by addressing the concerns (and some misconceptions) that he has regarding SSIS development.

In previous versions of SQL Server, it might have been more difficult to deploy and maintain an SSIS solution distributed across multiple SSIS packages, but in SQL Server 2012, the Project Deployment model addresses many of these issues.

With the introduction of project-scoped parameters and connection managers, it becomes possible for individual SSIS packages to share resources. New additions to the solution will allow you to leave existing elements unchanged.

The Execute Package task has been available in previous versions, so this statement is completely unfounded.

It is true that in the majority of cases a BI developer would be capable of developing the complete data warehousing solution on his or her own, but there really is no need for this. And furthermore, even if only one BI developer is available, he or she might still benefit from dividing conceptually different tasks into individual SSIS packages— both for improved manageability as well as improved reusability.

This is true. Variables are not exposed outside of SSIS packages and cannot be accessed from other packages. This shortcoming is reduced with the introduction of parameters, which can be used to pass scalar values to child packages. On the other hand, variables never were the only method available for exchanging data between the elements of an SSIS solution. (Hint: consider files and staging tables.)

Hide the Answer.

Case Scenario 11 (Chapter 6 - Case Scenario 2): Data-Driven Execution

You have been tasked with consolidating existing data warehousing solutions that are currently implemented as individual SSIS packages performing data extraction, transformations, and loading. A part of the solution depends on two external services that provide the data from remote data stores, saving it to files on your local network.

All data processing should be performed during a nightly maintenance window. Your system communicates to the external services using special tools that are installed locally. The services are not completely reliable—occasionally one or both of them are unresponsive, but the data they provide is essential to your company. If any of the services fails to produce the expected files, you want the administrators at the remote locations to be notified so that they can remove any obstacles.

All available data should, of course, still be processed and loaded into the data warehouse; however, if external data is not available, or is not complete, the changes must not be propagated to the Analysis Services database used for reporting. The middle management in your company prefers the analytical data to be complete, rather than current and incomplete, and they are prepared to wait a day or two for it to be updated accordingly.

What data warehousing concepts can you use to accomplish your task? Answer.

You could consolidate the solution by using a master package; all of the requirements suggest that this would be the right approach: there are multiple SSIS packages that need to be executed in a particular order; two external processes should also be integrated into the solution; and you also need to automate the communication between your local system and the two remote systems to respond to circumstances detected during execution.

Hide the Answer.

What solutions, provided by the SSIS toolset, can you use? Answer.

Obviously you will use the Execute Package and Execute Process tasks to control the execution of existing solutions, as well as the Send Mail task to notify the remote administrators of failure.

Hide the Answer.
How can you prevent incomplete data from being loaded into the SSAS database? Answer.

To prevent incomplete data from being available to the consumers, you will need to detect issues with the data extraction, transformation, and data loading processes, and only invoke the propagation of data into the SSAS database if no errors were detected. You could use variables and precedence constraint expressions to guide the execution of the master package, based on available information about the state of the environment and the child processes.

Hide the Answer.

Case Scenario 12 (Chapter 7 - Case Scenario): Loading Large Dimension and Fact Tables

You work for a telecommunication provider and you have to load the customer dimension and the call detail records (CDR) fact table. Your company has 2 million subscribers and you receive 5 million rows daily regarding CDR data.

You need to integrate this new source to an existing DW. Answer the following questions:

How would you load the Customer dimension table—should you use Slowly Changing Dimension task? Answer.

Based on the size of the data, you should not use the Slowly Changing Dimension task. It would be wise to implement some sort of CDC on the source data to get only changed rows from the source system. If this is impossible, then you should add an additional column in your dimension to store the hash value of the columns to allow you to compare what has changed without comparing each column. Based on the size of the data, also adding a MERGE statement using an Execute SQL task to apply the necessary INSERT and UPDATE operations to the dimension would be beneficial.

Hide the Answer.

How would you create your fact table to store CDRs? How would you implement your daily loading of CDRs? Answer.

Create a fact table with multiple partitions—probably on a daily basis would be most useful if you have to reload the data for a specific day. Apply compression or, if using SQL Server 2012, use the column store index, because on a monthly basis you will get around 150 million rows and on a yearly basis 1.8 billion. The loading process should load the data into a heap table and then, by switching partitions and applying either compression or a column store index, include the new partition in the existing fact table.

Hide the Answer.

Case Scenario 13 (Chapter 8 - Case Scenario): Auditing and Notifications in SSIS Packages

You are creating a set of SSIS packages to move data from a data warehouse to data mart tables. You need to ensure that your database is in a consistent state and not in an intermedi- ate state when an error occurs. In addition, you need to prepare an audit trail of information and build alerts into your package design.

How would you handle the following requirements?

Each destination table in you data mart must have its inserts and updates fully complete and committed; otherwise you need to roll back the changes so that the table is in a consistent state. You also need a way to restart the packages from the point of failure. Answer.

Because the commit level is configured on a table-by-table basis, all the data flow and control flow tasks that operate on a single task need to be grouped together in a container, and the TransactionOption property must be set to Required for each container. You should also implement checkpoints on the containers, which will let you restart the packages at the point of failure after you have resolved the problems. You can simplify this implementation by creating a master package that has checkpoints turned on and that uses the Execute Package task to call child packages for each destination table that has transactions enabled.

Hide the Answer.

You need to capture the count of rows that are inserted into the destination tables. Answer.

To capture the destination row count, you add several Row Count transformations to your package. Place a Row Count transformation in the pipeline before each destination and store the number of rows inside a variable. To capture the variable values, set the RaiseChangeEvent property to True and add the OnVariableValueChange event handler. This event will fire each time the variable value has changed, and you can then use an SQL statement to store the number of rows to a tracking table.You could validate the XML against an XML schema collection.

Hide the Answer.
When a package fails, you must immediately send email messages that identify the task that failed and describe the error. Answer.

Using an OnError event, you can create a new event handler on the package level. You can then add a Send Mail task to the event handler and configure it to send an email message when any tasks fail. You can use the SourceName and ErrorDescription system variables to identify which task failed and to get the error description.

Hide the Answer.

Case Scenario 14 (Chapter 9 - Case Scenario): Making SSIS Packages Dynamic

You are creating a set of SSIS packages to move data from flat files and different databases to a data warehouse. Because of strict development, test, and production procedures, your SSIS project must support these possibilities.

How would you address these issues? What would your solution look like?

You need to test different deployment options for your SSIS packages in the development phase by using SSDT Answer.

Add parameters for the connection strings. Create development, test, and production build configurations and bind parameters to each of them with different data source values. This will allow you to execute the package from SSDT against different configurations without manually changing the value of parameters.

Hide the Answer.

You have a development, test, and production environment and would like to minimize any changes to packages when they are ready for the test phase. Answer.

Add parameters and parameterize all needed connection managers.

Hide the Answer.
You need to parameterize the location of the flat files but still be able to dynamically set the correct file name when using the Foreach Loop container to read all the files inside the specified folder. Answer.

Create a parameter that will hold the value of the file location folder. Create a new variable to store the current file name. Using SSIS expression language, set the new variable value to combine the value from the parameter with the value you get from the Foreach Loop container. Use the new variable as the property expression for the Foreach Loop to dynamically change the fully qualified file name while the package is running.

Hide the Answer.

Case Scenario 15 (Chapter 10 - Case Scenario 1): Implementing SSIS Logging at Multiple Levels of the SSIS Object Hierarchy

You have been tasked with configuring logging in an existing SSIS project. The project imple- ments a single master SSIS package, controlling the execution of multiple child packages. The members of the development team and the administrators, tasked with the mainte- nance of deployed SSIS solutions, have sent you the following list of requirements:

The start and the end of the execution for each package must be logged to the Windows Application Log.

The start and the end of the execution of each control flow task must be logged to a comma-separated text file that will be copied, after all child packages have completed, from the server at which the SSIS project will be deployed to a secondary location where the files will be loaded into the process management and analysis database.

Warnings and errors must be logged to the Windows Application Log and to the text files.

In Foreach Loop containers using Foreach File Enumerators, the values of the variables holding file names must be logged at each iteration to the text files.

An SSIS log provider for Windows Event Log and an SSIS log provider for text files must be assigned to all packages and enabled at the package level exclusively.
To prevent duplicates in the log, every Execute Package task in the master package should have logging disabled.
Package level logs must capture OnPreExecute, OnPostExecute, OnWarning, and OnError events.

In every child package, the SSIS log provider for text files must be enabled at the task level and not enabled at package or container levels.

Task-level logs must capture OnPreExecute, OnPostExecute, OnWarning, and OnError events.

To reduce development time, you can configure one task and save its settings to a file, and then you can apply it to all other tasks.

The SSIS log provider for text files should be enabled at every Foreach Loop container using the Foreach File Enumerator.
Container-level logs must capture the OnVariableValueChanged event, and the variables used to hold the file names should have the RaiseChangedEvent property enabled (set to True). However, note that as a side effect, all other container-scoped variables of the Foreach Loop container with this property enabled will have their changes logged as well.

Hide the Answer.

Case Scenario 16 (Chapter 10 - Case Scenario 2): Implementing SSIS auditing at Different Levels of the SSIS Object Hierarchy

You have been tasked with adding appropriate auditing to an existing SSIS project. The project implements a single master SSIS package controlling the execution of multiple child packages.

The members of the development team and the analysts, tasked with analyzing the execu- tion characteristics of deployed SSIS solutions, have prepared the following list of require- ments for you:

At the project, package, and task level, audit the duration of the execution with millisecond accuracy; the numbers of rows inserted, updated, and deleted; and the size of inserted, updated, and deleted data in bytes. Answer.

At the task level, row counts and data sizes must be collected for deleted data during execution; for instance, by using the Row Count and Aggregate transformations.

At the package level, values can be aggregated from task-level data, and at the project level, values can be aggregated from package-level data.

Hide the Answer.

At the row level, each row must contain reference information allowing the row to be associated with corresponding audit data at higher levels and with the corresponding log entries. Answer.

At the row level, all insert and all update operations must save data into the data- base—either in the same table with primary data or in separate tables.
Row numbers and data sizes can then be computed at the row level and aggregated to higher levels.

Hide the Answer.

Case Scenario 17 (Chapter 11 - Case Scenario 1): Using Strictly Structured Deployments

In your organization, SSIS development and SSIS operations are strictly separated; not only is development restricted to its own dedicated environment with copies of production data stores, the production environment cannot be accessed by any member of the development team. As a consequence, SSIS deployment can only be performed by production administrators.

How can you use your development tools to deploy SSIS solutions despite these restrictions? Answer.

Typically, developers are not locked out completely from production environments; however, there are cases in which production data stores, even if used exclusively to host SSIS processes, are off limits to members of development teams.

Even under circumstances such as these, SSIS deployment is still possible:

The development team will have to configure SSDT to place the project deployment files in a location in the file system that is accessible to the production administrators. This must be set for each project, in the project properties, under Configuration Prop- erties. In the Build section of the properties, you can locate the OutputPath property and replace the default value, bin, with the path to the common location. When the projects are ready for deployment, you will have to notify the administrators to proceed with the deployment.

Hide the Answer.

What tools will be required in the production environment to facilitate SSIS deployments? Answer.

The administrators will either use SSMS or the Integration Services Deployment Wizard to perform the deployment.
Even if they have not installed SSMS to the production server, they can connect to it remotely; however, the Integration Services Deployment Wizard will also be available at the destination server.

Hide the Answer.

Case Scenario 18 (Chapter 11 - Case Scenario 2): Installing an SSIS Server

You have been tasked with installing SQL Server 2012 on a new server, to be used as the SSIS server.

Which SQL Server features will you have to install? How would you configure the instance of SQL Server to serve as the SSIS server? Answer.

To install an SSIS server, you must install the SQL Server 2012 Database Engine and Integration Services on the destination server and also create and configure an SSISDB catalog on the newly installed instance of SQL Server. To provide automated SSIS execution, you must also install SQL Server Agent.

Hide the Answer.

What other features could you also install, without turning the server into a develpment machine? Answer.

You could also install the Management Tools, and the Client Tools Connectivity feature. Without SQL Server Data Tools, the server cannot be used for SSIS development, so you would not install SSDT.

Hide the Answer.

Case Scenario 19 (Chapter 12 - Case Scenario 1): Deploying SSIS packages to Multiple Environments

You have been asked by your manager to prepare a deployment plan for an SSIS data warehousing solution. The plan is to allow the solution to be executed on six different servers, each of them targeting the same environment but processing only a specific subset of source data.

Of course, your boss also expects you to find a way to minimize the maintenance, and possibly centralize monitoring and troubleshooting of the solution.t scenario hires you again. They ask for your advice regarding the following update processes:

Can this be achieved at all? If so, how? Answer.

To allow the same SSIS solution to be executed at different servers, it would have to be deployed multiple times.

You would start by parameterizing individual SSIS packages so that the queries used for data extraction would use parameters or would be created dynamically, so that only a well-defined subset of data could be extracted when the package is executed.

You could then associate these parameters with environment variables and prepare six different environments for six servers.

Hide the Answer.

Which SQL Server 2012 features could you use to distribute SSIS processing to meet the requirements? What would you still have to develop yourself? Answer.

You would create an SSISDB catalog at each server and deploy the solution multiple times.
SQL Server Integration Services would also have to be installed at each of the six serv- ers. You would have to create a SQL Server Agent job at each of the six servers, and configure the relevant job steps, executing SSIS packages, to use the corresponding environment created earlier.
To centralize SSIS execution monitoring you would have to consolidate information from six SSIS servers in a single location, for instance, by implementing partitioned views.

Hide the Answer.

Case Scenario 20 (Chapter 12 - Case Scenario 2): Remote Executions

Your organization would like to minimize total cost of ownership, and so SQL Server Integra- tion Services will only be installed on a single production server. This server will host the SSISDB catalog where all the company’s SSIS solutions will be stored, maintained, and executed.

The data managed by your organization should be available in a distributed environment, where subsets of data are managed at different servers and in different applications. Some subsets of data need to be downloaded to remote locations and occasionally uploaded from them to the central location.

It has been decided that all data movement operations will be implemented as SSIS packages.

Is it possible to execute SSIS packages remotely? If so, how? Answer.

Remote SSIS execution is supported. You could execute SSIS packages by using Transact-SQL, execute SQL Server Agent jobs by using Transact-SQL, or even go as far as designing your own service (a Windows service or a web service) to invoke SSIS processes on the central server.

Hide the Answer.

What features of the SQL Server platform could you use to enable remote SSIS execution? Answer.

As long as all machines used in this distributed environment belong to the same domain or appropriate authentication can be used to access the SSIS server, you would not have to design any part of this solution yourself. In a heterogeneous environment, you would have to resort to a custom service, deployed to the same domain as the SSIS server but also accessible from outside the domain.

Hide the Answer.

Case Scenario 21 (Chapter 13 - Case Scenario): Tuning an SSIS Package

You have been tasked with optimizing a package created by one of your colleagues. When you open the package, you notice that it is very large and complex—it has multiple data flow transformations, merging data from multiple sources and writing the data to your SQL Server instance.

How would you approach the optimization? Answer.

First you should measure the package’s memory consumption to see if all the operations are kept in the memory and no disk swapping occurs. Then you should measure each task execution to evaluate which task takes most of the time. Then start decomosing the task. If it is a data flow task, you should measure the execution time for reading the data (to determine how fast the source adapters are), for applying transformations, and for writing data to the destination. With this information, you can then focus on applying different optimization techniques for transformations or looking at the possible enhancements at the source or the destination.

Hide the Answer.

What would be your preferences when redesigning the package? Answer.

Because the package contains multiple data flows, consider splitting multiple data flows to multiple packages and analyzing each data flow in terms of transformations used. Observe whether some row-by-row operations are done against the database and, if possible, push down some processing logic (such as sorting or joining data) to the underlying database. Check whether the Lookup transformation is fully cached and that the query contains only the needed columns. The last step is to optimize the buf- fer settings of the data flow.

Hide the Answer.

Case Scenario 22 (Chapter 14 - Case Scenario): Data Warehouse Not Used

You are in a conversation with a DBA of a company you know. The DBA has told you that the company has already implemented a data warehouse and an analytical system based on the data from the data warehouse. However, the DBA complains that end users do not use this analytical system. Although it is a much more complex process, the end users prefer to create their own analyses in Excel from LOB database data. Answer the following questions:

What do you think went wrong with the data warehouse? Answer.

End users probably do not trust the data warehouse (DW) data.

Hide the Answer.

How could you help this company? Answer.

You should suggest that the DBA do data profiling of the DW data in order to assess the quality of its data. If the quality is good, then the DBA should educate end users and prove to them they can rely on the DW data. If the quality is not good, then the DBA should suggest to managers that they implement a data quality solution and define explicit data stewardship roles in the company.

Hide the Answer.

Case Scenario 23 (Chapter 15 - Case Scenario 1): Introducing an MDM Solution

A company you know complains about data quality. The company acquired three smaller companies in the past, and now they have four different LOB systems. They have big prob- lems with inaccurate and duplicate customer data. In addition, they would like to implement a data warehouse. You need to help this company. Answer the following questions:

Is this company a candidate for a master data management solution? Answer.

Yes, this company is prepared for an MDM solution.

Hide the Answer.

What data would you immediately define as master data? Answer.

Customer data is definitely part of master data.

Hide the Answer.
Would an MDM solution provide a reasonable ROI for this company? Answer.

An MDM solution would give a reasonable ROI to this company. With an MDM solution, you would resolve data conflicts. Data is reused in many places and will be reused even more with a data warehouse implemented.

Hide the Answer.

Case Scenario 24 (Chapter 15 - Case Scenario 2): Extending the POC Project

Assume that you have implemented Master Data Services for the company in Case Scenario 1. However, your customer was still not satisfied. When interviewing department managers, you discovered some points of dissatisfaction.

Interviews

Here’s a list of company personnel who expressed some dissatisfaction during their inter- views, along with their statements:

Sales Manager “Information about marketing campaign discounts for specific customers frequently comes from the Marketing department after the sales are already done.”

Marketing Manager “In the Sales department, they want to update data about discounts, whereas our department clearly owns this data.”

You need to solve these issues.

What do you think is missing in the MDM solution you implemented? Answer.

You didn’t resolve authority problems. In addition, you didn’t define workflow properly.

Hide the Answer.

Would you suggest any specific role for customer and campaign data? Answer.

You should suggest to your customer that they define an explicit data stewardship role. Currently each department is responsible for its own part of the master data; data stewards should be responsible for the complete master data.

Hide the Answer.

Case Scenario 25 (Chapter 16 - Case Scenario): Editing Batches of MDS Data

A company that uses MDS has to update data for 4,000 customers in four different countries. They have dedicated four people to this task. However, they have realized that using the Mas- ter Data Manager application is not the most effective way for such mass updates. In addition, they have realized that two of their users actually occasionally updated the same members, and thus they spent some wasted effort. The company has asked you for a better solution. Answer the following questions:

How can you help them speed up the editing process? Answer.

You should introduce Excel 2010 with MDS Add-in for Excel.

Hide the Answer.

How would you prevent users from updating the same members? Answer.

You should use hierarchy member permissions to limit each user to members of a single country.

Hide the Answer.

Case Scenario 26 (Chapter 17 - Case Scenario): Improving Data Quality

A company that you are consulting for complains that their customer data quality is low. Addresses in particular are often wrong and inconsistent. However, they have heard that there is a third-party provider that provides accurate addresses for all of the countries in which your client’s customers reside. Answer the following questions:

What tool would you suggest the client use to solve this problem? Answer.

You should recommend that the company use Data Quality Services.

Hide the Answer.

How can you use the third-party data? Answer.

When creating a knowledge base, you can specify a reference data provider for the address domains.

Hide the Answer.

Case Scenario 27 (Chapter 18 - Case Scenario): Preparing Data for Data Mining

A company that you are consulting with uses a predictive data mining model to predict customers’ income. They prepared the data themselves. They complain that the predicted values are nearly always too high when they test the model on the test set. They want you to help them improve the predictions.

What do you suspect is the reason for the incorrect predictions? Answer.

You should suspect that something is wrong with their sample. They probably selected richer customers only for the training set. Maybe they created a pattern by incorrectly splitting the data into the training and the test set and included richer customers in the training set and poorer customers in the test set.

Hide the Answer.

How could you improve the predictions? Answer.

You should re-check the sample and the split and correct them as needed.

Hide the Answer.

Case Scenario 28 (Chapter 19 - Case Scenario): Data Cleansing

You have been asked to implement data extraction by using Regular Expressions in SSIS.

Can this even be achieved by using SSIS? If so, how? Answer.

Yes. SSIS can be used to implement data extraction by using Regular Expressions. The most appropriate method would be to design a custom data flow transformation component, using the .NET Framework Regular Expressions implementation.

Hide the Answer.

What SSIS features would you use for this operation? Considering the fact that more than one substring of the same input string can be matched by a particular regular expression pattern, how will you handle this in your data flow? Answer.

The transformation component would have to read the string from an appropriate column, pass it to the Regular Expressions engine, and receive the results, which should then be stored in the appropriate columns. To allow multiple substrings to be extracted from a single row of source data, the component would have to implement an asynchronous output for the results of the transformation. After all, it is possible that the Regular Expression pattern in question could not find any matches in the source string at all

Hide the Answer.
How will you allow SSIS package developers to use this component? Answer.

The SSIS package developers should be allowed to pass the Regular Expression pattern to the transformation via a custom component property; they could, for instance, store the pattern in a package variable, or use the component to connect to an external data source and use Regular Expressions patterns stored in a table in a SQL Server database.

Hide the Answer.

Case Scenario 29 (Chapter 20 - Case Scenario): Improving Data Quality

A company for whom you are consulting needs to match a huge amount of customer data with existing MDS data. MDS data serves as the authoritative source. In addition, the company complains that sometimes even the MDS data is duplicated, especially when advanced end users use Excel as an MDS client tool. Answer the following questions:

What would you suggest to use for identity mapping of large amounts of data? Answer.

You should suggest that the company use the SSIS Fuzzy Lookup transformation.

Hide the Answer.

How would you prevent the insertion of duplicate data in MDS through Excel? Answer.

You should suggest creating a DQS matching policy KB and using it in Excel.

Hide the Answer.