In this blog, Lars will briefly touch upon the most recent outlook for Data Intelligence and the strategic shift of its toolset towards Datasphere. Next, my colleague Dirk-Jan Kloezeman and I put the new replication flows to the test, by following in the footsteps of SAP’s recently published technology blogs. So, if you’re curious how far Datasphere’s flows can take you, be sure to read on!
What will we cover in this blog?
- How will Data Intelligence transition to Datasphere?
- Our thoughts on Datasphere Replication Flows
- Connecting SAP Datasphere to SAP S/4HANA (on-premise)
- Connecting SAP Datasphere to Google BigQuery
- Creating a Replication Flow from S/4HANA to BigQuery
- Creating a Replication Flow from S/4HANA to Azure Data Lake
From SAP Data Intelligence to SAP Datasphere?
Along with the transformation of SAP Data Warehouse Cloud to SAP Datasphere last year, SAP subtlety implied that SAP Data Intelligence (DI) would ‘increasingly move towards’ Datasphere in the near future. As stated in the introduction, recent updates to Datasphere have brought more options for data ingestion and replication, different pipeline configuration options and increased support for non-standard operators in data flows within the solution.
More recently, Datasphere’s popular Replication Flow functionality has been extended to use Datasphere as a pipeline between SAP and non-SAP systems, with new nested chains and data previews aimed at improving these flows. We see this as a good step towards incorporating SAP Data Intelligence’s toolset into Datasphere, even though we also still recognize a very apparent gap in functionality between the solutions.
Although more of DI’s functionalities will undoubtedly find their way to Datasphere in the months to come, DI itself will be removed from CPEA contracts as of July 1st 2024. Existing DI customers will be supported (maintenance that is) until December 31st 2028, after which they will be fully dependent on Datasphere for DI-like functionalities. This sounds far away, but as stated before, we still see that a lot of ground will need to be covered to bring Datasphere to the level that Data Intelligence is on now. Datasphere’s Replication Flows, and in particular its premium outbound flows that are capable of connecting to non-SAP systems, could prove a valuable stepping stone in this progress (see the below image for a brief overview, courtesy of SAP).
Our thoughts on Datasphere Replication Flows
Before we take a closer, step-by-step look at Datasphere’s Replication Flows, we want to start off by saying that they do offer a convenient method for moving data from an SAP environment to a non-SAP environment. The use of replication flows is relatively simple in set up and use, however, this functionality does come with costs if you decide to move your data to non-SAP targets.
While using Datasphere via this manner does not utilize any storage on the actual system (and thus no costs are incurred here), customers will have to pay one thousand euros per month per 20 gigabytes of data traffic passing through (premium) replication flows. These come on top of the standard Datasphere costs, which you can check in SAP’s new capacity unit estimator here.
As a possible alternative, we also tried to perform the initial data load from the same SAP Datasphere local table to both Google BigQuery and a Kafka broker, but this unfortunately did not work. Of course, we were not able to find an answer to all of our questions; some of the important ones we still have at the moment include:
- What happens if multiple Spaces try to replicate data from the same S/4HANA source?
- To what degree is the metadata of your source (e.g. in terms of lineage) carried over to the target system?
- Is it possible to multi-target replicate data?
- When is the support for REST APIs coming to SAP Datasphere? (as this is a requirement for even some SAP-to-SAP connections)
- Apart from Tables, will it be possible to use Datasphere’s Analytic Models or Views as a source for replication flows? If so, when? (this is currently not specified on the SAP roadmap)
Despite these open topics, for which we will find an answer soon, we hope this blog will give you some insights into both the outlook of Data Intelligence functionalities in Datasphere as well as the possibilities that the (premium) replication flows offer.
Now that we have a bit more clarity on the future of SAP Data Intelligence and SAP Datasphere, I will give the floor to my colleagues Dennis and Dirk-Jan as they dive into the latter solution’s replication flows.
Setting up the connection towards S/4HANA
It stands to reason that you will require an operational SAP Datasphere system if you want to configure along with this guide.
- First, connect and configure the SAP Cloud Connector to your S/4HANA on-premise system (we will not go into this specific configuration in this blog).
- Next, connect and configure the DP Agent (whose configuration is also outside of this blog’s scope).
- Once these steps have been conducted, go to the SAP Datasphere à Connections menu and create a new Local Connection (make sure to select S/4HANA on-premise when prompted) and configure it as follows:
4. Next, provide optional advanced properties and the correct connection information for your objects (e.g. a business name, not obligatory but quite handy, and an obligatory technical name) and validate the connection:
If everything has been configured correctly, a message toast should pop up similar to the one below:
If, for example, something is wrong with the virtual hostname used in the Cloud Connector settings, it will show a popup with additional information like this one:
Setting up the connection to Google BigQuery
Here, we will illustrate the required steps to set up a test environment and connection between SAP Datasphere and Google BigQuery:
- Create a Google BigQuery (trial) account via the ‘Try BigQuery free’-button: https://cloud.google.com/bigquery
- Now login to your Google account and add your address and credit card information.
- After the account creation you will land on an overview page with a project that gets created by default (you can optionally create your own).
- Go to Google BigQuery, either through the ‘Products & solutions’ menu or via the direct URL: https://console.cloud.google.com/bigquery
5. Use ‘Create SQL Query’ to create a dataset/schema. The ‘project-id’ prefix is optional.
6. Go to Service Accounts either through the ‘IAM & Admin -> Service Accounts’ menu or with the direct URL: https://console.cloud.google.com/iam-admin/serviceaccounts
7. Create a Service Account and grant it the required roles:
- Note that a BigQuery Job User is necessary for ‘Data flows’ and ‘Replication flows’.
- The BigQuery Data Owner is needed to read the dataset/schemas and write in it. Note that another, less privileged role might also work here (we have not been able to validate this yet). Later on in the connection validation from SAP Datasphere this falls under ‘Remote tables’.
8. Open the newly created service account:
9. Go to the ‘Keys’-tab and create a new JSON key. It will directly start a download. After you’ve done this, create a private key of the type ‘JSON’ and save it to your computer.
10. Now go to the SAP Datasphere documentation and check the supported drivers:
11. You can download the driver on the URL below. Make sure to use the correct filename from the documentation.
https://storage.googleapis.com/simba-bq-release/odbc/SimbaODBCDriverforGoogleBigQuery_3.0.0.1001-Linux.tar.gz
12. Open your SAP Datasphere tenant, go to ‘System’ > ‘Configuration’ and in the tab ‘Data Integration’ under ‘Third-Party Drivers’, upload the previously downloaded file through the ‘Driver File:’-upload box. The system will also remind you that you will need to have a valid license for driver files before processing. Depending on the connection speed this may take some time.
13. Once the driver is added make sure to select it and press the ‘Sync’-button under the ‘Third-Party Drivers’-submenu.
14. Go to the SAP Datasphere Connections and create a new Local Connection (like we did before for S/4HANA). However, this time of course select Google BigQuery:
Set up the connection as follows, providing once again the obligatory technical and optional business name of the connection.
15. Select the created connection and validate it (please refer to the image below).
You will once again receive either the ‘Okay’ or ‘Something is wrong’-messages (for each an example below):
Creating the Replication Flows
Now that the connections for both S/4HANA and BigQuery have been set up, we can create a sample replication task from S/4HANA to Google BigQuery:
1. Create a new Replication Flow in the Datasphere Space of your choice (through the Data Builder):
2. Select the Source Connection, in this case S4H_TEST, Source Container, in this case CDS, and add some Source Objects, in this case ZDV_CDS_VIEW:
3. Next select the Target Connection, in this case GBQ_TEST, and select a container, in this case the created dataset/schema A_TEST_SCHEMA_NAME:
4. Give the Replication Flow a name, save it and finally deploy it.
5. Now run the Replication Flow, optionally check the Data Integration Monitor:
6. Finally, return to check Google BigQuery; it should contain the replicated data:
SAP Datasphere Replication Flow from S/4HANA to Azure Data Lake
Now that we have seen how connections are set up and how replication flows are created to our BigQuery example, let’s take a short look at how this could work for other targets, for example Azure Data Lake.
First, you will need to have configured S4/HANA and an Azure storage account enabled with Gen2 Blob Storage. The next step is to create connections in Datasphere to S4/HANA and of course Azure Data Lake. Give that this blog has already gone in-depth on how to do this, we will not go into this (almost identical) setup process again here and start from the point where the relevant connections have already been established:
1. Navigate to the Data Builder -> New Replication Flow.
You will once again land in the ‘You haven’t added any data yet’-screen, so press ‘Select Source Connection’ and select your S4/HANA connection. Now, press Select Source Container -> CDS
(which is the container we use for both the BigQuery and Azure examples):
2. Select Add Source objects, for example I_BUSINESSPARTNER.
3. Subsequently, press Add Selection and select your Source Object. Now we will add the target connection, so select your Azure data lake connection and the required container:
4. Configure the connection to your needs, for example:
- Initial and Delta
- With a Delta load interval of thirty minutes
5. Save, activate and run your replication flow like we did before for the S/4HANA – BigQuery connection:
6. Now we can once again check the Data integration monitor for the run status:
7. In Azure you can subsequently check your files:
8. Finally, you might encounter some errors:
To tackle this issue, implement the respective SAP note in the S/4HANA system via SNOTE.
For the above issue, truncate the initial data load to circumvent it.
This concludes our blog on the state of SAP Data Intelligence/Datasphere and its replication flows. If you want to know more or have other Datasphere or Data Intelligence related questions, please do not hesitate to contact us. For the official SAP Documentation on Datasphere’s Replication Flows, click here.
Credits
This blog was written by our experts Dennis van Velzen, Lars van der Goes and Dirk-Jan Kloezeman.