Replicating data from SAP S/4 HANA to Databricks using Premium Outbound

 

SAP has dialled up their partnership with Databricks with the presentation of SAP Business Data Cloud last month. All the more reason to investigate the possibilities between SAP Datasphere and Databricks’ Data Lakehouse. Hence, in this blog I describe how to create a data flow from S/4 HANA to Databricks using Datasphere’s Premium Outbound functionality.

In such scenario we do not need to store data in Datasphere and will just be using it for data replication. The full data flow will look as follows:

SAP S/4 HANA -> Datasphere Premium Outbound -> AWS S3 -> Databricks.

Prerequisites:

  1. Datasphere connection to SAP S/4 HANA.

Datasphere’s Premium Outbound is basically a replication flow with the target connection outside of Datasphere itself. Replication to Datasphere is free, where replication outside to external targets is charged per Mb. You cannot extract a delta without an initial load, so there is no way to save credits on that. For more information on Premium Outbound pricing and the way this is measured, make sure to check out one of our previous blogposts.

Sergii Afbeelding1

Next, we need to create an IAM user with full access to the bucket which will be used by SAP Datasphere:

Sergii Afbeelding2

Let’s give it the following permissions to access our bucket:

Sergii Afbeelding3
0103

Now we need to create an access key for SAP Datasphere. Go to Security Credentials -> Access Keys and create a new one:

Sergii Afbeelding 4

Save both values, as we will need them later.

The next step is for us to go to Datasphere and create a connection to S3, so we can use it as a destination in Replication Flow. To do this, go to the Connections section of your Space and create a new one:

Sergii Afbeelding 6

For accessing your S3 bucket, you will need to provide Datasphere with the IAM User Access Key and Secret you saved in the previous step:

Sergii Afbeelding 7

Now, save and validate your setup. You should see the following:

Sergii Afbeelding 8

As initially stated, we assume that the connection between S/4HANA and SAP Datasphere has been already created, hence this setup will not be covered in this blog. Now, let’s go and create a new Transformation Flow in SAP Datasphere:

Sergii Afbeelding 9

Subsequently, select S/4HANA as a Source Connection and add 'I_SALESORDERITEM’ as a source object. Now, use our new S3 connection as a target and select the S3 bucket we created as a container:

Sergii Afbeelding10

Click on the added object and make sure Load Type is set to Initial, Delta is enabled in the Settings section, File Type is set to Parquet and finally, ensure that Enable Apache Spark Compatibility is switched ON in S3: Target Settings:

Sergii Afbeelding 11

As you can see, the Replication Flow is now ready, and we can deploy and run it:

Sergii Afbeelding13

If we now go to our S3 bucket we should see a new folder created with both Initial and Delta parquet files:

Sergii Afbeelding14
Sergii Afbeelding15
Sergii Afbeelding16

Our last step is to import data to into Databricks. We need to create an external location in the Catalog to give access to our S3 bucket. Go to Catalog -> External Data -> Create external location -> Follow AWS Quickstart template.

Sergii Afbeelding17

Now you can select New -> Add or upload data -> Create table from Amazon S3.

Sergii Afbeelding18
Sergii Afbeelding19
Sergii Afbeelding20

Finally, check the result:

Sergii Afbeelding21

As you can see, Datasphere adds ‘__operation_type, __sequence_number’ and ‘__timestamp’-fields, so that the replication can be processed as CDC feed.

Thank you for reading my blogpost; I hope you learned something new. In the next blog in this series, I will be looking at importing data from Datasphere Data Lake into Databricks using JDBC. Should you have any questions and/or remarks in the meantime, please do not hesitate to contact us.

We're here to listen.
Get in touch with us.

About the author

Photo of Sergii Dyzhyn
Sergii Dyzhyn

Read more articles by Sergii Dyzhyn