In such scenario we do not need to store data in Datasphere and will just be using it for data replication. The full data flow will look as follows:
SAP S/4 HANA -> Datasphere Premium Outbound -> AWS S3 -> Databricks.
Prerequisites:
- Datasphere connection to SAP S/4 HANA.
Datasphere’s Premium Outbound is basically a replication flow with the target connection outside of Datasphere itself. Replication to Datasphere is free, where replication outside to external targets is charged per Mb. You cannot extract a delta without an initial load, so there is no way to save credits on that. For more information on Premium Outbound pricing and the way this is measured, make sure to check out one of our previous blogposts.
Next, we need to create an IAM user with full access to the bucket which will be used by SAP Datasphere:
Let’s give it the following permissions to access our bucket:
Now we need to create an access key for SAP Datasphere. Go to Security Credentials -> Access Keys and create a new one:
Save both values, as we will need them later.
The next step is for us to go to Datasphere and create a connection to S3, so we can use it as a destination in Replication Flow. To do this, go to the Connections section of your Space and create a new one:
For accessing your S3 bucket, you will need to provide Datasphere with the IAM User Access Key and Secret you saved in the previous step:
Now, save and validate your setup. You should see the following:

As initially stated, we assume that the connection between S/4HANA and SAP Datasphere has been already created, hence this setup will not be covered in this blog. Now, let’s go and create a new Transformation Flow in SAP Datasphere:
Subsequently, select S/4HANA as a Source Connection and add 'I_SALESORDERITEM’ as a source object. Now, use our new S3 connection as a target and select the S3 bucket we created as a container:
Click on the added object and make sure Load Type is set to Initial, Delta is enabled in the Settings section, File Type is set to Parquet and finally, ensure that Enable Apache Spark Compatibility is switched ON in S3: Target Settings:
As you can see, the Replication Flow is now ready, and we can deploy and run it:
If we now go to our S3 bucket we should see a new folder created with both Initial and Delta parquet files:
Our last step is to import data to into Databricks. We need to create an external location in the Catalog to give access to our S3 bucket. Go to Catalog -> External Data -> Create external location -> Follow AWS Quickstart template.
Now you can select New -> Add or upload data -> Create table from Amazon S3.
Finally, check the result:
As you can see, Datasphere adds ‘__operation_type, __sequence_number’ and ‘__timestamp’-fields, so that the replication can be processed as CDC feed.
Thank you for reading my blogpost; I hope you learned something new. In the next blog in this series, I will be looking at importing data from Datasphere Data Lake into Databricks using JDBC. Should you have any questions and/or remarks in the meantime, please do not hesitate to contact us.