DP 203

9 minute read

Azure Synapse

polybase => acces to external table

Azure Data Factory

  • is a cloud ETL

  • Lined Service => ingest data

  • Dataset is a data structure

  • activity => transformation logic

vs datawareHouse

  1. Indexing, distribution, & partition Indexing : By default, SQL Pool creates clustered column-store Indexing options Characteristics Clustered column-store To be used when we don’t know how to index the table Applies to all table’s data High level compression Large tables Plus than 60M rows Do not use for transient data and small tables Clustered (row-store) and Nonclustered Clustred is to be used on tables when a single row needs to be quickly retrieved. To improve filter on other columns, nonclustred can be added to those columns. B-tree structure Not large tables, less than 60M rows Heap Speed load Small, staging, and temp tables, less than 60M rows

Distribution: Distribute data on different nodes. When a query is run in SQL Pool, the work is divided into 60 smaller queries that run in parallel. Distribution type Characteristics Hash Assigns each row to one distribution Deterministic distribution Distribution column should be defined as NOT NULL, and should not be a date column (otherwise, the same date lands in the same distribution). It is recommended to use columns that will be used for JOIN, ORDER BY, DISTINCT, HAVING, OVER ==> high query perf Improves query performance on large Fact tables Round Rubin Assigns the rows evenly (equally) across all the distributions Not deterministic To be chosen for staging and Temp tables High perf when loading data to staging tables. But query perf are better with Hash To be chosen by default if now idea about the queries, or when there are no join keys, or there is no eligible column for Hash, and table size is less than 2GB Replicated Assigns all the rows to each node Query perf on small tables, Dim tables To be used for tables less than 2GB

Partition: Partitioning is supported on all indexing tables and on all distribution types

Partitioning by date. Can be done only on one column.

Improve both data loading and query perf

PARTITION ( partition_column_name RANGE [ LEFT | RIGHT ] FOR VALUES ( [ boundary_value [,…n] ] ))

Example:

PARTITION ( id RANGE LEFT FOR VALUES (10, 20, 30, 40 ))

col <= 10, Partition 2: 10 < col <= 20, Partition 3: 20 < col <= 30, Partition 4: 30 < col <= 40, Partition 5: 40 < col

RIGHT:

col < 10, Partition 2: 10 <= col < 20, Partition 3: 20 <= col < 30, Partition 4: 30 <= col < 40, Partition 5: 40 <= col

  1. Polybase

Polybase allows to query and copy data from external sources. Hence, the data will stay in the original location and format. Used with T-SQL langage. Many connectors: Oracle, MangoDB, Hadoop, Azure storage, SQL Server, S3, Taradata Scalable and enables parallel data transfer. Computations can be pushed into Hadoop for more performance To find data skew, we need to run SBCC PWD_SHOWSPACEUSED if serverless, else query sys.dm_pwd_nodes_db_partition_stats Polybase does not support JSON Polybase not possible for text files, instead we have to use Bulk insert To use PolyBase, we need to create first an external tables to reference external data.

Steps to create external tables in Synapse SQL pools: CREATE EXTERNAL DATA SOURCE to reference an external Azure storage and specify the credential that should be used to access the storage. CREATE EXTERNAL FILE FORMAT to describe format of CSV or Parquet files. CREATE EXTERNAL TABLE on top of the files placed on the data source with the same file format.

Steps to copy data from adls to azure synapse DW: Create an external data source that uses the abfs location : Create External Data Source to reference Azure Data Lake Store Gen 1 or 2 Create an external file format and set the First_Row option : Create External File Format. Use CREATE EXTERNAL TABLE AS SELECT (CETAS) and configure the reject options to specify reject values or percentages

  1. SCD (Slowly changing dimension)

Type 1 Overwrite and update

Type 2 Historical, duplicate the row specifying start and end date. It needs surrogate key (Id).

Type 3 Limited historyAdd a new column (current value)In total: original value, current value, and effective date

Type 6 Combination of 1+2+3: duplicate column (origin + current), duplicate row (surrogate key)

  1. Encryption

TDE: Transparent Data Encryption. Encryption data at rest. No application’s changes are required. Always encrypted: protect sensitive data

Data Lake Storage automatically encrypts data at rest, protecting data privacy.

Always Encrypted TDE Requires SQL Server Enterprise Edition No (starting with SQL Server 2016 SP1) Yes Free in Azure SQL Database Yes Yes Protects data at rest Yes Yes Protects data in use Yes No Protects data from SQL administrators and admins Yes No Data is encrypted/decrypted on the client side Yes No Data is encrypted/decrypted on the server side No Yes Encrypt at column level Yes No (encrypts entire database) Transparent to application Partially Yes

Static VS Dynamic data masking:

SQL Pool

Dedicated SQL Pool VS Serverless SQL Pool

Dedicated SQL Pool Serverless SQL Pool To be used to analyze data in a DB or a DW To be used to query data in ADLS Cost on computing process and storage Cost only when query data

ADLS gen2

Folder structure format recommended to have less folders and secure easily is : {​​​Region}​​​​​​​​​​/{​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​SubjectMatter(s)}​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​/{​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​yyyy}​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​/{​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​mm}​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​/{​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​dd}​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​/{​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​hh}​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​/

Files format Type Characteristics CSV Not compressed ==> more disk storage High perf when working with, and query only some, columns JSON Avro Stores data in row-based format ==> optimized for write-heavy transactional workloads High perf when query data based on timestamp Avro supports batch and it is very relevant for streaming as EventHub and Kafka Parquet Stores data in columns ==> optimized for read-heavy analytical workloads Faster than CSV Hadoop ORC Stored as columns and compressed for read-heavy

è C (column) : A (Analytical), R (row) : T (transactional)

Storage type

C. Redundancy & Outage LRS ZRS GRS/RA-GRS GZRS/RA-GZRS 3 copies in the same data center. Hence, same region 3 copies across separate zones, but in the same region 6 copies: 3 in the 1st region, 3 in the 2nd region 6 copies: 3 copies across separate zones, but in the 1st region, 3 copies locally in the 2nd region Failover manually Failover manually GRS : manually RA : automatic GZRS : manually RA : automatic Least expansive

RA option is recommended.

D. Copy / Migration patterns

​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​ Azure stream analytics

/!\ Azure Stream Analytics doesn’t support queries from a python script /!\

Declarative SQL is used

E2E analytics pipeline for data streaming. Extensible with JS and C# Uses T-SQL : stream analytics query langage Rapid scalability, robust streaming, analyses millions of events at sub second latencies, near real time insights Performs aggregation over window of time A stream pipeline is called a job, a unit of execution It is composed of : Publishers, Events, Events Hub, Data Streams, and subscribers

A. Query: Une image contenant texte

Description générée automatiquement

DATEDIFF (datepart, startdate, enddate) è returns biginit Returns the count (signed integer) of the specified datepart boundaries crossed between the specified startdate and enddate.

Datepart : time type : arguments: year, quarter, month, day, second …

DateADD (datepart, number, date) è returns biginit Returns a specified date with the specified number interval (signed integer) added to a specified datepart of that date

Number: big int type. If fraction, then it will be truncated and not rounded. The value is added to a datepart of date

DATEPART (datepart, date) è returns biginit Returns an integer that represents the specified datepart of the specified date.

B. Positional functions They are a type of window functions.

LAG (column or expression) : to be used to access previous rows data as per defined offset value. Useful to compute the rate of growth of a variable, detecting when a variable crosses a threshold (plafond), a condition starts or stops to being true LAST : return the last value of selected column LEAD ( scalar_expression [ ,offset ] , [ default ] ) : to be used for comparaison Offset: number of rows Default : default value to be returned if the offset is beyond the partition

C. Temporal windows functions

Tumbling window segment a data stream into distinct time segments. Don’t overlap Hopping Segment a data and forward in time by a fixed period (thanks to hop size) è overlapTo make it as Tumbling, specify the hop size to be the same as the window size Sliding output events only for points in time when the content of the window actually changes Session group events that arrive at similar times, filtering out periods of time where there is no data Snapshot group events that have the same timestamp. System.Timestamp() function

D. Geospatial functions:

CreateLineString Minimum of two points are needed to create a linestring. It returns a GeoJSON linestring CreatePoint Accepts latitude and longitude and return GeoJSON point CreatePolygon accepts points and returns a GeoJSON polygon record. The order of points must follow right-hand ring orientation, or counter-clockwise St_distance returns the distance between two geometries in meters St_overlaps compares two geometries è 1 if overlap, 0 otherwise St_intersects compares two geometries è 1 if intersection, 0 otherwise St_within whether a geometry is within another geometry è 1 if the 1st geometry is contained in the 2nd, 0 otherwhise ADF A. Copy behavior

If recursive option is enabled, files in subfolers will be copied, if not enabled, they will not.

preserveHierarchy flattenHierarchy mergeFiles preserve the same files hierarchy all the files will copied in a same folder all the files will be merged in a same file

if binay source is choosen, ADF service will not parse the data.

B. Split data during the copy

If first matching condition ==> disjoint == False If all matching conditions ==> disjoint == True

C. Integratrion runtime

Azure integration runtime

Azure SSIS

Self hosted

Copy data between cloud stores Transform data between cloud stores using data flows Execute activities using cloud stores and services to be used when executing SSIS packages through ADF Copy data between cloud and on-premises stores Copy data between on-premises stores Execute activities using on-premises stores and services

Security Capabilities

Azure Databricks

Cluster mode

Standard cluster High concurrency cluster Single node cluster Not appropriate for minimum query latency Can’t be used for fine-grained utilization of ressources Good for minimum query latency supports fine-grained utilization of ressources No workers or spark jobs that can execute on the driver node Scale up exceptionally but can take many steps to reach the max. Scale down only when cluster is completely idle and not utilized for the las 10 min Premium : Scale up min to max en 2 etapes, scale down sans que le cluster soit totalement idle, après 40sec d’utilisation sous utilisée pour les job cluster, après 150 sec sous utilisé pour les all-purpose

ideal single user, and for processing large amounts of data with spark ideal for groups of uders who need to share ressources or run ad-hoc jobs. Administrators usually create High Concurrency clusters. Databricks recommends enabling autoscaling for High Concurrency clusters. ideal for small amounts of data or non-distributed workloads any language: Python, R, Scala, and SQL. Scala is not available

Job Cluster All-purpose cluster to be used to run fast and robust automated jobs to be used to analyse data collaboratively using interactive notebooks

Scheduled jobs should run in standard cluster.

Automated Databricks clusters are the best for jobs and automated batch processing. Azure Databricks has two types of clusters: interactive and automated. You use interactive clusters to analyze data collaboratively with interactive notebooks. You use automated clusters to run fast and robust automated jobs.

Init script: a shell script that runs during startup of each cluster node before the Apache Spark driver or worker JVM starts. Principale of it’s tasks:

Install packages and libraries not included in Databricks Runtime Modify the JVM system classpath in special cases. Set system properties and environment variables used by the JVM. Modify Spark configuration parameters. There are two kinds of global scripts:

Cluster-scoped: run on every cluster configured with the script. This is the recommended way to run an init script. Global: run on every cluster in the workspace. They can help you to enforce consistent cluster configurations across your workspace. Use them carefully because they can cause unanticipated impacts, like library conflicts.