登录查看更多内容

ORC + ADLS = Polybase

Ron Dunn

Performance Specialist at Snowflake - The Data Cloud

发布日期: 2018年4月27日

Most people know that the fastest way to load Azure SQL Data Warehouse is through Polybase. And most people use delimited text files as their input.

That's good, until you need to move strings which need to differentiate null and empty strings, strings which may contain embedded carriage returns, and binary values.

Polybase also supports some Hadoop-based file formats - RCFile, ORC and Parquet. That's great if your data already lives in a Hadoop world, but what if you're moving data from an on-premise application? You might find that you're creating a delimited file, moving that file to Hadoop, and rewriting it in one of these formats before loading to ASDW. Not exactly a fast way to handle the data.

Recently I needed to address this problem, and I looked for ways to dynamically create ORC or Parquet files. There wasn't much information around, so I'm sharing my findings.

There were two challenges in this POC - the ORC and Parquet writers are buried in a mess of Hadoop dependencies, and the documentation on how to use them is virtually non-existent. In the end, I was able to succeed with ORC, but I've got some outstanding issues with Parquet that I'll address in another article.

I'm pleased with the outcome. ORC is a great format for Polybase ingestion. It compresses nicely, and a single file can be loaded in parallel by Polybase. This means that you get a file which uploads fast to ADLS _and_ loads quickly through Polybase.

So here it is - a Groovy code example that will run a query against a database, then write the results of that query to an ORC file.

// Get dependencies. 
// WARNING: Many dependencies, may take VERY long time.

@GrabConfig(systemClassLoader=true) 
@Grab(group='com.microsoft.sqlserver', module='mssql-jdbc', version='6.4.0.jre8') 
@Grab (group='org.apache.hive', module='hive-storage-api', version='2.5.0')
@Grab (group='org.apache.orc', module='orc-core', version='1.4.3')

// Import library definitions

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.hadoop.hive.common.type.HiveDecimal
import org.apache.hadoop.hive.ql.exec.vector.*
import org.apache.orc.*
import groovy.sql.Sql

// Set the connection parameters we'll use for the query.

def sqlServer		= 'localhost'
def sqlDatabase		= 'AdventureWorks2017'
def sqlUser			= 'sa'
def sqlPassword		= 'pw'
def sqlQuery		= 'select ProductID, Name, MakeFlag, StandardCost, Weight, ModifiedDate from Production.Product'

// Set up the ORC file schema. 

Configuration conf = new Configuration()
TypeDescription schema = TypeDescription.createStruct()
	.addField("ProductID", TypeDescription.createInt())
	.addField("Name", TypeDescription.createString())
	.addField("MakeFlag", TypeDescription.createBoolean())
	.addField("StandardCost", TypeDescription.createFloat())
	.addField("Weight", TypeDescription.createDecimal().withPrecision(10).withScale(2))
	.addField("ModifiedDate", TypeDescription.createTimestamp())

// Open an ORC file based on the schema.

Writer writer = OrcFile.createWriter(new Path("./product.orc"),
		OrcFile.writerOptions(conf)
				.setSchema(schema)
				.compress(CompressionKind.ZLIB)
				);

// Create a batch of ORC vectors to receive the row data.

VectorizedRowBatch batch = schema.createRowBatch()
LongColumnVector vProductId = (LongColumnVector) batch.cols[0]
BytesColumnVector vName = (BytesColumnVector) batch.cols[1]
LongColumnVector vMakeFlag = (LongColumnVector) batch.cols[2]
DoubleColumnVector vStandardCost = (DoubleColumnVector) batch.cols[3]
DecimalColumnVector vWeight = (DecimalColumnVector) batch.cols[4]
TimestampColumnVector vModifiedDate = (TimestampColumnVector) batch.cols[5]

// Connect to the source database

def sql = Sql.newInstance ("jdbc:sqlserver://$sqlServer;databaseName=$sqlDatabase;user=$sqlUser;password=$sqlPassword;")

// Execute SQL query and iterate over rows to create ORC file.

sql.eachRow (sqlQuery) { row ->
	
	// Short-hand the vector index
	
	int v = batch.size
	
	// Set an integer value
	
	vProductId.vector[v] 	= row.getInt('ProductId')
	
	// Set a string value that may be empty or null
	
	String name = row.getString('Name')
	if (name != null) {
		vName.setVal (v,name.getBytes())
		}
	else {
		vName.isNull[v] = true
		vName.noNulls = false
		}
	
	// Use integers for bit values
	
	vMakeFlag.vector[v] = row.getInt('MakeFlag')
	
	// Float example
	
	vStandardCost.vector[v] = row.getFloat('StandardCost')
	
	// Decimal. Bit of fiddling here to handle NULLs
	
	BigDecimal weight = row.getBigDecimal ('Weight')
	if (weight == null) {
		vWeight.set (v,(HiveDecimal) null)
		}
	else {
		HiveDecimal hWeight = HiveDecimal.create(weight)
		vWeight.set (v,hWeight)
		}
		
	// DateTime
	
	vModifiedDate.set (v,row.getTimestamp('ModifiedDate'))

	// Write each batch as filled.
	
	++batch.size
	if (batch.size == batch.getMaxSize()) {
		writer.addRowBatch(batch);
		batch.reset();
		}

	}

// Write final batch to disk

if (batch.size != 0) {
	writer.addRowBatch(batch);
	batch.reset();
	}

// Finished

writer.close();

Last month I shared some code to transfer the resulting file to Azure Data Lake Store, ready for query through Polybase. You'll find it here:

https://www.dhirubhai.net/pulse/groovy-data-lake-store-ron-dunn/

To wrap up the load, here's the code that you'll need to use Polybase, in Azure SQL Data Warehouse.

The external data source connection:

create external data source adls
with (
    type = HADOOP,
    location = 'adl://me.azuredatalake.net',
    credential = adlsCredential
    );

Next, the external file type:

create external file format adls_orc  
with ( 
    format_type = ORC,
    data_compression = 'org.apache.hadoop.io.compress.DefaultCodec' 
    );

And finally, the external table definition:

drop external table product_orc;
create external table product_orc (
    ProductId int
    ,ProductName nvarchar(50)
    ,MakeFlag bit
    ,StandardCost real
    ,Weight decimal(10,2) null
    ,ModifiedDate datetime
    )
with (
    location = '/product.orc',  -- Location of file I uploaded
    data_source = adls,         -- My external data source
    file_format = adls_orc      -- My external file format (see above)
    );

Ashika Umagiliya

Software Architect / Data Engineer

6 年

Hi Ron, thanks for sharing . Question : After I writing String null? values to ORC, and then reading them back using?OrcFile.createReader , java read them as empty strings not null. And If I use these ORC file to create Hive tables, will these values be treated as null or as empty string in Hive ?

Amit Sukralia

Data | BI | Data Engineering | Leadership | Data Strategy

6 年

Thanks for sharing Ron. Did you also do any benchmarking on the performance of the two methods - is ORC quicker than the delimited text files and how fast if at all? We had a similar issue with NULL and empty strings and had to replace them with default strings when extracting from source. Will be keen to try ORC.

Murali Krishna Vysyaraju (TOGAF Certified)

Assistant Vice President - Genpact

6 年

Hi Ron, Good insight on your experience. Have you ever tried to load data from Excel into Azure SQL Data Warehouse using Polybase ? If so, can you share some insights on implementation

查看更多评论

要查看或添加评论，请登录

Ron Dunn的更多文章

Integer vs String

2025年3月4日

Integer vs String

There was doubt from a couple of readers when I suggested, last week, that integer keys were faster than strings. Let's…

4 条评论
Faster, Cheaper File Ingestion

2023年8月21日

Faster, Cheaper File Ingestion

If you're ingesting staged Parquet files into Snowflake, INSERT is better than COPY INTO. The vectorized Parquet…

9 条评论
[Don't] [Do] [This]

2021年10月5日

[Don't] [Do] [This]

[If] [Microsoft] [has] [left] [one] [indelible] [footprint] [on] [the] [world] [of] [SQL], [it] [is] [excessive] [use]…

15 条评论
CTAS Alter-Switch vs Drop-Rename

2018年12月1日

CTAS Alter-Switch vs Drop-Rename

You need to be using ALTER-SWITCH rather than DROP-RENAME when swapping tables after a CTAS statement. The best way to…

3 条评论
Azure SqlDW Pause/Resume with Automation RunAs Credentials

2018年11月27日

Azure SqlDW Pause/Resume with Automation RunAs Credentials

That's probably the longest and most obscure heading I've ever written, but if you're planning to use Azure Automation…

5 条评论
Groovy Data Lake Store

2018年3月4日

Groovy Data Lake Store

Azure Data Lake Store is Microsoft's newest cloud storage and analytics platform. Groovy is a JVM scripting language.

1 条评论

See all articles

ORC + ADLS = Polybase

Ron Dunn

Performance Specialist at Snowflake - The Data Cloud

Ron Dunn的更多文章

社区洞察

其他会员也浏览了

How I've set up my first Hadoop / Spark cluster: Preparation

Understanding Apache Hudi's MERGE INTO Command with Minio and HiveMetaStore

Concept Of Parallelism To Upload The Split Data While Fulfilling Velocity Problem Is Right Or Not

Setting Up a Delta Lake Solution in Hadoop HDFS and Azure Data Lake Storage: By Fidel Vetino

Data Modeling in the Big Data Era: HDFS

Data Lake Journey

Golden Monkey Go March in! Revo R (updated)

To Hub or Not to Hub, That is the Question ...

Past and Future of Big Data

Ron Dunn的更多文章

Integer vs String

Faster, Cheaper File Ingestion

[Don't] [Do] [This]

CTAS Alter-Switch vs Drop-Rename

Azure SqlDW Pause/Resume with Automation RunAs Credentials

Groovy Data Lake Store

社区洞察

其他会员也浏览了

How I've set up my first Hadoop / Spark cluster: Preparation

Understanding Apache Hudi's MERGE INTO Command with Minio and HiveMetaStore

Concept Of Parallelism To Upload The Split Data While Fulfilling Velocity Problem Is Right Or Not

Setting Up a Delta Lake Solution in Hadoop HDFS and Azure Data Lake Storage: By Fidel Vetino

Data Modeling in the Big Data Era: HDFS

Data Lake Journey

Golden Monkey Go March in! Revo R (updated)

To Hub or Not to Hub, That is the Question ...

Past and Future of Big Data