Microsoft Azure MVP

Microsoft Azure Storage Best Practices (Part 3)

Andrew Gravell
Microsoft Azure Storage Best Practices (Part 3)
5 (100%) 1 vote

This is the 3rd part of my Azure Storage Best Practices series.  In this article, I will focus on the best practices for blobs, tables and queues.

Best Practices for Blob Service

How do I upload a folder the fastest?

If you want to upload the folder the fastest, you should upload multiple blobs simultaneously.  Remember, a PartitionKey is your blob name. You want to make sure that your traffic is across different partitions because your account limit is 5 gigabytes per second ingress if you have GRS turned on.  However, for a single blob, it’s 60 megabytes per second. So rather than using multiple threads to upload a single blob, you can use these multiple threads to upload these various blobs.

How do I upload a blob the fastest?

That’s where to get the 60 megabytes, up to 60 megabytes per second, you can use multiple threads and you can use the Block Blobs concept of doing these parallel uploads of blocks of data.

Distribute the load across the namespace. Don’t use append/prepend patterns with respect to uploading blobs.

Prefer block upload size of 1 to 4 MB ranges because more data you send, the better it is for your round trip time and also the way we handle the data, unless you’re doing some streaming scenarios, like, for example, if you need to play on the Windows Media Player or something like that. These applications read small chunks of data. So out there you can start using smaller chunks to upload, less than 1 MB to kind of upload data. So if you’re not reading smaller ranges, use these 1 to 4 MB of data size.

Best Practices for Table Service

JSON over AtomPub. That is kind of pretty obvious because you get that 70 percent reduction in bandwidth. JSON also provides you a no metadata kind of response so your queries will be just your raw data, name and value, and the type is inferred by your application. And so that gives you up to 70 percent reduction in bandwidth usage.

Critical queries. Make sure when you’re designing your application you have a set of critical queries that need to be performed and that needs to be accessed over and over again. Make sure you use a clustered index. Microsoft Azure provides one clustered index which is PartitionKey and RowKey. Make sure that you’re not scanning the table or scanning the partition ranges as such, but just using the PartitionKey, RowKey to avoid hot spots.

Batch. Microsoft Azure Storage has the entity group transaction which is a bunch of entities that can be grouped together and sent in a single request. It is an all-or-nothing semantics, so it provides you a transaction on that particular request, and this allows you to kind of get more throughput as such. This is because obviously you’ve used a number of round trips rather than sending each entity. You can send up to a hundred entities in a single batch.

Utilize Schema-less property of Azure Table by storing these multiple types of entities that need locality together. For example, if you have customer and order entities, it’s a NoSQL store. We do not have things like joints which are provided in the relational store, so what you could do is store customer and the orders in the same table in your application and use your RowKey to kind of see what kind of entity it is. And then whenever you need your customers, you can only select your customers, or if you need both for that customer, all the orders, you can select it based on your PartitionKey, which might be your customer ID. So you have this control on locality of these entities on how it’s stored. Remember PartitionKey and RowKey is a cluster index for us, so that’s how we store the data, too. It’s sorted by that particular combination of keys.

Single index – {PartitionKey, RowKey}: if needed, in a NoSQL store, you can concatenate fields to kind of get this composite index. But the prefix is always the more important criteria in your key so that whenever you query, you need to make sure that you at least know what your prefix is for the key so that you can limit the range of scans, if possible. For most scaleable applications, you need to avoid scans as such, but there are background workers which might need to do scans sometimes which don’t need that level of performance. For that it’s fine to use kind of scans with good enough backup policies on retries.

Entity locality: {PartitionKey, RowKey} determines sort order.  You should store related entities together to reduce IO and improve performance.

Table Service Client Layer in 3.0+. Microsoft introduce a new way of writing your table client.  Microsoft Azure Storage used to provide WCF data service before. WCF was built with relational store in mind, and so it needed something else that is really performant and knows basically the nuances of a NoSQL store. That’s why Azure Storage provided the Table Service layer and this dramatically improves the performance. And this is one of the major criterias why I could saturate the storage account limits from a single VM. Before, while using WCF, that was not possible at all, and this, along with, obviously JSON, has enabled me to saturate my storage account from a single VM.

Best Practices for Queue Service

Make message processing idempotent. Queue has semantics where you retrieve the message, process it, and then you delete the message. If you don’t delete the message, the message becomes visible again after an amount of time that you had said… you had set while retrieving it. So make sure that message processing is idempotent.

Leverage update message, so that you can extend the visibility based on how much work you need to do.

Message count: Use this to scale your workers as such, so you can go and get the message count in the queue, and then based on that, you can either scale out or scale down the number of workers that need to process these messages.

Dequeue count, it gives you an opportunity to learn about whether you’re setting good enough visibility time or if there are any poison messages which is making your workers crash while processing these messages. So you can use dequeue count to kind of alert on when it reaches a certain count, for example. And also remove the message and store it in a separate location.

Use blobs to store large messages: every message can be 64 KB in size, so you can use blobs, however, to batch these messages together. You can upload all the messages, multiple messages, into a blob and then store the blob URL in a message, in a queue. So while processing it, you can download the blob. You get more throughput than what a single queue can give by doing that.

Multiple queues: a single queue is a PartitionKey. You can get up to 2,000 messages per second of 1 KB in size, so if you need more, you would need to use multiple queues.

 


Where should I Store my Logs? – Table / Queue / Blob

 


Leave a Comment

Your email address will not be published. Required fields are marked *

Loading...
1Code.Blog - Your #1 Code Blog