Posts Tagged 'Data Access'

August 31, 2015

Data Ingestion and Access Using Object Storage

The massive growth in unstructured data (documents, images, videos, and so on) is one of the greatest problems facing today’s IT personnel. The challenge is storing all the data so that it and its storage solution can grow exponentially. Object storage is an ideal, cost-effective, scale-out solution for storing extensive amounts of unstructured data.

SoftLayer offers object storage based on the OpenStack Swift platform. Object storage provides a fully distributed, scalable, API-accessible storage platform that can be integrated directly into applications. It can be used for storing static data, such as virtual machine (VM) images, photos, emails, and so on. Click here for more information on object storage.

There are two important use cases when working with object storage: data ingestion and data access.

Data ingestion use case
A large medical research company needs to upload a large amount of data into their SoftLayer compute instance. The requirement is for a multi-hundred terabyte image repository that contains hundreds of millions of images. Researchers will then upload code to run on bare metal servers with GPUs to process the images in the repository. The images range from 512KB CT images to 30MB to 50 MB mammograms and are logically grouped into 12 million “studies.” The client wants to onboard the data as quickly as possible.

Recommendations

  • Evenly distribute the objects into approximately 1,000 containers for the initial upload. For the amount of objects the client needs to store, our tests have shown that having a much larger number of containers, or too few objects per container, would incur significant performance penalties. The proposed 1,000 containers allow for a good balance for parallelism in object creation and keeps the container sizes manageable.
  • Concurrently add new objects to all containers using 400 worker threads for small objects (e.g., 512KB CT images) and 40 worker threads for large objects (e.g., 30MB to 50MB mammograms). The ideal number of worker threads is dependent on the workload size. Using a minimal amount of threads results in better response but lower throughput. Using significantly more threads may lower both latency and throughput because the threads start competing for resources.

Data access use case
A large technology company has a mix of GET, PUT, and DELETE operations for which it needs object storage capable of holding billions of small objects (15KB or less). They also want consistent latencies for their operation mix (GET 54%, PUT 33%, and DELETE 13%), which requires optimal tuning for consistent performance. The client’s benchmarking calls for 1,400 operations per second.

Recommendations

  • Use multiple containers (at least 40) to improve the latency for PUT and DELETE objects. As long as the objects are distributed over at least 40 containers with a sufficient number of worker threads, the average latencies for PUT and DELETE objects was well below 100ms in our tests. There may be occasional latency spikes, which are not surprising on shared storage systems, but overall, the latencies should be relatively consistent.
    • The read latency for a GET is very fast—less than 20ms on average for small objects.
  • Use multiple containers if very high throughput is needed. In our tests, we could drive more than 6,000 transactions per second on the production cluster with at least 40 containers.

-Naeem Altaf & Khoa Huynh

Categories: 
Subscribe to data-access