Think About It: Loop Iteration Performance (Part 2)

This blog post was published more than one year ago and might be outdated!

July 15, 2015 · 3 min read

Business Analyst Digital Sales

This article is the second of a three-part series and describes how we optimized our data processing and reached performance improvements tweaking our code. Make sure you covered the first article about how we tweaked PHPExcel to run faster while reading Excel and CSV files.

Our performance optimization sprint contained the improvement of read file data, processing and persist it. While the file data is relatively small referred to the file size, the amount of data sets can vary between 5.000 and more then 40.000 entities on an average, but may be a lot more in some cases.

The following code examples are shorten and simplified to focus on the issue we encountered and how we solved it. In the following example we fetch the data from the file and save it in the variable $dataProductSet as an array:

/* @var $dataProductSet \Foo\Bar\File\Data\Products[] */
$dataProductSet = array(...);
foreach ($dataProductSet as $dataProduct) {
    /* @var $dataProduct \Foo\Bar\File\Data\Products */
    $relatedProduct = $this->productManager->getProductBySku($dataProductSet->getSku());
    if ($relatedProduct) {
        // Alread existing product
        $productGroup = $this->productManager->getProductGroupById(
            $relatedProduct->getProductGroupId()
        );
        // ...
    } else {
        // New product
        // ...
    }
}

The data is already mapped to the correlating class entity (e.g. \Foo\Bar\File\Data\Products), which represents the entities' file structure. Now the data must be mapped to the internal used data structure to save it persistently. We iterate over the set of product data and for each of the raw data product, we look up if the product already exists by requesting the ProductManager, which itself fetch the data from the data storage. Furthermore we process more data, like for example fetch the product group, and so on.

The given example works fine, but is far from being performance optimized. This may be no issue for a handful of entries in the $dataProductSet collection, but with 40.000 entries, you have to process 80.000 data storage request just by this example, and this takes a lot of time after all.

The solution is simple and really fast: The Index-Cached Set.

The Index-Cached Set is nothing more as a rearranged array, so the key values will have the value you access your list by, for example the entity id. A method can look like this:

protected function getProductsIndexedBySku(array $products)
{
    $indexedSet = array();
    foreach ($products as $product) {
        $indexedSet[$product->getSku()] = $product;
    }
    return $indexedSet;
}

So the example from the beginning will look like this after some small refactoring:

/* @var $dataProductSet \Foo\Bar\File\Data\Products[] */
$dataProductSet = array(...);
$indexedProducts = $this->getProductsIndexedBySku($this->productManager->getProducts());
foreach ($dataProductSet as $dataProduct) {
    /* @var $dataProduct \Foo\Bar\File\Data\Products */
    if (isset(indexedProducts[$dataProductSet->getSku()])) {
        // Alread existing product
        // ...
    } else {
        // New product
        // ...
    }
}

Now you request the data storage only one time, avoiding additional overhead (like e.g. opening database transaction). Using this also for fetching the product groups data, the times of requesting the data storage have been reduce from 80.000 to 2 times, and not be increasing linear or exponential with the amount of incoming data.

Using the Index-Cached Set provides support for reducing the amount of data storage requests and avoiding unwanted overhead slowing down the process. By optimizing our code like this, we have reduce the processing duration from minutes (two figures!) down to a few seconds or less.

The next article of this series will be about how we optimized the persistence process of bulk data in our code in combination with PostgreSQL.