Think About It: PHPExcel Performance Tweaks (Part 1)
A few weeks back I covered a small article about a CSV-Tool optimized for memory usage and additionally tweaking performance.
Our performance optimization sprint contained the improvement of read file data, processing and persist it. While the file data is relatively small referred to the file size, the amount of data sets can vary between 5.000 and more then 40.000 entities on an average, but may be a lot more in some cases.
This article is the first of a three-part series and describes how we tweaked PHPExcel to run faster while reading Excel and CSV files. Originally we used to read files uploaded by our user with a single statement like this:
$sheetData = $objPHPExcel->getActiveSheet()->toArray();
This leads to a heavy memory trouble on files with a lot of existing cells. On default, PHPExcel uses 1 KB memory per cell. Our users uploaded files, where almost a thousand cells per row exists, resulting in an average memory usage of 1 MB per row. As stated in the abstract, the files have entities between 5.000 and 40.000, each represented as a single row, which ended in memory exhaustion. So this was going nowhere for us. We performed some tweaks in the PHPExcel configuration and usage:
I. Cache Cell Index in Memory
"This method holds cells in PHP memory as an array of serialized objects, but gzipped to reduce the memory usage still further, although access to read or write a cell is slightly slower." (Source: PHPExcel Developer Documentation, Page 11)
// set caching configuration
$cacheMethod = \PHPExcel_CachedObjectStorageFactory::cache_in_memory_gzip;
\PHPExcel_Settings::setCacheStorageMethod($cacheMethod);
II. Iterators and GC Optimization
Instead of accessing the file data directly, you can iterate over the meta structure. The meta structure is built up by worksheets, rows and cells. Using the iterator you can access the cell data with a minimal amount of memory usage, avoiding exponential memory exhaustion on bigger files. Additionally, make use of the worksheet disconnector and unset the phpexcel variable to optimize the garbage collection.
/* @var $phpExcel \PHPExcel */
$phpExcel = $this->getPhpExcel($fileName);
$sheetIter = $phpExcel->getWorksheetIterator();
// iterator over sheets
foreach ($sheetIter as $currentSheet) {
// get row iterator
$rowIter = $currentSheet->getRowIterator();
$rowIter->resetStart($startRow);
// iterate over rows
foreach ($rowIter as $row) {
/* @var $row \PHPExcel_Worksheet_Row */
// get cell iterator
$cellIter = $row->getCellIterator();
$cellIter->setIterateOnlyExistingCells(false);
// iterate over cells
for (; $cellIter->valid(); $cellIter->next()) {
/* @var $cell \PHPExcel_Cell */
$cell = $cellIter->current();
// $cell->getValue()
}
}
}
// Optimized garbage collection
$phpExcel->disconnectWorksheets();
unset($phpExcel);
III. use Custom Read Filters
As an alternative to the iterators you can define a Custom Read Filter, which let you completely control, which rows and cells should be read. The benefit of the Read Filter is the possibility to defined this indexed based, for example to read only the second and fourth column in every second row.
class CustomFilter implements \PHPExcel_Reader_IReadFilter
{
public function readCell($column, $row, $worksheetName = '')
{
// Read rows 1 to 7 and columns A to E only
if ($row >= 1 && $row <= 7) {
if (\in_array($column,range('A','E'))) {
return true;
}
}
return false;
}
}
The custom filter must be set by the object reader method setReaderFilter.
$cm = \PHPExcel_CachedObjectStorageFactory::cache_in_memory_gzip;
\PHPExcel_Settings::setCacheStorageMethod($cm);
$inputFileType = ...
$readFilter = new \CustomFilter();
$objReader = \PHPExcel_IOFactory::createReader($inputFileType);
$objReader->setReadFilter($readFilter);
$objPHPExcel = $objReader->load($inputFileName);
After we ran into heavy memory exhaustions on our customers staging server and in our integration tests, this tweaking may saved our lives ;)
The greatest benefit may be the avoiding an exponential increased processing time on bigger files.
The next article of this series will be about how we optimized our data processing and reached performance improvements tweaking our code.