08.19.11

Optimizing database queries in CakePHP

Posted in CakePHP, PHP, Web Development at 12:01 pm

We’ve been using CakePHP for a while now. I’ve found it to be very helpful for quick prototyping and development of core site functionality. This is nice because it allows more time for tweaking a site’s user interface, functionality, and performance. That last one, performance, can occasionally be a challenge. One of the means of optimizing a site’s performance is through the crafting of database queries that take into account indexes and conditions. But because CakePHP is building the SQL we have to keep in mind how CakePHP does this as we construct our query parameters.

Let’s review one of those situations. Say we have the following (simplified) models:

class Item extends AppModel {
  var $hasMany = array('Answer');
  var $belongsTo = array('Packet');
}
 
class Student extends AppModel {
  var $hasMany = array('Answer');
}
 
class Answer extends AppModel {
  var $belongsTo = array('Student','Item','Packet');
}
 
class Packet extends AppModel {
  var $hasMany = array('Answer','Student');
  var $hasAndBelongsToMany = array('Item');
}

Let me explain the model relationships. We have test items that are assigned to packets. Packets are sent out for testing to a group of student. So each answer is unique based on the packet, item, and student. The (somewhat circular) relationships between these models provides for easy access to specific data. For example, let’s say we want to know all the answers from a particular state for a couple of our items. Here’s an example query that present some optimization issues:

$queryOpts = array(
  'fields' => array('Item.*');
  'conditions' => array('Item.id' => array(1995, 1726, 1971, 1707, 1972));
  'contain' => array(
    'Answer' => array(
      'fields' => array('Answer.*');
      'conditions' => array(
        'Answer.packet_id IN (SELECT id FROM packets WHERE state = "CA")'
      )
    )
  );
);
$items = $this->Item->find('all',$queryOpts);

You’ll note that I’m not using CakePHP syntax for the Answer model conditional. We had been using this conditional in our web application prior to moving to CakePHP. Since there is not, so far as I can tell, an easy way to work with SQL subqueries we decided to start by forcing a query similar to the original.

Running the above creates the following SQL statement when querying the answers table:

SELECT `Answer`.* FROM `answers` AS `Answer` WHERE `Answer`.`packet_id` IN (SELECT id FROM packets WHERE packet_type = "F") AND `Answer`.`item_id` IN (1995, 1726, 1971, 1707, 1972)

We have a fairly large data set so the above query takes ~32 seconds to run on my system. But that query is not optimized; reordering the conditions in the WHERE clause produces the following query:

SELECT `Answer`.* FROM `answers` AS `Answer` WHERE `Answer`.`item_id` IN (1995, 1726, 1971, 1707, 1972) AND `Answer`.`packet_id` IN (SELECT id FROM packets WHERE packet_type = "F")

The optimized query performs significantly better, <1 second.

Since for this example we know which items we want we can update the query to manually force the desired ordering:

$queryOpts = array(
  'fields' => array('Item.*');
  'conditions' => array('Item.id' => array(1995, 1726, 1971, 1707, 1972));
  'contain' => array(
    'Answer' => array(
      'fields' => array('Answer.*');
      'conditions' => array(
        'Item.id' => array(1995, 1726, 1971, 1707, 1972)
        'Answer.packet_id IN (SELECT id FROM packets WHERE state = "CA")'
      )
    )
  );
);
$items = $this->Item->find('all',$queryOpts);

But what if we don’t know what items we want. Say we’re pulling items (and answers) as part of another containable query. We have no way to specify the items in the Answer model conditionals. But we can improve the query so that we’re not using a subquery each time (i.e. we’re doing it the “CakePHP” way):

$fields = array('Packet.id');
$conditions = array('Packet.packet_type' => 'F');
$packet_list = $this->Item->Packet->find('list',array('fields'=>$fields,'conditions'=>$conditions));
 
$queryOpts = array(
  'fields' => array('Item.*');
  'conditions' => array('Item.id' => array(1995, 1726, 1971, 1707, 1972));
  'contain' => array(
    'Answer' => array(
      'fields' => array('Answer.*');
      'conditions' => array(
        'Answer.packet_id' => $packet_list
      )
    )
  );
);
$items = $this->Item->find('all',$queryOpts);

Which produces the following two queries:

SELECT `Packet`.`id` FROM `packets` AS `Packet` WHERE `Packet`.`packet_type` = 'F';
SELECT `Answer`.*, `Answer`.`item_id` FROM `answers` AS `Answer` WHERE `Answer`.`packet_id` IN (1278, 1277, 1276, 1274,...,1726, 1971, 1972, 1995);

Again we’re seeing the kind of performance we want, <1 second for both queries. As good as when we knew all our conditions in advance. The main drawback to doing it this way is the size of the query string sent to your database. If your first query is pulling a large number of results then your second query will be quite large. And most databases have limits on how large a query can be.

This isn’t the only way to get to the end point either. As with any programming task there are a variety of ways to arrive at the desired result. For example, you could break up the query into multiple queries that build off of the previous query. Sticking with our items/answers query you could first get the items in one query. Then foreach through the results and run a query to get the answers, integrating the results manually. It’s more work, and likely would decrease performance since you’re increasing the number of database queries, but if you can’t seem to optimize your all-in-one queries it’s worthwhile to try other options to see if you can get better performance.

Aside: performance problems due to poorly optimized queries can be somewhat mitigated by query caching. Even so, you’re better of with an optimized query to begin with since it saves you from a performance hit every time the relevant cached result is flushed.

06.24.11

ASP goes on a diet

Posted in Web Development at 10:35 am

I recently modified our custom mass mailer script to accept attachments. After the modifications the new functionality worked nicely, except when it didn’t. Sometimes the script would error out when it accessed the Request object. When this happened I would receive the following message:

Request object error ‘ASP 0104 : 80004005′ Operation not Allowed

Though the error messages produced by ASP tend to be a bit cryptic, Microsoft sometimes does a decent job of documenting them on their web site. A quick search revealed the problem: IIS 5.x and 6 have a default request limit of 200K. That’s sufficient in most situations where you would attach something to an email, but once you start attaching multiple documents there’s a chance you’ll need a higher limit. The fix is easy enough, but does require editing the IIS metabase. Setting the AspMaxRequestEntityAllowed property to an appropriate limit (in bytes, up to 1GB max) addresses the issue. I like to use metaedit to edit the metabase, but you can also use the command-line metabase editing script:

cscript adsutil.vbs set w3svc/ASPMaxRequestEntityAllowed 20971520

20971520 is the limit of the request size in bytes (20M in this case).

References:

05.27.11

WP redirects confuse IE

Posted in System Administration, Web Development, WordPress at 11:34 am

Some users were having problems on a community site that was implemented using WordPress+BuddyPress. After some testing the issue involved users visiting the site with Internet Explorer (IE). These users were receiving error pages instead of site content. Other web browsers did not have similar issues.

The cause of the problem turns out to be a confluence of issues:

  • The site is hosted on a Windows+IIS server, a rare platform for WordPress (and even more so for BuddyPress) and one that probably doesn’t receive full attention during quality assurance testing.
  • How WordPress performs redirects on IIS is a bit quirky. BuddyPress issues a lot of redirects so this redirect quirk comes into play quite often. The issue is that when WordPress needs to perform a redirect on IIS it returns a “refresh” header pointing to the new page rather than a “location” header.
  • IE’s attempt to make the Internet more friendly; specifically IE’s use of “friendly error pages.” These friendly error pages replace the content delivered by the server (if that content falls below a certain size in KB).

Normally none of these issue are a problem by themselves and a web browser (including IE) will load the page indicated by the redirect. However, all three of the above issues taken together result in a situation where IE never sees the header refresh and so doesn’t redirect the user to the correct location.

The fix is fairly simple: change the headers that WordPress sends to include the standard “location” header. To do this you modify wordpress/wp-includes/pluggable.php@wp_redirect() so that it reads as follows (line 14 is new):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
function wp_redirect($location, $status = 302) {
	global $is_IIS;
 
	$location = apply_filters('wp_redirect', $location, $status);
	$status = apply_filters('wp_redirect_status', $status, $location);
 
	if ( !$location ) // allows the wp_redirect filter to cancel a redirect
		return false;
 
	$location = wp_sanitize_redirect($location);
 
	if ( $is_IIS ) {
		header("Refresh: 0;url=$location");
		header("Location: $location", true, $status);
	} else {
		if ( php_sapi_name() != 'cgi-fcgi' )
			status_header($status); // This causes problems on IIS and some FastCGI setups
		header("Location: $location", true, $status);
	}
}

05.13.11

Integrating calculated fields and model data in CakePHP

Posted in PHP, Web Development at 3:48 pm

(This is mostly a summary of Dealing with calculated fields in CakePHP’s find().)

One of the great things about CakePHP is that if it doesn’t have some core functionality you want/need there are easy ways to add it. More and more I’m taking advantage of this ability. This all comes about because I wanted to have calculated fields available inline with the model data. By calculated fields I mean results that are not data columns (e.g. SELECT *, CURDATE() AS current_date FROM users … yes, that’s a fairly contrived example).

By default Cake places results from calculated fields outside the model data, like this:

Array
(
    [0] => Array
        (
            [User] => Array
                (
                    [id] => 1
                    [username] => aaas
                )
 
            [0] => Array
                (
                    [current_date] => '2011-05-13'
                )
 
        )
)

What we want is to place the “current_date” calculated field inside the User model, so it’s more naturally accessed with $users[0]['User']['current_date'] instead of $users[0]['User'][0]['current_date']. Easy enough to do through a model’s afterFind() callback method (to make it widely available place the function in app_model.php).

function afterFind($results, $primary=false) {
	if($primary == true) {
		if(Set::check($results, '0.0') && Set::check($results, "0." . $this->alias)) {
			$fields = array_keys( $results[0][0] );
			foreach($results as $key=>$value) {
				foreach( $fields as $fieldName ) {
					$results[$key][$this->alias][$fieldName] = $value[0][$fieldName];
				}
				unset($results[$key][0]);
			}
		}
	}
	return $results;
}

And yet, even as I add such enhanced functionality to my web apps I’m finding limits. The above logic is complicated a little because you don’t know what type of results you’re getting. The results you get from calling, for example, find('all') versus find('count') are not the same. But Cake doesn’t give any hints to the afterFind() callback, and as a result additional logic is included to try and guess at our data structure. The above adds a quick but incomplete hack by a) checking that the results are for the primary model queried (e.g. not from contained model), b) checking for the presence of nested numerical keys, and c) checking that there is model data to integrate with.

The take away is that while the code produces the desired data structure, in its current form it does so only for specific results.

05.06.11

Align two columns in Excel

Posted in Web Development at 2:45 pm

I recently had two sets of data, one a full list of records and the other a list of identifiers for the records that needed to be extracted. Extracting the relevant records from the full list would be a fairly easy programming task, but the data was in Excel and I wanted to try and accomplish the task in that environment. Thankfully this problem has already been solved and the answer posted to the web.

If I have one column of identifiers for the records that need to be extracted (column A, 100 records) and one column with the full list of identifiers (column B, 1000 records) the following formula will indicate which identifiers from column B match an identifier from column A:

=IF(ISNA(MATCH(B1,$A$1:$A$100,0)),"","X")

Just place the formula in its own column and copy down for the length of the full data set. The columns do not need to be in any particular order and you can create a separate worksheet that contains your filter list, keeping your data and your filter separate. You can then AutoFilter on your search column to see the results (and copy/paste to another worksheet if necessary).

What file was that in?

Posted in System Administration, Web Development at 2:13 pm

One of the more annoying things about coding is finding the right file. Or, even worse, finding a file you didn’t know you needed to look at. Especially when the number of files you’re parsing is in the thousands. If you’re on Windows the built-in search can help, but you never know if all the files in the target directory have been indexed.

Luckily, one of the nice things about coding is that you’re often dealing with plain text files. And you are typically searching for a particular string. As they say, there’s an app for that. Both *nix and Windows are capable of searching through the contents an entire hierarchy of files using command-line programs. Each OS has a variety of commands that can do the job, but I’ll highlight the two I use most often here.

In Windows you would use findstr.exe, and it’s as simple as running the following from the containing directory:

findstr.exe /MIS "searchtext" filetype

On *nix systems the grep command is your friend:

grep -lr "searchtext" filetype

Of course, a quick google search will get you all the help you need in refining your search.

12.23.10

Executable blocking on Windows 2003

Posted in System Administration at 1:40 pm

I was recently attempting to install PHP using an installer (MSI) I downloaded to my local workstation and copied to one of our servers. When I attempted to run the installer I received the following inscrutable warning:

Windows cannot access the specific device, path, or file. You may not have the appropriate permissions to access the item.

Thanks, Windows, that’s helpful. I had no idea what was happening, especially since I was logged in as an administrator with full control of the file.

As usual, the web was my friend in solving the problem. Windows 2003 Server has a security feature where executables copied from remote systems are blocked for execution unless allowed by an administrator. The feature appears to have been present for a while (since SP2?), but I don’t recall running into it before. As mentioned above, the file in my case was copied from my workstation over SMB; I don’t know if other transfer methods are also affected. I don’t have a problem with the feature, but I do have a problem with how it’s implemented. An indication in the pop-up of why the file could not be accessed would have resulted in a resolution of seconds not … er … minutes.

The resolution is simple enough, go into the properties for the file and you’ll see at the bottom a security warning and a button that allows you to unblock the file.

Click the “Unblock” button and you can execute the file as you normally would.

References:

12.03.10

Careful with Google Analytics filters

Posted in Web Development at 12:23 pm

I made a change to the filters attached to our profiles in Google Analytics (GA) so that we would capture only relevant traffic. Namely, I was attempting to avoid capturing traffic to our development servers. Unfortunately, instead of a domain-based filter I created one that resulted in GA capturing zero traffic. Fail.

The problem stems from my attempt to use a predefined filter to block out the unwanted traffic. I tried to set up a “include only traffic from the domains that include” filter, thinking this was a filter on the server’s domain. What this filter actually checks against is the visitor’s domain. I guess a little more attention to the “from/to” part of the filter would have made this apparent, but Google doesn’t provide much documentation (or even better, examples) on their filters.

After a few days I noticed the sudden lack of data and realized my error. Turns out what I really wanted was a custom filter with the following settings:

  • filter type: include
  • fiter field: hostname
  • filter pattern: regex for the domain (e.g. www\.project2061\.org
  • case sensitive: no

Even after looking at this again, it’s still not clear to me why these two are different. The descriptive language is almost identical. Google needs to do some work here.

Of course, there are also other solutions to this problem, listed below:

  • Use an include to pull in the GA code and only populate the file on the servers where tracking should occur. The only significant problem here would be forgetting to set up the include.
  • Create advanced segments in GA that filters out traffic to anything other than the production domains. This would have the benefit of being able to track both production and development server use in the same report. But if there are any significant difference between the two sites then this wouldn’t be all that helpful.

References:

suhosin to [internal web app]: you talk too much

Posted in PHP, System Administration at 11:13 am

Following up on my earlier post, I’ve had to make some further configuration adjustments to avoid suhosin-related restrictions in one of our custom web applications. This particular application has a function that generates a summary of data from student assessments. The summaries can be generated based on groupings of packets and items. Depending on the filtering parameters selected there can be a fairly large number of packets and items. Not all of the packets necessarily contain the items of interest, but it’s always easier to select all if you want an overall summary of item performance.

I recently noticed the following alert in the system log:

ALERT – configured POST variable limit exceeded – dropped variable ‘included_packet_ids[]‘ (attacker USER_IP_ADDRESS, file REPORT_FILTERING_PAGE)

One of the reasons I use POST variables on this page is to avoid the relatively small data size limit of GET. Suhosin adds additional limits, including in the number of times you can reference an individual variable.  Our limit was set at 1000, meaning there were over 1000 packets selected. This points to a need to adjust how the filter “selects all” … but for now I’ve adjusted the suhosin limit upward by modifying the suhosin.post.max_vars setting.

References:

11.05.10

suhosin to WordPress: go on a diet

Posted in PHP, System Administration, WordPress at 3:45 pm

We were seeing a lot of suhosin alerts in the system messages log of the type:

ALERT – script tried to increase memory_limit to 268435456 bytes which is above the allowed value (attacker SERVER_IP_ADDRESS, file WP_MAIN_ADMIN_PAGE, line 96)

The source of the issue is WordPress. The application is trying to raise the memory limit and suhosin won’t let it. Apparently WordPress will try to set a 256MB memory limit before executing certain functions. The necessity of adjusting this setting seems questionable to me, but I also understand that it’s often better to play it safe when developing software for public consumption.

I don’t particularly like applications attempting to specify their own resource usage in a web environment. In my mind applications should specify a required/recommended memory limit in the system requirements and stay away from adjusting this setting behind-the-scenes. Tell me during setup if the current setting may result in non-optimal performance or even a halt in script execution. That’s not how it’s done here, but really no harm is done in the long run beyond the annoyance of suhosin throwing errors at the system logs.

There are two easy fixes to the problem:

  1. Set the PHP memory limit to 256MB
  2. Modify the suhosin.memory_limit parameter to 256MB

In our particular situation it’s just as easy to set the PHP memory limit. There’s always a risk of overloading the physical resources, but this site receives little enough traffic that I’m not concerned about the right confluence of request occurring to cause a crash.

References:

ALERT – script tried to increase memory_limit to 268435456 bytes which is above the allowed value