I nearly burned up my entire afternoon today troubleshooting a few annoying inconveniences with the ADODB database layer, plus the design decisions of MySQL. Here’s what I’ve found out in a few ‘fun’ facts. This post is reasonably heavy on the technical content, so be warned.
You may be tempted to use the MySQL “REPLACE INTO” function for your own code. A sort of hybrid between INSERT or UPDATE operations, MySQL will seek the primary key in your query and check if it matches a database row. A match results in an UPDATE operation, while no match results in an INSERT query being run. Unfortunately, reading the documentation reveals an enraging drawback:
…except that if an old row in the table has the same value as a new row for a PRIMARY KEY or a UNIQUE index, the old row is deleted before the new row is inserted.
This is a poor showing from an optimization perspective, but consider a typical business application: we’re not going for any kind of ridiculous efficiency, and the overhead of a DELETE operation is minimal. The real gain achieved by going through a REPLACE statement is on the development side, where no longer are two code paths or separate queries required for add and edit operations.
Except this isn’t the case.
My specific condition was the standard implementation of a users table: a PRIMARY KEY with an auto-incrementing user identifier, username, a few other fields, and a password storage column. Unlike the chumps at MySpace, I don’t store passwords in reversible encryption/plaintext or make a habit of sending the field contents for account reset emails.
By default, an UPDATE operation only changes the requested fields in a table – so if I left out the password column, the contents would stay the same for that user. Yet somehow, even when I didn’t specify a password change for a user, their password field became blank.
Lo and behold: because of the ridiculous DELETE behaviour present behind the scenes in a REPLACE query, you lose all other contents of that row if you don’t store and specify their values again.
Strike one for REPLACE. The fact that silent data loss can occur is what I’d consider a severity one problem, and if you’re planning on using REPLACE in one of your own programs, ensure that you’ve retrieved all columns from that table first. Not only is it weak from a performance standpoint, but its ability to nuke perfectly legitimate hashed data makes it dangerous.
The second such entertaining point was what happened when I tried to use ADODB’s “AutoExecute” function. This method provides a database-agnostic call to run an insert or update statement based simply on a key/value array:
AutoExecute($table, $arrFields, $mode, $where=false, $forceUpdate=true,$magicq=false)
AutoExecute() inserts or updates $table given an array of $arrFields, where the keys are the field names and the array values are the field values to store. Note that there is some overhead because the table is first queried to extract key information before the SQL is generated. We generate an INSERT or UPDATE based on $mode (see below).
This seems like an incredibly lazy and awesome feature when you think about it: the advantages of the REPLACE INTO statement, with only one minor if/then code block based on insertion or update. However, in the infinite wisdom of ADODB, all strings end up being quoted automatically. This means that you can’t use CURRENT_TIMESTAMP to update columns, nor can you easily use DBTimeStamp since the result of that function has already been escaped. For some reason, ”2008-09-24 20:36” won’t parse inside SQL.
As a result, I’ve dropped back to writing my own parameterized queries for this application. This provides the sanitization features of escaping potentially malicious user input, while preserving the ability to run timestamp updates for rows.
Thoughts? How would you go about ensuring an optimal database interaction?