Taskboy http://taskboy.com/blog/ Joe Johnston blogs about Perl, PHP, web technologies and tech industry thoughts. en Copyright Joe Johnston Small Hiatus http://taskboy.com/blog/static/1221.html&rss=1 Due to a lingering cold, I'm taking a break of blogging for a bit. This is a great opportunity to use my RSS feed to keep up with my posts.

]]>
2010-03-16T07:49:38-7:00
Comment Spam returns http://taskboy.com/blog/static/1220.html&rss=1

Comment spam has returned to Taskboy, but this time, it's personal.

The first version of comment spam appeared to be from a program that simply crawled sites and looked to hook into the comment systems of most blogs. The content of these comments was clearly mechanically produced and made little sense. Technologies like CAPTCHA have mostly eliminated that sort of noise.

I know get messages that are either written by the world's greatest ELISA bot or by poorly paid folks with computers. The content of these human generated messages are often on point and could be attempts to participate in a dialog. However, there is always a URL form that points to a business associated with the posts.

Some of these comments I post. Some I do not. I did not want to unfairly reject a comment. However, I'm pretty sure I see the pattern now and will be less lenient in the future.

]]>
2010-02-24T03:52:57-7:00
Google and The Web 2.0 Monoculture http://taskboy.com/blog/static/1219.html&rss=1

Doc Searls points out our growing dependency on Google. In brief, he equates Google with a kind of free public utility that provides the following functionality:

  • Maps/Satellite data
  • Search
  • Mobile phones (via Android)

To that list, I would add the following features on which many people and companies have come to depend:

  • Advertising (via AdWords)
  • Ad-based venue (via AdSense)
  • Email and Instant Messaging (via Gmail)
  • Voice Mail/Phone routing (via Talk)
  • New aggregation (via News)

Searls, with a "me too" from Dave Winer, declaim that Google has become "too big to fail." They worry that Google, fattened and dependent on its advertising engine, is vulnerable to economic bubbles in advertising. They worry that Google is in a bubble right now.

I do not think this is the case. Google no doubt enjoys quite a bit of revenue from advertising right now. However, no one outside of Google has a complete accounting of the company's revenue. If Google survived the recessions of 2001-2003 and 2008-2010, there is good reason to believe that they will weather future economic storms.

However, Doc Searls points to a more immediate danger that Google presents to consumers: that of the digital monoculture. Google rarely extracts money from users directly. There are only a handful of subscription services offered by them. However, the user base for Gmail is enormous. The same can be said for their advertising services. What would our online life be like if Google went offline tomorrow?

Just looking at one service, Gmail, is illustrative of the scope of the problem. Many companies have outsourced their mail handling to entirely to Google. Recall that in the 1990's, IT staffs spent a considerable amount of time and money setting up corporate email systems. Although the largest companies still do this, many simply outsource this tasks and reap significant savings and reliability over in-house mail systems. Without Google, a very large number of companies and people would not be able to conduct business. Sure, there would be work-arounds: alternate email accounts, telephones, etc. However, this distruption would cost real and measurable dollars.

Perhaps the most immediate effect would be the loss of Google's wonderful, if easily forgotten, search function. To remind those readers recently recovered from a coma, the current neologism for searching online for something is "googling." Imagine the sort of a day you would have if your browser returned a 404 missing page error when accessing http://www.google.com/. That would not be a salad day at all.

The open source community, of which both Searls and Winer are associated, has longed battled against digital monocultures (e.g. IBM, Microsoft, Apple, etc.). Consumers usually benefit from choice (although not always [remember the mess of the home computer market in the 1980s]). Healthy competition promotes innovation and cost-savings. It also creates a healthier ecosystem in which the failure of one entity does not threaten the survival of everyone.

To this end, consumers of free digital products ought to consider how much they depend on these services. I practice what I preach. I pay for Yahoo Mail Plus, which is $20 a year. It's a fair deal: I see no ads and I can use POP mail. That's pretty short money for a service that has yet to have a outage more than 5 minutes in three years. The same goes for my blog hosted on bluehost.com. For $7 a month, I get shell access to very reasonable Linux environment. Of course, there are plenty of free choices for blog hosting these days, but I need to control my content and the context in which it appears. Free services can close shop without notice and there is little consumers can do to retrieve their content.

The dangers of monoculture become readily apparent after a failure. In groups, humans are not noted for their ability to successfully anticipate future disasters. It seems that now, we don't even recall past calamities all that well. One would think that the near fatal collapse of traditional lending institutions who participated in rank speculation would produce a rapid and perhaps onerous regulatory response. However, that has not yet proven to be the case nearly two years after the shock.

From a strictly selfish perspective, I would love to see Google fail completely tomorrow. Business opportunities abound in chaos and fortune favors the bold.

UPDATE: It looks like no one at Yale reads my blog.

]]>
2010-02-12T02:46:27-7:00
Linux sound problem resolved with a symlink http://taskboy.com/blog/static/1218.html&rss=1

I tweeted yesterday about my centos 5 linux box losing sound (and frankly X, but that's another story). I tried to use the sound card detector (system-config-soundcard) to find the card again, but it silently failed (typical). The Linux sound howto is over nine years old, so that's pretty useless. What to do?

Well, I turned to my old friends lsmod, dmesg and strace. I could see tha there were kernel sound modules loaded. I could knew that I had configuration that was working. So I issued the following command:

# strace mpg123 /path/to/mp3file

I got quite a bit of output, but the relevant line was this:

open("/dev/snd/pcmC0D0p", O_RDWR|O_NONBLOCK) = -1 ENOENT (No such file or directory)

And sure enough, this is what was in my /dev/snd directory:

[root@durgan snd]# ls
controlC0  
hwC1D2    
pcmC1D0c  
pcmC1D1p  
timer
controlC1  
pcmC0D0c  
pcmC1D0p  
seq

Oh look! There's no pcmC0D0p file. What to do. I could try to create the missing dev file, but I just symlinked the pcmC1D0p to pcmC0D0p.

And yes, this horrible, horrible hack worked like a charm.

What happened to delete this device file? Why couldn't I make the system re-detect and re-initialize the sound system? It's stupid, stupid user issues like this that have plagued linux since 1995 and show no signs of getting better today. This is why you don't see a lot of linux notebooks or games.

]]>
2010-02-08T03:21:38-7:00
Microformats, RDFa and their semantic uses http://taskboy.com/blog/static/1217.html&rss=1

Increasingly, search engines like Google and Yahoo are supporting a kind of HTML markup schema called microformats. Microformats are a kind of an embedded document inside a standard web page (which can be thought of as a sort of macroformat, if you will).

There are two flavors of mircoformats: one for HTML (which introduces no new elements or attributes) and one for XHTML (which adds a few new attributes, but no new tags). The proposed RDFa standard can be thought of as the flavor of microformats for XHTML, since it is based on the RDF XML format. The reality is the there are two groups of people (RDFa, microformats) working in the same space, so there is overlap. However for now, the HTML vs. XHTML partitioning works well to categorize these two embedded document efforts.

There are a few things you can do with these embedded syntaxes. If you regularly review businesses or products (like yelp.com does), you might consider using the hReview mircoformat to identitfy parts of your review, as shown in the following snippet:

<div class="hreview">
   <span class="item">
      <span class="fn">Taskboy Feed Bag</span>
   </span>
   Reviewed by <span class="reviewer">Joe Johnston</span>
   on 
   <span class="dtreviewed">
      <span clsas="value-title" title="2010-02-07"/>Feb 7, 2010
   </span>
   <span class="summary">Terrific news source for free</span>
   <span class="description">The Feed Bag RSS aggregator on Taskboy 
has replace Google news as one stop shop to get the news of what's 
going on in tech and politics now.</span>
    <span class="rating">4.5</span>
</div>

As you can see, that's a lot of semantic markup for so little visible text. Google can pull this code apart and display it under the link for the product or service discussed.

In addition to reviews, Google and Yahoo understand at least four other species of embedded documents: people/businesses, products, events, and video. Because there are two standards bodies at work, there are microformats and RDFa specs for each. The following table summarizes this with links to give examples of these embedded documents.

To test your embedded document, try Google's microformat tool.

I have a few concerns about microformats. The first is that it requires a lot of additional markup. I understand that a blogging system can create a form to collect these discrete features, but it still seems like a lot of work for casual use. The second concern I have is that microformats reuse the class attribute that is normally used for CSS. This creates a whole bunch of reserved words to avoid for class names in your site's CSS. Perhaps its not that big a deal, but I do not like namespace conflicts. I prefer the RDFa spec, which simply introduces new attributes (typeod, property, etc.) specific to its purpose. That seems a lot cleaner to me. However, the various RDFa formats are not as well documented as their microformat counterparts. As in all things, "good enough" often trumps "clean design."

There is no doubt that embedded documents are a bit of a moving target. I don't expect the formats for things already defined to change, but more objects will be described by new specifications.

]]>
2010-02-07T08:31:24-7:00
Common problems when using SQLite and PHP http://taskboy.com/blog/static/1216.html&rss=1

I'm currently developing a social media application using PHP and sqlite. I don't know if I'll deploy with sqlite, but for development, it works well. I have two CVS sandboxes that I work in for this project. One of these in on my macbook (which comes with sqlite-enabled PHP) and an Ubuntu virtual machine. There are a few gotchas to be aware of when using sqlite in this kind of environment.

Write-protected directories

In other RDBM systems like mysql or postgres, the is a server process that is responsible for reading and writing to the database disk files. With sqlite, this isn't the case. If you're using sqlite through PHP, then the process owner running the PHP must be able to read and write to the location of the sqlite database file. This requirement gets a little more complex when you are running PHP through apache which has its own ideas about directory security.

In must apache setups, your PHP scripts will not be able to write in the web-accessible "document root". You will not be able to keep your sqlite database file in the document root. If you try, you will find errors on INSERT and UPDATES about "database file is locked." However, it is foolish from a security point of view to keep your database file in a web-accessbile directory anyway.

This particular problem hit me hard on Mac OS X. Once user directories are enabled in the apache configuration, your Sites directory becomes a document root. You won't be able to keep your sqlite files in that directory or any subdirectory under it. Instead, created a ~/tmp directory and keep the sqlite database file there.

SQLite version skew: apache/shell

Because sqlite is an embedded system, it is compiled into program you are using. If you are using PHP, you can run into the following issue. I use PHP from the command line all the time. When I create the database, I run a PHP script from the shell to do this. Unfortunately, the command line PHP and the version of PHP compiled into apache may not be the same. Further, the apache PHP may not be compiled with the same version of sqlite. This is the case on Mac OS X. What a mess!

To get around this, always create your sqlite databases through apache/PHP. You will run into far fewer issues this way.

Changing the schema requires an apache restart

Recall that apache is a pre-forking server. If you change the schema of your sqlite database while apache's running, you could get an error in PHP that "schema has changed." Whatever SQL statement you were attempting to run will fail.

From the "don't do that" school of medicine comes this technical advice. If you need to change the schema of an sqlite database, shut down apache first, update the database and restart apache.

I hope this post helps others avoid the mistakes I made.

]]>
2010-02-05T09:25:52-7:00
Facebook's HipHop optimizes the wrong thing http://taskboy.com/blog/static/1215.html&rss=1

Facebook will shortly release a tool called HipHop for enhancing the performance of PHP. My understanding of the tool is that it compiles PHP code into C++ which is then compiled into a system native executable. While I have no doubt that this tool does produce significant speed gains over apache/PHP, I do think one needs to be aware of the trade-offs of this kind of system. After all, this isn't the first time a trick like this has been used for a dynamic language.

C++ and PHP are very different languages. I'm not talking about syntax, but how source code is handled. In PHP, the source code is turned into op codes that the PHP interpreter understands. The interpreter knows how to the operating system perform these op codes. In C++, source code is compiled into assembler which is then linked into a system executable which can be run from the shell. Compiled code runs faster than interpreted code for a number of reasons, but the most important is that compiled code is closest to native assembler which essentially is the op code system that the host CPU uses to make stuff happen.

The problem with compiling PHP into C++ is that you lose all the wonderful dynamic features of PHP since these cannot be easily or efficiently translated automatically into C++ source code. The very dynamic nature of PHP (or Perl or Ruby or Python, etc) is what makes these languages accelerate programmer productivity. I think facebook will see this performance hit later.

Let's not forget that Moore's law of CPU power often solves a great deal of performance issues. Hardware is always cheaper than developer time and less prone to bugs.

I favor architectures that take advantage of Moore's law and use horizontal scaling and commodity solutions over fancier tricks that require specialized talent (like erlang). I might suggest caching the opcodes that the PHP interpreter generates and simply running those. This is the essence of the Zend server and how apache/mod_perl/Apache::Registry work. Sure, you don't get quite the performance of compiled code but you'll still see a noticable boost. I believe PHP does some level of this kind of caching right now.

It's true that one can do amazing feats by being clever, but clever doesn't scale (unless you're Google).

]]>
2010-02-04T08:13:22-7:00
On PHP frameworks http://taskboy.com/blog/static/1214.html&rss=1

An interesting chart comparing various PHP frameworks. I'm not sure that I can read it correctly. It seems to imply that Zend and CakePHP are the most popular frameworks.

Both frameworks are free, but Zend is clearly optimized for the Zend server platform, which isn't free. Also, I can't help thinking that the audience is somewhat different for these two. CakePHP seems aimed at the more opensource, DIY crowd while Zend is clearly pointed to the enterprise IT crowd. While there is overlap, you can see that Zend is a commercial venture.

I have very mixed emotions about using frameworks. On the hand, frameworks deliver huge dollops of functionality right out of the box. This accelerates the completion of many IT projects. On the other, you get locked into another group's development schedule and, to some extent, the architectural choices they make. Projects built with these tools also expose themselves to bugs and security holes originating in the frameworks. Finally, you end up having to trust or vet the code in the framework.

For an inward-facing intranet product, I think frameworks are great. I'm not sure I'd want to launch something like twitter or facebook with one.

]]>
2010-02-03T08:46:03-7:00
The Observer pattern and Action Queues http://taskboy.com/blog/static/1213.html&rss=1

There is a well-know design pattern called publish-subscribe or the Observer model. The problem this model attempts to solve is one in which a object requires one or more parties to act on it when it changes to a particular state. A concrete example of this event handlers in GUIs including the DOM. Actions made be associated with button presses.

The Observer model seperates the subject from the parties (observers) that act on it. Observers register the interest with the subject. When the subject changes to the desired state, it notifies each of the registered observers.

In PHP, a subject class might be modeled like this:

class Subject {
  private $Q = array();
  function Attach($O) {
     array_push($this->Q, $O);
  }

  function Detach($O) { 
    for($i=0; $i < count($Q); $i++) {
      if ($Q[$i] == $O) {
        array_splice($this->Q, $i, 1);
	break;
      }      
    }
  } 
  
  function Notify() {
    foreach($this->Q as $O) {
      $O->Update($this);
    }
  }  
}

The Attach() and Detach() methods are the API by which Observers register or unregister their interest in a Subject object. When the subject changes into an interesting state, its Notify() method is called. This method in turn calls the Observer Update() method with a reference to the current Subject object. An Observer class might look like this:

class Observer {
  function Update($S) {
     // Do something interesting
  }	   
}

As you can see, an Observer need only implement one well-known method, Update(). This arrangement nicely decouples the Subject from the Observers.

There may be times when a less formal, more functional mechanism is desirable. What if you want just want certain actions to happen on an object when an interesting state obtains? You might use what I call an Action queue to do this. An Action Queue is simply an array of function references that are called by an object at an interesting time. Here's what an Action queue might look like:

class MyClass {
   private $Q = array();

   function Attach($name, $func) {
      $this->Q[$name] = $func; 
   }

   function Detach($name) {
      unset($this->Q[$name]);
   }
  
   function Notify() { 
     foreach ($this->Q as $n=>$func) {
        $func($this);
     }
   }
} 

As you can see, there is no need for an Observer class. Bits of functionality created with create_function() can be attached ad hoc to this class, as the following snipet shows:

  $appender = create_function('$obj', 'return "Got => ".\$obj');
  $MyClassObj->Attach("append", $appender);

Because these code bits are anonymous, an arbitary name is required during the Attach phase in the event that you might want to remove the behavior later.

]]>
2010-02-02T09:38:48-7:00
Emacs search and replace of unprintable characters http://taskboy.com/blog/static/1212.html&rss=1 A common problem facing emacs users is to replace a certain sequence of characters in a buffer with either a newline or tab or something else equally awkward. The solution is relatively simple.

Invoke search and place as normal (M-%), enter the text to be replaced, enter the replacing character using the following chart:

  • Newline(\n): C-qC-j
  • Tab(\t): C-qC-i
  • Carriage return(\r): C-qC-m

See the pattern? C-q is an escape sequence and C-[char] is the position in the alphabet of the ASCII offset of the replacement character. OK, perhaps that's a little tangled. But now you know.

]]>
2010-01-31T09:09:13-7:00