Markov Blogger code

Posted:

Today’s experiement with Markov chains has been really entertaining. I have added a few heuristics to the basic two word markov chain described in The Practice of Programming to handle the vageries of blog entries a bit better. Also, I made walking the chain more random, so that the generated entries are different every time. The markov program presented in the K&P book always begins with the first word found in the data set.
This makes for boring blog entries.

The first program presented snarfs blog entries of particular userID from use.perl using the SOAP interface. It creates simple data files that can be easily consumed by the markov chain program.

get_blog

!/usr/bin/perl —

#

Get blog entries from use.perl.org of given UID

#

use strict; use SOAP::Lite; use Getopt::Std; use HTTP::Cookies; use Digest::MD5 ‘md5_hex’; use Data::Dumper; use File::Basename;

use constant URI => “http://use.perl.org/Slash/Journal/SOAP”; use constant PROXY => “http://use.perl.org/journal.pl”; use constant UID => 777; use constant PW => “s3cr3t”; use constant DOWNLOAD_DIR => ‘./entries’;

my $opts = {}; getopts(‘?hP:U:d:l:u:v’, $opts);

if ($opts->{‘?’} || $opts->{‘h’}) { print usage(); exit; }

unless ($opts->{‘u’}) { die “ERROR Missing -u \n”, usage(), “\n”; }

Everything is ready for the SOAP call now

my $cookie_jar = HTTP::Cookies->new; $cookie_jar->set_cookie(0, user => bakeUserCookie(($opts->{‘U’} || UID), ($opts->{‘P’} || PW), ), “/”, “use.perl.org”);

my $c = SOAP::Lite->uri(URI)-> proxy(PROXY, cookie_jar => $cookie_jar);

print “listing entries for UID $opts->{u}\n” if $opts->{v}; my $ret = $c->get_entries($opts->{‘u’}, $opts->{‘l’}); exit if had_transport_error($ret); my $rows = $ret->result;

my $tdir = $opts->{d} || DOWNLOAD_DIR; unless (-d $tdir) { print “making download directory ($tdir)\n” if $opts->{v}; unless (mkdir($tdir, 0700)) { die “Can’t make target director ‘$tdir’: $!\n”; } }

cut’n’paste code? never

$tdir .= “/$opts->{u}”; unless (-d $tdir) { print “making download directory ($tdir)\n” if $opts->{v}; unless (mkdir($tdir, 0700)) { die “Can’t make target director ‘$tdir’: $!\n”; } }

print “fetching ” . @{$rows} . ” entries\n” if $opts->{v}; for my $r (@{$rows}) { if (-e “$tdir/$r->{id}”) { print “Already fetched $r->{id}\n” if $opts->{v}; next; }

print “fetching entry: $r->{id}\n” if $opts->{v}; my $ret = $c->get_entry($r->{id}); if (had_transport_error($ret)) { warn (“**couldn’t fetch entry ‘$r->{id}’\n”) if $opts->{v}; next; }

my $rec = $ret->result; if (open my $out, “>$tdir/$r->{id}”) { print $out “subject: $rec->{subject}\n”; print $out “$rec->{body}\n”; close $out; } else { warn(“**couldn’t create file ‘$tdir/$r->{id}’. Skipping.\n”) if $opts->{v}; next; } }

print “done\n”;

—————————

subs

—————————

sub usage { my $base = basename($0); return <

USAGE:

$base -u 22 # fetch all entries for UID 22 $base -d ./raw -u 22 -l 5 # fetch last 5 entries, put them in 'raw'

OPTIONS:

? this screen h this screen P use.perl password (for AUTH) U use.perl UID (for AUTH) d target directory to hold fetched entries l limit (-1 for all entries) u target UID v verbose messages EOT }

sub had_transport_error { my ($ret) = @_;

if ($ret->fault) { warn “Oops: “, $ret->faultstring, “\n”; return 1; }

return; }

Thanks Pudge

sub bakeUserCookie { my($uid, $passwd) = @; my $cookie = $uid . ‘::’ . md5hex($passwd); $cookie =~ s/(.)/sprintf(“%%%02x”, ord($1))/ge; $cookie =~ s/%/%25/g; return $cookie; }

The really interesting work is done by the next program, which consumes the data and generates a output suitable for posting back to use.perl via the perl scripts I described in my article. Notice that I keep hyperlinks together. Also blockquotes. Some may call this cheating, but I think the effect is more pleasant. I experimented with three work chains, but this wasn’t random enough for me. Perhaps I need a larger data set. A future version of this program could use both chains and somehow blend them together for the output. It would be great to see popular three word combinations appear in the generated text, but perhaps I already get that with two word chains.
Also note that this script marks my first use of the qr operator. I guess compiled regexes aren’t so scary after all.

markov.pl

!/usr/bin/perl —

ripped off from Practice of Programming, Kernighan & Pike, p. 80

tweaked about for my nefarious purposes.

use strict; use Text::Wrap;

use constant CHAIN_SIZE => 3; # or 3 use constant NONWORD => “\n”; use constant DEBUG => $ENV{MARKOV_DEBUG} || 0;

my ($subject, $body) = ({}, {}); my ($s_in, $b_in) = ({},{});

expand directories as needed

my @tmp; for my $file (@ARGV) { if (-d $file) { push @tmp, glob(“$file/*”); warn (“expanding directory ‘$file’\n”) if DEBUG; } else { push @tmp,$file; } } @ARGV=();

warn(“Randomizing the order of the input files\n”) if DEBUG;

mix up the order of the files a bit

do { push(@ARGV, (splice @tmp, rand(@tmp), 1)); # warn(“\@ARGV size: ” . @ARGV . ” \@tmp size: ” . @tmp . “\n”) if DEBUG; } while (@tmp);

warn(“Reading input files\n”) if DEBUG; my (@last_subject_keys, @last_body_keys); for my $file (@ARGV) {

while (<>) { if (/^subject:/) { s/^subject://; fill_table(table => $subject, state => $s_in, line => $, ‘keys’ => \@lastsubject_keys ); } else { fill_table(table => $body, state => $s_in, line => $, ‘keys’ => \@lastbody_keys, ); } } } end_table($subject, \@last_subject_keys); end_table($body, \@last_body_keys);

warn(“body table: ” . (keys %{$body}) . “\n”) if DEBUG;

generate subject

my $subj; do { my $max = rand(12) + 1; $subj = make_chain($subject, $max) } while (length($subj) > 64);

print “subject: [MarkovBlogger] $subj\n”;

my $max = rand(250) + 120; print “body: “, make_chain($body, $max), “\n”;

————————————————

subs

————————————————

sub fill_table { my (%args) = @_; my ($tbl, $in, $line, $keys) = @args{qw(table state line keys)};

my ($w1,$w2,$w3) = @{$keys}; unless (defined $w1) { $w1 = $w2 = $w3 = NONWORD; }

my $ecode = ‘ecode’; my %delims = ( href => [ qr!!i ], ecode => [ qr!<$ecode>!i, qr!!i ], ul => [ qr!

    !i, qr!
!i ], blockquote => [ qr!
!i, qr!
!i ], );

WORD: for my $word (split /\s+/, $line) {

# am I in a special block?
# can't start a new special until I find the end of previous one
for my $el (qw(href ecode ul blockquote)) {
  if ($in->{$el}) {
if ($word =~ /$delims{$el}[0]/) { # end?
  $word = "$in->{$el} $word";
  $in->{$el} = "";
} else {
  $in->{$el} .= " $word";
  next WORD;
}
  } elsif ($word =~ /$delims{$el}[0]/) { # start?
$in->{$el} = $word;
next WORD;
  } 
}

if (CHAIN_SIZE > 2) {
  push @{$tbl->{$w1}->{$w2}->{$w3}}, $word;
  ($w1, $w2, $w3) = ($w2, $w3, $word); # pull the chain along
} else {
  push @{$tbl->{$w1}->{$w2}}, $word;
  ($w1, $w2) = ($w2, $word); # pull the chain along
}

}

# assign these keys back into the passed in ref return @{$keys} = ($w1, $w2, $w3); }

sub end_table { my ($tbl) = shift(@); my ($w1, $w2, $w3) = @{shift(@)};

if (CHAIN_SIZE > 2) { push @{$tbl->{$w1}->{$w2}->{$w3}}, NONWORD; } else { push @{$tbl->{$w1}->{$w2}}, NONWORD; }

}

sub make_chain { my ($tbl, $size) = @_;

my $text = “”;

# let’s start in a rand point on the chain my @w1 = ((keys %{$tbl}), NONWORD); my $w1 = $w1[rand(@w1)];

my @w2 = ((keys %{$tbl->{$w1}}), NONWORD); my $w2 = $w2[rand(@w2)];

my @w3 = (NONWORD); if (CHAIN_SIZE > 2) { @w3 = ((keys %{$tbl->{$w1}->{$w2}}), NONWORD); } my $w3 = $w3[rand(@w3)];

for my $i (0..$size) { my $suf; if (CHAIN_SIZE > 2) { $suf = $tbl->{$w1}->{$w2}->{$w3}; } else { $suf = $tbl->{$w1}->{$w2}; }

warn("word1: '$w1'\n\tword2: '$w2'\n") if DEBUG;

unless (ref $suf) {
  $w1 = $w1[rand(@w1)];
  $w2 = $w2[rand(@w2)];
  $w3 = $w3[rand(@w3)];
  redo;
}

my $r = int(rand @{$suf});
my $t = $suf->[$r];

if ($t eq NONWORD) {
  warn ("detected the end of the chain (" 
         . (keys %{$tbl})
     .  ") at $i.  reseting keys\n") if DEBUG;
  $w1 = $w1[rand(@w1)];
  $w2 = $w2[rand(@w2)];
  $w3 = $w3[rand(@w3)];
  next;
} 

# there are "unbalanced" braces (close nuff for me)
if ($t !~ m!\([^\)]*\)!) {  
  $t  =~ s!^\(!!;
  $t =~ s!\)$!!;
  $text .= "$t ";
}

if (CHAIN_SIZE > 2) {
  ($w1, $w2, $w3) = ($w2, $w3, $t);
} else {
  ($w1, $w2) = ($w2, $t);
}

}

# do some goofy clean up $text = ucfirst $text; chop $text; # final space

# remove stray punctuation $text =~ s/[,:]$//; if (substr($text, -1, 1) ne ‘.’) { $text .= “.”; }

$Text::Wrap::columns = 60; return wrap(“”, “”, $text); }

Let’s see how annoying this gets. I have a feeling after a week, I’ll make this quietly go away. Or expand it into a something truly monstrous. I made several tweaks to the code just trying to post this blog entry.

UPDATE: Thanks to the wonders of Soviet-style revisionism, I have updated this code a bit to remove the uses variables $w1,$w2,$w3 from the main line. Also, I randomize the input file order to make the output less likely to come from the some person’s blog. Perhaps I’ll bundle this up for CPAN or taskboy.com or something.

[Original use.perl.org post and comments.]