View Single Post
Old 05-16-2004, 12:08 AM   #1
JWSmythe
Green Mole
 
Join Date: May 2004
Posts: 4
Automatic Webpage Thumbnails In Search Result Page

I posted this to the wrong forum first. Sorry, I'm not very familiar with the forums layout here (yet). I apologize in advance for any lines that may have wrapped incorrectly. I spotted a few and corrected them below, but there could be more.


-----

web page thumbnails in result page

I made a little addition to one of our sites using PhpDig, which others may find interesting. It's a combination of PHP, Perl, and a C application, so any PHP purists, you'll have to forgive me. You could probably skip the Perl part, but the C application is the only one I found that does what I wanted.

Warning, this *IS* a quick hack. I did it in just a few hours, with plenty of beer powering the work. There may be bugs, there may be exploitable code. It may not even work on your system. On the other hand, you may love it, and your users may be amazed that you have thumbnails of all the pages shown.

Now for the executive summary.

I change one of the default templates quite a bit, adding in all the features I wanted from other templates, and one little piece that displays thumbnails of the resulting web page, so the users will have at least an idea of what page they've hit. I think it looks very good.

The thumbnail generation must be done from a *nix machine with X running, and KDE installed. I use Gnome as my desktop, but have KDE installed, and that works fine.

Now for the gory details.

To do exactly what I've done, you need PHP, Perl, a C compiler (like GCC), and khtml2png (available at http://www.babysimon.co.uk/khtml2png/index.html ). Despite the name, khtml2png can make other files than png's. I'm using jpg's, because they're usually smaller, and some versions of some stupid browsers (MSIE) don't always handle png's correctly. khtml2png can also include captures of flash pages in the thumbnail. Read the documentation.

Perl needs DBI.pm and MIME::Base64 installed.

The thumbnails are created after the first person sees a result in the search engine referencing a particular page. This means, at least one person will see just a white spot where there should be a thumbnail, but sometime in the next minute or so after that happens, there will be a thumbnail there from then on.

I added an extra table to my search database. This is the create

---
CREATE TABLE `request_thumb` (
`url` varchar(255) NOT NULL default '',
PRIMARY KEY (`url`)
) TYPE=MyISAM;
---

There's no real need for anything more than that. We use the url as the primary key, so you don't end up with duplicated requests. Once the image is made and put up on the web server, we remove the record from the database, which shouldn't be recreated.

I'm going on the assumption that no URL will be more than 255 characters. It *COULD* happen, and this *COULD* create constant requests to recreate the images, so if you use this on a production site with URL's longer than 255 characters, you'll want to make changes. site_url in table sites is varchar(127), so this shouldn't be a problem, and you could probably change my url varchar(255) to url varchar(127). I usually consider URL's 255 max. I believe that's in an RFC somewhere, but I have plenty of other things to do than read RFC's all day.

In my phpdig.html template, in the phpdig:results block, laid out the entire record in it's own table (your site, your option), and then put this image in

<IMG SRC="showthumbs.php?thumbrequest=<phpdig:complete_path/>" WIDTH="240">

So, every time the page is loaded, it'll run showthumbs.php.

This is showthumbs.php:

-----
<?
# See if there's a thumbnail.
# If so send it.
# If it's not recent, or it's missing, request an update in the database.

$basepath = "/host/users/search.proadult.com/htdocs/thumbnails";
$baseurl = "http://search.proadult.com/thumbnails";
$max_cache_live = "604800"; // 604800 = 7 days

$plain_url = addslashes($_GET['thumbrequest']);
$encoded_name = base64_encode($plain_url);
$encoded_name = strtr($encoded_name, "\n", "");
$encoded_name = "$encoded_name.jpg";

if (file_exists("$basepath/$encoded_name")){
$curtime = time();
$ctime = filectime("$basepath/$encoded_name");

$result_url = "$baseurl/$encoded_name";
if (($ctime - $curtime) > $max_cache_live){
$requestupdate = 1;
};
}else{
$requestupdate = 1;
trigger_error ("Thumbnail: Couldn't find $encoded_name for $plain_url");
$result_url = "http://search.proadult.com/whitedot.png";
};

if ($requestupdate == 1){
# I'd like an updated thumbnail please
$link = mysql_connect('server', 'user', 'pass');
mysql_select_db('database');
mysql_query("INSERT INTO request_thumb (url) VALUES ('$plain_url')");
};

header("Location: $result_url");
-----

Basically, if the thumbnail isn't there, it'll stretch a white dot out to the size of the image, to keep from showing a broken image. If the thumbnail does exist, it shows it to the user. If the thumbnail does exist, but is more than $max_cache_live seconds old, still shows it to the user, but sends a request to rebuild the thumbnail.

The logic behind doing this is, I know perfectly well that *EVERY* page that's spidered won't necessarly ever be viewed. With a theoretical minimum number of pages for our site of 50K*3, it was unrealistic to think that we should make *ALL* those thumbnails in advance, and try to do housekeeping to keep them up to date. There have been a few people searching, with plenty of searches by myself, and I've only generated 267 thumbnails, which (kinda) proves I'm right. Of course, what people search on your site, and how many results pages they look at will change your usage.

At worst, the resulting pages for popular searches will get one hit per week. The pages that rank poorly will never get hit, so who cares.

The thumbnails are stored as a base64 encoded version of the URL, so not to give the ability for someone to send funky requests, and potentially do bad things on my perfectly happy filesystem. Be warned, you'll have plenty of filenames that are rather long. The PHP page on base64_encode() says that base64 encode makes strings that are 33% longer than the original. So if your original URL was 20 characters, the thumbnail filename will be at least 21 characters. I believe they're padded to an even 4 characters, so it'll be 24 characters plus the .jpg extension. If your filesystem is going to have a problem with this, pick a different way to encode your thumbnail name. I considered just making md5sum's of it, and even though everything I know about it says it can't happen, I'd just be worried that the wrong thumbnail would show up, through some twist of fate. If you're a web host, that means a churches web site would end up with a hard-core porn picture, which I know would happen to me somehow. Well, that would be if we hosted churches web sites.

On the machine that has X running on it (on another server in this case), I have this perl script running. It could have been done in PHP, but I don't have PHP installed on that workstation. This is in my cron to once once per minute.

--- begin build.searchimages.pl
#!/usr/bin/perl

use DBI;
use MIME::Base64;

$basepath = "/host/users/searchimages/thumbnails";

$db = DBI->connect("DBI:mysql:database:server", user, 'pass') || die "$!";

$source_query = "SELECT url FROM request_thumb";

$source = $db->prepare("$source_query") || die "$!, error on source prepare\n";
$source->execute || print "Error on source execute\n";

while (@curarray = $source->fetchrow_array){
$req_url = $curarray[0];
$req_url =~ s/\;//g;
$outfile = encode_base64($req_url);
chop ($outfile);
$outfile =~ s/\n//g;
$outfile = "$outfile.jpg";
print "Creating $req_url -> $outfile\n";
$sysstring = "/opt/kde/bin/khtml2png -display :0 --width=800 --height=600 --scaled-width 240 --scaled-height 180
$req_url $basepath/$outfile";
system(`$sysstring`);
$sysstring = "/usr/local/bin/scp $basepath/$outfile search_server:/your_path/thumbnails";
print "SCP: $sysstring\n";
system("$sysstring");
$db->do("DELETE FROM request_thumb WHERE url = '$curarray[0]'");
--- end build.searchimages.pl


khtml2png is *NOT* fast enough to run directly from the page, by any stretch of the imagination. It has to be able to download the HTML and related images, plus parse the HTML, and generate the thumbnail. It takes a good 20 or so per thumbnail.

I use scp to move the images over to the web server, you could change that to do anything you want. I like scp. It does require that you have SSH keys set up for automagic login.

If anyone gets use from this, cool. If it gets included with a future PhpDig, please give me credit somewhere.

This could probably use a cron to blow away thumbnails that are more than x days (or weeks) old, but that's a housekeeping thing that I haven't really put too much thought into yet. Small sites may never need it. Larger sites may need to do it more frequently, or even implement a more extensive directory structure, so you don't end up with too many thumbnails in one directory. But hey, that's the magic of open source, right? Tune it the way *YOU* want/need it.
JWSmythe is offline   Reply With Quote