PDA

View Full Version : Automatic Webpage Thumbnails In Search Result Page


JWSmythe
05-16-2004, 01:08 AM
I posted this to the wrong forum first. Sorry, I'm not very familiar with the forums layout here (yet). I apologize in advance for any lines that may have wrapped incorrectly. I spotted a few and corrected them below, but there could be more.


-----

web page thumbnails in result page

I made a little addition to one of our sites using PhpDig, which others may find interesting. It's a combination of PHP, Perl, and a C application, so any PHP purists, you'll have to forgive me. You could probably skip the Perl part, but the C application is the only one I found that does what I wanted.

Warning, this *IS* a quick hack. I did it in just a few hours, with plenty of beer powering the work. There may be bugs, there may be exploitable code. It may not even work on your system. On the other hand, you may love it, and your users may be amazed that you have thumbnails of all the pages shown.

Now for the executive summary.

I change one of the default templates quite a bit, adding in all the features I wanted from other templates, and one little piece that displays thumbnails of the resulting web page, so the users will have at least an idea of what page they've hit. I think it looks very good.

The thumbnail generation must be done from a *nix machine with X running, and KDE installed. I use Gnome as my desktop, but have KDE installed, and that works fine.

Now for the gory details.

To do exactly what I've done, you need PHP, Perl, a C compiler (like GCC), and khtml2png (available at http://www.babysimon.co.uk/khtml2png/index.html ). Despite the name, khtml2png can make other files than png's. I'm using jpg's, because they're usually smaller, and some versions of some stupid browsers (MSIE) don't always handle png's correctly. khtml2png can also include captures of flash pages in the thumbnail. Read the documentation.

Perl needs DBI.pm and MIME::Base64 installed.

The thumbnails are created after the first person sees a result in the search engine referencing a particular page. This means, at least one person will see just a white spot where there should be a thumbnail, but sometime in the next minute or so after that happens, there will be a thumbnail there from then on.

I added an extra table to my search database. This is the create

---
CREATE TABLE `request_thumb` (
`url` varchar(255) NOT NULL default '',
PRIMARY KEY (`url`)
) TYPE=MyISAM;
---

There's no real need for anything more than that. We use the url as the primary key, so you don't end up with duplicated requests. Once the image is made and put up on the web server, we remove the record from the database, which shouldn't be recreated.

I'm going on the assumption that no URL will be more than 255 characters. It *COULD* happen, and this *COULD* create constant requests to recreate the images, so if you use this on a production site with URL's longer than 255 characters, you'll want to make changes. site_url in table sites is varchar(127), so this shouldn't be a problem, and you could probably change my url varchar(255) to url varchar(127). I usually consider URL's 255 max. I believe that's in an RFC somewhere, but I have plenty of other things to do than read RFC's all day.

In my phpdig.html template, in the phpdig:results block, laid out the entire record in it's own table (your site, your option), and then put this image in

<IMG SRC="showthumbs.php?thumbrequest=<phpdig:complete_path/>" WIDTH="240">

So, every time the page is loaded, it'll run showthumbs.php.

This is showthumbs.php:

-----
<?
# See if there's a thumbnail.
# If so send it.
# If it's not recent, or it's missing, request an update in the database.

$basepath = "/host/users/search.proadult.com/htdocs/thumbnails";
$baseurl = "http://search.proadult.com/thumbnails";
$max_cache_live = "604800"; // 604800 = 7 days

$plain_url = addslashes($_GET['thumbrequest']);
$encoded_name = base64_encode($plain_url);
$encoded_name = strtr($encoded_name, "\n", "");
$encoded_name = "$encoded_name.jpg";

if (file_exists("$basepath/$encoded_name")){
$curtime = time();
$ctime = filectime("$basepath/$encoded_name");

$result_url = "$baseurl/$encoded_name";
if (($ctime - $curtime) > $max_cache_live){
$requestupdate = 1;
};
}else{
$requestupdate = 1;
trigger_error ("Thumbnail: Couldn't find $encoded_name for $plain_url");
$result_url = "http://search.proadult.com/whitedot.png";
};

if ($requestupdate == 1){
# I'd like an updated thumbnail please
$link = mysql_connect('server', 'user', 'pass');
mysql_select_db('database');
mysql_query("INSERT INTO request_thumb (url) VALUES ('$plain_url')");
};

header("Location: $result_url");
-----

Basically, if the thumbnail isn't there, it'll stretch a white dot out to the size of the image, to keep from showing a broken image. If the thumbnail does exist, it shows it to the user. If the thumbnail does exist, but is more than $max_cache_live seconds old, still shows it to the user, but sends a request to rebuild the thumbnail.

The logic behind doing this is, I know perfectly well that *EVERY* page that's spidered won't necessarly ever be viewed. With a theoretical minimum number of pages for our site of 50K*3, it was unrealistic to think that we should make *ALL* those thumbnails in advance, and try to do housekeeping to keep them up to date. There have been a few people searching, with plenty of searches by myself, and I've only generated 267 thumbnails, which (kinda) proves I'm right. Of course, what people search on your site, and how many results pages they look at will change your usage.

At worst, the resulting pages for popular searches will get one hit per week. The pages that rank poorly will never get hit, so who cares.

The thumbnails are stored as a base64 encoded version of the URL, so not to give the ability for someone to send funky requests, and potentially do bad things on my perfectly happy filesystem. Be warned, you'll have plenty of filenames that are rather long. The PHP page on base64_encode() says that base64 encode makes strings that are 33% longer than the original. So if your original URL was 20 characters, the thumbnail filename will be at least 21 characters. I believe they're padded to an even 4 characters, so it'll be 24 characters plus the .jpg extension. If your filesystem is going to have a problem with this, pick a different way to encode your thumbnail name. I considered just making md5sum's of it, and even though everything I know about it says it can't happen, I'd just be worried that the wrong thumbnail would show up, through some twist of fate. If you're a web host, that means a churches web site would end up with a hard-core porn picture, which I know would happen to me somehow. Well, that would be if we hosted churches web sites.

On the machine that has X running on it (on another server in this case), I have this perl script running. It could have been done in PHP, but I don't have PHP installed on that workstation. This is in my cron to once once per minute.

--- begin build.searchimages.pl
#!/usr/bin/perl

use DBI;
use MIME::Base64;

$basepath = "/host/users/searchimages/thumbnails";

$db = DBI->connect("DBI:mysql:database:server", user, 'pass') || die "$!";

$source_query = "SELECT url FROM request_thumb";

$source = $db->prepare("$source_query") || die "$!, error on source prepare\n";
$source->execute || print "Error on source execute\n";

while (@curarray = $source->fetchrow_array){
$req_url = $curarray[0];
$req_url =~ s/\;//g;
$outfile = encode_base64($req_url);
chop ($outfile);
$outfile =~ s/\n//g;
$outfile = "$outfile.jpg";
print "Creating $req_url -> $outfile\n";
$sysstring = "/opt/kde/bin/khtml2png -display :0 --width=800 --height=600 --scaled-width 240 --scaled-height 180
$req_url $basepath/$outfile";
system(`$sysstring`);
$sysstring = "/usr/local/bin/scp $basepath/$outfile search_server:/your_path/thumbnails";
print "SCP: $sysstring\n";
system("$sysstring");
$db->do("DELETE FROM request_thumb WHERE url = '$curarray[0]'");
--- end build.searchimages.pl


khtml2png is *NOT* fast enough to run directly from the page, by any stretch of the imagination. It has to be able to download the HTML and related images, plus parse the HTML, and generate the thumbnail. It takes a good 20 or so per thumbnail.

I use scp to move the images over to the web server, you could change that to do anything you want. I like scp. It does require that you have SSH keys set up for automagic login.

If anyone gets use from this, cool. If it gets included with a future PhpDig, please give me credit somewhere.

This could probably use a cron to blow away thumbnails that are more than x days (or weeks) old, but that's a housekeeping thing that I haven't really put too much thought into yet. Small sites may never need it. Larger sites may need to do it more frequently, or even implement a more extensive directory structure, so you don't end up with too many thumbnails in one directory. But hey, that's the magic of open source, right? Tune it the way *YOU* want/need it.

rispbiz
08-11-2004, 10:52 PM
Thanks JWSmythe

Great Build!!

I have installed and currently using this at http://www.2-surf.net

It was a little indepth to install though and had to make a couple script changes to work with my server. But well worth it!

Thanks Again!!!!
Check it out at http://www.2-surf.net

Thank You,
2-Surf.net

rispbiz
08-11-2004, 11:07 PM
Here is one change I had to make before it would work correctly.

This Part:

$requestupdate = 1;
trigger_error ("Thumbnail: Couldn't find $encoded_name for $plain_url");
$result_url = "http://search.proadult.com/whitedot.png";
};

I had to comment out.

trigger_error ("Thumbnail: Couldn't find $encoded_name for $plain_url");

So it would look like this.

$requestupdate = 1;
//trigger_error ("Thumbnail: Couldn't find $encoded_name for $plain_url");
$result_url = "http://search.proadult.com/whitedot.png";
};

The problem was that it wouldn't display the whitedot.png

Which I change to and Updating Picture.

Hope this helps.

Thanks
2-surf.net

rispbiz
08-18-2004, 12:21 PM
Here is another addition I had to add to the build.searchimages.pl

Here is the orginal part of the build.searchimages.pl

$req_url $basepath/$outfile";
system(`$sysstring`);
$sysstring = "/usr/local/bin/scp $basepath/$outfile search_server:/your_path/thumbnails";
print "SCP: $sysstring\n";
system("$sysstring");
$db->do("DELETE FROM request_thumb WHERE url = '$curarray[0]'");
--- end build.searchimages.pl

Here I had to add a OPTIMIZE TABLE after the delete to prevent overhead in the table.

$req_url $basepath/$outfile";
system(`$sysstring`);
$sysstring = "/usr/local/bin/scp $basepath/$outfile search_server:/your_path/thumbnails";
print "SCP: $sysstring\n";
system("$sysstring");
$db->do("DELETE FROM request_thumb WHERE url = '$curarray[0]'");
$db->do("OPTIMIZE TABLE request_thumb");

--- end build.searchimages.pl

ChadK
08-19-2004, 08:29 AM
This mod looks very cool but I'm not sure I have the skill to tackle it just yet.

ChadK
08-20-2004, 06:17 AM
I don't know rispbiz. Oh well.

As to this mod..
khtml2png has many requirements that can't be satisfied by my server..

Cookies:
If the user running khtml2png hasn't been using KDE, or their cookie
policy is set to "Ask", they will be asked what to do about cookies,
causing khtml2png to hang. To avoid this, copy one of the supplied
kcookiejarrc files to ~/.kde/share/config/kcookiejarrc, or run konqueror
interactively.


Also:
khtml2png needs to connect to an X server


On my server I get:
vncserver: command not found

:( So I can't use it..

rispbiz
08-24-2004, 12:01 PM
You can use this to connect to x server

vncserver -depth 32 -geometry 1000x1000

Here is a webpage that may help http://www.michaelhoover.org/work/2004/07/khtml2png.html

Thanks
2-surf.net