PDA

View Full Version : antiword tweaking code


MTSC
02-18-2007, 06:32 AM
Am wrestling with antiword. In short, MSWord documents are uploaded to site, diverted by antiword to temp dir where antiword parses and counts characters, then script divides char count by 5, and outputs a "word" count.

Less than 1 percent variance is desired - compared to what Word reports when its TOOLS are used to count characters.

Have code in place to remove any whitespace above two spaces after end-sentence punctuation, and to include tabs and returns.

}
$content = str_replace('[pic]', '', $content);
$content = preg_replace('/[\r\n\t]/', '', $content);
$content = preg_replace('/([^\.\!\?"\'])[ ]+/', '$1', $content);
$content = preg_replace('/\.[ ]{3,}/', '', $content);
echo 'Total character count for '. $file.': '. strlen($content).'<br/>';
$total_chars += strlen($content);

But I get anything from near perfect to 5% under or over.
Anyone with any ideas on how to tweak this antiword code to something more reliable?

TIA,
Sarah