PDA

View Full Version : Indexing DOC with VBA


JCL
02-01-2006, 06:58 AM
Hi all,

Here is my projet:
- Intranet with more than 200000 WORD documents.
- PHP interface (Easy-PHP) + phpDIG

DESCRIPTION

Problems:
1- Catdoc doesn't work fine with me, especially with various Word formats and accents
2- I do not index a web site but also a directory subtree. Each subdir contains 3000 files.

Solutions:
1a- I use Word instead of Catdoc to extract the text objects and store them into text files, with a VBA script.
1b- I had to modify the code to replace file.txt by .doc with the links
1c- the VBA script generates an index.html file linking all text extractions used by the spider.

2- I had to modify some constants to make the spider accept a number of files in Level 1 more than 200000

Performance:
1- the VBA text extractor takes roughly 2s per file
2- I used a 2x512Mo RAM to index the base. It took me more than 250 work hours.

The indexing time results are:
File number / Time (s)
1000 0,5
100000 4
150000 6
200000 10

THE VBA SCRIPT

Works better than catdoc but I don't warranty it for your purpose !

The aim of this script is to extract text objects from DOC files. Each directory contains a certain number of files. For instance, you can have "C:\DATA\shelf01, C:\DATA\shelf02, ..." with every directory containing 3000 files.
For each DOC file is created a TXT file.
In each directory is created a index.html file containing links to the extractions.

Microsoft does not recommend the use of Office to automate processes, because of the conversational requests (a dialog box asking you for something). You have to look at you computer to watch if it is suspended. In my db it appended twice. If it occurs, just have to shut down WORD by clicking on the cross.

The script takes care of password protected and rights of the files. Any protected file will not be indexed.

The script is partially fault tolerant, i.e., if the program crashes, you just have to relaunch it. It will start at the beginning of the current directory.


Best Regards,

Jean-Christophe LECOQ



Sub GenerateINDEXHTML()

Dim ShelfList
Dim currentDir As String, currentWORDFile As String, currentTXTFile As String, lastFileProcessed As String
Dim excludedDir As Variant, excludedFiles As Variant
Dim tempres As String
Dim FTPBasePath As String, LogFileName As String, LogErrorFileName As String, CurrentProcessFile As String
Dim ExcludeDocFileName As String
Dim arrFiles As Variant, arrDir As Variant
Dim NumFiles As Long, NumDir As Long
Dim CurrentOpenedFileID As Long, IndexFileID As Long, LogFileID As Long
Dim LogErrorFileID As Long, CurrentProcessFileID As Long, ExcludeDocFileID As Long
Dim TotaDirToExclude As Long, TotalFileToExclude As Long
Dim i, j, k
Dim reloadWORDmax As Long, reloadWORDcounter As Long

Dim dirToDo As Boolean, canProcessFile As Boolean
Dim debugflag As Boolean
Dim ProcessNow As Boolean
Dim oApp As Object
Dim oWord As Object, oTxt As Object


' Where the files are supposed to be stored
FTPBasePath = "c:\FTP\documents\"

' Log file name
LogFileName = "C:\FTP\conv_dir.log"

' Log Error file name
LogErrorFileName = "C:\FTP\conv_err.log"

' Log the current file that is processed
CurrentProcessFile = "C:\FTP\conv_cur.log"

' where the file to exclude of the process are
ExcludeDocFileName = "C:\FTP\conv_exclude.log"

' debug flag
debugflag = True

' reload winWORd every x documents
reloadWORDmax = 200
reloadWORDcounter = 0

NumDir = GesDirList(FTPBasePath, arrDir)

' open and read the list of files to exclude
TotalFileToExclude = 0
If ("" <> Dir(ExcludeDocFileName)) Then
ExcludeDocFileID = FreeFile
Open ExcludeDocFileName For Input As #ExcludeDocFileID ' should test if the file exist
ReDim excludedFiles(0 To 1000) ' should redim if exceeding
Debug.Print "Building excluded file list..."
Do While Not EOF(ExcludeDocFileID)
Input #ExcludeDocFileID, excludedFiles(TotalFileToExclude)
TotalFileToExclude = TotalFileToExclude + 1
Loop
ReDim Preserve excludedFiles(0 To TotalFileToExclude)
Close #ExcludeDocFileID
End If

' check if there is already a log file
TotaDirToExclude = 0
LogFileID = FreeFile
If ("" = Dir(LogFileName)) Then
Open LogFileName For Output As #LogFileID
Else
' open the log file and read it to get directories
Open LogFileName For Input As #LogFileID
ReDim excludedDir(0 To 100)
Debug.Print "Retrieving log file..."
Do While Not EOF(LogFileID)
Input #LogFileID, currentDir
excludedDir(TotaDirToExclude) = currentDir
TotaDirToExclude = TotaDirToExclude + 1
Loop
ReDim Preserve excludedDir(0 To TotaDirToExclude)
Close #LogFileID
End If

' read the last file processed
If (True = debugflag) Then
' set the file in process
Debug.Print "Process is set in debug mode."

If ("" <> Dir(CurrentProcessFile)) Then
CurrentProcessFileID = FreeFile
Open CurrentProcessFile For Input As #CurrentProcessFileID
Input #CurrentProcessFileID, lastFileProcessed
Close #CurrentProcessFileID
ProcessNow = False
Else
Debug.Print "No last file processed."
ProcessNow = True
End If
Else
' process file in any case
ProcessNow = True
End If

Debug.Print "Let's go..."

' list of the shelfs
For i = 1 To NumDir

currentDir = FTPBasePath & arrDir(i)
Debug.Print "Processing " & currentDir

' is the directory excluded ?
dirToDo = True
If (0 <> TotaDirToExclude) Then
For k = 0 To (TotaDirToExclude)
If (0 <> InStr(excludedDir(k), currentDir)) Then
Debug.Print "directory already done ! skipping..."
dirToDo = False
Exit For
End If
Next
End If

If (True = dirToDo) Then
NumFiles = GetFileList(currentDir, "*.doc", arrFiles)

' generate a new text file for index.html
fileName = currentDir & "\index.html"
IndexFileID = FreeFile
Open fileName For Output As #IndexFileID

Print #IndexFileID, "<!DOCTYPE HTML PUBLIC>"
Print #IndexFileID, "<html><head><title>Liste</title></head>"
Print #IndexFileID, "<BODY><TABLE>"

If (NumFiles > 0) Then
For j = 1 To NumFiles
' stopping the current WORD session if needed
If (reloadWORDcounter >= reloadWORDmax) Then
' quit WORD, without changing anything
oApp.Quit SaveChanges:=wdDoNotSaveChanges
Set oApp = Nothing
reloadWORDcounter = 0
End If

' starting a new WORD session
If (reloadWORDcounter = 0) Then
Set oApp = CreateObject("Word.Application")
oApp.Visible = False
oApp.DisplayAlerts = False
'disable spell and grammar checking
oApp.Options.CheckGrammarAsYouType = False
oApp.Options.CheckSpellingAsYouType = False
oApp.Options.CheckGrammarWithSpelling = False
' do not update OLE links
oApp.Options.UpdateLinksAtOpen = False
End If

reloadWORDcounter = reloadWORDcounter + 1

' name of the word file
currentWORDFile = currentDir & "\" & arrFiles(j)

' build a txt equivalent file
If (0 <> InStr(Left(Right(currentWORDFile, 4), 3), ".")) Then
currentTXTFile = LCase(Left(currentWORDFile, Len(currentWORDFile) - 4)) & ".txt"
End If
tempres = currentTXTFile
Do
k = InStr(tempres, "\")
If (0 <> k) Then
tempres = Right(tempres, Len(tempres) - k)
End If
Loop Until (0 = k)

canProcessFile = True
Err.Clear

' is the file excluded ?
If (TotalFileToExclude > 0) Then
For k = 0 To (TotalFileToExclude)
If (0 <> InStr(excludedFiles(k), currentWORDFile)) Then
Debug.Print "the file " & currentWORDFile & " is excluded ! skipping..."
canProcessFile = False
Exit For
End If
Next
End If

ProcessNow = True

' can the process restart ?
If (True = debugflag) Then
If (False = ProcessNow) Then
If (Len(lastFileProcessed) > 0) Then
' is it the current file ?
If (0 <> InStr(lastFileProcessed, currentWORDFile)) Then
Debug.Print "Restarting process from " & currentWORDFile
ProcessNow = True
End If
Else
Debug.Print "No crash before, starting now."
ProcessNow = True
End If
Else
' store the name of the current file in case of crash
CurrentProcessFileID = FreeFile
Open CurrentProcessFile For Output As #CurrentProcessFileID
Print #CurrentProcessFileID, currentWORDFile
Close #CurrentProcessFileID
End If
End If

If ((True = canProcessFile) And (True = ProcessNow)) Then
On Error GoTo OpenErrorHandler
Set oWord = oApp.Documents.Open(fileName:=currentWORDFile, ConfirmConversions:=False, _
AddToRecentFiles:=False, ReadOnly:=True, _
PasswordDocument:="JCisthebest")

If (True = canProcessFile) Then
oWord.ShowGrammaticalErrors = False
CurrentOpenedFileID = FreeFile
Open currentTXTFile For Output As #CurrentOpenedFileID
Close #CurrentOpenedFileID
Set oTxt = oApp.Documents.Open(fileName:=currentTXTFile, AddToRecentFiles:=False)
oTxt.ShowGrammaticalErrors = False
oTxt.range.Text = oWord.range.Text
If (True = canProcessFile) Then
oApp.ActiveDocument.Save
End If

oWord.Close SaveChanges:=wdDoNotSaveChanges
oTxt.Close
Set oWord = Nothing
Set oTxt = Nothing

End If 'canProcessFile (reading a file generates problem)
End If 'canProcessFile (file excluded)

If (True = canProcessFile) Then
Print #IndexFileID, "<TR><TD><A href=" & tempres & _
">" & arrFiles(j) & " converti en txt " & "</A></TD></TR>"
End If

Next

End If

Print #IndexFileID, "</TABLE></BODY></html>"
Close #IndexFileID

' open log file in an append mode, to add a new shelf
LogFileID = FreeFile
Open LogFileName For Append As #LogFileID
Print #LogFileID, currentDir
Close #LogFileID

End If
Next

oApp.Quit
Set oApp = Nothing

Debug.Print "OVER !"
Exit Sub

OpenErrorHandler:
' append error in log file
LogErrorFileID = FreeFile
Open LogErrorFileName For Append As #LogErrorFileID
Print #LogErrorFileID, Time & ": Error " & Err.Number & " on " & currentWORDFile
canProcessFile = False
Close #LogErrorFileID
If (Err.Number = -2147023170) Then
' critical error, relaod WINWORD
Debug.Print "critical error ! reloading WINWORD..."
reloadWORDcounter = reloadWORDmax
End If
Resume Next
End Sub

Function GesDirList(ByVal sPath As String, arrFiles) As Long
Dim NextReDim As Long, nFound As Long
Dim currentRep
NextReDim = 10
nFound = 0

ReDim arrFiles(0 To NextReDim)
If Right(sPath, 1) <> "\" Then sPath = sPath & "\"

currentRep = Dir(sPath, vbDirectory)
Do While currentRep <> ""
If currentRep <> "." And currentRep <> ".." Then
If (GetAttr(sPath & currentRep) And vbDirectory) = vbDirectory Then
nFound = nFound + 1
arrFiles(nFound) = currentRep
If nFound >= NextReDim Then
NextReDim = NextReDim + NextReDim
ReDim Preserve arrFiles(0 To NextReDim)
End If
End If
End If
currentRep = Dir()
Loop
ReDim Preserve arrFiles(0 To nFound)
GesDirList = nFound
End Function

Function GetFileList(ByVal sPath As String, SearchStr As String, arrFiles) As Long
Dim fileName As String
Dim NextReDim As Long, nFound As Long, bAdd As Boolean
Dim LCaseSearchStr As String
LCaseSearchStr = LCase(SearchStr)
NextReDim = 100

ReDim arrFiles(0 To NextReDim)
If Right(sPath, 1) <> "\" Then sPath = sPath & "\"
nFound = 0
fileName = Dir(sPath & SearchStr)
Do While Len(fileName) > 0
Select Case True
Case fileName = "."
Case fileName = ".."
Case Else
nFound = nFound + 1
arrFiles(nFound) = fileName
If nFound >= NextReDim Then
NextReDim = NextReDim + 100
ReDim Preserve arrFiles(0 To NextReDim)
End If
End Select
fileName = Dir()
Loop
ReDim Preserve arrFiles(0 To nFound)
GetFileList = nFound
End Function