|
02-01-2006, 07:58 AM | #1 |
Green Mole
Join Date: Feb 2006
Location: FRANCE
Posts: 1
|
Indexing DOC with VBA
Hi all,
Here is my projet: - Intranet with more than 200000 WORD documents. - PHP interface (Easy-PHP) + phpDIG DESCRIPTION Problems: 1- Catdoc doesn't work fine with me, especially with various Word formats and accents 2- I do not index a web site but also a directory subtree. Each subdir contains 3000 files. Solutions: 1a- I use Word instead of Catdoc to extract the text objects and store them into text files, with a VBA script. 1b- I had to modify the code to replace file.txt by .doc with the links 1c- the VBA script generates an index.html file linking all text extractions used by the spider. 2- I had to modify some constants to make the spider accept a number of files in Level 1 more than 200000 Performance: 1- the VBA text extractor takes roughly 2s per file 2- I used a 2x512Mo RAM to index the base. It took me more than 250 work hours. The indexing time results are: File number / Time (s) 1000 0,5 100000 4 150000 6 200000 10 THE VBA SCRIPT Works better than catdoc but I don't warranty it for your purpose ! The aim of this script is to extract text objects from DOC files. Each directory contains a certain number of files. For instance, you can have "C:\DATA\shelf01, C:\DATA\shelf02, ..." with every directory containing 3000 files. For each DOC file is created a TXT file. In each directory is created a index.html file containing links to the extractions. Microsoft does not recommend the use of Office to automate processes, because of the conversational requests (a dialog box asking you for something). You have to look at you computer to watch if it is suspended. In my db it appended twice. If it occurs, just have to shut down WORD by clicking on the cross. The script takes care of password protected and rights of the files. Any protected file will not be indexed. The script is partially fault tolerant, i.e., if the program crashes, you just have to relaunch it. It will start at the beginning of the current directory. Best Regards, Jean-Christophe LECOQ Code:
Sub GenerateINDEXHTML() Dim ShelfList Dim currentDir As String, currentWORDFile As String, currentTXTFile As String, lastFileProcessed As String Dim excludedDir As Variant, excludedFiles As Variant Dim tempres As String Dim FTPBasePath As String, LogFileName As String, LogErrorFileName As String, CurrentProcessFile As String Dim ExcludeDocFileName As String Dim arrFiles As Variant, arrDir As Variant Dim NumFiles As Long, NumDir As Long Dim CurrentOpenedFileID As Long, IndexFileID As Long, LogFileID As Long Dim LogErrorFileID As Long, CurrentProcessFileID As Long, ExcludeDocFileID As Long Dim TotaDirToExclude As Long, TotalFileToExclude As Long Dim i, j, k Dim reloadWORDmax As Long, reloadWORDcounter As Long Dim dirToDo As Boolean, canProcessFile As Boolean Dim debugflag As Boolean Dim ProcessNow As Boolean Dim oApp As Object Dim oWord As Object, oTxt As Object ' Where the files are supposed to be stored FTPBasePath = "c:\FTP\documents\" ' Log file name LogFileName = "C:\FTP\conv_dir.log" ' Log Error file name LogErrorFileName = "C:\FTP\conv_err.log" ' Log the current file that is processed CurrentProcessFile = "C:\FTP\conv_cur.log" ' where the file to exclude of the process are ExcludeDocFileName = "C:\FTP\conv_exclude.log" ' debug flag debugflag = True ' reload winWORd every x documents reloadWORDmax = 200 reloadWORDcounter = 0 NumDir = GesDirList(FTPBasePath, arrDir) ' open and read the list of files to exclude TotalFileToExclude = 0 If ("" <> Dir(ExcludeDocFileName)) Then ExcludeDocFileID = FreeFile Open ExcludeDocFileName For Input As #ExcludeDocFileID ' should test if the file exist ReDim excludedFiles(0 To 1000) ' should redim if exceeding Debug.Print "Building excluded file list..." Do While Not EOF(ExcludeDocFileID) Input #ExcludeDocFileID, excludedFiles(TotalFileToExclude) TotalFileToExclude = TotalFileToExclude + 1 Loop ReDim Preserve excludedFiles(0 To TotalFileToExclude) Close #ExcludeDocFileID End If ' check if there is already a log file TotaDirToExclude = 0 LogFileID = FreeFile If ("" = Dir(LogFileName)) Then Open LogFileName For Output As #LogFileID Else ' open the log file and read it to get directories Open LogFileName For Input As #LogFileID ReDim excludedDir(0 To 100) Debug.Print "Retrieving log file..." Do While Not EOF(LogFileID) Input #LogFileID, currentDir excludedDir(TotaDirToExclude) = currentDir TotaDirToExclude = TotaDirToExclude + 1 Loop ReDim Preserve excludedDir(0 To TotaDirToExclude) Close #LogFileID End If ' read the last file processed If (True = debugflag) Then ' set the file in process Debug.Print "Process is set in debug mode." If ("" <> Dir(CurrentProcessFile)) Then CurrentProcessFileID = FreeFile Open CurrentProcessFile For Input As #CurrentProcessFileID Input #CurrentProcessFileID, lastFileProcessed Close #CurrentProcessFileID ProcessNow = False Else Debug.Print "No last file processed." ProcessNow = True End If Else ' process file in any case ProcessNow = True End If Debug.Print "Let's go..." ' list of the shelfs For i = 1 To NumDir currentDir = FTPBasePath & arrDir(i) Debug.Print "Processing " & currentDir ' is the directory excluded ? dirToDo = True If (0 <> TotaDirToExclude) Then For k = 0 To (TotaDirToExclude) If (0 <> InStr(excludedDir(k), currentDir)) Then Debug.Print "directory already done ! skipping..." dirToDo = False Exit For End If Next End If If (True = dirToDo) Then NumFiles = GetFileList(currentDir, "*.doc", arrFiles) ' generate a new text file for index.html fileName = currentDir & "\index.html" IndexFileID = FreeFile Open fileName For Output As #IndexFileID Print #IndexFileID, "<!DOCTYPE HTML PUBLIC>" Print #IndexFileID, "<html><head><title>Liste</title></head>" Print #IndexFileID, "<BODY><TABLE>" If (NumFiles > 0) Then For j = 1 To NumFiles ' stopping the current WORD session if needed If (reloadWORDcounter >= reloadWORDmax) Then ' quit WORD, without changing anything oApp.Quit SaveChanges:=wdDoNotSaveChanges Set oApp = Nothing reloadWORDcounter = 0 End If ' starting a new WORD session If (reloadWORDcounter = 0) Then Set oApp = CreateObject("Word.Application") oApp.Visible = False oApp.DisplayAlerts = False 'disable spell and grammar checking oApp.Options.CheckGrammarAsYouType = False oApp.Options.CheckSpellingAsYouType = False oApp.Options.CheckGrammarWithSpelling = False ' do not update OLE links oApp.Options.UpdateLinksAtOpen = False End If reloadWORDcounter = reloadWORDcounter + 1 ' name of the word file currentWORDFile = currentDir & "\" & arrFiles(j) ' build a txt equivalent file If (0 <> InStr(Left(Right(currentWORDFile, 4), 3), ".")) Then currentTXTFile = LCase(Left(currentWORDFile, Len(currentWORDFile) - 4)) & ".txt" End If tempres = currentTXTFile Do k = InStr(tempres, "\") If (0 <> k) Then tempres = Right(tempres, Len(tempres) - k) End If Loop Until (0 = k) canProcessFile = True Err.Clear ' is the file excluded ? If (TotalFileToExclude > 0) Then For k = 0 To (TotalFileToExclude) If (0 <> InStr(excludedFiles(k), currentWORDFile)) Then Debug.Print "the file " & currentWORDFile & " is excluded ! skipping..." canProcessFile = False Exit For End If Next End If ProcessNow = True ' can the process restart ? If (True = debugflag) Then If (False = ProcessNow) Then If (Len(lastFileProcessed) > 0) Then ' is it the current file ? If (0 <> InStr(lastFileProcessed, currentWORDFile)) Then Debug.Print "Restarting process from " & currentWORDFile ProcessNow = True End If Else Debug.Print "No crash before, starting now." ProcessNow = True End If Else ' store the name of the current file in case of crash CurrentProcessFileID = FreeFile Open CurrentProcessFile For Output As #CurrentProcessFileID Print #CurrentProcessFileID, currentWORDFile Close #CurrentProcessFileID End If End If If ((True = canProcessFile) And (True = ProcessNow)) Then On Error GoTo OpenErrorHandler Set oWord = oApp.Documents.Open(fileName:=currentWORDFile, ConfirmConversions:=False, _ AddToRecentFiles:=False, ReadOnly:=True, _ PasswordDocument:="JCisthebest") If (True = canProcessFile) Then oWord.ShowGrammaticalErrors = False CurrentOpenedFileID = FreeFile Open currentTXTFile For Output As #CurrentOpenedFileID Close #CurrentOpenedFileID Set oTxt = oApp.Documents.Open(fileName:=currentTXTFile, AddToRecentFiles:=False) oTxt.ShowGrammaticalErrors = False oTxt.range.Text = oWord.range.Text If (True = canProcessFile) Then oApp.ActiveDocument.Save End If oWord.Close SaveChanges:=wdDoNotSaveChanges oTxt.Close Set oWord = Nothing Set oTxt = Nothing End If 'canProcessFile (reading a file generates problem) End If 'canProcessFile (file excluded) If (True = canProcessFile) Then Print #IndexFileID, "<TR><TD><A href=" & tempres & _ ">" & arrFiles(j) & " converti en txt " & "</A></TD></TR>" End If Next End If Print #IndexFileID, "</TABLE></BODY></html>" Close #IndexFileID ' open log file in an append mode, to add a new shelf LogFileID = FreeFile Open LogFileName For Append As #LogFileID Print #LogFileID, currentDir Close #LogFileID End If Next oApp.Quit Set oApp = Nothing Debug.Print "OVER !" Exit Sub OpenErrorHandler: ' append error in log file LogErrorFileID = FreeFile Open LogErrorFileName For Append As #LogErrorFileID Print #LogErrorFileID, Time & ": Error " & Err.Number & " on " & currentWORDFile canProcessFile = False Close #LogErrorFileID If (Err.Number = -2147023170) Then ' critical error, relaod WINWORD Debug.Print "critical error ! reloading WINWORD..." reloadWORDcounter = reloadWORDmax End If Resume Next End Sub Function GesDirList(ByVal sPath As String, arrFiles) As Long Dim NextReDim As Long, nFound As Long Dim currentRep NextReDim = 10 nFound = 0 ReDim arrFiles(0 To NextReDim) If Right(sPath, 1) <> "\" Then sPath = sPath & "\" currentRep = Dir(sPath, vbDirectory) Do While currentRep <> "" If currentRep <> "." And currentRep <> ".." Then If (GetAttr(sPath & currentRep) And vbDirectory) = vbDirectory Then nFound = nFound + 1 arrFiles(nFound) = currentRep If nFound >= NextReDim Then NextReDim = NextReDim + NextReDim ReDim Preserve arrFiles(0 To NextReDim) End If End If End If currentRep = Dir() Loop ReDim Preserve arrFiles(0 To nFound) GesDirList = nFound End Function Function GetFileList(ByVal sPath As String, SearchStr As String, arrFiles) As Long Dim fileName As String Dim NextReDim As Long, nFound As Long, bAdd As Boolean Dim LCaseSearchStr As String LCaseSearchStr = LCase(SearchStr) NextReDim = 100 ReDim arrFiles(0 To NextReDim) If Right(sPath, 1) <> "\" Then sPath = sPath & "\" nFound = 0 fileName = Dir(sPath & SearchStr) Do While Len(fileName) > 0 Select Case True Case fileName = "." Case fileName = ".." Case Else nFound = nFound + 1 arrFiles(nFound) = fileName If nFound >= NextReDim Then NextReDim = NextReDim + 100 ReDim Preserve arrFiles(0 To NextReDim) End If End Select fileName = Dir() Loop ReDim Preserve arrFiles(0 To nFound) GetFileList = nFound End Function |
Thread Tools | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
index only *.doc files ? | ipguy | Troubleshooting | 1 | 01-16-2006 04:45 PM |
indexing for the 1st time but getting "duplicate of existing doc" msg with some files | Morphea | Troubleshooting | 9 | 12-30-2004 04:03 PM |
problem with .pdf and .doc files | mleray | External Binaries | 11 | 12-09-2004 11:26 PM |
No short description from .doc? | Spider | External Binaries | 1 | 09-06-2004 03:25 AM |
For dummies: How can i index word doc?? | dapuse | External Binaries | 3 | 01-27-2004 03:09 PM |