Hi. See how "10 Le livre du Mois" and "1c MirĂ³, un feu dans les ruines" were getting indexed...
- 10 in hexadecimal is 16 in decimal: "Le livre du Mois" without the quotes is 16 characters.
- 1c in hexadecimal is 28 in decimal: "MirĂ³, un feu dans les ruines" without the quotes is 28 characters.
These are chunks, where the hexadecimal is the length of the chunk. The {1,3} in the regex is assuming those hexadecimals are always at most a length of three, but replacing {1,3} with a + might be better.
Speaking of better, I suppose a routine could be written to loop and parse and convert between hex and dec and find string positions and all that, but there probably won't be any header fields in the trailer, so just avoiding the hexadecimals should do as a quick patch.