There have been a lot of complaints about both the competency and the logic behind the latest Epstein archive release by the DoJ: from censoring the names of co-conspirators to censoring pictures o…
We also have very little in the way of error correction, since it’s mostly not human readable
This is the main point.
Most well working OCR systems have a dictionary-check pass, which goes a long way into fixing the errors.
On the other hand, if all those files are the same font and size, it should be possible to tune the OCR to better match the requirements. Also reduce the possibilities to the character set used by the encoding.
I was recently using OCR for an unrelated project and it was totally unusable as is, because unlike what it expected (plain text documents), it got text on top of pictures. So now I have to find ways to preprocess and single out the text, removing the graphic lines that might be behind it, to make it readable.
OCR is mostly good enough. Problem here is we have 76 pages that we need to be read perfectly, with a low fidelity input
We also have very little in the way of error correction, since it’s mostly not human readable
This is the main point.
Most well working OCR systems have a dictionary-check pass, which goes a long way into fixing the errors.
On the other hand, if all those files are the same font and size, it should be possible to tune the OCR to better match the requirements. Also reduce the possibilities to the character set used by the encoding.
I was recently using OCR for an unrelated project and it was totally unusable as is, because unlike what it expected (plain text documents), it got text on top of pictures. So now I have to find ways to preprocess and single out the text, removing the graphic lines that might be behind it, to make it readable.