Copy text from pdf does copy gibberish

 

Read the full explanation (optional)

It’s a  “problem” that often happens accidentally, but is also used intentionally to prevent copying and indexing of PDF files, especially when posted online.

Fonts in PDF files are stored with two tables, one contains the glyphs (the character shapes) and one contains a “toUnicode” map, which says what character each glyph represents. Acrobat uses the first table to draw the page, so it doesn’t actually know what the text “says”, only which patterns of shapes to draw. When you copy or search the file, the second lookup table is used to work out what the text says (i.e. in the word APPLE the first table says the second shape looks like “P” even if the shapes aren’t stored in alphabetical order, the toUnicode table says the second letter is 0x0050, a capital P).

If this toUnicode map is corrupted or missing, the PDF will render to screen (and print) just fine, but Acrobat has no idea what the shapes mean. The result when you screenread, export, search or copy/paste is a default set of mappings – so it will be a 1:1 relationship (every “A” will become the same character) – but the pairing is not predictable, so it cannot automatically be repaired. You can do it using plugins but would have to manually work out what each pair should be, and recreate the map table a letter at a time.

When this happens intentionally, it means the document author has removed or re-written the toUnicode map, using a plugin. When it happens accidentally it usually means the software exporting the PDF didn’t pass the correct font information to the PDF print driver (in the PostScript stream).

 

Here are the solutions:

#1 – Best Solution – Use google docs. Platform independent solution

  1. upload pdf to google docs
  2. open the document in google docs
  3. copy paste whatever you want

#2 – Use OCR

You need Acrobat 9:

  1. Document  → Watermark  → Add (add a text watermark, hit the space bar once).
  2. Advanced  → Print Production  → Flattener Preview  → Convert all text to outlines (checkbox on). Save.
  3. Document  → OCR text recognition  → recognize text using OCR. Select all text with the type tool, copy.

#3 – Print to Microsoft XPS Document Writer

  1. Print from Acrobat using “Microsoft XPS Document Writer” Output is: “your file name.oxps”
  2. Open “…oxps” with XPS Viewer.
  3. Print to PDF (Acrobat PDF, or CutePDF), using the highest resolution (600 DPI).
  4. Open with Acrobat and use OCR (Searchable Image (Exact)) option.

References:

 

World stories

Economy:

Scams:

PoliceCam:

  1. https://www.youtube.com/watch?v=p6mds5tDqDw
  2. https://www.youtube.com/watch?v=MpBGmFvPOto
  3. https://www.youtube.com/watch?v=03mXh_P1pJ8

Crime:

  1. https://www.youtube.com/watch?v=n8bK3QMsFKs

Controversial Research:

  1. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2785121/

 

Mathematics in web pages

Online editors:

Fonts:

APIs:

Documentation:

Summary:

Examples:

Practice: