Cleaning Up Text from PDFs

A few weeks ago I mentioned to a friend of mine that I use an MS Word macro to remove the weird line breaks that sometimes occur when I copy text from a PDF and paste it into Word. If you use PDFs as sources in your writing, you’ve probably run across this problem before. I thought other people might be curious as well, so I’ll share two different ways to deal with this problem.

Method 1: Textfixer

Textfixer is a website with a link break removal tool that lets you plop in text from a PDF and remove link breaks. It’s easy to use.

Method 2: the MS Word Macro

I mentioned in the post on organizing my research life that I prefer Word to Google Docs because it’s more robust. Macros are part of that robustness. I’ve been using this one a long time, so I prefer it to Textfixer out of habit. From a 1997 article in PC World* (available via Proquest), here’s how to create a Cleanup macro in Word:

1. Open the document you want to reformat, then start recording your macro: Select Tools:Macro, type a name for your macro, such as Cleanup, click Record, and in the Record Macro dialog box, click OK.

2. Select Edit:Replace, type ^p (a caret and a p) in the Find What box, and type a ~ (a tilde) in the Replace With box. Click Replace All to replace all hard returns with tildes. Click OK at the prompt. The document will look strange, but don’t worry about it.

3. While the Replace dialog box is still on screen, type ~~ in the Find What box, and type ^p in the Replace With box. Click Replace All to replace all the double hard returns (the normal break between paragraphs in text files) with a single hard return. Click OK at the prompt.

4. Type ~ in the Find What box, and type a single space in the Replace With box. Click Replace All to replace what was the single hard return at the end of a line with a space character to separate words. Click OK at the prompt, then click Close.

5. Click the Macro Stop button to turn off the macro recording. You may still have to do some minor editing to finish cleaning up the document, but most of your work will be done.

The only drawback I’ve found is that you can’t run the macro on only a selected portion of text, which means you need to have a separate Word window open to drop in the PDF quote, run the Cleanup macro, then copy the clean text and paste it where you want it.Of course, if you use Textfixer you have to keep that open as well. On the other hand, I can use it even when I don’t have an Internet connection. I hope this helps someone. It’s saved me a lot of time over the years.

Appendix on Organizing Quotations

Which reminds me of something I meant to mention in the post on organizing my research life. One reason I don’t use Mendeley much, despite its quality, is that I use another program to organize PDFs and other efiles (Calibre) and I don’t annotate PDFs very often (and when I do, I use PDF Xchange Viewer). However, I pull a lot of quotations out of PDFs and print sources and organize them all in a Word file, beginning with the citation and then every quote from that source organized by page number. I then make the citation a “Heading 2,” and use the Document Map to view and navigate among them. (No Document Map feature on Google Docs, btw.) For the book, I then organized the quotes in the order I dealt with the sources. What works better for me than annotation is to have the main parts of everything I’m analyzing laid out in order, so that I can take snippets of quotes when I need them and already have them formatted the same way as my manuscript. It also helps me remember the gist of a book or article by skimming through key quotes. Since the book had many historical primary sources, I did this a lot. Accompanying my 200-page manuscript was another 125-page document of nothing but quotations from about 70 different sources. (On a side note, when I couldn’t copy and paste, I would type from print sources, which ended up being a good way to get my fingers moving on days when I struggled to start writing.)

*Campbell, George. “Make Word 97 Easier on Your Eyes.” PC World 15:10 (Oct 1997), p. 344.

3 thoughts on “Cleaning Up Text from PDFs

  1. Hi Wayne. Regarding capturing printed resources, have you tried or heard of anyone having success with a pen scanner? I’ve seen a few on Amazon that get good reviews and intrigued by the idea, though their cost ($150 – $200) are a deterrent. I like these personal posts on how you do research as it’s always better to see a living example rather than talk about raw theory. Thanks for your efforts.

  2. Jason, I haven’t tried one of them, but they do seem pricey. For short bits or passages from a longer work, I usually just type it into Word. The finger exercise seems to help later when I write and I’m a reasonably fast typist. For longer things, I use some scanners in the library that provide OCR on the fly and then send the PDF to my email. They work pretty well, and we’ve put them where the copiers are.

Leave a Reply