In the last installment, we looked at the general approach to locating, converting and completing a quality direct marketing house list. Here, I'm going to show you the steps for handling a simple list. In the next installment, we'll dig into something more complex and more useful.
The goal of today's exercise is to take an online directory of nearly 2,000 contacts located two levels beneath a home page, and turn it into a database- or contact manager-ready file. We're going to use two tools to do this: Acrobat 7 (the latest release) and Word 2002.
Our goal will be to create a well-targeted mailing list in Access format with 2,000 names in under 20 minutes (not counting background activities)—starting from scratch.
Our list will be ready to go after spending 45 seconds per name, which I consider to be a reasonable time to find and document a potential lead.
This, and the next, article is very procedural. To help, I've created a ZIP file you can download containing a set of files that you can use to get a more concrete idea of what I'm doing. Those files are the following:
- First—this is a 535 KB PDF file showing what the initial download page contains.
- Full—this is a 2.5 MB PDF file containing the entire list as I capture it from the web.
- Sample—this is a 150 KB PDF file containing only about 10 pages from that full list.
- Initial RTF—this is a 4.6 MB Word document containing the pre-processed list.
- Sample Initial RTF—this is a 140 KB file containing the first few pages of Initial RTF, for purposes of small download review.
- Finished RTF—this is a 1.8 MB RTF document containing the end result.
- Finished TXT this is a 625 KB TXT file—and this is the file you'll eventually import into your database.
You're encouraged to use these files to execute the steps I describe here.
Read the Rules
The list we're going to work with today, and in the next installment as well, are real lists. Before you do the same things we're going to do here with lists of relevance to you, make sure you read all terms and conditions of use posted on the site you're gathering from. Some sites specifically disallow what we're about to do. If a site says I can't use their information this way, I don't.
OK. Let's start the 20-minute timer.
Step One—Find The List
I chose an industry at random to use for the examples: I imagined myself a manufacturer of seals and rings (e.g., washers, O-rings and so on) for machinery. My marketplace became industrial equipment manufacturers. My Google search was:
directories industrial manufacturers
and one of the first items on the list was perfect for me: Industrial Quick Search (
Take a look at the Web page—you'll see a list of around 180 categories. Click on any one and you'll see the company information—that's what we're after. We have five pieces of information: name, city, state, phone, and a description of the company (plus, if we wanted, a general category description). This is perfect for a general telemarketing campaign (not enough information for a targeted or personalized approach—but we'll deal with all that in the next installment).
Step Two—Define the Correct Acrobat Settings
We're going to capture that home page and convert it to a PDF file—then from within Acrobat we're going to download the rest of the information. In order to ensure we get the best, most efficient results, we need to adjust the default web capture settings in Acrobat.
Acrobat is going to want to try and capture the web site as faithfully as possible—and you have to stop it from doing that since it means you'll have far too much formatting to contend with. You don't want to maintain table structures, graphics, colors and the like.
Here are the basic Acrobat steps:
1. From File > Create PDF > From Web Page, you'll this dialog box (your settings need to match the examples here exactly):
2. Hit the Settings button, and you'll see a dialog box with two tabs. Set the General tab as you see here.
In the box above, the only thing you're concerned about is that Acrobat creates PDF tags for the document, to maintain a workable format when we convert the PDF file in a little while.
3. Select HTML under File Type Settings (this is optional) and click the Settings button.
The only thing I recommend here is that you deselect Convert Images, so that the download is faster (we only care about the text anyway). Click OK when you're done.
4. Open the Page Layout Tab
On this box, I've increased the size of the created page, to avoid having relevant information span more than one page.
We also need to tell Acrobat that we want those pages to load in the PDF document we're creating, and not in a browser window. Select Edit > Preferences and then the Web Capture settings at the bottom of the Preferences box—make sure your preferences look like this:
Step Three—Download the File
Copy and Paste the web site URL into the Create PDF from Web Page dialog box. There are other ways to capture web pages (Acrobat 7, for instance, adds buttons on IE that allow you to do this)—but the copy and paste method is available in earlier versions of Acrobat so we'll use that here.
Click the "Create" button. A status box opens letting you know the progress of your download.
When it completes, you'll have a single-page document that looks like the file named First in the abovementioned downloadable ZIP file.
Step Four—Completing the PDF
The links in the center of the page are our target. If we click on them the specific web page will load.
There are a number of ways to capture these pages. You can click on each one individually—but that takes a long time and you have to sit, wait and watch. You can also use the Advanced > Web Capture > Append All Links on Page command. This is more convenient but it will download every link on the page—and as you can see there are some links that you don't want—often many (banner ads for instance).
To do this select Advanced > Web Capture > View Web Links.
This lists every link that has not already been downloaded on the page (not available in Acrobat 5 and below):
Hit the Select All button. Then, as I've done here, deselect all the irrelevant links (in this case, all those links that do not point to a listing page) by holding down the Control—or Command-- key and clicking on it with your mouse.
Hit the Download button and the highlighted pages will be added to the document.
In this case it takes about 10 seconds (at my bandwidth) to download each page—so you've got a few minutes wait (but remember, that doesn't get added to our total time, since you can do something else while this is happening).
Look at Full or the much shorter Sample file to see what the file looks like when all the pages are downloaded.
About the File Structure
Each listing contains five relevant components:
- A company name hyperlinked to the company web site
- The city
- The state
- The phone number
- A description of what the company does
This will form the foundation of our prospect database—using only this information you can make a telephone call and know something about what the company does when you have that conversation. You'll still have to ask for "the person in charge of the supply chain" or whatever—fixing that is what we'll do in the next installment.
Our goal is to turn that into a database of 2,000 names that looks like this:
BECO Manufacturing Co., Inc. | Laguna Hills | CA | 800-926-2326 | BECO is a leading manufacturer of air cylinders, pneumatic valves, hand sprays, tanks and a variety of other products. BECO's PVC air cylinders are noncorrosive, inexpensive, durable and lightweight. At BECO, we are happy to discuss customizing our cylinders to fit your company's needs. Call us today! |
And we want to do it in under 20 minutes of work time.
So far we've spent about two of those minutes.
We have 18 left.
Step Five—Import Into Word
Our next step is to get this into Word—we'll use the RTF format, since it is common to earlier version of Acrobat.
Select File > Save As and then the RTF option beneath the filenames. Hit the Settings button. Acrobat wants to recreate the file exactly as it was on the web page, and we don't want that. We only want a clean, single column list, with the URLs for each web site intact.
This one's easy: just Deselect everything.
Save the RTF file, then open it in Word (might be a good time for lunch, this could take a while depending on the number of entries in the file). You can see what it looks like by checking out the Initial RTF file. (Make sure to display all formatting marks from the Tools > Options > View menu by checking the All box in the Formatting Marks area.)
Only 16 minutes left to go.
Codes and Wildcards
One thing we have to do is use character codes and wildcards in find and replace specifications. Let's look at what we need to use here:
^p | Paragraph |
^t | Tab |
^# | Any digit |
^& (replace field only) | Use the characters specified in the Find area as the Replace string |
* | Any character |
@ | Any number of the preceding character |
A space—used in this article only... In the find and replace box, just use the spacebar |
Step Six—Clean up and Format the List
They key to working through the mess of extraneous information you'll see is to find and replace information based on patterns—repetitive phrases, font conventions, number of carriage returns used consistently to separate one item from another, and things like that. Every list is different, every list has its own characteristics. But they all have patterns.
The most obvious issue here involves the tables that precede each category listing of companies. We need to get rid of them. (There are no border lines in the tables so they don't pop right out at you, but if you search for in Word you'll find the first one.
There is one area where Wildcard specifications create a little problem for us: Word won't recognize the paragraph code (^p) in Wildcard mode—so we have to replace it with an unusual character (one that doesn't appear anywhere else)—actually, in this case, we have to replace a string of 11 paragraphs in a row—the number of returns that follow the extraneous tabular information.
We'll use the tilde (~) as the replacement character since it is seldom used and is not a Wildcard character itself:
1. Open Edit > Replace.
With Use Wildcards deselected, enter the following find and replace specification (copy and paste it from here):
Find what: ^p^p^p^p^p^p^p^p^p^p^p
Replace with ~