data mining | Josh Branchaud

I recently participated in HackOmaha, a small hackathon located in Omaha, NE (coverage and results). I needed a list of all the businesses in Omaha, but did not have one readily available. Here is how I got that list.

Disclaimer: I don’t purport that this is some optimal solution to the problem, but rather want to provide some insight for jumping into this kind of problem for the first time. There is plenty of room for improvements and increased efficiency.

The Context

There were 3 local government data sets and our challenge was to leverage that data to do something useful. I was interested in doing something with the city council agendas because these seemed like they would be the most challenging. Each agenda is a bunch of natural language describing the weekly agenda in PDF format. There were other likeminded people at HackOmaha, so we teamed up to see what we could accomplish.

Part of the team began downloading the PDFs, extracting the text, and storing it in an ElasticSearch database. I worked with another part of the team to devise a strategy for doing Natural Language Processing (NLP) on the content of these agendas. Our goal was to extract business/company names, people’s names, and addresses from these agendas. To aid this process, I figured we would need a list of businesses in Omaha. This would supposedly make things easier on the NLP algorithm(s).

The Problem

A brief amount of research did not turn up any easily accessible business listing for Omaha. I did find http://www.owhyellowpages.com/ which offered a business search feature for Omaha businesses. It seemed pretty comprehensive (there looked to be around 40,000+ business listings). The problem though is that you could only access these 10 at a time in the browser.

The Solution

Web Scraping. We didn’t have a lot of time, but I figured if I could setup an automated process to extract the business information from the HTML pages, then I would be able to move onto something else useful. Here is what I needed to do:

Identify a deterministic way to uniquely access all the business listing pages
Identify the HTML elements that contained the data I wanted (business name, address, city, state)
Scrape the HTML documents for the aforementioned data
Store the business data (preferably in a CSV file to start)

1. The business listings were easy enough to access uniquely and deterministically. The first 10 results would show up at http://www.owhyellowpages.com/search/business?page=0 and simply incrementing the value for page would get the subsequent sets of results. There were pages ranging from 0 to 4215, so I would just need to iterate over that range.

2. Using Firefox’s FireBug, I was able to narrow down the source location of the business name, address, and city-state. I took note of each tag and its class as well as the tag that acted as a container for an entire business listing.

Business Listing – <div class="vcard">...</div>
Business Name – <span class="fn org"><a ...>NAME</a></span>
Business Address – <span class="street-address">ADDRESS</span>
Business City-State – <span class="city-state">CITYSTATE</span>

3. Not having an prior experience with web scraping, I decided to break this step up into two pieces in order to manage some of the risk of trying something new. Figuring it would take a little time to learn how to scrape data from an HTML page, I want to at least get some part of the process underway. It is quicker and more reliable to read an HTML page from disk (especially with my SSD), so I decided to get an automated process started that would systematically download business listing HTML documents. Then I would dig into how to actually scrape said HTML documents which would eventually just be waiting for me on my SSD.

3a. In order to get the HTML documents on my machine, I threw together a quick bash script that utilized wget.

#!/bin/bash

# Download all of the Omaha World Herald Yellow Page businesses
#
# The general form of the URL to wget is:
#  http://www.owhyellowpages.com/search/business?page=0
# To request each subsequent page, the number for page needs to be
# incremented. The files should then be renamed. The naming scheme
# from wget will follow this format:
#  business?page=#
# where the number (#) is the page number from 0 to 4215.

url="http://www.owhyellowpages.com/search/business?page="
file="./business?page="
final_dir="./yellowpages/"
final_file="business"

if [ ! -d "$final_dir" ]; then
    mkdir $final_dir
fi

for a in {0..4215}
do

    wget ${url}$a
    mv ${file}$a ${final_dir}${final_file}$a.html

done

It is not pretty, but it gets the job done (at github). I am sure there are any number of other approaches to doing this, so feel free to roll your own. Regardless, I start the script up and let it run because it is going to be a while. In the meantime, I move on to figuring out how to parse them.

3b. I have enjoyed using python lately, so it seemed like a good choice for throwing together a quick scraping script. I did some searching and found the libraries I would need. To do HTML parsing, I considered using a standard module (HTMLParser), but then realized there were better options out there. I finally settled on using BeautifulSoup. I also found out that I could just as easily scrape the webpages from there URL (using urllib2) as I could from my local filesystem. I threw together a series of generic functions for doing the scraping and then two additional methods for scraping the HTML locally or online.

I start off by importing the necessary libraries and declaring a few global variables:

# BusinessExtractor
# the purpose of this module is to go through the 4000+ html pages from
# www.owhyellowpages.com that contain business listings so as to extract
# the business name, address, and possibly other relevant information.
# Two approaches can be taken:
# - access the file on disk and parse relevant content
# - access the file online and parser the relevant content

from bs4 import BeautifulSoup
import urllib2

# Some global variables
htmlfile = "yellowpages/business" # need to add number and .html to end
firstfile = "yellowpages/business0.html"

This is followed by two functions that can either access a file or a url that is HTML, turn it into a Soup object, and return that to the caller:

# get_file_soup: String -&gt; Soup
# this function takes a file name and generates the soup object for
# it using the BeautifulSoup library.
def get_file_soup(filename):
  f = open(filename, 'r')
   html = f.read()
   soup = BeautifulSoup(html, "html5lib")
    f.close()
 return soup

# get_url_soup: String -&gt; Soup
# given the URL for a website, this function will open it up, read in
# the HTML and then create a Soup object for the contents. That Soup
# object is returned.
def get_url_soup(url):
  f = urllib2.urlopen(url)
  html = f.read()
   soup = BeautifulSoup(html, "html5lib")
    f.close()
 return soup

You may notice the extra argument in the BeautifulSoup constructor (“html5lib”). This specifies a parser that is different from the standard one. The html5lib parser is more flexible in that it can parse poorly-formed HTML. This is necessary to parse, in my opinion, to be able to parse HTML written by random people/machines.

Now we need a function that will take a soup and find the portions of the HTML document that we are interested in (most of the document will be ignored).

# list_each_vcard: Soup -&gt; String[]
# this function will go through the approx. 10 vcards for a given soup of
# the html page, aggregate the desired pieces of info into a list, and then
# return that list.
def list_each_vcard(soup):
   vcards = soup.findAll('div', { 'class' : 'vcard' })
   businesses = []
   for vcard in vcards:
      items = []
        items.append(get_name_for_business(vcard))
        items.append(get_address_for_business(vcard))
     items.append(get_citystate_for_business(vcard))
       businesses.append(','.join('"{0}"'.format(w) for w in items))
 return businesses

There are three functions in the above bit of code that we haven’t seen yet. Don’t despair, I will show them each next. First, we can see BeautifulSoup shine with its simplicity. I call the findAll function on the Soup object, specifying that I want div tags with the class vcard (<div class="vcard">...</div>). It returns a list of Soup objects that make up the specific div tags that it encounters. I can now iterate on these to grab specific information out of each of them. Remember that each vcard div represents a business, so each contains the pieces of information I want to know about each business.

Here is a look at the get_name_for_business function:

# get_name_for_business: Soup -&gt; String
# this function takes a particular vcard business from the overall
# html soup and finds the name of the business and returns that unicode object
def get_name_for_business(vcard):
 name = vcard.find('span', { 'class' : 'fn org' }).find('a')
   if(name != None):
     innerstuff = name.contents
        if len(innerstuff) &gt; 0:
            inner_item = innerstuff[0]
            return inner_item
     else:
         return ""
 else:
     return ""

The business name can be found inside the link tag (<a>) that is found inside the span of class 'fn org'. In 4000+ pages it is important to recognize the possibility that some business listing may be missing information or that the desired tags appear in excess. A quick solution for dealing with these sorts of situations is to add a check that what we ‘find‘ is not equal to None. If it is, then we just give an empty string which can easily be weeded out of the list later.

Here is a look at the get_address_for_business function:

# get_address_for_business: Soup -&gt; String
# this function takes a particular vcard business from the overall
# html soup and finds the street-address and returns that unicode object
def get_address_for_business(vcard):
 address = vcard.find('span', { 'class' : 'street-address' })
  if(address != None):
      innerstuff = address.contents
     if len(innerstuff) &gt; 0:
            inner_item = innerstuff[0]
            return inner_item
     else:
         return ""
 else:
     return ""

Similar to above, the address can be pulled out of the span tag of class 'street-address' and again we do the None check to add robustness.

Here is a look at the get_citystate_for_business function:

# get_citystate_for_business: Soup -&gt; String
# this function takes a particular vcard business from the overall
# html soup and finds the citystate info and returns that unicode object
def get_citystate_for_business(vcard):
 citystate = vcard.find('span', { 'class' : 'city-state' })
    if(citystate != None):
        innerstuff = citystate.contents
       if len(innerstuff) &gt; 0:
            inner_item = innerstuff[0]
            return inner_item
     else:
         return ""
 else:
     return ""

Lastly, to get the city and state, we find the span tag of class 'city-state' and again the None check is included.

4. With all these functions in place, we finally just need a function to put it all together and store all the data off to some CSV file. In this case, there will actually be two functions (one for the local file parsing and the other for the url file parsing):

# scrape_range: String int int -&gt; void
# given the name of an output file, an int for the beginning of the range,
# and an int for the end of the range, this function will go through each
# set of business vcards and get the CSV business data. Each set of this
# data will be written to the out file.
def scrape_range(outfile, begin, end):
   csvfile = open(outfile, 'a')
  for a in range(begin, end+1):
     soup = get_file_soup(htmlfile + str(a) + ".html")
     for business in list_each_vcard(soup):
            csvfile.write(business + "\n")
    csvfile.close()

# scrape_url_range: String int int -&gt; void
# given the name of an output file, an int for the beginning of the range,
# and an int for the end of the range, this function will open up the URLs
# in that range and then scrape out the business attributes from the vcards
# that we are interested in. The scraped data will be written to the out file.
def scrape_url_range(outfile, begin, end):
 csvfile = open(outfile, 'a')
  for a in range(begin, end+1):
     soup = get_url_soup("http://www.owhyellowpages.com/search/business?page=" + str(begin))
       for business in list_each_vcard(soup):
            csvfile.write(business + "\n")
    csvfile.close()

The first, scrape_range, allows us to scrape locally stored HTML files whose numbers range between the two given integer values. The given string, outfile, is the name of the file that the CSV data will be written to. The second, scrape_url_range, instead allows us to scrape HTML files located at a particular URL whose numbers range between the two given integer values. Like the first, the given outfile string is the name of CSV file to which the business listings will be written. Note: the reason these range values work so well here is because the HTML files and the URLs were deterministically associated with a range of integer values. Other scenarios might not be so clean and straightforward, in which case you would have to device a different scheme for iterating over the pages.

For the full source code, see the github repository.

In the end, we have a massive CSV file (on github) that contains content like the following:

"Academy Roofing","2407 N 45th Ave","Omaha, NE"
"Accurate Heating & Cooling","11710 N 189th Plz","Bennington, NE"
"Aksarben / ARS Heating, Air Conditioning & Plumbing","7070 S 108th Street","Omaha, NE"
"BRT Construction","718 Avenue K","Carter Lake, IA"
"Complete Industries Inc","9402 Fairview Road","Papillion, NE"
"Critter Control","PO Box 27308","Omaha, NE"
"Electrical Systems Inc","14928 A Cir","Omaha, NE"
"Elite Exteriors","14535 Industrial Rd","Omaha, NE"
"Goslin Contracting","6116 Military Ave","Omaha, NE"
"Mangia Italiana","6516 Irvington Rd","Omaha, NE"
"Papa Murphy's Take 'n' Bake","701 Galvin Rd S","Bellevue, NE"

After writing all the code, it was just a matter of letting this script run for a few hours aggregating over 40,000+ business listings. From there, a person could imagine all sorts of use cases for the data including the one we intended: training our NLP algorithms and/or matching candidate business names against a database of listings.

Comment below or join the conversation at Hacker News

Josh Branchaud

I tell machines what to do everyday

Tag Archives: data mining

Scraping Business Listings in Omaha with Python

The Context

The Problem

The Solution