June 9, 2014

How I Automated My Bike Shop

Web Scraping With Nokogiri and Mechanize #

Ever since I can remember I have loved mountain biking. From doing Ride the Rockies with my family at the age of 8 to riding gnarly, technical downhill trails up at Whistler Bike Park, in British Columbia, Canada, I have spent a huge chunk of my life cruising on two wheels.

Near the end of last year I decided to switch from a pure consumer of bike parts to being a retailer in the industry. In a couple of weeks, I tackled all the necessary prerequisites for a wholesale account and my dream of being a retailer was a reality.

For the first couple of months sales were going well with the majority of them being online. Unfortunately like most small business, I hit my first pain point. Instead of spending my time concentrating on building my brand and advertising, I was bogged down with the menial task of updating the stock of every item in my store and attempting to add as many of the 23,000 skus available. I knew there had to be a better way.

Unlike most of their competitors, my distributor had no API. Their site is a simple ecommerce site that requires a page load for each individual products information (ex: description, MSRP, stock…).

It was time to automate this task.

I began with a simple Ruby on Rails app which was hosted on [heroku](heroku.com) to reduce web request latency. The web scraping and site navigation was handled by the gems Mechanize and Nokogiri. This scraper required two steps to fully acquire all the data needed.

In order to get to each individual product variations page (to scrape), I first needed the Item # of each variant (as shown below) on the search results page. Through some trickery, I was able to get all product groups to list in the search results.

My scraper would manually enter the URL for all 621 product group pages, parse the page and insert it them into the database.

For each product group above, my script would create one product group record with an associated product for each item within it. Below is the code that tackles this task.

 
def self.scrape_product_groups(pages)
    a = Mechanize.new
    
    # Logs into the backend to make sure I can view all details
    page = BTI.login(a)

    # Pages get passed in as an array
    # Allows for background jobs to process 
    # different pages concurrently
    #
    # Ex: BTI.scrape_product_groups([1,2,3]) 
    # would scrape pages 1,2,3
    pages.to_a.each do |page_num|
      puts "Scraping page #{page_num}"

      # Loads up product page
      page = a.get("https://bti-usa.com/public/quicksearch/+/
        ?page=#{page_num}")

      # Parses page from Mechanize to Nokogiri
      raw_xml = page.parser

      # Granb all product rows on the page
      groupRows = raw_xml.css('.groupRow')

      # Grabs product group 'bti_id' from css and parses out junk
      groupRows.each do |item|
        bti_id = item.attributes.first.last.value
          .gsub('groupItemsDiv__num_','')
          .gsub('groupItemsDiv_','')

        # Creates or finds product group based on 'bti_id'
        pg = ProductGroup.live.where(bti_id: bti_id).first
        pg ||= ProductGroup.create(bti_id: bti_id)
        
        # Parses product group name
        pg.name = item.css('.groupTitleOpen').text

        puts "Updating #{pg.name} product group"

        #Build up product group description from all bullet points
        pg.description = ""
        item.css('.groupBullets').css('li').each do |li|
          pg.description += li.text + '. '
        end

        # Iterates through every item number in the product group
        item.css('.itemNo').each do |itemNo|
          # Finds and cleans up 'bti_id'
          bti_id = itemNo.css('a').text.gsub('-','')

          # Creates a new product for the product group
          # if none is found
          product = Product.live
            .where(bti_id: bti_id, product_group_id: pg.id).first
          product ||= Product.create(bti_id: bti_id, 
            product_group_id: pg.id)
        end
      pg.save
    end
  end
end

Once all these ‘item numbers’ are collected, the second stage of the scraper kicks in.

The scraper then goes to each individual page (like below) and scrapes all info on it. The different prices (cost and msrp) and stock are displayed when logged in.

The code that tackled this challenge is below:



def BTI.parse_product_info(a, product)
  # Passes in 'a' = Mechanize.new and a product record
  page = BTI.login(a)
  
  # Navigates individual product page 
  page = a.get("https://bti-usa.com/public/item/#{product.bti_id}")

  # Converts mechanize to nokogiri data
  raw_xml = page.parser

  # Load associated product group
  pg = product.product_group

  # If the product is no longer on the site, 
  # archive both its product group and product
  # so that it will not be scraped in the future
  if raw_xml.css("#errorCell").any?
    pg.archive
    product.archive

    return
  end

  # Parses category from bread crumbs in header
  category_parent_name = raw_xml.css('.crumbs')
    .css('a').first(2).last.try(:text)
  category_child_name = raw_xml.css('.crumbs')
    .css('a').first(4).last.try(:text)

  # Finds or create the parent and child category  
  category_parent = Category.where(name: category_parent_name, 
    parent: true).first_or_create
  category_child = Category.where(name: category_child_name)
    .first_or_create

  # Moves the product group and/or
  # product to an activated state if 
  # they were categorized as needed to be scraped
  pg.activate if pg.scraped?
  product.activate if product.scraped?
        
  # Adds the parent and/or child category to the
  # product group if not already categorized in it  
  pg.categories << category_parent 
    unless pg.categories.include?(category_parent)
  pg.categories << category_child 
    unless pg.categories.include?(category_child)

  # Parses the brand out of the page        
  pg.brand = raw_xml.css('.headline').css('span').text
  #Updates the product group in the database
  pg.save

  # Grabs the image record if one exists
  images = raw_xml.css(".itemTable").css("img")[1]
        
  # If an image exists change the url to the largest
  # image stored on the server
  if images
    image_url = images.attributes["src"].value.
      gsub('thumbnails/large', 'pictures') 
   product.photo_url = "https://bti-usa.com" + image_url
  end

  # If the product requires special authorization sell, mark it 
  # as so
  product.authorization_required = 
    !(!!page.form_with(:action => '/public/add_to_cart') or 
    !!raw_xml.search('//img/@src')
    .to_s.match('/images/stockalert.gif'))

  # Finds the model of the product by parsing out the brand name        
  product.model = pg.name.gsub(pg.brand, '')
  product.save

  # Parses the different product prices (Featured below)
  parse_product_price(raw_xml, product)

  # Parses out all product variations
  raw_xml.css('.itemSpecTable').css('tr').each do |variation|
    # Grabs key and value of each bullet point
    key = variation.css('.specLabel').text
    value = variation.css('.specData').text
        
  # Saves mpn in product MPN field
  if key == "vendor part #:"
    product.mpn = value
    product.save if product.changed?
  end

  # Parses out junk
  unless key == "" or value == "" or 
    key == "BTI part #:" or 
    key == "vendor part #:" or
    key == "UPC:"
      variation = Variation.where(key: key.gsub(':',''), 
        value: value.gsub('/', ' / ')
        .titleize, product_id: product.id)
        .first_or_create
   end
  end
end

def BTI.parse_product_price(raw_xml, item)
  # Grabs item name and all html containing
  # price info
  title_bar = raw_xml.css("h3")
  name = parse_noko(title_bar).gsub("\"", "")
  tds = raw_xml.css("div#bodyDiv").css("td")

  # Resets product prices and stock
  price = 0.0
  msrp = 0.0
  sale = 0.0
  stock = 0

  # Loops through html table and parses out
  # price and stock info
  # Have to loop due to BTI not believing in 
  # css classes 
  (0..100).to_a.each do |i|
    unless tds[i].nil?
      parsed_item = parse_noko(tds[i])
                
      case parsed_item
      when "price:"
        price = parse_noko(tds[i+1], true).to_f
      when "onsale!"
        sale = parse_noko(tds[i+1], true).to_f
      when "MSRP:"
        msrp = parse_noko(tds[i+1], true).to_f
      when "remaining:"
        stock = parse_noko(tds[i+1], true).to_i
      end
    end
  end

  # Updates product data and 
  # commits it to database
  item.name = name
  item.msrp_price = msrp
  item.sale_price = sale
  item.regular_price = price
  item.stock = stock
  item.save

  # Outputs to screen
  puts "  * #{name}\n"
  puts "  *** Price - #{price}\n"
  puts "  *** Stock - #{stock}\n"
  puts "\n"
end

# Clean up nokogiri new line and return junk
def BTI.parse_noko(raw, with_spaces = false)
  raw_text = raw.text
  if with_spaces
    raw_text = raw_text.gsub(" ", "")
  end
  raw_text.gsub("\r", "").gsub("\n","").gsub("\t","").gsub("$","").gsub(",","")
end

This second stage parsed over 23,000 items in just under two hours.

In order to get this quick time, I set up my app to process the scraping through multi-threaded background jobs using the Sidekiq. This allowed me to do 25 concurrent page requests at a time. Also if one of these request failed due to server errors on my distributor’s side, the job would be re-qued and processed a few minutes later.

By using Heroku Scheduler, this task would scale up background worker dynos when started, scrape all the data and turn them off when complete.

When it comes to lean business model, this scraper was it. The Heroku server only cost me $9 a month due to having to upgrade from the free postgres database. I didn’t have to pay for worker dynos since Heroku gives 750 free hours of server time every month.

I had all the data I needed, now it was just a matter of getting the products in front of customers!

Shopify and their API #

I began with trying to push to my existing Wordpress shop, but due to difficulties with the Woocommerce API, I knew there was a better solution.

When you need a cutting edge, simple ecommerce store with a simple API, Shopify is definitely the solution. Within 15 minutes I had the store setup with the correct sales tax amounts and merchant payment accounts (processed through Stripe) .

I went back to my app and built a rake task to manipulate my data into Shopify’s schema.

Within a day the task was done and all 23,000 sku’s were being pushed to my store.

The store was up and the products were live. For a few months I was able to concentrate on the important things and let one rake task take care of updating stock and uploading new products.

I ended up shutting down the site and shop this month (June, 2014) due to it not being worth my time. It was incredibly difficult to turn a substantial profit due to a huge amount of competition in the industry and not being able to lower my prices below MAP.

It was a fun ride but it’s time for the next venture.

Checkout the entire app: GitHub

Also I am always looking for new ideas to work on. So let me know if you are in need of a Ruby Dev to kick ass on your project.

Wanna chat? Shoot me an email

Follow me at @matteleonard to see what I’m up to next

Check out my personal site at Mattl.co

Kudos