Listed below is my code for a scraper I wrote. I need help adding delays to this scrapper. I want a page scraped every hour.
下面列出的是我写的刮刀的代码。我需要帮助为这个刮板添加延迟。我想每小时抓一页。
require 'open-uri'
require 'nokogiri'
require 'sanitize'
class Scraper
def initialize(url_to_scrape)
@url = url_to_scrape
end
def scrape
# TO DO: change to JSON
# page = Nokogiri::HTML(open(@url))
puts "Initiating scrape..."
raw_response = open(@url)
json_response = JSON.parse(raw_response.read)
page = Nokogiri::HTML(json_response["html"])
# your page should now be a hash. You need the page["html"]
# Change this to parse the a tags with the class "article_title"
# and build the links array for each href in these article_title links
puts "Scraping links..."
links = page.css(".article_title")
articles = []
# everything else here should work fine.
# Limit the number of links to scrape for testing phase
puts "Building articles collection..."
links.each do |link|
article_url = "http://seekingalpha.com" + link["href"]
article_page = Nokogiri::HTML(open(article_url))
article = {}
article[:company] = article_page.css("#about_primary_stocks").css("a")
article[:content] = article_page.css("#article_content")
article[:content] = Sanitize.clean(article[:content].to_s)
unless article[:content].blank?
articles << article
end
end
puts "Clearing all existing transcripts..."
Transcript.destroy_all
# Iterate over the articles collection and save each record into the database
puts "Saving new transcripts..."
articles.each do |article|
transcript = Transcript.new
transcript.stock_symbol = article[:company].text.to_s
transcript.content = article[:content].to_s
transcript.save
end
#return articles
end
end
1
So what are you doing with the articles array when you are done scraping?
那么当你完成抓取时你在做什么文章数组?
I am not sure if it is what you are looking for, but I would just use cron to schedule to run this script every hour. If your script is part of a bigger application - there is a neat gem called whenever which provides a ruby wrapper for cron tasks.
我不确定它是否是您正在寻找的,但我会使用cron来安排每小时运行一次这个脚本。如果你的脚本是一个更大的应用程序的一部分 - 有一个整洁的gem,只要它为cron任务提供ruby包装。
Hope it helps
希望能帮助到你
本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:http://www.silva-art.net/blog/2014/01/06/416e4ac76a1c34eb1b75a9a8a4f9dcd4.html。