Skip to content

Instantly share code, notes, and snippets.

@basicxman
Created May 27, 2011 03:51
Show Gist options
  • Select an option

  • Save basicxman/994609 to your computer and use it in GitHub Desktop.

Select an option

Save basicxman/994609 to your computer and use it in GitHub Desktop.

Revisions

  1. basicxman revised this gist May 27, 2011. 1 changed file with 1 addition and 11 deletions.
    12 changes: 1 addition & 11 deletions run.rb
    Original file line number Diff line number Diff line change
    @@ -76,23 +76,13 @@ def get_pages_first_link(url)
    valid_links.first.attr("href")
    end

    # Using .include? is a little inefficient as going in reverse order would be
    # faster in this case.
    def has_recursed?
    last = @breadcrumbs.last
    (@breadcrumbs.length - 2).downto(0) do |i|
    return true if @breadcrumbs[i] == last
    end
    return false
    end

    def has_reached_end?
    return false if @breadcrumbs.length <= 2
    if @breadcrumbs.last == "/wiki/Philosophy"
    @reached_end = true
    return true
    end
    return has_recursed?
    return @breadcrumbs[0..-2].include? @breadcrumbs.last
    end

    def add_crumb(url)
  2. basicxman revised this gist May 27, 2011. 1 changed file with 11 additions and 1 deletion.
    12 changes: 11 additions & 1 deletion run.rb
    Original file line number Diff line number Diff line change
    @@ -76,13 +76,23 @@ def get_pages_first_link(url)
    valid_links.first.attr("href")
    end

    # Using .include? is a little inefficient as going in reverse order would be
    # faster in this case.
    def has_recursed?
    last = @breadcrumbs.last
    (@breadcrumbs.length - 2).downto(0) do |i|
    return true if @breadcrumbs[i] == last
    end
    return false
    end

    def has_reached_end?
    return false if @breadcrumbs.length <= 2
    if @breadcrumbs.last == "/wiki/Philosophy"
    @reached_end = true
    return true
    end
    return @breadcrumbs[0..-2].include? @breadcrumbs.last
    return has_recursed?
    end

    def add_crumb(url)
  3. basicxman revised this gist May 27, 2011. 1 changed file with 2 additions and 1 deletion.
    3 changes: 2 additions & 1 deletion run.rb
    Original file line number Diff line number Diff line change
    @@ -49,8 +49,9 @@ def remove_bracket_links(text)
    end
    end
    end
    return temp if temp[last_bracket + 1] == '"' # First brackets might be within a valid link.
    temp[first_bracket..last_bracket] = ""
    temp
    return temp
    end

    def print_results
  4. basicxman revised this gist May 27, 2011. 1 changed file with 2 additions and 1 deletion.
    3 changes: 2 additions & 1 deletion run.rb
    Original file line number Diff line number Diff line change
    @@ -67,7 +67,8 @@ def get_pages_first_link(url)
    page = Nokogiri::HTML(open(url))
    content = page.css("#bodyContent")

    content.css(".dablink", ".navbox", ".tocolours", ".image").each(&:remove) # Anchors with a .dablink parent are exlucded.
    # Remove invalid links.
    content.css(".dablink", ".navbox", ".tocolours", ".image").each(&:remove)
    content.css("p").first.inner_html = remove_bracket_links content.css("p").first.inner_html

    valid_links = content.css("p > a")
  5. basicxman revised this gist May 27, 2011. 1 changed file with 1 addition and 4 deletions.
    5 changes: 1 addition & 4 deletions run.rb
    Original file line number Diff line number Diff line change
    @@ -67,10 +67,7 @@ def get_pages_first_link(url)
    page = Nokogiri::HTML(open(url))
    content = page.css("#bodyContent")

    content.css(".dablink").each(&:remove) # Anchors with a .dablink parent are exlucded.
    content.css(".navbox").each(&:remove) # Remove navigation box.
    content.css(".tocolours").each(&:remove) # Remove table of contents side box.
    content.css(".image").each(&:remove) # Image links :/
    content.css(".dablink", ".navbox", ".tocolours", ".image").each(&:remove) # Anchors with a .dablink parent are exlucded.
    content.css("p").first.inner_html = remove_bracket_links content.css("p").first.inner_html

    valid_links = content.css("p > a")
  6. basicxman revised this gist May 27, 2011. 1 changed file with 4 additions and 4 deletions.
    8 changes: 4 additions & 4 deletions run.rb
    Original file line number Diff line number Diff line change
    @@ -32,24 +32,24 @@ def unwikify(page)
    def remove_bracket_links(text)
    temp = text
    first_bracket = temp.index "("
    last_bracket = temp.length - 1

    return temp if first_bracket.nil?

    done = false
    opening_brackets = 0
    index = first_bracket
    until done
    (first_bracket..last_bracket).each do |index|
    if temp[index] == "("
    opening_brackets += 1
    elsif temp[index] == ")"
    opening_brackets -= 1
    if opening_brackets == 0
    last_bracket = index
    break
    end
    end
    index += 1
    end
    temp[first_bracket..index] = ""
    temp[first_bracket..last_bracket] = ""
    temp
    end

  7. basicxman revised this gist May 27, 2011. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion run.rb
    Original file line number Diff line number Diff line change
    @@ -31,7 +31,7 @@ def unwikify(page)

    def remove_bracket_links(text)
    temp = text
    first_bracket = temp.index "("
    first_bracket = temp.index "("

    return temp if first_bracket.nil?

  8. basicxman revised this gist May 27, 2011. 1 changed file with 19 additions and 2 deletions.
    21 changes: 19 additions & 2 deletions run.rb
    Original file line number Diff line number Diff line change
    @@ -31,8 +31,25 @@ def unwikify(page)

    def remove_bracket_links(text)
    temp = text
    temp.gsub! /\<span.*?\<\/span\>/, ""
    temp.gsub! /^.*?\)/, ""
    first_bracket = temp.index "("

    return temp if first_bracket.nil?

    done = false
    opening_brackets = 0
    index = first_bracket
    until done
    if temp[index] == "("
    opening_brackets += 1
    elsif temp[index] == ")"
    opening_brackets -= 1
    if opening_brackets == 0
    break
    end
    end
    index += 1
    end
    temp[first_bracket..index] = ""
    temp
    end

  9. basicxman revised this gist May 27, 2011. 1 changed file with 9 additions and 5 deletions.
    14 changes: 9 additions & 5 deletions run.rb
    Original file line number Diff line number Diff line change
    @@ -41,15 +41,19 @@ def print_results
    puts "[#{@breadcrumbs.length}] " + @breadcrumbs.map { |crumb| unwikify(crumb) }.join(" -> ")
    end

    def remove(elm)
    elm.remove
    end

    def get_pages_first_link(url)
    puts url
    page = Nokogiri::HTML(open(url))
    content = page.css("#bodyContent")

    content.css(".dablink").each { |elm| elm.remove } # Anchors with a .dablink parent are exlucded.
    content.css(".navbox").each { |elm| elm.remove } # Remove navigation box.
    content.css(".tocolours").each { |elm| elm.remove } # Remove table of contents side box.
    content.css(".image").each { |elm| elm.remove } # Image links :/
    content.css(".dablink").each(&:remove) # Anchors with a .dablink parent are exlucded.
    content.css(".navbox").each(&:remove) # Remove navigation box.
    content.css(".tocolours").each(&:remove) # Remove table of contents side box.
    content.css(".image").each(&:remove) # Image links :/
    content.css("p").first.inner_html = remove_bracket_links content.css("p").first.inner_html

    valid_links = content.css("p > a")
    @@ -85,4 +89,4 @@ def crawl

    end

    go = ExtendedMind.new
    go = ExtendedMind.new
  10. basicxman created this gist May 27, 2011.
    88 changes: 88 additions & 0 deletions run.rb
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,88 @@
    #!/usr/bin/env ruby

    # Extended Mind
    # Wikipedia checker, the concept is that for any Wikipedia article you can
    # eventually get to the article on Philosophy if you click the first link
    # on the article.
    # http://xkcd.com/903 (see alt text)

    require 'open-uri'
    require 'nokogiri'

    class ExtendedMind

    def initialize
    @start_page = ARGV[0]
    @breadcrumbs = []

    start
    crawl

    print_results
    end

    def wikify(page)
    "http://wikipedia.org" + page
    end

    def unwikify(page)
    page.gsub("/wiki/", "")
    end

    def remove_bracket_links(text)
    temp = text
    temp.gsub! /\<span.*?\<\/span\>/, ""
    temp.gsub! /^.*?\)/, ""
    temp
    end

    def print_results
    puts "\n[-] Unable to check, recurses.\n" if @reached_end.nil?
    puts "[#{@breadcrumbs.length}] " + @breadcrumbs.map { |crumb| unwikify(crumb) }.join(" -> ")
    end

    def get_pages_first_link(url)
    puts url
    page = Nokogiri::HTML(open(url))
    content = page.css("#bodyContent")

    content.css(".dablink").each { |elm| elm.remove } # Anchors with a .dablink parent are exlucded.
    content.css(".navbox").each { |elm| elm.remove } # Remove navigation box.
    content.css(".tocolours").each { |elm| elm.remove } # Remove table of contents side box.
    content.css(".image").each { |elm| elm.remove } # Image links :/
    content.css("p").first.inner_html = remove_bracket_links content.css("p").first.inner_html

    valid_links = content.css("p > a")
    valid_links.first.attr("href")
    end

    def has_reached_end?
    return false if @breadcrumbs.length <= 2
    if @breadcrumbs.last == "/wiki/Philosophy"
    @reached_end = true
    return true
    end
    return @breadcrumbs[0..-2].include? @breadcrumbs.last
    end

    def add_crumb(url)
    @breadcrumbs << get_pages_first_link(wikify(url))
    end

    def start
    begin
    add_crumb("/wiki/#{@start_page}")
    rescue
    abort "Invalid start!"
    end
    end

    def crawl
    until has_reached_end?
    add_crumb(@breadcrumbs.last)
    end
    end

    end

    go = ExtendedMind.new