Created
May 27, 2011 03:51
-
-
Save basicxman/994609 to your computer and use it in GitHub Desktop.
Revisions
-
basicxman revised this gist
May 27, 2011 . 1 changed file with 1 addition and 11 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -76,23 +76,13 @@ def get_pages_first_link(url) valid_links.first.attr("href") end def has_reached_end? return false if @breadcrumbs.length <= 2 if @breadcrumbs.last == "/wiki/Philosophy" @reached_end = true return true end return @breadcrumbs[0..-2].include? @breadcrumbs.last end def add_crumb(url) -
basicxman revised this gist
May 27, 2011 . 1 changed file with 11 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -76,13 +76,23 @@ def get_pages_first_link(url) valid_links.first.attr("href") end # Using .include? is a little inefficient as going in reverse order would be # faster in this case. def has_recursed? last = @breadcrumbs.last (@breadcrumbs.length - 2).downto(0) do |i| return true if @breadcrumbs[i] == last end return false end def has_reached_end? return false if @breadcrumbs.length <= 2 if @breadcrumbs.last == "/wiki/Philosophy" @reached_end = true return true end return has_recursed? end def add_crumb(url) -
basicxman revised this gist
May 27, 2011 . 1 changed file with 2 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -49,8 +49,9 @@ def remove_bracket_links(text) end end end return temp if temp[last_bracket + 1] == '"' # First brackets might be within a valid link. temp[first_bracket..last_bracket] = "" return temp end def print_results -
basicxman revised this gist
May 27, 2011 . 1 changed file with 2 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -67,7 +67,8 @@ def get_pages_first_link(url) page = Nokogiri::HTML(open(url)) content = page.css("#bodyContent") # Remove invalid links. content.css(".dablink", ".navbox", ".tocolours", ".image").each(&:remove) content.css("p").first.inner_html = remove_bracket_links content.css("p").first.inner_html valid_links = content.css("p > a") -
basicxman revised this gist
May 27, 2011 . 1 changed file with 1 addition and 4 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -67,10 +67,7 @@ def get_pages_first_link(url) page = Nokogiri::HTML(open(url)) content = page.css("#bodyContent") content.css(".dablink", ".navbox", ".tocolours", ".image").each(&:remove) # Anchors with a .dablink parent are exlucded. content.css("p").first.inner_html = remove_bracket_links content.css("p").first.inner_html valid_links = content.css("p > a") -
basicxman revised this gist
May 27, 2011 . 1 changed file with 4 additions and 4 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -32,24 +32,24 @@ def unwikify(page) def remove_bracket_links(text) temp = text first_bracket = temp.index "(" last_bracket = temp.length - 1 return temp if first_bracket.nil? done = false opening_brackets = 0 (first_bracket..last_bracket).each do |index| if temp[index] == "(" opening_brackets += 1 elsif temp[index] == ")" opening_brackets -= 1 if opening_brackets == 0 last_bracket = index break end end end temp[first_bracket..last_bracket] = "" temp end -
basicxman revised this gist
May 27, 2011 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -31,7 +31,7 @@ def unwikify(page) def remove_bracket_links(text) temp = text first_bracket = temp.index "(" return temp if first_bracket.nil? -
basicxman revised this gist
May 27, 2011 . 1 changed file with 19 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -31,8 +31,25 @@ def unwikify(page) def remove_bracket_links(text) temp = text first_bracket = temp.index "(" return temp if first_bracket.nil? done = false opening_brackets = 0 index = first_bracket until done if temp[index] == "(" opening_brackets += 1 elsif temp[index] == ")" opening_brackets -= 1 if opening_brackets == 0 break end end index += 1 end temp[first_bracket..index] = "" temp end -
basicxman revised this gist
May 27, 2011 . 1 changed file with 9 additions and 5 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -41,15 +41,19 @@ def print_results puts "[#{@breadcrumbs.length}] " + @breadcrumbs.map { |crumb| unwikify(crumb) }.join(" -> ") end def remove(elm) elm.remove end def get_pages_first_link(url) puts url page = Nokogiri::HTML(open(url)) content = page.css("#bodyContent") content.css(".dablink").each(&:remove) # Anchors with a .dablink parent are exlucded. content.css(".navbox").each(&:remove) # Remove navigation box. content.css(".tocolours").each(&:remove) # Remove table of contents side box. content.css(".image").each(&:remove) # Image links :/ content.css("p").first.inner_html = remove_bracket_links content.css("p").first.inner_html valid_links = content.css("p > a") @@ -85,4 +89,4 @@ def crawl end go = ExtendedMind.new -
basicxman created this gist
May 27, 2011 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,88 @@ #!/usr/bin/env ruby # Extended Mind # Wikipedia checker, the concept is that for any Wikipedia article you can # eventually get to the article on Philosophy if you click the first link # on the article. # http://xkcd.com/903 (see alt text) require 'open-uri' require 'nokogiri' class ExtendedMind def initialize @start_page = ARGV[0] @breadcrumbs = [] start crawl print_results end def wikify(page) "http://wikipedia.org" + page end def unwikify(page) page.gsub("/wiki/", "") end def remove_bracket_links(text) temp = text temp.gsub! /\<span.*?\<\/span\>/, "" temp.gsub! /^.*?\)/, "" temp end def print_results puts "\n[-] Unable to check, recurses.\n" if @reached_end.nil? puts "[#{@breadcrumbs.length}] " + @breadcrumbs.map { |crumb| unwikify(crumb) }.join(" -> ") end def get_pages_first_link(url) puts url page = Nokogiri::HTML(open(url)) content = page.css("#bodyContent") content.css(".dablink").each { |elm| elm.remove } # Anchors with a .dablink parent are exlucded. content.css(".navbox").each { |elm| elm.remove } # Remove navigation box. content.css(".tocolours").each { |elm| elm.remove } # Remove table of contents side box. content.css(".image").each { |elm| elm.remove } # Image links :/ content.css("p").first.inner_html = remove_bracket_links content.css("p").first.inner_html valid_links = content.css("p > a") valid_links.first.attr("href") end def has_reached_end? return false if @breadcrumbs.length <= 2 if @breadcrumbs.last == "/wiki/Philosophy" @reached_end = true return true end return @breadcrumbs[0..-2].include? @breadcrumbs.last end def add_crumb(url) @breadcrumbs << get_pages_first_link(wikify(url)) end def start begin add_crumb("/wiki/#{@start_page}") rescue abort "Invalid start!" end end def crawl until has_reached_end? add_crumb(@breadcrumbs.last) end end end go = ExtendedMind.new