Forked from carolineartz/*nokogiri-commandline-ref.txt
Created
March 19, 2022 17:58
-
-
Save marviorocha/733b05d745c74d05a8bab5ce2b61e23d to your computer and use it in GitHub Desktop.
Revisions
-
carolineartz revised this gist
Apr 9, 2014 . 2 changed files with 0 additions and 0 deletions.There are no files selected for viewing
File renamed without changes.File renamed without changes. -
carolineartz renamed this gist
Apr 9, 2014 . 1 changed file with 0 additions and 0 deletions.There are no files selected for viewing
File renamed without changes. -
carolineartz revised this gist
Apr 9, 2014 . 1 changed file with 118 additions and 0 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,118 @@ require 'nokogiri' require 'open-uri' # Get a Nokogiri::HTML:Document for the page we're interested in... doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove')) # Do funky things with it using Nokogiri::XML::Node methods... #### # Search for nodes by css doc.css('h3.r a.l').each do |link| puts link.content end doc.at_css('h3').content #### # Search for nodes by xpath doc.xpath('//h3/a[@class="l"]').each do |link| puts link.content end #### # Or mix and match. doc.search('h3.r a.l', '//h3/a[@class="l"]').each do |link| puts link.content end #### # Work with attributes xml = "<foo wam='bam'>bar</foo>" doc = Nokogiri::XML(xml) doc.at_css("foo").content => "bar" doc.at_css("foo")["wam"].content => "bam" #### # Work with elements el = doc.at_css("foo") el.children # => array of elements #### So for example if we wanted to know all the names of the food items in our document we simply say: > doc.xpath("//name").collect(&:text) => ["carrot", "tomato", "corn", "grapes", "orange", "pear", "apple"] If we were interested in the entire node we could leave off the .collect(&:text). What if we wanted to select all the names of food items that were best baked? This requires us to use what’s called an axis – we will first need to find the element “baked” but then go back up our XML elements to find which food the item is inside. > doc.xpath("//tag[text()='baked']/ancestor::node()/name").collect(&:text) => ["pear", "apple"] What if we were only interested in vegetables that were good for roasting? Just add //veggies: > doc.xpath("//veggies//tag[text()='roasted']/ancestor::node()/name").collect(&:t xt) => ["carrot", "tomato"] What about if we wanted to know all the tags ‘corn’ had? Again this is very easy: > doc.xpath("//name[text()='corn']/../tags/tag").collect(&:text) => ["raw", "boiled", "grilled"] We can even do searches matching the first character. Let’s say we wanted to know all the food items that started with the letter ‘c’: > doc.xpath("//name[starts-with(text(),'c')]").collect(&:text) => ["carrot", "corn"] You could also use [contains(text(),'rot'] and get back just carrot, useful when you want to do a partial match. #### # Traversion node.ancestors # Ancestors for <node> node.at('xpath') # Returns node at given XPATH node.at_css('selector') # Returns node at given CSS selector node.xpath('xpath') # Returns nodes at given XPATH node.css('selector') # Returns nodes at given selector node.child # Returns the child node node.children # Returns child nodes node.parent #### # Data manipulation node.name # Element name node.node_type node.content # Returns text as string # (aka: .inner_text, .text) node.content = '...' node.inner_html node.inner_html = '...' node.attribute_nodes # Returns attributes as nodes node.attributes # Returns attributes as hash #### # Tree manipulation node.add_next_sibling(other) # Place <other> after <node> node.add_previous_sibling(other) # Place <other> before <node> node.add_child(other) # Put <other> inside <node> node.after(data) # Put a new node after <node> node.before(data) # Put a new node before <node> node.parent = other # Reparents <node> inside <other> -
carolineartz created this gist
Apr 9, 2014 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,662 @@ A digest of most of the methods documented at [nokogiri.org](http://nokogiri.org/). Reading [the source](https://github.com/sparklemotion/nokogiri) can help, too. Topics not covered: [RelaxNG validation](http://nokogiri.org/Nokogiri/XML/RelaxNG.html) or [Builder](http://nokogiri.org/Nokogiri/XML/Builder.html) See also: http://cheat.errtheblog.com/s/nokogiri Strings are always stored as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings. Methods that return XML (like to_xml, to_html and inner_html) will return a string encoded like the source document. More Resources * [sax-machine](https://github.com/pauldix/sax-machine) * [feedzirra](https://github.com/pauldix/feedzirra) * [elementor](https://github.com/nakajima/elementor) * [mechanize](http://mechanize.rubyforge.org/) * [markup_validity](https://github.com/tenderlove/markup_validity) * [XPath Reference](http://www.w3.org/TR/xpath/#path-abbrev) * [XPath Reference 2](http://msdn.microsoft.com/en-us/library/ms256122.aspx) * [CSS Selector Reference](http://msdn.microsoft.com/en-us/library/ie/hh772056(v=vs.85).aspx) * [StackOverflow top questions](http://stackoverflow.com/questions/tagged/nokogiri?sort=votes) ## Creating and working with Documents [Nokogiri::HTML::Document](http://nokogiri.org/Nokogiri/HTML/Document.html) [Nokogiri::XML::Document](http://nokogiri.org/Nokogiri/XML/Document.html) ``` ruby doc = Nokogiri(string_or_io) # Nokogiri will try to guess what type of document you are attempting to parse doc = Nokogiri::HTML(string_or_io) # [, url, encoding, options, &block] doc = Nokogiri::XML(string_or_io) # [, url, encoding, options, &block] # set options with block {|config| config.noblanks.noent.noerror.strict } # OR with a bitmask {|config| config.options = Nokogiri::XML::ParseOptions::NOBLANKS | Nokogiri::XML::ParseOptions::NOENT} # http://nokogiri.org/Nokogiri/XML/ParseOptions.html # doc = Nokogiri.parse(...) # doc = Nokogiri::XML.parse(...) #shortcut to Nokogiri::XML::Document.parse # doc = Nokogiri::HTML.parse(...) #shortcut to Nokogiri::HTML::Document.parse # document namespaces doc.collect_namespaces doc.remove_namespaces! doc.namespaces # shortcuts for creating new nodes doc.create_cdata(string, &block) doc.create_comment(string, &block) doc.create_element(name, *args, &block) # Create an element doc.create_element "div" # <div></div> doc.create_element "div", :class => "container" # <div class='container'></div> doc.create_element "div", "contents" # <div>contents</div> doc.create_element "div", "contents", :class => "container" # <div class='container'>contents</div> doc.create_element "div" { |node| node['class'] = "container" } # <div class='container'></div> doc.create_entity doc.create_text_node(string, &block) doc.root doc.root=node # A document is a Node, so see working_with_a_node ``` ## Working with Fragments [Nokogiri::XML::DocumentFragment](http://nokogiri.org/Nokogiri/XML/DocumentFragment.html) [Nokogiri::HTML::DocumentFragment](http://nokogiri.org/Nokogiri/HTML/DocumentFragment.html) Generally speaking, unless you expect to have a DOCTYPE and a single root node, you don’t have a document, you have a fragment. For HTML, another rule of thumb is that documents have html and body tags, and fragments usually do not. A fragment is a [Node](http://nokogiri.org/Nokogiri/XML/Node.html), but is not a [Document](http://nokogiri.org/Nokogiri/XML/Document.html). If you need to call methods that are only available on Document, like `create_element`, call `fragment.document.create_element`. ```ruby fragment = Nokogiri::XML.fragment(string) fragment = Nokogiri::HTML.fragment(string, encoding = nil) # Note: Searching a fragment relative to the document root with xpath # will probably not return what you expect. You should search relative to # the current context instead. e.g. fragment.xpath('//*').size #=> 0 fragment.xpath('.//*').size #=> 229 ``` ## Working with a [Nokogiri::XML::Node](http://nokogiri.org/Nokogiri/XML/Node.html) ``` ruby node = Nokogiri::XML::Node.new('name', document) # initialize a new node node = document.create_element('name') # shortcut node.document node.name # alias of node.node_name node.name= # alias of node.node_name= node.read_only? node.blank? # Type of Node node.type # alias of node.node_type node.cdata? # type == CDATA_SECTION_NODE node.comment? # type == COMMENT_NODE node.element? # type == ELEMENT_NODE alias node.elem? node.fragment? # type == DOCUMENT_FRAG_NODE (Document fragment node) node.html? # type == HTML_DOCUMENT_NODE node.text? # type == TEXT_NODE node.xml? # type == DOCUMENT_NODE (Document node type) # other types not covered by a convenience method # ATTRIBUTE_DECL: Attribute declaration type # ATTRIBUTE_NODE: Attribute node type # DOCB_DOCUMENT_NODE: DOCB document node type # DOCUMENT_TYPE_NODE: Document type node type # DTD_NODE: DTD node type # ELEMENT_DECL: Element declaration type # ENTITY_DECL: Entity declaration type # ENTITY_NODE: Entity node type # ENTITY_REF_NODE: Entity reference node type # NAMESPACE_DECL: Namespace declaration type # NOTATION_NODE: Notation node type # PI_NODE: PI node type # XINCLUDE_END: XInclude end type # XINCLUDE_START: XInclude start type # Attributes, like a hash that maps string keys to string values node['src'] # aliases: node.get_attribute, node.attr. node['src'] = 'value' # alias node.set_attribute node.key?('src') # alias node.has_attribute? node.keys node.values node.delete('src') # alias of node.remove_attribute node.each { |attr_name, attr_value| } # Node includes Enumerable, which works on these attribute names and values # Attribute Nodes node.attribute('src') # Get the attribute node with name src # Returns a Nokogiri::XML::Attr, a subclass of Nokogiri::XML::Node # that provides +.content=+ and +.value=+ to modify the attribute value node.attribute_nodes # returns an array of this' the Node attributes as Attr objects. node.attribute_with_ns('src', 'namespace') # Get the attribute node with name and namespace node.attributes # Returns a hash containing the node's attributes. # The key is the attribute name without any namespace, # the value is a Nokogiri::XML::Attr representing the attribute. # If you need to distinguish attributes with the same name, but with different namespaces, use #attribute_nodes instead. # Traversing / Modifying # +node_or_tags+ can be a Node, a DocumentFragment, a NodeSet, or a string containing markup. ## Self node.traverse {|node| } # yields all children and self to a block, _recursively_. node.remove # alias of node.unlink # Unlink this node from its current context. node.replace(node_or_tags) # Replace this Node with +node_or_tags+. # Returns the reparented node (if +node_or_tags+ is a Node), # or returns a NodeSet (if +node_or_tags+ is a DocumentFragment, NodeSet, or string). node.swap(node_or_tags) # like +replace+, but returns self to support chaining ## Siblings node.next # alias of node.next_sibling # Returns the next sibling node node.next=(node_or_tags) # alias of node.add_next_sibling # Inserts node_or_tags after this node (as a sibling). # Returns the reparented node (if +node_or_tags+ is a Node) # or returns a NodeSet if (if +node_or_tags is a DocumentFragment, NodeSet, or string.) node.after(node_or_tags) # like +next=+, but returns self to suppport chaining node.next_element # Returns the next Nokogiri::XML::Element sibling node. node.previous # alias of node.previous_sibling # Returns the previous sibling node node.previous=(node_or_tags) # alias of node.add_previous_sibling ? # Inserts node_or_tags before this node (as a sibling). # Returns the reparented node (if +node_or_tags+ is a Node) # or returns a NodeSet (if +node_or_tags+ is a DocumentFragment, NodeSet, or string.) node.before(node_or_tags) # just like +previous=+, but returns self to suppport chaining node.previous_element # Returns the previous Nokogiri::XML::Element sibling node. ## Parent node.parent node.parent=(node) ## Children node.child # returns a Node node.children # Get the list of children of this node as a NodeSet node.children=(node_or_tags) # Set the inner html for this Node # Returns the reparented node (if +node_or_tags+ is a Node), # or returns a NodeSet (if +node_or_tags+ is a DocumentFragment, NodeSet, or string). node.elements # alias: node.element_children # Get the list of child Elements of this node as a NodeSet. node.add_child(node_or_tags) # Add +node_or_tags+ as a child of this Node. # Returns the reparented node (if +node_or_tags+ is a Node), # or returns a NodeSet (if +node_or_tags+ is a DocumentFragment, NodeSet, or string.) node << node_or_tags # like above, but returns self to support chaining, e.g. root << child1 << child2 node.first_element_child # Returns the first child node of this node that is an element. node.last_element_child # Returns the last child node of this node that is an element. ## Content / Children node.content # aliases node.text node.inner_text node.to_str node.content=(string) # Set the Node's content to a Text node containing +string+. The string gets XML escaped, and will not be interpreted as markup. node.inner_html # (*args) children.map { |x| x.to_html(*args) }.join node.inner_html=(node_or_tags) # Sets the inner html of this Node to +node_or_tags+ # Returns self. # Also see related method +children=+ ## Searching below (see Working with a Nodeset below) # see docs for namespace bindings, variable bindings, and custom xpath functions via a handler class node.search(*paths) # alias: node / path # paths can be XPath or CSS node.at(*paths) # alias node % path # Search for the first occurrence of path. Returns nil if nothing is found, otherwise a Node. (like search(path, ns).first) node.xpath(*paths) # search for XPath queries node.at_xpath(*paths) # like xpath(*paths).first node.css(*rules) # search for CSS rules node.at_css(*rules) # like css(*rules).first node > selector # Search this node's immediate children using a CSS selector # Searching above node.ancestors # list of ancestor nodes, closest to furthest, as a NodeSet. node.ancestors(selector) # ancestors that match the selector # Where am I? node.path # Returns the path associated with this Node node.css_path # Get the path to this node as a CSS expression node.matches?(selector) # does this node match this selector? node.line # line number from input node.pointer_id # internal pointer number # Namespaces node.add_namespace(prefix, href) # alias of node.add_namespace_definition # Adds a namespace definition with prefix using href value. The result is as # if parsed XML for this node had included an attribute # ‘xmlns:prefix=value'. A default namespace for this node (“xmlns=”) can be # added by passing ‘nil' for prefix. Namespaces added this way will not show # up in #attributes, but they will be included as an xmlns attribute when # the node is serialized to XML. node.default_namespace=(url) # Adds a default namespace supplied as a string url href, to self. The # consequence is as an xmlns attribute with supplied argument were present # in parsed XML. A default namespace set with this method will now show up # in #attributes, but when this node is serialized to XML an “xmlns” # attribute will appear. See also #namespace and #namespace= node.namespace # returns the default namespace set on this node (as with an “xmlns=” attribute), as a Namespace object. node.namespace=(ns) # Set the default namespace on this node (as would be defined with an # “xmlns=” attribute in XML source), as a Namespace object ns . Note that a # Namespace added this way will NOT be serialized as an xmlns attribute for # this node. You probably want #default_namespace= instead, or perhaps # #add_namespace_definition with a nil prefix argument. node.namespace_definitions # returns namespaces defined on self element directly, as an array of # Namespace objects. Includes both a default namespace (as in“xmlns=”), and # prefixed namespaces (as in “xmlns:prefix=”). node.namespace_scopes # returns namespaces in scope for self – those defined on self element # directly or any ancestor node – as an array of Namespace objects. Default # namespaces (“xmlns=” style) for self are included in this array; Default # namespaces for ancestors, however, are not. See also #namespaces node.namespaced_key?(attribute, namespace) # Returns true if attribute is set with namespace node.namespaces # Returns a Hash of {prefix => value} for all namespaces on this node and its ancestors. # This method returns the same namespaces as #namespace_scopes. # # Returns namespaces in scope for self – those defined on self element # directly or any ancestor node – as a Hash of attribute-name/value pairs. # Note that the keys in this hash XML attributes that would be used to # define this namespace, such as “xmlns:prefix”, not just the prefix. # Default namespace set on self will be included with key “xmlns”. However, # default namespaces set on ancestor will NOT be, even if self has no # explicit default namespace. # see also attribute_with_ns # Rubyisms node <=> another_node # Compare two Node objects with respect to their Document. Nodes from different documents cannot be compared. # uses xmlXPathCmpNodes "Compare two nodes w.r.t document order" node == another_node # compares pointer_id node.clone # alias node.dup # Copy this node. An optional depth may be passed in, but it defaults to a deep copy. 0 is a shallow copy, 1 is a deep copy. # Visitor pattern node.accept(visitor)# calls visitor.visit(self) # Write it out (sorted from most flexible/hardest to use to least flexible/easiest to use) node.write_to(io, *options) # Write Node to +io+ with +options+. +options+ modify the output of # this method. Valid options are: # # * +:encoding+ for changing the encoding # * +:indent_text+ the indentation text, defaults to one space # * +:indent+ the number of +:indent_text+ to use, defaults to 2 # * +:save_with+ a combination of SaveOptions constants. # SaveOptions # AS_BUILDER: Save builder created document # AS_HTML: Save as HTML # AS_XHTML: Save as XHTML # AS_XML: Save as XML # DEFAULT_HTML: the default for HTML document # DEFAULT_XHTML: the default for XHTML document # DEFAULT_XML: the default for XML documents # FORMAT: Format serialized xml # NO_DECLARATION: Do not include declarations # NO_EMPTY_TAGS: Do not include empty tags # NO_XHTML: Do not save XHTML # e.g. node.write_to(io, :encoding => 'UTF-8', :indent => 2) node.write_html_to(io, options={}) # uses write_to with :save_with => DEFAULT_HTML option (libxml2.6 does dump_html) node.write_xhtml_to(io. options={}) # uses write_to with :save_with => DEFAULT_XHTML option (libxml2.6 does dump_html) node.write_xml_to(io, options={}) # uses write_to with :save_with => DEFAULT_XML option node.serialize # Serialize Node a string using +options+, provided as a hash or block. Uses write_to (via StringIO) # node.serialize(:encoding => 'UTF-8', :save_with => FORMAT | AS_XML) # node.serialize(:encoding => 'UTF-8') do |config| # config.format.as_xml # end node.to_html(options={}) # serializes with :save_with => DEFAULT_HTML option (libxml2.6 does dump_html) node.to_xhtml(options={}) # serializes with :save_with => DEFAULT_XHTML option (libxml2.6 does dump_html) node.to_xml(options={}) # serializes with :save_with => DEFAULT_XML option node.to_s # document.xml? ? to_xml : to_html node.inspect node.pretty_print(pp) # to enhance pp # Utility node.encode_special_chars(str) # Encodes special characters :P node.fragment(tags) # Create a DocumentFragment containing tags that is relative to this context node. node.parse(string_or_io, options={}) # Parse +string_or_io+ as a document fragment within the context of # *this* node. Returns a XML::NodeSet containing the nodes parsed from # +string_or_io+. # External subsets, like DTD declarations node.create_external_subset(name, external_id, system_id) node.create_internal_subset(name, external_id, system_id) node.external_subset node.internal_subset # Other: node.description # Fetch the Nokogiri::HTML::ElementDescription for this node. Returns nil on XML documents and on unknown tags. # e.g. if node is an <img> tag: Nokogiri::HTML::ElementDescription['img'] Nokogiri::HTML::ElementDescription: img embedded image > node.decorate! # Decorate this node with the decorators set up in this node's Document. Used internally to provide Slop support and Hpricot compatibility via Nokogiri::Hpricot node.do_xinclude # options as a block or hash # Do xinclude substitution on the subtree below node. If given a block, a # Nokogiri::XML::ParseOptions object initialized from +options+, will be # passed to it, allowing more convenient modification of the parser options. ``` ## Working with a [Nokogiri::XML::NodeSet](http://nokogiri.org/Nokogiri/XML/NodeSet.html) ``` ruby nodes = Nokogiri::XML::NodeSet.new(document, list=[]) # Set operations nodes | other_nodeset # UNION, i.e. merging the sets, returning a new set nodes + other_nodeset # UNION, i.e. merging the sets, returning a new set nodes & other_nodeset # INTERSECTION # i.e. return a new NodeSet with the common nodes only nodes - other_nodeset # DIFFERENCE Returns a new NodeSet containing the nodes in this NodeSet that aren't in other_nodeset nodes.include?(node) nodes.empty? nodes.length # alias nodes.size nodes.delete(node) # Delete node from the Nodeset, if it is a member. Returns the deleted node if found, otherwise returns nil. # List operations (includes Enumerable) nodes.each {|node| } nodes.first nodes.last nodes.reverse # Returns a new NodeSet containing all the nodes in the NodeSet in reverse order nodes.index(node) # returns the numeric index or nil nodes[3] # element at index 3 nodes[3,4] # return a NodeSet of size 4, starting at index 3 nodes[3..6] # or return a NodeSet using a range of indexes # alias nodes.slice nodes.pop # Removes the last element from set and returns it, or nil if the set is empty nodes.push(node) # alias nodes << node # Append node to the NodeSet. nodes.shift # Returns the first element of the NodeSet and removes it. Returns nil if the set is empty. nodes.filter(expr) # Filter this list for nodes that match expr. WHAT DOES THIS RETURN? NodeSet? Array? # find_all { |node| node.matches?(expr) } nodes.children # Returns a new NodeSet containing all the children of all the nodes in the NodeSet # Content nodes.inner_html(*args) # Get the inner html of all contained Node objects nodes.inner_text # alias nodes.text # Convenience modifiers nodes.remove # alias of nodes.unlink # Unlink this NodeSet and all Node objects it contains from their current context. nodes.wrap("<div class='container'></div>") # wrap new xml around EACH NODE in a Nodeset nodes.before(datum) # Insert datum before the first Node in this NodeSet # e.g. first.before(datum) nodes.after(datum) # Insert datum after the last Node in this NodeSet # e.g. last.after(datum) nodes.attr(key, value) # set the attribute key to value on all Node objects in the NodeSet nodes.attr(key) { |node| 'value' } # set the attribute key to the result of the block on all Node objects in the NodeSet # alias nodes.attribute, nodes.set nodes.remove_attr(name) # removes the attribute from all nodes in the nodeset nodes.add_class(name) # Append the class attribute name to all Node objects in the NodeSet. nodes.remove_class(name = nil) # if nil, removes the class attrinute from all nodes in the nodeset # Searching nodes.search(*paths) # alias nodes / path nodes.at(*paths) # alias nodes % path nodes.xpath(*paths) nodes.at_xpath(*paths) nodes.css(*rules) nodes.at_css(*rules) nodes > selector # Search this NodeSet's nodes' immediate children using CSS selector selector # Writing out nodes.to_a # alias nodes.to_ary # Return this list as an Array nodes.to_html(*args) nodes.to_s nodes.to_xhtml(*args) nodes.to_xml(*args) # Rubyisms nodes == nodes # Two NodeSets are equal if the contain the same number of elements and if each element is equal to the corresponding element in the other NodeSet nodes.dup # Duplicate this node set nodes.inspect ``` ## Miscellany ``` ruby nc = Nokogiri::HTML::NamedCharacters # a Nokogiri::HTML::EntityLookup nc[key] # like nc.get(key).try(:value) # e.g. nc['gt'] (62) or nc['rsquo'] (8217) nc.get(key) # returns an Nokogiri::HTML::EntityDescription # e.g. nc.get('rsquo') #=> #<struct Nokogiri::HTML::EntityDescription value=8217, name="rsquo", description="right single quotation mark, U+2019 ISOnum"> # Adding a Processing Instruction (like <?xml-stylesheet?>) # Nokogiri::XML::ProcessingInstruction http://nokogiri.org/tutorials/modifying_an_html_xml_document.html pi = Nokogiri::XML::ProcessingInstruction.new(doc, "xml-stylesheet",'type="text/xsl" href="foo.xsl"') doc.root.add_previous_sibling(pi) ``` ## [Reader](http://nokogiri.org/Nokogiri/XML/Reader.html) parsers Reader parsers can be used to parse very large XML documents quickly without the need to load the entire document into memory or write a SAX document parser. The reader makes each node in the XML document available exactly once, only moving forward, like a cursor. ``` ruby reader = Nokogiri::XML::Reader(string_or_io) # attrs # .encoding # .errors # .source # Reading reader.each {|node| } # node and reader are the same object. shortcut for while(node = self.read) yield(node); end; reader.read # Move the Reader forward through the XML document. node.name node.local_name # Attributes node.attribute('src') node.attribute_at(1) node.attribute_count node.attribute_nodes node.attributes node.attributes? # Content node.empty_element? node.self_closing? node.value # Get the text value of the node if present as a utf-8 encoded string. Does NOT advance the reader. node.value? # Does this node have a text value? node.inner_xml # Read the contents of the current node, including child nodes and markup into a utf-8 encoded string. Does NOT advance the reader node.outer_xml # Does NOT advance the reader node.base_uri # Get the xml:base of the node node.default? # Was an attribute generated from the default value in the DTD or schema? node.depth # Namespaces and the rest node.namespace_uri # Get the URI defining the namespace associated with the node node.namespaces # Get a hash of namespaces for this Node node.prefix # Get the shorthand reference to the namespace associated with the node. node.xml_version # Get the XML version of the document being read node.lang # Get the xml:lang scope within which the node resides. node.node_type # one of # TYPE_ATTRIBUTE # TYPE_CDATA # TYPE_COMMENT # TYPE_DOCUMENT # TYPE_DOCUMENT_FRAGMENT # TYPE_DOCUMENT_TYPE # TYPE_ELEMENT # TYPE_END_ELEMENT # TYPE_END_ENTITY # TYPE_ENTITY # TYPE_ENTITY_REFERENCE # TYPE_NONE # TYPE_NOTATION # TYPE_PROCESSING_INSTRUCTION # TYPE_SIGNIFICANT_WHITESPACE # TYPE_TEXT # TYPE_WHITESPACE # TYPE_XML_DECLARATION node.state # Get the state of the reader ``` ## XSD Validation [XSD](http://nokogiri.org/XSD.html) [XSD::XMLParser](http://nokogiri.org/XSD/XMLParser.html) [XSD::XMLParser::Nokogiri](http://nokogiri.org/XSD/XMLParser/Nokogiri.html) ``` ruby xsd = Nokogiri::XML::Schema(string_or_io_to_schema_file) doc = Nokogiri::XML(File.read(PO_XML_FILE)) xsd.valid?(doc) # => true/false xsd.validate(doc) # returns an an array of SyntaxError s xsd.validate(doc).each do |syntax_error| syntax_error.error? syntax_error.fatal? syntax_error.none? syntax_error.to_s syntax_error.warning? # undocumented attributes syntax_error.code R syntax_error.column R syntax_error.domain R syntax_error.file R syntax_error.int1 R syntax_error.level R syntax_error.line R syntax_error.str1 R syntax_error.str2 R syntax_error.str3 R end # http://nokogiri.org/Nokogiri/XML/Schema.html # http://nokogiri.org/Nokogiri/XML/AttributeDecl.html # http://nokogiri.org/Nokogiri/XML/DTD.html # http://nokogiri.org/Nokogiri/XML/ElementDecl.html # http://nokogiri.org/Nokogiri/XML/ElementContent.html # http://nokogiri.org/Nokogiri/XML/EntityDecl.html # http://nokogiri.org/Nokogiri/XML/EntityReference.html doc.validate # validate it against its DTD, if it has one ``` ## CSS Parsing [Nokogiri::CSS](http://nokogiri.org/Nokogiri/CSS.html) [Nokogiri::CSS::Node](http://nokogiri.org/Nokogiri/CSS/Node.html) [Nokogiri::CSS::Parser](http://nokogiri.org/Nokogiri/CSS/Parser.html) [Nokogiri::CSS::SyntaxError](http://nokogiri.org/Nokogiri/CSS/SyntaxError.html) [Nokogiri::CSS::Tokenizer](http://nokogiri.org/Nokogiri/CSS/Tokenizer.html) [Nokogiri::CSS::Tokenizer::ScanError](http://nokogiri.org/Nokogiri/CSS/Tokenizer/ScanError.html) ``` ruby # http://nokogiri.org/Nokogiri/CSS.html Nokogiri::CSS.parse('selector') # => returns an AST Nokogiri::CSS.xpath_for('selector', options={}) # http://nokogiri.org/Nokogiri/CSS/Node.html # attr: type, value #methods # accept(visitor) # find_by_type # new # preprocess! # to_a # to_type # to_xpath # http://nokogiri.org/Nokogiri/CSS/Parser.html # a Racc generated Parser ``` ## XSLT Transformation [Nokogiri::XSLT](http://nokogiri.org/Nokogiri/XSLT.html) [Nokogiri::XSLT::Stylesheet](http://nokogiri.org/Nokogiri/XSLT/Stylesheet.html) ``` ruby doc = Nokogiri::XML(File.read('some_file.xml')) xslt = Nokogiri::XSLT(File.read('some_transformer.xslt')) puts xslt.transform(doc) # [, xslt_parameters] # xslt.serialize(doc) # to am xml string # xslt.apply_to(doc, params=[]) # equivalent to xslt.serialize(xslt.transform(doc, params)) ``` ## [SAX](http://nokogiri.org/Nokogiri/XML/SAX.html) Parsing Event-driving XML parsing appropriate for reading very large XML files without reading the entire document into memory. [The best documentation is in this file.](https://github.com/sparklemotion/nokogiri/blob/master/lib/nokogiri/xml/sax/document.rb) ``` ruby # Document template # Define any or all of these methods to get their notifications: # Your document doesn't have to subclass Nokogiri::XML::SAX::Document, # doing so just saves you from having to define all the sax methods, # rather than the few you need. class MyDocument < Nokogiri::XML::SAX::Document def xmldecl(version, encoding, standalone) end def start_document end def end_document end def start_element(name, attrs = []) end def end_element(name) end def start_element_namespace(name, attrs = [], prefix = nil, uri = nil, ns = []) end def end_element_namespace(name, prefix = nil, uri = nil) end def characters(string) end def comment(string) end def warning(string) end def error(string) end def cdata_block(string) end end # Standard Parser parser = Nokogiri::XML::SAX::Parser.new(MyDocument.new) # [, encoding = 'UTF-8] # A block can be passed to the parse methods to get the ParserContext before parsing, but you probably don't need that parser.parse(string_or_io) parser.parse_io(io) # [, encoding = "ASCII"] parser.parse_file(filename) parser.parse_memory(string) # If you want HTML correction features, instantiate this parser instead parser = Nokogiri::HTML::SAX::Parser.new(MyDoc.new) ``` (If you're a weirdo,) You can stream the XML manually using [Nokogiri::SAX::PushParser](http://nokogiri.org/Nokogiri/XML/SAX/PushParser.html) The best documentation is [this file](https://github.com/sparklemotion/nokogiri/blob/master/lib/nokogiri/xml/sax/push_parser.rb). ## [Slop](http://nokogiri.org/Nokogiri/Decorators/Slop.html) decorator (Don’t use this) The ::Slop decorator implements method_missing such that methods may be used instead of CSS or XPath. See the bottom of [this page](http://nokogiri.org/tutorials/searching_a_xml_html_document.html) [Nokogiri.Slop](http://nokogiri.org/Nokogiri.html#method-c-Slop) [Nokogiri::XML::Document#slop!](http://nokogiri.org/Nokogiri/XML/Document.html#method-i-slop-21) [Nokogiri::Decorators::Slop](http://nokogiri.org/Nokogiri/Decorators/Slop.html) ``` ruby doc = Nokogiri::Slop(string_or_io) doc = Nokogiri(string_or_io).slop! doc = Nokogiri::HTML(string_or_io).slop! doc = Nokogiri::XML(string_or_io).slop! doc = Nokogiri::Slop(<<-eohtml) <html> <body> <p>first</p> <p>second</p> </body> </html> eohtml assert_equal('second', doc.html.body.p[1].text) doc = Nokogiri::Slop <<-EOXML <employees> <employee status="active"> <fullname>Dean Martin</fullname> </employee> <employee status="inactive"> <fullname>Jerry Lewis</fullname> </employee> </employees> EOXML # navigate! doc.employees.employee.last.fullname.content # => "Jerry Lewis" # access node attributes! doc.employees.employee.first["status"] # => "active" # use some xpath! doc.employees.employee("[@status='active']").fullname.content # => "Dean Martin" doc.employees.employee(:xpath => "@status='active'").fullname.content # => "Dean Martin" # use some css! doc.employees.employee("[status='active']").fullname.content # => "Dean Martin" doc.employees.employee(:css => "[status='active']").fullname.content # => "Dean Martin" ```