November 15, 2013

Parsing XML (and HTML) in Ruby with Nokogiri

There are many tools which can help you extract data from XML in Ruby. But it looks that Nokogiri is the most popular.

Also it is very easy to use. Here is the very quick example.

require 'nokogiri'

HTML = %{
      <a href="#">Page 1</a>
      <span>Awesome page</span>
      <a href="#">Page 2</a>
      <span>Not so awesome page</span>
      Page 3
      <span>(no link yet)</span>
      <a href="...">Page 4</a>
      <span>broken link</span>

So we have some HTML to parse. Let's initialize nokogiri object as first.

html = Nokogiri::XML(HTML)

And find all a elements and extract text from them.

# => ["Page 1", "Page 2", "Page 4"]

Pretty easy, right? We can do the same with span tags.

# => ["Awesome page", "Not so awesome page", "(no link yet)", "broken link"]

But what about all span elements from li elements which contains a element too? No problem.

html.css('li:has(a) span').map(&:text)
# => ["Awesome page", "Not so awesome page", "broken link"]

Or even better. Let's find all a elements from li elements which contains span with text 'broken link'.

html.css('li:has(span[text()="broken link"]) a').map(&:text)
# => ["Page 4"]

So as you see it is pretty easy to parse XML (so also HTML) with Nokogiri.

You can use CSS selectors as it is shown above. You can use also xpath, however I do not have experiences with it.

Anyway I believe that there is much more what you can do with it. Just check Nokogiri documentation for more information.

Hey there!

My name is Patrik Bóna and I am the only programmer at Memberful. This blog is kind of dead, but I just started my own Ruby on Rails screencast. Follow me on Twitter if you want to be notified about my newest videos.