Today I’m going to focus on a very specific task: getting HTML content between two tags using python and BeautifulSoup module.

This may sound very specific indeed, but if you try to google for “text between tags” or “extract content between tags” (don’t forget to “+” your language of choice at the end this these queries), you will find lots of misguiding information and advice.

Seems like DOM works perfectly when you need to fetch attributes or data inside tags, but to get information which doesn’t neatly fit into this model doesn’t seem so easy.

Let’s assume that we have HTML like this:

<h1>Header 1</h1>
<p>Paragraph 1
with lots of text inside it
</p>
<p>Paragraph 2
with some more text
</p>
<h1>Header 2</h1>
<p>Paragraph 1
with lots of text inside it
</p>
<p>Paragraph 2
with some more text
</p>

As you probably guessed already, we need text inside those two paragraphs, for each header. Of course, we also want to keep header -> paragraph association while processing.

So, here we go:

#!/usr/bin/env python

from bs4 import BeautifulSoup

# Open HTML file
doc = BeautifulSoup(open('input.html'))

# Prepare array to store data
entries = []

# Find all 'h1' tags
for section in doc.find_all('h1'):

  # Get header text
  header = section.find_all(text=True)[0].split('.')

  # Get paragraph content
  # ... don't forget about Unicode
  content = u""

  # Find next tag
  for p in section.find_next_siblings():

    # ... if it's 'h1' tag - then stop, as we reached next header
    if p.name == 'h1':
      break

    # We can do some HTML cleanup here
    # ... remove 'span' tags
    if p.span:
      p.span.unwrap()
    # ... delete paragraph class
    del p['class']

    # Take care of newline characters
    # ... and tell Python to treat it as Unicode
    content += unicode(p).replace("n", u' ')

    # Newline tag to properly separate paragraphs
    content += '<br/>'

  # Add new header plus its content into array
  entries.append({ 'header': header, 'content': content})

# Show the result
print(entries)

And the results:

[
  {'content': u'<p>Paragraph 1 with lots of text inside it </p><br/><p>Paragraph 2 with some more text </p><br/>',
   'header': [u'Header 1']},
  {'content': u'<p>Paragraph 1 with lots of text inside it </p><br/><p>Paragraph 2 with some more text </p><br/>',
   'header': [u'Header 2']}
]