Sermon Scraping Revisited

Programming | Sermon Scraping Revisited

2025-11-27

At the beginning of the year, I published an article describing my efforts to archive the sermon audio files from my church's web site using scraping techniques. That solution worked well for a few weeks. Unfortunately, at the end of January, the program began throwing an error whenever it was executed:

Traceback (most recent call last):
  File "C:\Users\chris\Desktop\scrape_peace_subsplash_sermons.py", line 100, in <module>
    media_data = get_sermon_pages(base_url, library_url)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\chris\Desktop\scrape_peace_subsplash_sermons.py", line 75, in get_sermon_pages
    get_sermon_pages(base_url, base_url + link)
  File "C:\Users\chris\Desktop\scrape_peace_subsplash_sermons.py", line 75, in get_sermon_pages
    get_sermon_pages(base_url, base_url + link)
  File "C:\Users\chris\Desktop\scrape_peace_subsplash_sermons.py", line 75, in get_sermon_pages
    get_sermon_pages(base_url, base_url + link)
  File "C:\Users\chris\Desktop\scrape_peace_subsplash_sermons.py", line 66, in get_sermon_pages
    preacher = subs[1]
               ~~~~^^^
IndexError: list index out of range

This error indicated that the subs list was not getting populated with more than one element when a particular sermon listing was encountered during the collection process. The collection process ran within the get_sermon_pages() function and the subs list population occurred between lines 62 and 66:

subs = divs[1].split("  •  ")

title = divs[2]
date = subs[0]
preacher = subs[1]

As noted in the previous article, this collected the title, date, and preacher values from each sermon listing on the media pages. Until the end of January, each sermon listing looked like this:

The date and preacher values were displayed, separated by a bullet character. However, on January 26, 2025, the server listing created that day did not include a preacher name value:

It only listed the date. When I saw this, I realized that I previously made a bad assumption that the preacher name value was required when the sermon listing was created. This was obviously not the case. Fortunately, resolving this was fairly trivial. After the div content was stored in the divs list, I added a statement to check for the presence of the bullet character:

if ("•" in divs[1]):
    subs = divs[1].split("  •  ")

    title = divs[2]
    date = subs[0]
    preacher = subs[1]
    return_data = [base_url + link, title, date, preacher]

    print(base_url + link)
    media_data.append(return_data)
else:
    title = divs[2]
    date = divs[1]
    preacher = "Guest Pastor"
    return_data = [base_url + link, title, date, preacher]

    print(base_url + link)
    media_data.append(return_data)

This simple if/else statement checks whether the bullet character is present in the value of the second divs element. If it is, then the program continues as it did originally, by assigning the value of subs[1] to the preacher variable. However, if that character isn't present, then the program statically assigns the value of Guest Pastor to the preacher variable.

Once this change was committed, the program execution ran once again without throwing any errors.

Unfortunately, that state of being error-free was short-lived. Within a few months, I began seeing a new error occur for every attempt made at downloading a sermon audio file:

Traceback (most recent call last):
  File "C:\Users\chris\Desktop\scrape_peace_subsplash_sermons.py", line 138, in <module>
    mp3 = requests.get(mp3_link)
          ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\chris\AppData\Local\anaconda3\Lib\site-packages\requests\api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\chris\AppData\Local\anaconda3\Lib\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\chris\AppData\Local\anaconda3\Lib\site-packages\requests\sessions.py", line 575, in request
    prep = self.prepare_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\chris\AppData\Local\anaconda3\Lib\site-packages\requests\sessions.py", line 484, in prepare_request
    p.prepare(
  File "C:\Users\chris\AppData\Local\anaconda3\Lib\site-packages\requests\models.py", line 367, in prepare
    self.prepare_url(url, params)
  File "C:\Users\chris\AppData\Local\anaconda3\Lib\site-packages\requests\models.py", line 438, in prepare_url
    raise MissingSchema(
requests.exceptions.MissingSchema: Invalid URL 'None': No scheme supplied. Perhaps you meant https://None?

This indicated that the call to the get_mp3_link() function on line 134 was not returning a link value, since the value of the mp3_link variable used in the requests.get() function on line 138 was None. The get_mp3_link() function is fairly simple; it just finds and returns the value of the only source tag within the embedded media page. Its apparent failure to return a link indicated that Subsplash made a change which broke this process.

Checking the content of the embedded media page revealed that Subsplash made significant changes.
The link to the MP3 file was no longer contained within a set of simple audio and source tags. It was now contained in a JSON-like string that was embedded within a script tag:

\"_links\":{\"download\":{\"href\":\"https://core.subsplash.com/files/download?type=audio-outputs\u0026id=00c6839a-98fa-480e-ba15-d3f24b6a25de\u0026filename={filename}.mp3\"},\"related\":{\"href\":\"https://cdn.subsplash.com/audios/5G6VH2/00c6839a-98fa-480e-ba15-d3f24b6a25de/audio.mp3\"}

Briefly researching this string revealed that it is a content payload embedded within a Next.js hydration script. Although this change made archiving the audio file slightly more complicated, its effects could be overcome with a few changes of my own.

Since I've rarely encountered JSON strings which actually adhere to the JSON standard, I decided that attempting to parse the JSON string and extract the MP3 link was not best course of action. Instead, I decided to use a regular expression to get the MP3 link.

I began this process by getting the MP3 links from a few different embedded media pages:

https://cdn.subsplash.com/audios/5G6VH2/cd8f55ce-6943-43f4-bed5-71270041f380/audio.mp3
https://cdn.subsplash.com/audios/5G6VH2/3c192ef4-da7a-4e95-b6ae-4c642392e030/audio.mp3
https://cdn.subsplash.com/audios/5G6VH2/00c6839a-98fa-480e-ba15-d3f24b6a25de/audio.mp3

Each appeared to follow a consistent pattern:

https://cdn.subsplash.com/audios/5G6VH2/{UID}/audio.mp3

This made developing a regular expression to match the pattern fairly trivial. The expression developed is:

https://cdn\.subsplash\.com/[A-Za-z0-9/_-]+\.mp3

Since 5G6VH2 appeared to be some sort of subscriber ID, I decided to include that in the pattern matching, just in case that ever changes. With the regular expression developed, I then used the re library to perform the matching. This started with adding the following line near the top of the main function:

mp3_regex = re.compile(r'https://cdn\.subsplash\.com/[A-Za-z0-9/_-]+\.mp3')

Then, the get_mp3_link() function was modified to accept a second argument, perform the regex search, and return the matching link:

def get_mp3_link(url, mp3_regex):   
    r = requests.get(url, timeout=30)
    soup = BeautifulSoup(r.content, 'html.parser')

    for script in soup.find_all('script'):
        text = script.string    #Check whether script tag has inner text.  If not, use empty string.
        if not text:        #Evaluates to true if text variable is empty, missing, or falsey
            continue

        mp3 = mp3_regex.search(text)    #regex search and match on mp3_regex
        if mp3:
            return mp3.group(0)

    return None

Finally, the call to the get_mp3_link() function in the main function was updated to include the mp3_regex variable as the second argument. Running this updated program successfully identified the MP3 link on each embedded media page and downloaded the file.

The program once again runs without error, until the next changes are made...

The source of the embedded media page encountered is available here: Embedded Media Content

The scraping program, updated as described in this article, is also available: scrape_peace_subsplash_sermons_v2.py