Yess…avatar…indigenous blue man group
— The Way of Water
My idea of fun is compulsively downloading stuff…
I pared back an impromptu script recently that archives the entirety of a substack blog as a standardized data format (ePub via JSON). What if I could expand my domain and download from different publications as well?
Thanks for reading!
Ripping a new source
I tried pulling articles from spikeartmagazine.com the same way that substack-to-json does, by just changing what elements it looks for. Well, surprise, after an hour or two it worked fine! It automatically batch downloaded everything from a specific contributor (probably also works with subjects and categories). But writing python gets old fast…
I learned more about webdrivers on the way. Chiefly, that etaoin
is a perfectly good webdriver library in clojure. With a little work, the python substack-to-json
can be adapted—then later, generalized.
Why did I want to do this? Lol.. dean-kissick dean-kissick-0 I just wanted this guy’s articles. He made a good playlist, gets referenced in the vibes scene, and now I trust him. I demonstrate my trust by repossessing your content, in some grey area between archiving and piracy.
Compare and contrast
substack-to-json
, the technique is: access an index page and crawl through to find every relevant URL, then iterate those URLs to grab fully rendered page content. How general is this method on other written content?
Caveats certainly vary across platforms, like to go through an entire substack archive you need to simulate scrolling to the bottom; on Spike everything loads at once.
Styling static pages once articles are rehosted is another story. Substack is fine, just put classes on a few tags and include their universal CSS file, and posts look as good or better than on their origin. It was kind of magic. Spike styles were a horrifying thicket of jQuery and Drupal, and their resources are CORS-blocked. Other than just being better engineered, Substack may enable cross-origin requests due to being a platform for serial content hosted on custom domains, rather than a single outlet? I don’t know what situations warrant one config or the other.
Webdrivers
So, webdrivers seem absurd and testament to how inverted the web is. You are creating a temporary engine that simulates an entire browser, invisibly or even visibly, and allows ‘RPC’-style calls to interact with DOM minutia in the most granular way. Much more control than I anticipated is possible, down to filling and submitting individual form elements, individual keypresses and clicks, etc, that is scratching the surface. Web is bizarre. Robots imitate human access patterns, humans are conformed to be data robots, and the entire ecosystem has become content farms.
But, in my vague first tests playing with etaoin
, I am having a difficult time understanding how to adapt to a basic functional style. It’s giving imperative. It seems like interaction with a webdriver leans in to mimicking an actual web surfing experience, the ‘driver
’ you reference cloaks a giant mutable browser runtime that is one giant side-effecting entity.
Other things
I paid attention to and leveraged head metadata tags for the first time. Instead of throwing myself into the trenches of body content, much of the relevant information, like post dates, authors, bylines, all exist sanitized and isolated in the heads of pages! Who knew lol. (Every single web dev down to the entry level)
I had been using XPath as that’s what the substack-to-json
author was using. Overkill! Looks like webdriver software generally allows the user to query the DOM in whatever way is most convenient, like.. CSS selector chains. Easier, more readable, basically just as flexible! I would rather avoid yet another DSL and reuse a structure-referencing syntax I already know.
Trying a clj webdriver
etaoin
, referenced above, is a pleasant surprise after I got a little mystified in python. I know enough python to adapt and hack on existing scripts to fit my use cases, but joy sparks are not exactly what float through my amygdala.
I will adapt each individual action I’d taken to clojure. It seems like the original script I’ve used,
Starts a headless (invisible) browser instance:
(require '[etaoin.api :as e])
(e/with-chrome-headless
{:path-browser "..."} ;; As, woops, I didn't have chrome installed
driver ;; A binding, not a reference! Gotcha
;; Now, all subsequent forms are executed declaratively
)
with-chrome-headless
(or with-chrome
) completely self-contains a webdriver instance, executes anything in the body with a driver
binding, then unalives itself no matter what! That is clean, and pleasant.
Visits a page which, at some point, will contain a link to every desired article:
(e/go "url ...")
;; You then perform subsequent actions.
Easy!
(e/go
returns nothing relevant)1
Loads the whole page:
I forsook this when parsing a simpler archive, this snippet is just a directly adapted proof of concept to demonstrate waiting.
(let [heights (atom [1 0])]
(while (not (zero? (apply - @heights)))
(e/scroll-bottom d)
(e/wait 0.5)
(e/wait-invisible d {:class :post-preview-silhouette})
(reset! heights [(e/js-execute d "return document.body.scrollHeight")
(first @heights)])))
Substack archives have infinite scroll. When you hit the bottom, it temporarily shows their version of a spinner with a particular classes. More posts have loaded, or the end has been reached.
This is the exact behavior recreated from the original script. How can it be improved? It’s messy, and depends on the temporary element loading by a given time, then compares the current and previous scroll heights.
Saves the desired information, like the main author, and list of all posts:
In theory… The entire archive of articles will now be loaded on the page.
(e/get-element-text-el
driver
(e/query driver {:css "#content header h3.author"})) ;; e.g
;; => "Some person"
(mapv #(get-element-inner-html-el driver %)
(e/query-all driver {:css "#content .article-block"}))
;; => ["<h4>My big article</h4>\n<a href='https://...'><img></a>"
;; "<h4>Another post..</h4>\n<a href='https://...'><img></a>"]
This snippet and the above would either be in a driver block, or relying on (def driver (e/chrome ...))
.
I was having a bit of a hurdle getting text/HTML results from different parts of a node all at once. But, the answer is straightforward!
Simplified mvs:
(defn get-articles
[d]
(let [els (e/query-all d {:css ".article"})
f (fn [el]
{:title (e/get-element-text-el d
(e/child d el {:css ".article-title"}))
:url (e/get-element-attr-el d
(e/child d el {:css ".article-title a"}))})]
(mapv f els)))
(defn scrape
[archive-url]
(e/with-chrome-headless driver
(e/go archive-url)
(get-articles driver)))
;; => [{:title "My article" :url "https://..."}
;; {:title "Another article" :url "https://.."}]
That is actually soooo sexy, simple, and reasonable. Once I got out of the py/selenium
style of mutating collectors, wow! This could be further streamlined or greatly expanded, it’s just an example for clarity.
And scrapes the content of a post.
Hm, it started to get odd here. The script individuates all top-level elements within the content block, filters out substack subscribe nudges, and pastes the article back together. That’s easy to query all the elements, and retaining each element separately makes later parsing easier. But why can’t I find an outer-html
function?
Not a hint why, but one solution, @borkdude mentions in 2018 when trying to accomplish an outer-html
grab:
(e/get-element-attr-el driver element "outerHTML")
Yeah works fine… But I agree it’s kind of weird it doesn’t exist. What about defining my own to learn more?
Here’s the source for e/get-element-inner-html-el
:
(defmethod get-element-inner-html-el
:default
[driver el]
{:pre [(some? el)]}
(:value (execute {:driver driver
:method :get
:path [:session (:session driver)
:element el
:property :innerHTML]})))
Go deeper, etaoin.api/execute
→ etaoin.impl.client/call
→ it’s just using clj.http/client
to make a browser request for the given property of a given element!
It should be trivial to copy the syntax2 and just change the property! And, it is:
(defn get-element-outer-html-el
[driver el]
(:value (e/execute {:driver driver
:method :get
:path [:session (:session driver)
:element el
:property :outerHTML]})))
Then, it can be plugged into a query for an individual article page’s content:
(defn text-html
[]
(e/with-chrome-headless
driver
(e/go driver "https://...")
(mapv #(get-element-outer-html-el driver %)
(e/query-all driver {:css ".field-name-body .field-item > *"}))))
;; => ["<p>Thanks for reading this essay.</p>"
;; "<p>Next paragraph</p>"
;; "<img src=\"photo.jpg\">"]
Wow, that’s really all I need. It’s surprising actually. I have achieved the goal of having a robot visit a website, read an article really quickly, and report back to me what it learned. Thanks robot!
The practice project that implements what I described is here, kees-/mag-stripe.