## Making an Atom Feed using Selenium
Making an Atom Feed using Selenium.
by
Christoph Lohmann <20h@r-36.net>
## Intro
* The web is getting more complex.
* You have pure javascript framework auto-generated websites with nothing
to parse from.
* The only way to get any conten from it is to execute the javascript.
* Sadly we need that part of a browser from it.
* There is not just some javascript engine.
* All is interwingeled.
* Google wanted it that way.
## Basic Atom Feed Generation
{
printf '\n'
printf '\n'
printf '%s\n' "$(date "+%FT%T%z")"
hurl $uri \
| grep content | sed 's,rawcontent,content,g' \
| while read -r line;
do
printf ""
printf "" "${line}"
printf "\n"
done
printf "\n"
} > somefeed.atom
## How it evolved.
* frameworks like python requests
* webkit
* Small browser evolved and scraping.
* PhantomJS
--> They became outdated due to the speed web engines evolved.
--> Feature bloat.
--> Corporate need for new things besides no need for it.
--> Sell more products.
* Intermediate steps followed of complex control protocols.
* I will skip them for so you stay sane.
## Current State: WebDriver
https://w3c.github.io/webdriver/
> WebDriver is a remote control interface that enables introspection and
> control of user agents. It provides a platform- and language-neutral
> wire protocol as a way for out-of-process programs to remotely instruct
> the behavior of web browsers.
Webbrowsers expose HTTP endpoints:
POST /session/...
DELETE /session/...
GET /session/...
* Could be wrapped into C too.
* For fast prototyping we use selenium and python.
## Selenium Environment
1. Get Selenium
$ pip install selenium
# Huge bloat is installed.
2. Get a Chromium WebDriver
Normally included in your chromium installation at:
/usr/bin/chromedriver
Or:
Gentoo: emerge www-apps/chromedriver-bin
Binary package: https://chromedriver.chromium.org/downloads
## Selenium Environment
Other Web Browsers:
* Edge
* Firefox
* Internet Explorer
* Safari
All have their quirks:
https://www.selenium.dev/documentation/webdriver/browsers/
## Basic Selenium Script
#!/bin/env python
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://www.bitreich.org")
driver.implicitly_wait(1.0)
driver.find_element(By.XPATH, "//*[@class=\"proletariat\"]").text
Output: gophers://bitreich.org
## Selenium IDE
https://www.selenium.dev/selenium-ide/
Browser Extension for Firefox and Chromium to record interactions with
websites.
* Easily generate scripts from that.
## Select Content in Websites
driver.find_element or driver.find_elements
text to grep
e = driver.find_element(By.ID, "text").text
e = driver.find_element(By.TAG_NAME, "elem").text
e = driver.find_element(By.CLASS_NAME, "info").text
e = driver.find_element(By.XPath, "//p/elem").text
Others: By.NAME (forms), By.CSS_SELECTOR, By.LINK_TEXT,
By.PARTIAL_LINK_TEXT
e.get_attribute("meta")
## Stuff we won't handle here.
Selenium can do:
* input
* fill out forms
* key presses emulation
* upload files
* send forms
* scroll web pages
* do pen actions (tablet)
* mouse emulation
* drag and drop elements
* history / navigate around in the browser
## Stuff we won't handle here.
Selenium can do:
* window manipulation
* handle multiple tabs / windows
* handle iframes
* move windows around
* take screenshots
* print websites
* handle popup alerts
* set / get cookies
* let you run inline javascript
* do color animations
* debug javascript for you using the bidirectional protocol
* build huge action chains for time-perfect handling
## Complex Example
https://www.kvsachsen.de/
* 'new modern' website of my doctor association.
* All in a javascript framework.
* News is hidden by loading even more javascript.
* No rss feed.
News:
1. Open https://www.kvsachsen.de/fuer-praxen/aktuelle-informationen/praxis-news
2. Parse Javascript in subframe.
## Complex Example
Get stuff ready:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options as chromeoptions
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from datetime import datetime
import pytz
link = "https://www.kvsachsen.de/fuer-praxen/\
aktuelle-informationen/praxis-news"
## Complex Example
Get ChromeDriver ready:
options = chromeoptions()
chromearguments = [
"headless",
"no-sandbox",
"disable-extensions",
"disable-dev-shm-usage",
"start-maximized",
"window-size=1900,1080",
"disable-gpu"
]
for carg in chromearguments:
options.add_argument(carg)
driver = webdriver.Chrome(options=options)
## Complex Example
Get the content:
driver.get(link)
## Complex Example
Wait for the content to be ready and loaded with a timeout
of 60 seconds:
isnews = WebDriverWait(driver=driver, timeout=60).until(
EC.presence_of_element_located((By.XPATH,
"//div[@data-last-letter]")
)
)
EC ... Expected Condition
EC can be very many things:
https://www.selenium.dev/selenium/docs/api/py/\
webdriver_support/\
selenium.webdriver.support.expected_conditions.html
Pro Tip: Do not wait for a static time, use some EC for this. You will
be safer and have less errors.
## Complex Example
Get the root news element we work from:
newslist = driver.find_elements(By.XPATH,
"//div[@data-filter-target=\"list\"]")[0]
Get some metadata for the atom feed:
title = driver.find_elements(By.XPATH,
"//meta[@property=\"og:title\"]")[0].\
get_attribute("content")
description = title
## Complex example
Print the header of the atom feed to stdout:
print("""""")
print("""""")
print("\t" % (title))
print("\t" % (description))
print("\t%s" % (link))
print("\t" % (link))
print("\t" % (link))
Use the current data for updated:
utcnow = datetime.now(pytz.utc)
print("\t%s" % (utcnow.isoformat()))
## Complex example
Get the entries:
articles = newslist.find_elements(By.XPATH, "./div")
Prepare a base URI for appending relative links:
baselink = "/".join(link.split("/", 3)[:-1])
Loop over all entries in backwards style:
for article in articles[::-1]:
## Complex example
Find the deep link to the article:
link = article.find_elements(By.XPATH, "./a")[0]
plink = link.get_attribute("href")
Normalize the link in case it is relative:
if not plink.startswith("http"):
plink = "%s/%s" % (baselink, plink)
Get the entry title, content and set an absolute author:
ptitle = link.get_attribute("data-title")
pcontent = article.text
pauthor = "sachsen@kvsachsen.de"
## Complex example
Parse the datetime for the article release:
updateds = article.find_elements(By.XPATH, ".//time")[0].text
try:
dtupdated = datetime.strptime(updateds, "%d.%m.%Y")
except ValueError:
continue
Bring the datetime into python native format for further processing:
dtupdated = dtupdated.replace(hour=12, minute=0,\
second=0, tzinfo=pytz.utc)
if dtupdated.year > utcnow.year:
dtupdated = dtupdated.replace(year=utcnow.year)
pupdated = dtupdated
## Complex example
Print the entry:
print("\t")
print("\t\t%s" % (plink))
print("\t\t" % (ptitle))
print("\t\t" % (plink))
print("\t\t%s" % (pauthor))
print("\t\t%s" % (pupdated.isoformat()))
print("\t\t" % (pcontent))
print("\t")
Print the footer (out of feeds loop):
print("")
## Example Script
The full example script and how I use it can be found in:
git://bitreich.org/brcon2023-hackathons
./sfeed-atom/kvssachsen2atom
## Summary
* With Selenium you can script all of modern web in all ways.
* We fight bloat with bloat.
* You run a full web process to parse the web.
* We wanted to avoid that with scraping.
* You can easily prototype web access in for example ipython(1).
* There are still privacy concerns.
* You run a huge blob of hundreds of thousands of sloc.
* Plato's cave allegory
## Plato's cave allegory
+--------------;,,.; ..\.|./,.
| .------(_)------
|# # (too bright!) - /,|.\
|# o =| o/ , /. |. .(
|# o|o =|o | / , .|, (_|
| | | = | ,.., ___|_,
+---------+----."' ''''''~~~~~~\____|~~~
* People are in a cave, watching the shadow figure of a hash,
presented to them by a narrator behind the wall to the exit
of the cave.
* When people want to leave the cave, they will be blinded by
the sun. The sunlight hurts their eyes. The will go back into
the have.
* The outside did not look so fine presented and prepared as
the shadow of the narrator's play. It does not hurt the eye.
* Only some people are able to adapt their eyes and see the
beauty of not being dependent on a narrator. They will be able
to leave the cave.
## Questions?
Do you have questions?
## Thanks
Thank you for listening.
For further suggestions, contact me at
Christoph Lohmann <20h@r-36.net>