HN2new | past | comments | ask | show | jobs | submitlogin
Monolith – CLI tool for saving complete web pages as a single HTML file (github.com/y2z)
772 points by iscream26 on March 24, 2024 | hide | past | favorite | 151 comments


Well this is fun... from the README here I learned I can do this on macOS:

    /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
    --headless --incognito --dump-dom https://github.com > /tmp/github.html
And get an HTML file for a page after the JavaScript has been executed.

Wrote up a TIL about this with more details: https://til.simonwillison.net/chrome/headless

My own https://shot-scraper.datasette.io/ tool (which uses headless Playwright Chromium under the hood) has a command for this too:

    shot-scraper html https://github.com/ > /tmp/github.html
But it's neat that you can do it with just Google Chrome installed and nothing else.


Can shot-scraper load a bunch of content on an "infinite scroll" page before saving? I'm guessing Monolith can't as it has no JS. The most effective way I've found to work through the history of a big YouTube channel is to hold page-down for a while then save to a static "Web Page, Complete" HTML file, but it's a bit clunky.


shot-scraper has a feature that's meant to help with this: you can inject additional al JavaScript into the page that can then act before the screenshot is taken.

I use that for things like accepting cookie banners, but using it to scroll down to trigger additional loading should work too.

There's also a --wait-for option which takes a JavaScript expression and polls until it's true before taking the shot - useful for if there's custom loading behavior you need to wait for.

Documentation here: https://shot-scraper.datasette.io/en/stable/screenshots.html


I found other problems in the area when trying to do this e.g. a lot of landing pages have hidden content that only animates in when you scroll down, subscribe/cookie overlays/modals covering content, hero headers that takes the height of the viewport ("height: 100vh") so if you make the page height large for taking a screenshot the header will cover all of it, and also sticky headers get in the way if you want to try scrolling while take multi-screenshots that are combined at the end.

You can come up with workarounds for each, but it's still hacky and there's always going to be other pages that need special treatment.


Yes, it's a neat thing. I use a node script[1] that wraps the chrome invocation to do CLI driven acceptance testing with a node script that loads the site acceptance tests[2]. I adopted the simple convention to remove all body elements on successful completion, and checking the output string, but I've also considered other methods like embedding a JSON string and parsing it back out.

1 - https://simpatico.io/acceptance.js

2 - https://simpatico.io/acceptance


I guess while we're talking about useful CLI options for chrome, developers and hackers might enjoy this one... You can disable CORS in Chrome if you launch it from the command line with this switch: --disable-web-security

That's handy for when you're developing a front end and IT/devops hasn't approved/enabled the the CORS settings on the backend yet, or if you're just hacking around and want to get data from somewhere that doesn't allow cross domain requests.


Can Firefox do the same?


Does shot-scraper have a work around for sites that detect headless chrome? ie. news.com.au , nowsecure.nl


No, nothing like that. I wonder how that detection works?

I tried this and it took a shot of a "bot detected" screen:

    shot-scraper https://news.com.au/
But when I used interactive mode I could take the screenshot - run this:

    shot-scraper -i https://news.com.au/
It opens a Chrome window. Then hit "enter" in the CLI tool to take the screenshot.


Got this to work!

    shot-scraper https://news.com.au/ \
      --init-script 'delete Object.getPrototypeOf(navigator).webdriver' \
      --user-agent 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:124.0) Gecko/20100101 Firefox/124.0'
Code and screenshots and prototype --init-script feature here: https://github.com/simonw/shot-scraper/issues/147#issuecomme...


If you want to go down the nowsecure.nl rabbithole (often used as a benchmark for passing bot detection) [1] is a good resource. It includes a few fixes that undetected-chromedriver doesn’t.

1. https://seleniumbase.io/help_docs/uc_mode/#uc-mode


Nice! I was going to say that using experimental headless mode with the "User-Agent Switcher for Chrome" plugin might work too.


Love your work!


Thanks for sharing, similar command is available in Windows as well.

I tested below command in PowerShell

& 'C:\Program Files\Google\Chrome\Application\chrome.exe' ` --headless ` --print-to-pdf="$env:USERPROFILE\Downloads\page.pdf" ` '<url>'


I wonder if there's an option to wait for a certain amount of time, or a particular event or something. I was trying to capture a page a few different ways, and most of them ended up with the Cloudflare "checking your browser" page.


Thank you for shot-scraper! I've tested it in the past, but something severely missing from all screenshot tools, shot-scraper included, is a way to avoid screenshoting popups. For instance, newsletter or login popups, GDPR popups, etc. If shot-scraper has a reliable way of screenshoting websites while avoiding these popups, I would love to know.

I'm on mobile so don't have access to my notes, but I'm pretty sure that a year ago when I tried there was no reliable way to screenshot e.g. the BBC news website without getting the popups.

Again, thank you.


Try this:

    shot-scraper -h 800 'https://www.spiegel.de/international/' \
      --wait-for "() => {
        const div = document.querySelector('[id^="sp_message_container"]');
        if (div) {
          div.remove();
          return true;
        }
      }"
shot-scraper runs that --wait-for script until it returns true. In this case we're waiting for the cookie consent overlay div to show up and then removing it before we take the screenshot.

Screenshots here: https://gist.github.com/simonw/de75355c39025f9a64548aa3366b1...


I work on a paid screenshot api[0] where we have features to either hide these banners and overlays using css, or alternatively we run some javascript to send a click event to what we detect as the 'accept' button in order to dismiss the popups.

It's quite a painful problem and we screenshot many millions of sites a day, our success rate at detecting these is high but still not 100%.

We have gotten quite far with heuristics and are exploring whether we can get better results by training a model.

[0]:https://urlbox.com


Interesting service.

Can the API provide a custom tag, comment or ID which will then be inserted in the output? Like in JPEG EXIF, PNG also knows Metadata, PDF description, HTML meta tag?


No - we don't currently support this feature as nobody has asked for it so far :)

You could append a new meta tag by running some custom JS to add it, but we don't modify exif, metadata or pdf description at the moment.


Just a thought, but what happens if you use orchestrate a browser instance with an installed ad blocker like uBlock Origin?


This works well, I've done it before with Selenium and a headless Firefox with uBlock Origin and Bypass Paywalls installed.


Screenshot the archive.org render?


Mmmm…. This is clever


Yay! I love Shot Scrapeer - I wish you had made it a decade ago!

Thanks for shot scraper.

Off the top of you head what would be the easiest command to have shotscraper barf a directory of shot-scraper HTMLs each day from my daily browsing history.

This would be interesting if I have a browsing session for learning something and I am researching across a bunch of sites - roll it all up into a Digi-ography of the sites used in learning that topic?

---

I've always been baffled that this isnt an inate functionality in any app/OS - its a damn computeer - I should have a great ability to recall what it displays and what you have been doing.

Heck - we need our machines to write us a daily status report for what we did at the end of each day.

Surely that would change productivity. If you were force to do a self-digital-confession and stare you ADHD and procrastination right in the face.


Yeah, things like Archive Box are probably a better bet there. But... you could write a script that queries the SQLite database of your history, figures out the pages you visited then loops through and runs `shot-scraper html ... > ...html` against each one.

I just wasted a few minutes trying to get Claude 3 Opus to write me a script - nearly got there but Firefox had my SQLite database locked and I lost interest. My conversation so far is at: https://gist.github.com/simonw/9f20a02f35f7a129b9850988117c0...


My "cheat" for "poke at chrome's sqlite database for current live state" is that they're always locked but none of them are that big, just make a copy and query the copy. `current-chrome-url` runs in `real 0m0.057s` and does a `jq -r .profile.last_used ~/.config/google-chrome/Local\ State` to get the profile, then copies `~/.config/google-chrome/"$PROFILE"` into a mktemp dir, then `sqlite3 -cmd 'select url from urls where id = (select url from visits where visit_time = (select max(visit_time) from visits));' $H .exit` on that copy.


This used to be fairly simple to do before https everywhere, just install squid (or whatever) and cron the cache folder to a zip file once a day or whatever.

There's paid solutions that kinda do what you want, but they capture all text on your screen and OCR it to make it searchable, which at least lets you backtrack and has the added advantage that it will make pdfs, meme images, etc searchable, too. last i heard it was mac only but a few folks mentioned some windows software that does it too.

as an aside i don't consider reading/learning nearly all day to be a net negative, even if ADD is to blame. (i haven't had the "H" since i was a child.) A status report wouldn't "stare" me in the face; in fact, it would be nice to have some language model take the daily report and over time suggest other things to read or possible contradictions to link to.


look at ArchiveBox from the comments below


> Heck - we need our machines to write us a daily status report for what we did at the end of each day.

I am sure Trump, Xi, Putin, etc. would like that very much.


If anyone is interested, I wrote a long blog post where I analyzed all the various ways of saving HTML pages into a single file, starting back in the 90s. It'll answer a lot of questions asked in this thread (MHTML, SingleFile, web archive, etc.)

https://www.russellbeattie.com/notes/posts/the-decades-long-...


Cool post. You should make hn entry


I always ship single file pages whenever possible. My original reasoning for this was that you should be able to press view source and see everything. (It follows that pages should be reasonably small and readable.)

An unexpected side effect is that they are self contained. You can download pages, drag them onto a browser to use them offline, or reupload them.

I used to author the whole HTML file at once, but lately I am fond of TypeScript, and made a simple build system to let me write games in TS and have them built to one HTML file. (The sprites are base64 encoded.)

On that note, it seems (there is a proposal) that browsers will eventually get support for TypeScript syntax, at which point I won't need a compiler / build step anymore. (Sadly they won't do type checking, but hey... baby steps!)


I used to only do single file HTML pages too, until I have a page that have multiple occurrences of the same image, it's wasteful to have the dataurl string every time that img occurs. Maybe I can save the dataurl string in JS once and assign it to those img in JS, but most of the time my page doesn't have any JS it feels bad to use JS just for this.


You could use the <use> element in an inline SVG to duplicate the same bitmap <image> to multiple parts of the page.


Interesting, I'll try that later. But I guess that won't have some useful attributes on <img> like "alt"?


I found [this suggestion](https://stackoverflow.com/questions/4697100/accessibility-re...), although it's pretty old. Maybe screen readers have made improvements to this scenario in the last few years?

    <svg role="img" aria-label="[title + description]">
        <title>[title]</title>
        <desc>[long description]</desc>
        ...
    </svg>


> On that note, it seems (there is a proposal) that browsers will eventually get support for TypeScript syntax, at which point I won't need a compiler / build step anymore. (Sadly they won't do type checking, but hey... baby steps!)

Careful what you wish for. Assuming the browser is able to run original TS without processing and you want type checking, then that also seems to effecticely lock the typechecking abilities of TS to their current level. Even without type checking it would already hinder the ability to add new syntax or standard types to TS.

Given TS is made for providing expressive typing over JS instead of constructively coming up with a type system with a language, there's still a lot of ground to cover, as can be seen by the improvements made in every TS release.


No the proposal is _not_ to include Typescript type checking in the browser. The proposal is to make the browser understand where the ts types are so it can _ignore_ them. So that you can write ts and have it be type checked on your machine, and them ship the ts and have it run on client machines _without_ any type checking running there. The browser will run the ts as though it was js.

So the types will actually be able to be anything. It can be a completely different type checking superset language than Typescript even! Nothing will be locked at the current level.

It's a frikkin magical proposal.

https://github.com/tc39/proposal-type-annotations


Here we can see concretely the benefits of TypeScript taking the strict stance of not generating any code, so basically stripping types will work.

Though, weren't there some exceptions to that?

It seems though the syntactical structures that are chosen to be ignored need to be listed in the proposal, making the support in browser non-trivial and still hindering the future extensions of TS and similar languages, because all future constructs would need to be supersets of this proposal—or whatever version is practically supported by current browsers. If a language brings up a new construct all the users of that construct need to revert back from shipping their source-code as is, increasing the cost of introducing such things in the future.

Personally I don't see great benefits in having straight up TS work as-is in the browsers as you still need to run type checking phase locally, but I do see that some would like to see that happen and that it would simplify some release processes.

It would not simplify the release process of folks that want to minify and obfuscate their sources, but it's probably fine to make that comparatively even harder ;).


The benefit is that we can forgo any build or compile step, and just ship code as is. There will be no source maps error to worry about. We will be able to run and debug the exact code that we wrote and not some jumbled transpiled code. We can just have a local type checker that can help with correctness and inline suggestions in our ide.

When I say nothing will be locked down, then that does only mean the type checking itself will not be locked down. Indeed it will not be specified at all. Buuuuut the location of the types in the js syntaxt will 100% be locked down and specified. That's what the proposal is. So there will be limitations on coming up with novel ways to integrate type syntax with the js syntax. But you will of course still be able to make your own compiler if you want this.



Doesn't this make the dead screen time rather long if you have to load all the game assets before you can even display a loader? (I guess you don't even have or need progress bars?)


Do you happen to have any resources on writing games in TS? I will def google, but game development has always been oddly hard for me and I’ve been in typescript land for a while now so figure it’s a good time to try again