• 0 Posts
  • 14 Comments
Joined 1 year ago
cake
Cake day: June 18th, 2023

help-circle
  • Depending on what you want to scape, that’s a lot of overkill and overcomplication. Full website testing frameworks may not be necessary to scrape. Python with it’s tooling and package management may not be necessary.

    I’ve recently extracted and downloaded stuff via Nushell.

    1. Requirement: Knowledge of CSS Selectors
    2. Inspect Website DOM in Webbrowser web developer tools
      1. Identify structure
      2. Identify adequate selectors; testable via browser dev tools console document.querySelectorAll()
    3. Get and query data

    For me, my command line terminal and scripting language of choice is Nushell:

    let $html = http get 'https://example.org/'
    let $meta = $html | query web --query '#infobox .title, #infobox .tags' |  | { title: $in.0.0 tags: $in.1.0 }
    let $content = $html | query web --query 'main img' --attribute data-src
    $meta | save meta.json
    

    or

    1..30 | each {|x| http get $'https://example.org/img/($x).jpg' | save $'($x).jpg'; sleep 100ms }
    

    Depending on the tools you use, it’ll be quite similar or very different.

    Selenium is an entire web-browser driver meaning it does a lot more and has a more extensive interface because of it; and you can talk to it through different interfaces and languages.