Git Scraping

Git Scraping

Simon Willison pioneered a technique that he calls git scraping. The idea is to use GitHub actions and the git commit structure to build time series datasets.

I’m currently building two datasets:

  1. In November 2021, CISA announced a Known Exploited Vulnerabilities Catalog. Binding Operational Directive 22-01 uses this as a foundation for requiring federal agencies to patch their systems. Git scraping will enable a couple pieces of analysis: how long does CISA give federal agencies to patch once they know the vulnerability is being exploited? how are these vulnerabilities distributed between different vendors? is there a pattern to how regularly CISA updates the list or requires patching? Data available here.

  2. The AWS Status Dashboard. I’m interested in not only the general distribution of outages, but also recovery times and any pattern in cascading failures. Data available here.