nbc headlines updates

2026-07-20 16:51:08 +02:00 · 2024-10-02 10:24:49 -07:00 · 2024-10-02 10:24:49 -07:00 · f43ae7a741
commit f43ae7a741
parent 496560cf9a
3 changed files with 1293 additions and 701 deletions
--- a/examples/nbc-headlines/1_scrape.ipynb
+++ b/examples/nbc-headlines/1_scrape.ipynb
@ -1,5 +1,27 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NBC News Headlines: Scraper\n",
    "\n",
    "This notebooks implements a scraper for [NBC News](https://www.nbcnews.com) headlines. It uses [this sitemap](https://www.nbcnews.com/archive/articles/2024/march), which provides a list of article headlines + URLs\n",
    "for every month for the past few years. \n",
    "\n",
    "This dataset is mostly to get a simple, real-world small text dataset for testing embeddings. \n",
    "They're small pieces of text (~dozen words), have a wide range of semantic meaning, and are more \"real-world\"\n",
    "them some other embeddings datasets out there.\n",
    "\n",
    "This notebook uses [Deno](https://deno.com/), [linkedom](https://github.com/WebReflection/linkedom), and a few \n",
    "SQLite extensions to scrape the headlines for a given date range. It creates a single SQL table, `articles`, \n",
    "with a few columns like `headline` and `url`. By default it will get all article headlines from January 2024 -> present\n",
    "and save them to a database called `headlines-2024.db`. Feel free to copy+paste this code into your own custom scraper. \n",
    "\n",
    "This notebook also just scrapes the data into a SQLite database, it does NOT do any embeddings + vector search. \n",
    "For those examples of those, see [`./2_build.ipynb`](./2_build.ipynb) and [`./3_search.ipynb`](./3_search.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
--- a/examples/nbc-headlines/2_build.ipynb
+++ b/examples/nbc-headlines/2_build.ipynb
--- a/examples/nbc-headlines/3_search.ipynb
+++ b/examples/nbc-headlines/3_search.ipynb