nbc headlines updates

This commit is contained in:
Alex Garcia 2024-10-02 10:24:49 -07:00
parent 496560cf9a
commit f43ae7a741
3 changed files with 1293 additions and 701 deletions

View file

@ -1,5 +1,27 @@
{ {
"cells": [ "cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# NBC News Headlines: Scraper\n",
"\n",
"This notebooks implements a scraper for [NBC News](https://www.nbcnews.com) headlines. It uses [this sitemap](https://www.nbcnews.com/archive/articles/2024/march), which provides a list of article headlines + URLs\n",
"for every month for the past few years. \n",
"\n",
"This dataset is mostly to get a simple, real-world small text dataset for testing embeddings. \n",
"They're small pieces of text (~dozen words), have a wide range of semantic meaning, and are more \"real-world\"\n",
"them some other embeddings datasets out there.\n",
"\n",
"This notebook uses [Deno](https://deno.com/), [linkedom](https://github.com/WebReflection/linkedom), and a few \n",
"SQLite extensions to scrape the headlines for a given date range. It creates a single SQL table, `articles`, \n",
"with a few columns like `headline` and `url`. By default it will get all article headlines from January 2024 -> present\n",
"and save them to a database called `headlines-2024.db`. Feel free to copy+paste this code into your own custom scraper. \n",
"\n",
"This notebook also just scrapes the data into a SQLite database, it does NOT do any embeddings + vector search. \n",
"For those examples of those, see [`./2_build.ipynb`](./2_build.ipynb) and [`./3_search.ipynb`](./3_search.ipynb)."
]
},
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": 43, "execution_count": 43,

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff