mirror of
https://github.com/asg017/sqlite-vec.git
synced 2026-04-25 16:56:27 +02:00
nbc headlines updates
This commit is contained in:
parent
496560cf9a
commit
f43ae7a741
3 changed files with 1293 additions and 701 deletions
|
|
@ -1,5 +1,27 @@
|
||||||
{
|
{
|
||||||
"cells": [
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# NBC News Headlines: Scraper\n",
|
||||||
|
"\n",
|
||||||
|
"This notebooks implements a scraper for [NBC News](https://www.nbcnews.com) headlines. It uses [this sitemap](https://www.nbcnews.com/archive/articles/2024/march), which provides a list of article headlines + URLs\n",
|
||||||
|
"for every month for the past few years. \n",
|
||||||
|
"\n",
|
||||||
|
"This dataset is mostly to get a simple, real-world small text dataset for testing embeddings. \n",
|
||||||
|
"They're small pieces of text (~dozen words), have a wide range of semantic meaning, and are more \"real-world\"\n",
|
||||||
|
"them some other embeddings datasets out there.\n",
|
||||||
|
"\n",
|
||||||
|
"This notebook uses [Deno](https://deno.com/), [linkedom](https://github.com/WebReflection/linkedom), and a few \n",
|
||||||
|
"SQLite extensions to scrape the headlines for a given date range. It creates a single SQL table, `articles`, \n",
|
||||||
|
"with a few columns like `headline` and `url`. By default it will get all article headlines from January 2024 -> present\n",
|
||||||
|
"and save them to a database called `headlines-2024.db`. Feel free to copy+paste this code into your own custom scraper. \n",
|
||||||
|
"\n",
|
||||||
|
"This notebook also just scrapes the data into a SQLite database, it does NOT do any embeddings + vector search. \n",
|
||||||
|
"For those examples of those, see [`./2_build.ipynb`](./2_build.ipynb) and [`./3_search.ipynb`](./3_search.ipynb)."
|
||||||
|
]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"execution_count": 43,
|
"execution_count": 43,
|
||||||
|
|
|
||||||
File diff suppressed because it is too large
Load diff
File diff suppressed because it is too large
Load diff
Loading…
Add table
Add a link
Reference in a new issue