mirror of
https://github.com/0xMassi/webclaw.git
synced 2026-06-07 22:15:12 +02:00
feat(extractors): add LinkedIn + Instagram with profile-to-posts fan-out
3 social-network extractors that work entirely without auth, using
public embed/preview endpoints + Instagram's own SEO-facing API:
- linkedin_post: /embed/feed/update/{urn} returns full body,
author, image, OG tags. Accepts both the urn:li:share
and urn:li:activity URN forms plus the pretty
/posts/{slug}-{id}-{suffix} URLs.
- instagram_post: /p/{shortcode}/embed/captioned/ returns the full
caption, username, thumbnail. Same endpoint serves
reels and IGTV, kind correctly classified.
- instagram_profile: /api/v1/users/web_profile_info/?username=X with the
x-ig-app-id header (Instagram's public web-app id,
sent by their own JS bundle). Returns the full
profile + the 12 most recent posts with shortcodes,
kinds, like/comment counts, thumbnails, and caption
previews. Falls back to OG-tag scraping of the
public HTML if the API ever 401/403s.
The IG profile output is shaped so callers can fan out cleanly:
for p in profile.recent_posts:
scrape('instagram_post', p.url)
giving you 'whole profile + every recent post' in one loop. End-to-end
tested against ticketswave: 1 profile call + 12 post calls in ~3.5s.
Pagination beyond 12 posts requires authenticated cookies and is left
for the cloud where we can stash a session.
Infrastructure change: added FetchClient::fetch_with_headers so
extractors can satisfy site-specific request headers (here x-ig-app-id;
later github_pr will use this for Authorization, etc.) without polluting
the global FetchConfig.headers map. Same retry semantics as fetch().
Catalog now exposes 17 extractors via /v1/extractors. Total unit tests
across the module: 47 passing. Clippy clean. Fmt clean.
Live test on the maintainer's example URLs:
- LinkedIn post (urn:li:share:7452618582213144577): 'Orc Dev' / full body
/ shipper.club link / CDN image extracted in 250ms.
- Instagram post (DT-RICMjeK5): 835-char Slovak caption, ticketswave
username, thumbnail. 200ms.
- Instagram profile (ticketswave): 18,473 followers (exact, not
rounded), is_verified=True, is_business=True, biography with emojis,
12 recent posts with shortcodes + kinds + likes. 400ms.
Out of scope for this wave (require infra we don't have):
- linkedin_profile: returns 999 to all bot UAs, needs OAuth
- facebook_post / facebook_page: content is JS-loaded, needs cloud Chrome
- facebook_profile (personal): not publicly accessible by design
This commit is contained in:
parent
b041f3cddd
commit
3bb0a4bca0
7 changed files with 1085 additions and 1 deletions
|
|
@ -24,6 +24,9 @@ pub mod github_repo;
|
|||
pub mod hackernews;
|
||||
pub mod huggingface_dataset;
|
||||
pub mod huggingface_model;
|
||||
pub mod instagram_post;
|
||||
pub mod instagram_profile;
|
||||
pub mod linkedin_post;
|
||||
pub mod npm;
|
||||
pub mod pypi;
|
||||
pub mod reddit;
|
||||
|
|
@ -67,6 +70,9 @@ pub fn list() -> Vec<ExtractorInfo> {
|
|||
docker_hub::INFO,
|
||||
dev_to::INFO,
|
||||
stackoverflow::INFO,
|
||||
linkedin_post::INFO,
|
||||
instagram_post::INFO,
|
||||
instagram_profile::INFO,
|
||||
]
|
||||
}
|
||||
|
||||
|
|
@ -171,6 +177,27 @@ pub async fn dispatch_by_url(
|
|||
.map(|v| (stackoverflow::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if linkedin_post::matches(url) {
|
||||
return Some(
|
||||
linkedin_post::extract(client, url)
|
||||
.await
|
||||
.map(|v| (linkedin_post::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if instagram_post::matches(url) {
|
||||
return Some(
|
||||
instagram_post::extract(client, url)
|
||||
.await
|
||||
.map(|v| (instagram_post::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
if instagram_profile::matches(url) {
|
||||
return Some(
|
||||
instagram_profile::extract(client, url)
|
||||
.await
|
||||
.map(|v| (instagram_profile::INFO.name, v)),
|
||||
);
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
|
|
@ -259,6 +286,24 @@ pub async fn dispatch_by_name(
|
|||
})
|
||||
.await
|
||||
}
|
||||
n if n == linkedin_post::INFO.name => {
|
||||
run_or_mismatch(linkedin_post::matches(url), n, url, || {
|
||||
linkedin_post::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == instagram_post::INFO.name => {
|
||||
run_or_mismatch(instagram_post::matches(url), n, url, || {
|
||||
instagram_post::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
n if n == instagram_profile::INFO.name => {
|
||||
run_or_mismatch(instagram_profile::matches(url), n, url, || {
|
||||
instagram_profile::extract(client, url)
|
||||
})
|
||||
.await
|
||||
}
|
||||
_ => Err(ExtractorDispatchError::UnknownVertical(name.to_string())),
|
||||
}
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue