问题
Just to clarify in advance, I don't have a Facebook account and I have no intent to create one. Also, what I'm trying to achieve is perfectly legal in my country and the USA.
Instead of using the Facebook API to get the latest timeline posts of a Facebook page, I want to send a get request directly to the page URL (e.g. this page) and extract the posts from the HTML source code.
(I'd like to get the text and the creation time of the post.)
When I run this in the web console:
document.getElementsByClassName('userContent')
I get a list of elements containing the text of the latest posts.
But I'd like to extract that information from a nodejs script. I could probably do it quite easily using a headless browser like puppeteer
or the like, but that would create a ton of unnecessary overhead. I'd really like to a simple approach like downloading the HTML code, passing it to cheerio and use cheeriio's jQuery-like API to extract the posts.
Here is my attempt of trying exactly that:
// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');
rp.get('https://www.facebook.com/pg/officialstackoverflow/posts/').then( postsHtml => {
const $ = cheerio.load(postsHtml);
const timeLinePostEls = $('.userContent');
console.log(timeLinePostEls.html()); // should NOT be null
const newestPostEl = timeLinePostEls.get(0);
console.log(newestPostEl.html()); // should NOT be null
const newestPostText = newestPostEl.text();
console.log(newestPostText);
//const newestPostTime = newestPostEl.parent(??).child('.livetimestamp').title;
//console.log(newestPostTime);
}).catch(console.error);
unfortunately $('.userContent')
does not work. However, I was able to verify that the data I'm looking for is embedded somewhere in that HTML code.
But I couldn't really come up with a with a good regex approach or the like to extract that data.
Depending on the post content the number of HTML tags within the post varies heavily.
Here is a simple example of a post containing one link:
<div class="_5pbx userContent _3576" data-ft="{"tn":"K"}"><p>We're proud to be named one of Built In NYC's Best Places to Work in 2019, ranking in the top 10 for Best Midsize Places to Work and top 3 (!) for Best Perks and Benefits. See what it took to make the list and check out our profile to see some of our job openings. <a href="https://l.facebook.com/l.php?u=https%3A%2F%2Fbit.ly%2F2H3Kbr2&h=AT29h2HyDsEk0rHRWqJA-Fa4M1qi3nJT1NBi95othaR3qeFuFAMNiVS2Dgtv5KR5m0xqjw6kfwZdhZt0_D3UQT1Oel2UhxRql-KwkA1xqWvrql4u1jDhzrkGVT_XxoUd8_w8_fzLZzzhz23a8yPCK6IPfWKB76_CEFjG3b78y4dFJvY9Z08AYlR01dmi5_FvWVEVytkN-123u6alYE8pqL6Jb6dtIQUTWGXYJPaNMrtxkCUZniEVXEcILkwHGSuHqCTAarboyMP55F1vhYO3OAiVMkvjbN274fVq92YvbK3bi90bU9T-5ADWHDUJ-CwcofSBTW47chstQeY0n_UluD_rBIPLsfXVSnCtpRkR2kXi9zzHLnNeIYeNssv3i7UKS_f5Z2pnVT6xe3zJbNpB68doH1Z__I9nsTCNIyFyKx2VxabecoL03DIawbRrzBoxLAwzNPLACBjTkpEQhdVn4_wdAIjXRL4cLQDcZkLEoG_sspBgRePH23TFbNufQOBly-FNtLHnkUDO2Ca-FYvAGXpcu6J4B1aH3XFPB803lsz-GRdACyOFOgXDXJfwr4WtWzUHxfiOPULWiI43yI5L4aU6wYRhPjxua3RuRZ8oj9fXa1w4Jrht94Ue2wfKtz8" target="_blank" data-ft="{"tn":"-U"}" rel="noopener nofollow" data-lynx-mode="async">http://*******/2H3Kbr2</a></p></div>
Formatted in a more readable form it looks somewhat like this:
<div class="_5pbx userContent _3576" data-ft="{"tn":"K"}">
<p>
We're proud to be named one of Built In NYC's Best Places to Work in
2019, ranking in the top 10 for Best Midsize Places to Work and top 3 (!) for
Best Perks and Benefits. See what it took to make the list and check out our
profile to see some of our job openings.
<a href="VERY_LONG_URL.........." target="_blank" data-ft="{"tn":"-U"}" rel="noopener nofollow" data-lynx-mode="async">SHORT_LINK.....</a>
</p>
</div>
This regex seems to work okay, but I don't think it is very reliable:
/<div class="[^"]+ userContent [^"]+" data-ft="[^"]+">(.+?)<\/div>/g
If for example the post contained another div-element then it wouldn't work properly. In addition to that I have no way of knowing the time/date the post was created using this approach?
Any ideas how I could relatively reliably extract the most recent 2-3 posts including the creation date/time?
回答1:
Okay, I finally figured it out. I hope this will be useful to others. This function will extract the 20 latest posts, including the creation time:
// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');
function GetFbPosts(pageUrl) {
const requestOptions = {
url: pageUrl,
headers: {
'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0'
}
};
return rp.get(requestOptions).then( postsHtml => {
const $ = cheerio.load(postsHtml);
const timeLinePostEls = $('.userContent').map((i,el)=>$(el)).get();
const posts = timeLinePostEls.map(post=>{
return {
message: post.html(),
created_at: post.parents('.userContentWrapper').find('.timestampContent').html()
}
});
return posts;
});
}
GetFbPosts('https://www.facebook.com/pg/officialstackoverflow/posts/').then(posts=>{
// Log all posts
for (const post of posts) {
console.log(post.created_at, post.message);
}
});
Since Facebook messages can have complicated formatting the message is not plain text, but HTML. But you could remove the formatting and just get the text by replacing message: post.html()
with message: post.text()
.
Edit: If you want to get more than the latest 20 posts, it is more complicated. The first 20 posts are served statically on the initial html page. All following posts are retrieved via ajax in chunks of 8 posts. It can be achieved like that:
// make sure your node.js version supports async/await (v10 and above should be fine)
// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');
class FbScrape {
constructor(options={}) {
this.headers = options.headers || {
'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' // you may have to update this at some point
};
}
async getPosts(pageUrl, limit=20) {
const staticPostsHtml = await rp.get({ url: pageUrl, headers: this.headers });
if (limit <= 20) {
return this._parsePostsHtml(staticPostsHtml);
} else {
let staticPosts = this._parsePostsHtml(staticPostsHtml);
const nextResultsUrl = this._getNextPageAjaxUrl(staticPostsHtml);
const ajaxPosts = await this._getAjaxPosts(nextResultsUrl, limit-20);
return staticPosts.concat(ajaxPosts);
}
}
_parsePostsHtml(postsHtml) {
const $ = cheerio.load(postsHtml);
const timeLinePostEls = $('.userContent').map((i,el)=>$(el)).get();
const posts = timeLinePostEls.map(post => {
return {
message: post.html(),
created_at: post.parents('.userContentWrapper').find('.timestampContent').html()
}
});
return posts;
}
async _getAjaxPosts(resultsUrl, limit=8, posts=[]) {
const responseBody = await rp.get({ url: resultsUrl, headers: this.headers });
const extractedJson = JSON.parse(responseBody.substr(9));
const postsHtml = extractedJson.domops[0][3].__html;
const newPosts = this._parsePostsHtml(postsHtml);
const allPosts = posts.concat(newPosts);
const nextResultsUrl = this._getNextPageAjaxUrl(postsHtml);
if (allPosts.length+1 >= limit)
return allPosts;
else
return await this._getAjaxPosts(nextResultsUrl, limit, allPosts);
}
_getNextPageAjaxUrl(html) {
return 'https://www.facebook.com' + /"(\/pages_reaction_units\/more[^"]+)"/g.exec(html)[1].replace(/&/g, '&') + '&__a=1';
}
}
const fbScrape = new FbScrape();
const minimum = 28; // minimum number of posts to request (gets rounded up to 20, 28, 36, 44, 52, 60, 68 etc... because of page sizes (page1=20; all_following_pages=8)
fbScrape.getPosts('https://www.facebook.com/pg/officialstackoverflow/posts/', minimum).then(posts => { // get at least the 28 latest posts
// Log all posts
for (const post of posts) {
console.log(post.created_at, post.message);
}
});
来源:https://stackoverflow.com/questions/54256433/extract-public-posts-from-facebook-page-without-api-app-key-token-secret