Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(cicd): setup test pipeline #69

Merged
merged 11 commits into from
Jan 3, 2024
Merged

Conversation

marcelovicentegc
Copy link
Collaborator

@marcelovicentegc marcelovicentegc commented Nov 26, 2023

The goal of this PR is to setup an automated test pipeline for PRs to make sure that nothing breaks with upcoming changes and safeguard this project's quality.

This PR includes the starting point to run automated tests against all APIs (cli, config file and docker images) to guarantee that the program builds and executes correctly with each upcoming change.

It also adds a workflow to validate (uses amann/action-semantic-pull-request) the title of PRs to make sure that versioning is kept consistent. This is a follow up from:

Related to:

@marcelovicentegc marcelovicentegc added the enhancement New feature or request label Nov 26, 2023
@marcelovicentegc marcelovicentegc self-assigned this Nov 26, 2023
@marcelovicentegc marcelovicentegc marked this pull request as draft November 26, 2023 23:15
@marcelovicentegc marcelovicentegc changed the title chore(cicd): setup docker image build test chore(cicd): setup test pipeline Nov 26, 2023
@mukum5
Copy link

mukum5 commented Dec 16, 2023

// For more information, see https://crawlee.dev/
import { PlaywrightCrawler } from "crawlee";
import { readFile, writeFile } from "fs/promises";
import { glob } from "glob";
import { config } from "../config.js";
import { Page } from "playwright";

export function getPageHtml(page: Page) {
return page.evaluate((selector) => {
const el = document.querySelector(selector) as HTMLElement | null;
return el?.innerText || "";
}, config.selector);
}

if (process.env.NO_CRAWL !== "true") {
// PlaywrightCrawler crawls the web using a headless
// browser controlled by the Playwright library.
const crawler = new PlaywrightCrawler({
// Use the requestHandler to process each of the crawled pages.
async requestHandler({ request, page, enqueueLinks, log, pushData }) {

  if (config.cookie) {
    // Set the cookie for the specific URL
    const cookie = {
      name: config.cookie.name,
      value: config.cookie.value,
      url: request.loadedUrl, 
    };
    await page.context().addCookies([cookie]);
  }

  const title = await page.title();
  log.info(`Crawling ${request.loadedUrl}...`);

  await page.waitForSelector(config.selector, {
    timeout: config.waitForSelectorTimeout ?? 1000,
  });

  const html = await getPageHtml(page);

  // Save results as JSON to ./storage/datasets/default
  await pushData({ title, url: request.loadedUrl, html });

  if (config.onVisitPage) {
    await config.onVisitPage({ page, pushData });
  }

  // Extract links from the current page
  // and add them to the crawling queue.
  await enqueueLinks({
    globs: [config.match],
  });
},
// Comment this option to scrape the full website.
maxRequestsPerCrawl: config.maxPagesToCrawl,
// Uncomment this option to see the browser window.
// headless: false,

});

// Add first URL to the queue and start the crawl.
await crawler.run([config.url]);
}

const jsonFiles = await glob("storage/datasets/default/*.json", {
absolute: true,
});

const results = [];
for (const file of jsonFiles) {
const data = JSON.parse(await readFile(file, "utf-8"));
results.push(data);
}

await writeFile(config.outputFileName, JSON.stringify(results, null, 2));

Copy link

@mukum5 mukum5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.github/workflows/pr.yml

@marcelovicentegc marcelovicentegc marked this pull request as ready for review January 3, 2024 18:29
@marcelovicentegc marcelovicentegc merged commit de19048 into main Jan 3, 2024
5 checks passed
Copy link

github-actions bot commented Jan 4, 2024

🎉 This PR is included in version 1.2.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

This was referenced Jan 4, 2024
hirsaeki pushed a commit to hirsaeki/gpt-crawler-y-upstream that referenced this pull request Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request released
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants