Skip to content

[Bug] Dead loop when scraping a forbidden webpage #2056

@BrandonStudio

Description

@BrandonStudio

Describe the Bug

The worker keeps scraping web page via PDF without success.

To Reproduce

POST to /v2/scrape

{
  "url": "https://zhuanlan.zhihu.com/p/1904292801329488682"
}

Expected Behavior

Succeed with content, or fail without content. No dead loop.

Screenshots

Image

Environment:

  • OS: Windows
  • Deployment Type: Self-Hosted
  • Firecrawl Version: main (d1418c8)
  • Node.js Version: 22.18.0

Logs

Please refer to the screen shot above.

Additional Context

  1. When receiving request, the enabled engines are fetch, pdf, and docx.
  2. The first run of buildFallbackList only returns fetch, which is good.
  3. The fetch scraper returns some content, with status code 403.
  4. The scrape loop finds this "likely proxy error" and attempts to switch to stealth:
    if (isLikelyProxyError && meta.options.proxy === "auto" && !meta.featureFlags.has("stealthProxy")) {
    meta.logger.info("Scrape via " + engine + " deemed unsuccessful due to proxy inadequacy. Adding stealthProxy flag.");
    throw new AddFeatureError(["stealthProxy"]);
    }
  5. The outer loop adds stealth feature flag and re-call the scrape loop:
    if (
    error instanceof AddFeatureError &&
    (meta.internalOptions.forceEngine === undefined || Array.isArray(meta.internalOptions.forceEngine))
    ) {
    meta.logger.debug(
    "More feature flags requested by scraper: adding " +
    error.featureFlags.join(", "),
    { error, existingFlags: meta.featureFlags },
    );
    meta.featureFlags = new Set(
    [...meta.featureFlags].concat(error.featureFlags),
    );
  6. In this round, the buildFallbackList returns pdf and docx. I'm not quite sure why. I understand that fetch engine does not support stealth, but I don't know why pdf and docx come up here, while they are removed in the first run. (Why the are not removed this time)
  7. Now, scrape with PDF, failed with AntiBotError
  8. The outer loop catches the error, and remove the PDF feature flag:
    meta.logger.debug("PDF was blocked by anti-bot, prefetching with chrome-cdp");
    meta.featureFlags = new Set(
    [...meta.featureFlags].filter(
    (x) => x !== "pdf",
    ),
    );
  9. Next round. This time, buildFallbackList still returns pdf and docx.
  10. Dead loop.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions