Rebuilding My LLM Web Scraper Two Years Later: What Actually Changed
Back in 2024, I wrote an article about web scraping with LLMs. Put together a repo, BeautifulSoup, LangChain, Pydantic, a Streamlit UI. It did the job. People read it. I forgot about it.
A couple weeks ago I rebuilt the same thing from scratch with whatever tools are available now. Same scope, same feature set. The original took me a weekend. The rebuild took about two hours, and the result has error handling, multi-model support, retries, and a proper frontend. All things I never even attempted the first time.
The 2024 Pipeline: Five Steps, Five Failure Points
The original pipeline had five steps, and at the time I thought it was pretty clean. Fetch HTML with requests. Clean it with BeautifulSoup, strip nav, footer, scripts. Define a Pydantic schema. Build a LangChain prompt. Parse the output with PydanticOutputParser.
The parser was the interesting part. It injected format instructions into the prompt, telling the LLM "return JSON that looks like this", and then tried to validate whatever came back. It worked most of the time. When the LLM wrapped the JSON in markdown backticks or changed a field name, it broke.
# 2024: the entire extraction pipeline
chain = prompt | ChatOpenAI(temperature=0) | PydanticOutputParser(pydantic_object=ObjectsToScrape)
PromptTemplate feeds ChatOpenAI, ChatOpenAI feeds PydanticOutputParser. When the output didn't match the expected JSON shape, you got a stack trace. No fallback, no retry. Just a crash.
It was solid work for 2024. But it was a PoC. No error handling, no retries, one model, one provider. If the page exceeded the context window, the request just failed. I published the repo and moved on.
Crawl4AI Replaces BeautifulSoup (and requests, and the entire crawling layer)
The first thing I noticed when I started rebuilding was how much code I didn't need to write.
Crawl4AI takes a URL and gives you clean markdown. JavaScript rendering, retries, timeouts, all handled internally. I pointed it at a page, got back structured content, and realized my entire src/url.py was useless. That file was around 30 lines of BeautifulSoup work: stripping nav and footer tags, preserving hrefs, removing scripts. Careful, boring code that I'd debugged more than once.
And in the original, if the target page rendered content with JavaScript, you needed Selenium or Playwright on top of requests. Another dependency, another layer to manage. Crawl4AI bundles Playwright internally and I didn't even notice until I checked the docs. The "get clean content from a URL" problem went from 40% of my codebase to zero lines.
Native Structured Output Replaces LangChain Parsers
Then the extraction layer. The PydanticOutputParser, that whole pattern of injecting format instructions and hoping the LLM returns valid JSON, doesn't need to exist anymore.
Models now have native structured output. You define a Pydantic schema, pass it to the API call with response_format, and you get validated typed data back. No parsing. No format instructions in the prompt. No retry loops on malformed responses.
This is the change that surprised me the most. In 2024, output parsing was the most fragile part of any LLM pipeline. We all had retry loops, regex fallbacks, "please respond only with valid JSON" appended to prompts. Now the model guarantees the output shape through constrained decoding. When something goes wrong now, it's because the model misunderstood the content, not because it wrapped JSON in backticks. That's a problem I actually know how to fix.
I used LiteLLM as the extraction layer so I wouldn't be locked to a single provider. Same function call works for gpt-4.1-mini, o4-mini, Claude Sonnet 4, or any other model LiteLLM supports. Swapping models is a dropdown change in the UI, not a code change.
# 2026: the entire extraction pipeline
response = litellm.completion(
model="openai/gpt-4.1-mini",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": markdown},
],
response_format={
"type": "json_schema",
"json_schema": {"name": "result", "strict": True, "schema": schema},
},
)
result = ResultModel.model_validate(json.loads(response.choices[0].message.content))
Compare that to the 2024 chain. The LLM call and the validation happen in the same step now. No intermediate parser, no format instructions injected into the prompt, no hoping the output matches.
The Async Problem Nobody Warns You About
I did hit one wall. Crawl4AI's AsyncWebCrawler is async-only, and FastAPI endpoints that do heavy sync work don't mix well with Playwright's event loop.
I spent a good 20 minutes trying the same wrong thing in different ways before finding a ThreadPoolExecutor pattern that spawns its own event loop per crawl request:
def _run_async(coro):
with ThreadPoolExecutor(1) as pool:
return pool.submit(lambda: asyncio.run(coro)).result()
Three lines. Not the most elegant solution, and I'm still not sure it's the right one for higher concurrency. But for a single-user tool it works.
Side note: Crawl4AI's docs don't mention this problem at all. There's a thread_safe=True parameter on the crawler that I found by reading the source, and their examples all assume you're running async top-level. If you're embedding it in a sync web framework, you're on your own. I ended up finding the pattern in a GitHub issue from the phidata project. Anyway.
This was the only real friction in the whole rebuild.
Where the Time Actually Went
The time I saved on crawling and extraction didn't just make the project faster to build. It changed what the project is.
The original had zero error handling because adding it would have doubled the development time. I had a weekend budget and the plumbing ate most of it. This time the plumbing was done in the first 30 minutes, so the remaining hour and a half went to actual product work: exponential backoff retries, multi-model support in the UI, proper logging, user-facing error messages.
I dropped Streamlit and built a React frontend with a FastAPI backend. The original Streamlit app had 60 lines of custom widget code just to manage schema definition rows. The new one uses a data table component and looks like an actual product.
A colleague saw the before and after and said "wait, is this the same project?" It is. Same 4 Python files in src/, same Pydantic dynamic models, same user flow: define a schema, enter a URL, get structured data.
2024 (original)2026 (rebuild)Crawlingrequests + BeautifulSoup (30 lines)Crawl4AI (1 function call)ExtractionLangChain + PydanticOutputParserLiteLLM + native structured outputModelsOpenAI onlyOpenAI, Anthropic, any LiteLLM providerError handlingNoneRetries with exponential backoffFrontendStreamlit (custom widgets)React + TypeScript + TailwindTime to buildA weekendAbout two hours
What This Means for the Next Project
In 2024 I felt more like an integration engineer than a builder. Understanding how LangChain chains worked, how output parsers validated, how to manage prompt templates. That was the job. In 2026, that layer is mostly gone. You define what you want (a Pydantic schema), point it at content (a URL), and the tools handle the wiring.
The next time I build something with LLMs, I'm not even going to think about the plumbing. I'll go straight to the product question: what should this thing actually do for the user? That's the shift. The interesting work moved up the stack.
Limitations (Because Nothing Is Free)
The new version still has real limitations. I haven't tested it on pages with complex JavaScript flows like SPAs or infinite scroll. Token costs add up if you're scraping at scale, each page runs through an LLM call that costs a few cents. The async workaround is fragile under concurrency. And structured output, while much better than PydanticOutputParser, still occasionally returns wrong data types on ambiguous fields.
It's production-grade for a single-user tool. For a team service, there's more work to do: rate limiting, caching, queue management, monitoring. The two-hour rebuild gets you to a working product, not to a scalable service.
Try It
Here's the rebuilt repo and the original article. Compare them.
If I had to boil this down to three takeaways:
Structured output killed the parser problem. Stop injecting format instructions. Pass a schema and let the model constrain its own output.
The plumbing got cheap, so spend your time on the product. If your LLM app still looks like a prototype, it's not because you lack time. It's because you're spending it on the wrong layer.
Rebuild your old stuff. Seriously. Pick an old repo and see how you evolve it with a few iterations. You'll learn more in two hours than in a week of reading release notes.
Do you have an old PoC gathering dust on GitHub? I'm curious what it would look like rebuilt with today's tools.
Comments
No comments yet. Be the first to comment!