How to Find All Existing and Archived URLs on a web site

There are many good reasons you might will need to discover many of the URLs on a web site, but your correct objective will figure out Everything you’re seeking. By way of example, you may want to:

Identify just about every indexed URL to analyze difficulties like cannibalization or index bloat
Gather recent and historic URLs Google has seen, especially for website migrations
Uncover all 404 URLs to Get well from submit-migration mistakes
In Every single situation, just one tool gained’t Present you with anything you will need. Unfortunately, Google Research Console isn’t exhaustive, as well as a “site:instance.com” research is restricted and tricky to extract data from.

On this submit, I’ll walk you thru some applications to develop your URL checklist and just before deduplicating the data employing a spreadsheet or Jupyter Notebook, depending on your internet site’s measurement.

Old sitemaps and crawl exports
When you’re searching for URLs that disappeared in the Are living web site recently, there’s an opportunity another person on your team might have saved a sitemap file or even a crawl export before the changes were being designed. Should you haven’t currently, look for these data files; they will frequently present what you require. But, if you’re examining this, you almost certainly didn't get so Blessed.

Archive.org
Archive.org
Archive.org is a useful Resource for Search engine optimisation jobs, funded by donations. When you seek for a domain and choose the “URLs” alternative, you are able to obtain as many as ten,000 detailed URLs.

Having said that, There are several constraints:

URL Restrict: You could only retrieve up to web designer kuala lumpur ten,000 URLs, which is inadequate for more substantial web sites.
Top quality: Numerous URLs might be malformed or reference useful resource information (e.g., photos or scripts).
No export selection: There isn’t a constructed-in technique to export the record.
To bypass The dearth of the export button, make use of a browser scraping plugin like Dataminer.io. Having said that, these limits necessarily mean Archive.org might not give a complete Answer for bigger websites. Also, Archive.org doesn’t indicate irrespective of whether Google indexed a URL—but if Archive.org uncovered it, there’s a good prospect Google did, also.

Moz Professional
Even though you may commonly utilize a link index to seek out exterior web pages linking to you personally, these applications also explore URLs on your web site in the method.


How to use it:
Export your inbound backlinks in Moz Professional to obtain a quick and simple listing of target URLs out of your web page. Should you’re addressing a large website, think about using the Moz API to export details further than what’s workable in Excel or Google Sheets.

It’s vital that you Be aware that Moz Professional doesn’t ensure if URLs are indexed or found by Google. Having said that, because most web sites apply the exact same robots.txt guidelines to Moz’s bots because they do to Google’s, this process normally is effective nicely being a proxy for Googlebot’s discoverability.

Google Research Console
Google Lookup Console delivers many precious sources for developing your listing of URLs.

Back links reviews:


Just like Moz Professional, the Links portion gives exportable lists of focus on URLs. Regretably, these exports are capped at one,000 URLs Every. It is possible to implement filters for distinct webpages, but considering that filters don’t apply to the export, you might really need to depend upon browser scraping resources—limited to five hundred filtered URLs at any given time. Not perfect.

Performance → Search Results:


This export provides an index of internet pages acquiring research impressions. When the export is proscribed, You should use Google Look for Console API for larger sized datasets. There are also free Google Sheets plugins that simplify pulling extra comprehensive information.

Indexing → Webpages report:


This part gives exports filtered by difficulty kind, however these are also limited in scope.

Google Analytics
Google Analytics
The Engagement → Pages and Screens default report in GA4 is a wonderful supply for collecting URLs, with a generous Restrict of 100,000 URLs.


Better yet, it is possible to utilize filters to build distinct URL lists, efficiently surpassing the 100k Restrict. For instance, if you want to export only blog URLs, stick to these techniques:

Move one: Increase a phase on the report

Phase two: Click on “Create a new section.”


Stage three: Define the section having a narrower URL sample, like URLs made up of /site/


Note: URLs found in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide important insights.

Server log files
Server or CDN log information are Most likely the final word Software at your disposal. These logs seize an exhaustive listing of every URL route queried by users, Googlebot, or other bots through the recorded time period.

Concerns:

Data dimensions: Log files can be large, countless sites only retain the last two weeks of data.
Complexity: Analyzing log information might be complicated, but various resources can be found to simplify the process.
Combine, and good luck
When you’ve collected URLs from every one of these sources, it’s time to combine them. If your web site is small enough, use Excel or, for bigger datasets, applications like Google Sheets or Jupyter Notebook. Guarantee all URLs are constantly formatted, then deduplicate the listing.

And voilà—you now have an extensive listing of present-day, old, and archived URLs. Excellent luck!

Leave a Reply

Your email address will not be published. Required fields are marked *