How to define All Existing and Archived URLs on a web site

There are several reasons you could need to search out all the URLs on a website, but your exact objective will ascertain Anything you’re looking for. For example, you might want to:

Recognize each indexed URL to research concerns like cannibalization or index bloat
Accumulate present and historic URLs Google has witnessed, specifically for web page migrations
Find all 404 URLs to recover from post-migration errors
In Every state of affairs, one Instrument won’t give you all the things you require. Regrettably, Google Research Console isn’t exhaustive, and a “internet site:illustration.com” research is restricted and challenging to extract details from.

In this post, I’ll stroll you through some applications to make your URL listing and just before deduplicating the info using a spreadsheet or Jupyter Notebook, depending on your web site’s size.

Old sitemaps and crawl exports
If you’re on the lookout for URLs that disappeared from your Stay site a short while ago, there’s an opportunity someone on your workforce may have saved a sitemap file or perhaps a crawl export ahead of the improvements had been made. For those who haven’t previously, look for these data files; they could often present what you'll need. But, should you’re looking at this, you almost certainly did not get so lucky.

Archive.org
Archive.org
Archive.org is a useful Resource for Search engine marketing tasks, funded by donations. If you try to find a domain and select the “URLs” solution, you may entry up to 10,000 shown URLs.

However, There are several constraints:

URL limit: It is possible to only retrieve as many as web designer kuala lumpur 10,000 URLs, which can be inadequate for larger sized web pages.
Excellent: A lot of URLs may very well be malformed or reference resource documents (e.g., images or scripts).
No export selection: There isn’t a crafted-in approach to export the checklist.
To bypass the lack of an export button, make use of a browser scraping plugin like Dataminer.io. On the other hand, these constraints signify Archive.org might not offer an entire Option for greater web-sites. Also, Archive.org doesn’t reveal regardless of whether Google indexed a URL—however, if Archive.org discovered it, there’s a very good opportunity Google did, far too.

Moz Professional
Though you could commonly utilize a connection index to find exterior internet sites linking to you, these tools also discover URLs on your internet site in the method.


How you can use it:
Export your inbound links in Moz Pro to acquire a brief and easy list of target URLs from your internet site. For those who’re coping with a massive Web site, think about using the Moz API to export details further than what’s manageable in Excel or Google Sheets.

It’s crucial that you Be aware that Moz Professional doesn’t validate if URLs are indexed or found by Google. However, considering the fact that most web sites implement the same robots.txt principles to Moz’s bots because they do to Google’s, this method typically performs nicely as being a proxy for Googlebot’s discoverability.

Google Search Console
Google Research Console gives a number of valuable sources for building your list of URLs.

One-way links reviews:


Just like Moz Professional, the Hyperlinks area presents exportable lists of concentrate on URLs. Regrettably, these exports are capped at 1,000 URLs Every single. You may utilize filters for unique web pages, but due to the fact filters don’t use to the export, you might really need to count on browser scraping tools—limited to 500 filtered URLs at a time. Not excellent.

Efficiency → Search Results:


This export provides you with a listing of web pages obtaining lookup impressions. Although the export is restricted, You need to use Google Lookup Console API for bigger datasets. In addition there are cost-free Google Sheets plugins that simplify pulling more intensive information.

Indexing → Web pages report:


This part provides exports filtered by concern variety, nevertheless these are generally also constrained in scope.

Google Analytics
Google Analytics
The Engagement → Pages and Screens default report in GA4 is an excellent resource for accumulating URLs, by using a generous Restrict of one hundred,000 URLs.


A lot better, you can apply filters to create different URL lists, proficiently surpassing the 100k limit. By way of example, if you need to export only blog site URLs, comply with these methods:

Stage one: Incorporate a segment into the report

Action 2: Click on “Create a new section.”


Stage three: Define the phase using a narrower URL sample, including URLs made up of /weblog/


Observe: URLs found in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide important insights.

Server log files
Server or CDN log information are Most likely the final word tool at your disposal. These logs capture an exhaustive checklist of every URL path queried by consumers, Googlebot, or other bots over the recorded interval.

Considerations:

Information dimension: Log information could be enormous, numerous web-sites only retain the last two months of information.
Complexity: Analyzing log data files is often demanding, but different tools are offered to simplify the method.
Combine, and fantastic luck
After you’ve collected URLs from these sources, it’s time to mix them. If your site is small enough, use Excel or, for larger sized datasets, resources like Google Sheets or Jupyter Notebook. Be certain all URLs are constantly formatted, then deduplicate the listing.

And voilà—you now have an extensive listing of current, old, and archived URLs. Excellent luck!

Leave a Reply

Your email address will not be published. Required fields are marked *