How to See What’s Behind a Website
P addressTo visit a website, your device needs to know the Internet Protocol address, or?IP address, ?of the computer that hosts it. Hosting a website means making it available to the world; the computers responsible for doing so are often called servers.
An IP address is typically written as a series of four numbers, separated by periods, each of which ranges from 0 to 255.
For example: 172.217.16.174 is the IP address of one of the servers that hosts the “google.com” website, at which visitors can access Google’s search engine.
At any given time, each device that is directly connected to the internet - be it a?webserver , an email service or a home WiFi router - is identified by a particular IP address. This allows other devices to find it, to request access to whatever it is hosting and, in some cases, to send it content like search terms, passwords or email messages.
Many devices, including most mobile phones, laptops and desktop computers, connect to the internet indirectly. They can reach out to websites and other services - and they can receive replies - but most other devices cannot reach out to them. In a sense, they are not?listening?for connections. Many of these devices have what are called “internal IP addresses.” This means that devices on the same local network can connect to them directly, but others cannot. If you lookup the IP address of your phone or laptop, you will likely find an internal IP address, but you will rarely find one associated with a website.
Domain registrar, domain registrants & domain registration
Domain names are unique. There can only be one “google.com,” for example. The process of purchasing a domain name is called domain registration.
This process ensures that domain names remain unique and makes it more difficult for someone to impersonate a website they do not control. When someone registers a domain name, a record is created to keep track of that domain’s official owner and administrator (or their representatives).
A person who registers a domain is called a?domain registrant . That registrant - or someone to whom they give access - can then point their domain to a particular IP address. If a webserver is listening at that IP address, a website is born.
The companies that handle the registration process are called domain?registrars , and they almost always charge a fee for their services. Example registrars include GoDaddy.com, Domain.com and Bluehost.com, among many others. These companies are required to keep track of certain information about each of their registrants.
A non-profit organisation called the?Internet Corporation for Assigned Names and Numbers (ICANN) ?governs the domain registration process for every website in the world.
Web host
We know that a website has a domain name and that a domain name is translated into an IP address. We also know that every website is actually stored on a computer somewhere in the physical world. The computer that hosts the website is called a web host.
There is an entire industry of companies that store and serve websites. They are called web hosting companies. They have buildings filled with computers that store websites, and they can be located anywhere in the world. While it is most common for websites to be hosted in “data centres” like these, they can actually be hosted from almost any device with an internet connection.
Basic WHOIS Look-up
When researching a website, one of the most useful sources of data can be found in its domain registration details.
Over the course of your investigation, it might be relevant to know who – whether it is an organisation or an individual – owns a particular domain, when it was registered and by which registrar, as well as other details. In many cases, this information can be accessed through third-party services that are detailed below.
Yet, as mentioned earlier, sometimes the owner of a domain would not want to appear as linked to the site. Whatever the reason - be it not wanting to be associated with the site’s content or just wishing to maintain a degree of privacy - it’s worth noting that domains can be registered through proxy or intermediary organisations that conceal the full details of the registration.
The information collected from domain registrants is called WHOIS data, and it includes contact details for the technical staff assigned to manage the site, as well as contact details of the actual site owner or their proxy.
This data has long been publicly available on sites like?ICANN’s WHOIS Lookup . However, there are currently other free or partially-free services (some have fees for advanced searches and extended results) that also aggregate WHOIS information and which often provide more details and more accurate and up-to-date information than ICANN.
Note that if you are making many requests for information in a short period of time, on most of these sites you may receive an error and need to wait or switch to a different service to continue your searches. Similarly, many of these sites require you to complete?CAPTCHAs ?(selecting various items from images) to make sure you are not a robot.
These are some of the sites providing useful WHOIS data for free:
As mentioned above, many registrars offer the ability to act as proxy contacts on the domain registration forms, a service known as “WHOIS privacy”. In such cases, domains registered with WHOIS privacy will not list the actual names, phone numbers, postal and email addresses of the true registrant and owner of the site, but rather the details of the proxy service. While this can frustrate some WHOIS queries, the lookup tool is nonetheless a powerful resource for investigating a domain.
As different search engines return different results for the same query depending on their indexes and?algorithms , it may be that searching with different WHOIS query services returns varying amounts of detail about your domain of interest. Checking with multiple sources whenever possible is therefore a good way to make sure you collect as much information as possible, as is standard in any part of an investigation.
To illustrate this, let’s look at what a search for “usps.com” (the website of the United States Postal Service) on several WHOIS services leads to.
A query for WHOIS data for “usps.com” using the?ICANN WHOIS Lookup ?returns:
ICANN WHOIS data for “usps.com” on 19 February 2019
The information we get about the registrant is limited – we can only see the domain’s creation and expiry dates – and the registrar’s details appear in place of those of the registrant.
To show how the information returned from these services may differ, a search for “usps.com” on?https://who.is/ ?returns more information about the Postal Service, including an address, email contact, and phone number.
Who.is WHOIS data for “usps.com” on 19 February 2019
Tip:
In addition to the WHOIS search tools above,?IntelTechniques ?– the website of Michael Bazzel, an open source intelligence consultant – provides an aggregated list of domain search tools that allow you to compare search results from several sources of WHOIS data. Just check the?Domain Name search menu ?on the left-hand side. Also note that IntelTechniques has a rich offering of other tools you can use in your investigations, such as image metadata search and social media search tools.
GDPR implications
The European Union’s (EU)?General Data Protection Regulation (GDPR) ?has led to a lot of uncertainty for the status of public WHOIS registries in the EU because in theory, WHOIS data of owners and administrators of EU-registered domains should not be collected and published by registrars. Under the GDPR, it is considered to be private information.
However,?ICANN has sued several European registrars ?for deviating from its interpretation of the GDPR, which has a more relaxed approach to the regulation and permits limited access to WHOIS data. Even after GDPR’s implementation, ICANN continued to demand EU registrars to at least collect data about site owners and administrators, if not to make it publicly available. ICANN’s interpretation has been?repeatedly rejected ?by the courts, but their insistence that their policy for EU registrants is GDPR compliant leaves a lot of questions unanswered. Most likely, collection of and access to WHOIS data for EU-based registrants will be restricted.
Even in these conditions, some researchers are finding ways to work around the restrictions that make some registrants’ data inaccessible at times.?This post ?by GigaLaw - a US law firm specialised in domain name disputes - provides some tips and techniques that can prove successful at times.
Historic WHOIS
Historic data can be a useful tool when investigating websites, because it can track the transfer of a domain’s ownership. It can also help identify owners of websites who have not consistently chosen to obscure their registration data using a WHOIS privacy service.
Example:
One example where this historic data proved useful was the?investigation of a cybercrime gang ?known as Carbanak, who were believed to have stolen over a billion dollars from banks. Using the historical data provided by DomainTools, a researcher was able to link multiple sites together by going through their historical records and finding hundreds of domains that were initially registered with the same phone number and Yahoo email address. These contact details were later used to establish a link between Carbanak and a Russian security company.
For your own investigations, several companies offer access to historic WHOIS records, though these records may often be restricted to non-EU countries due to the GDPR, as mentioned above.
It is perhaps the best-known of these companies that offer historic hosting and WHOIS data. Unfortunately, this data is not free and DomainTools requires you to register for a membership in order to access it.
An alternative to Domain Tools that also provides historical WHOIS data. It requires you to create an account for both basic free, as well as advanced fee-based services. There is a limit to the number of free basic searches per day and this option only provides you with the latest historical data archive of a website (not full history). The full historical archives require payment and there are several annual fee rates depending on the number of searches and other features the service provides. Whoisology doesn’t work via the Tor Browser, and it may also use CAPTCHAs to verify that you are a real person searching for information.
Safety First!
If you decide to set up an account with these services, it may be a good idea to create a new email address that you can use for this purpose only. This way you avoid sharing your regular contact data and other personal details.
Reverse WHOIS Look-up
Reverse phone directories, which allowed you to look up a phone number to determine who it belonged to, used to be a staple of investigative work for years. These directories contained the same information as a phone book, but they organised it differently: entries were sorted by phone numbers rather than by names. This allowed investigators to cross-reference phone numbers back to the names of the people to whom those numbers belonged. While printed reverse directories have long since been replaced by online databases (such as?White Pages Reverse Phone ), the need to cross-reference information has expanded into many other applications.
Investigators often need to look up residents by home address, to get names from email addresses or find businesses by officer or incorporation agent (a person or business that carries out company formation services on behalf of real owners). Reverse directories should be part of any investigator’s toolkit. The notion of tracing little pieces of information back to their sources is central to the investigative mindset.
When you look up the domain names registered to a certain email address, phone number or name, it is called a “reverse WHOIS lookup”. Several sites offer these kinds of searches.
To identify the owner of a domain – especially when that owner has taken some steps to obscure their identity – you will need to locate all the information about the website that can be reverse searched. The tools available to cross-reference information from a website will change, and the information available will vary for each site, but the general principle is consistent. When trying to locate the owner of a domain name, focus on locating information that can help you “reverse” back to an ultimate owner.
Here are some tools you can use for reverse searches:
It is free and allows searches by email or phone number. ViewDNSinfo also provides other useful options such as searching by an individual or company, historical IP address search (historical list of IP addresses a given domain name has been hosted on as well as where that IP address is geographically located) etc. Note that IP address owners are sometimes marked as ‘unknown’ so it helps to use several websites for your searches and combine the results for a fuller picture. It works via Tor Browser and doesn’t have CAPTCHA.
You can register on Domain Eye to get 10 free searches per day. It works via Tor Browser and doesn’t have CAPTCHA.
A paid service with no free demos available for reverse WHOIS at the moment. It works via Tor Browser and doesn’t have CAPTCHA.
ViewDNSinfo example of reverse WHOIS search based on email address [email protected] (used by the Internet Archive), date searched 11 January 2019
Finding information with shared hosting and reverse IP search
Often it’s not that simple to determine domain ownership, particularly if the owner has gone to some lengths to hide their identity. At this point we should try to look at the situation from another perspective. If straightforward search queries aren’t providing fruitful results, we can look for smaller, less apparent clues by combing through data that is somehow related to the website, but may not be obviously connected or easy to collect.
Websites are hosted on one or more servers, or computers running server applications that transmit the site’s content to visitors. Web hosting has a cost, either in the form of a monthly subscription or in the form of buying and running physical computer infrastructure. To reduce costs, or sometimes because of prior relationships with web administrators, related websites will often share hosting. Analysing the other domains sharing the same hosting service can sometimes shine a light on the owner or administrator of the site you are investigating.
Note:
There is a difference between the web domain owner and its administrator. Sometimes a registered administrator may not be the actual owner of the domain. In many instances, a technical point of contact can be in charge of registering the domain and administrating the website infrastructure on behalf of the owner. This does not necessarily include administration of and responsibility for the website’s content.
You can use the IP address to see which other sites are hosted on the same server. This is useful in identifying websites that, since they are hosted on the same server, might be related.
You can find thorough results by searching for a domain name or IP address in ViewDNSinfo’s?Reverse IP ?search box.
Let’s search for “tacticaltech.org” and see what other domains are are hosted at its IP address, 213.108.108.217.
ViewDNSinfo example of reverse IP search for tacticaltech.org
The search returns a list of 19 domains hosted on the same server and sharing the IP address. It so happens that in this case they are all related to the same organisation, Tactical Tech. This won’t always be the case, as it often happens that unrelated domains share the same server IP, so further research is required before you can conclude there is a real connection.
Another way to list sites that share the same IP address is by adding the prefix “IP:” to your IP address query in the Bing search engine, as illustrated in the example below.
Using Bing’s “IP:” search prefix to identify sites hosted on the same IP addresses
It’s worth pointing out that, while ViewDNS provides a list of domains, searching an IP address with the “IP:” search prefix in Bing also returns specific webpage addresses (such as?https://myshadow.org/location-tracking , shown above). Given the varied results from any collection of sources, you should again use multiple services and compare the results.
Additional resources that offer similar services include:
This website offers information curated from various sources, such as?Alexa , which estimates the popularity of websites, and?SEM rush , which gives a sense of how likely search engines are to include results on websites. Some services are free, but you can buy credit to download more detailed findings such as reverse WHOIS reports. Robtext also works via the Tor Browser.
Robtex.com search for tacticaltech.org
Displays domain information as well as other information that may be useful in investigating a website such as the web trackers, hosting history and site technology. By searching for a domain in the “site contains” search box here:?https://searchdns.netcraft.com , you will be able to click on the “site report” icon for the relevant result.
Netcraft’s site report results for archive.org
This service shows you which web hosting company is being used by a domain name. Two domains hosted at the same company do not mean they are related or have the same owner. However, it is common for administrators who manage several websites to use the same hosting provider for the sake of convenience – a practice which could reveal connections. Webhostinghero also works via the Tor Browser.
In some cases, administrators do not use hosting providers, but rather host their websites independently, whether from their own data centre, office, or even home. In these instances, it may be more straightforward to identify links among the websites hosted there.
Other services like this one include:?https://www.whoishostingthis.com/ ?and?https://hostingchecker.com/ , both accessible via the Tor Browser.
webhostinghero
Websites that share an owner are often designed and hosted using the same software. BuiltWith will scan a website and try to determine the web technologies upon which the site relies. You can then search other sites you suspect might be related and look for similarities. If you find a match, you may be able to use the other tools presented here to find additional evidence of a connection.
BuiltWith search results for securityinabox.org
Discovering useful information in a webpage’s source code
A webpage that you see in your browser is a graphical translation of code.
Webpages are often written in plain text using a combination of?scripting ?languages such as HTML (HyperText Markup Language) and JavaScript, among others.
Together, these are referred to as a website’s?source code , which includes both content and a set of instructions, written by programmers, that makes sure the content is displayed as intended.
Your browser processes these instructions behind the scenes and produces the combination of text and images you see when accessing a website. With a simple extra step, your browser will let you view the source code of any page you visit.
Give it a try. Open up your browser and take a look at the source code of a website that interests you. You can usually right click on the page and select “View page source”. On most Windows and Linux browsers, you can also press CTRL+U. For Mac instructions and additional tips, check out this guide on?how to read source code ?(also accessible via Tor Browser)
For example:
Part of the source code for the White House website?https://www.whitehouse.gov , which you can reveal by right-clicking your cursor and selecting “View source code”, looks like this:?Example of source code
If you’ve never looked at a site’s source code before, you might be struck by how much of the information that is transmitted to your computer does not appear when you view the page in your browser.
For instance, there may be comments left by whoever wrote the source code. These comments are only visible when you view the source – they are never displayed in the rendered page (that is, the page that has been translated into graphics and text). A comments begin with?<!--, which indicates that what comes next is a comment and should not be displayed on the page. They end with?-->, which signals the end of the comment.
Comments are often written in plain language and sometimes provide hints about who maintains a website. They may also include personal notes or reveal information such as a street address or copyright designation.
Finding connections with reverse Google Analytics ID
There are numerous things you can uncover from a page’s source coude, but one good example is code that helps website owners and administrators monitor the traffic that a website is receiving. One of the most popular such services is Google Analytics -?https://analytics.google.com .
Sites that are related often share a Google Analytics ID. Because Google Analytics allows multiple websites to be managed by one traffic-monitoring account, you can use their ID numbers to identify domains that may be connected by a shared ownership or administrator.
Sites that use Google Analytics embed an ID number into their source code. All Google Analytics IDs begin with “UA-”, and are followed by an account number. They look a bit like this: “UA-12345678-2”.
For example:
To follow on the White House example above, the Google Analytics ID for?www.whitehouse.gov ?is “UA-12099831-10”. You can find this out yourself by following these steps while on the website:
Whitehouse Analytics code
The number after the first dash (-12099831) is the White House’s Google Analytics account number. The number at the end (10, in this case) indicates how many different websites rely on that same account to track visitors.
Because multiple websites can be managed on one Google Analytics account, you can use Google Analytics ID numbers to identify domains that may be connected by a shared ownership or administrator.
There are several reverse search tools that allow you to locate sites that share a given analytics IDs. Examples includes:
As usual, it’s advisable to search the same Google Analytics ID on several of these websites, as their results tend to vary.
Note:
Sometimes one website may copy the source code of another even if they are not actually related. This will lead to misleading results when looking up the Google Analytics ID. Reverse lookups of Google Analytics ID must always be treated as a possible lead and not as hard evidence. This technique can be useful but makes it worth repeating the importance of checking multiple sources before drawing conclusions.
For instance, in the case above, searching for the White House ID (UA-12099831-10) with any of these services will return a list of sites sharing the same Google Analytics ID with the White House website. (Also note that results tend to differ from service to service; some will return more sites others less, so search on more to compile a thorough list of findings.) If you do this exercise, you will notice that several websites that are most likely unrelated to the official White House site also appear on the list. Some are parody sites, others are gaming sites, and so on. Although this looks bizarre at first, the explanation is rather simple – the White House source code has been copied and replicated without deleting the Google Analytics ID. Therefore, not all the listed sites are related in this case. Also worth noting that the unrelated websites are not actually using the Google Analytics ID of the White House and its genuinely related sites, they are merely displaying it.
How can these searches help an investigation?
If a website owner or administrator is obscuring their identity on one site, they may not have taken similar measures on every site they manage or own. Enumerating these sites by reverse searching the Google Analytics IDs can help you locate related websites that may be easier to identify.
Example:
In?a 2011 article , Wired columnist Andy Baio revealed that out of a sample of 50 anonymous or pseudonymous blogs he researched, 15 percent were sharing their Google Analytics ID with another website. This finding proved fruitful for unmasking anonymous sites. Out of the sample of 50, Baio claimed to have identified seven of the bloggers in 30 minutes of searching. The full story is available?here .
Let’s try an exercise and see if the website?Our Revolution ?uses Google Analytics to monitor traffic.
领英推荐
Screenshot of “Ourrevolution.com”
To determine whether “Our Revolution” has a Google Analytics ID we have to view the source code as described above.
Source code of “ourrevolution.com”
We can then use one of the reverse search tools mentioned above to see if other sites are using that same Google Analytics ID. On DNSlytics, for instance, choose Reverse Analytics from the Reverse Tools top navigation menu.
Searching by Google Analytics ID on DNSlytics
In addition to the “Our Revolution” domain where we found the Analytics ID, the search returns another domain name: “Summer for Progress” -?https://web.archive.org/web/20190831040944/https://summerforprogress.com/ ?(archived website as the actual website “https://summerforprogress.com/” is now offline”).
Results of Google Analytics ID search on dnslytics.com
Metadata analysis
When someone creates a file (such as a document, PDF or spreadsheet) on their computer, the programs they use automatically embed information in that file.
We can consider “data” to be the contents you see in a file: the words in a document, the charts in a PDF, the numbers in a spreadsheet or the elements of a photograph.
On the other hand, the automatically embedded information is called “metadata”.
Examples of metadata might include the size of the file, the date when the file was created, or the date when it was last changed or accessed. Metadata might also include the name of the file’s author or the name of the person who owns the computer used to create it.
There are many types of metadata. Here, we look at how to find and make sense of several examples that are useful for investigations.
With documents, even if metadata doesn’t always identify the author or creator of a file (if they take steps to keep this identity hidden, for example, by deleting metadata such as name or dates), it often still provides clues to their identity or other significant facts about them or the devices and software they used to work on those files.
A similar situation happens when we take photos: the image files our cameras produce often contain a type of metadata called EXIF (Exchangeable image file format). EXIF metadata can reveal information related to when and where the photo was taken: time, date, GPS (Global Position Satellite) location, etc.
Users can manually remove this potentially identifying information, and many apps and websites clear metadata from uploaded files in order to protect their users. In some cases, however, EXIF metadata that remains in the final version of a photograph may end up revealing clues about the identity of the photographer, locations, dates and other information that can help you connect the missing links in your investigation.
Example:
For example, American serial killer Dennis Rader was arrested after mailing a disk containing documents from his church to a news organisation. The documents contained metadata that identified their author. Here is an article in?The Atlantic?showing?how ?it happened.
With this in mind, if you can’t find the owner of a domain name through the means and tools presented above, it can be useful to download all text documents, spreadsheets, PDFs and other files hosted by the site. From there, you can analyse the documents’ metadata and look for an author name or other identifying details. You can do this by checking the properties of the documents after you download them. Keep in mind, however, that documents like these sometimes contain malware that can put you and those with whom you work at risk. To avoid thid, you should not open them with a device that you use for any other purposes (work or personal) or that is connected to the internet.
Safety First!?- Opening downloaded files from unknown sources
Some investigators maintain a separate laptop that they use only to open untrusted files. These devices are often called ‘air gapped’ computers because, once they are set up, they are never connected to the internet.
As an alternative, you can restart your computer from a USB stick that contains the?Tails operating system ?when you need to analyse suspicious documents. Even if a document contains malware that affects Tails, any damage it might do will become irrelevant once you reboot back into your normal operating system. And the next time you restart into Tails, you will have a clean system once again. Tails is based on the GNU/Linux operating system, however, so it comes with a bit of a learning curve.
To use either of these techniques, you will need a USB stick or an external hard drive so you can transfer the files in question.
Finally, if you are not worried about associating yourself with the documents or about exposing their contents to Google (or to anyone with the authority to access other people’s Google accounts), you can upload them to?Google Drive , and search for metadata using Google Docs. Don’t worry, Google is pretty good at protecting their servers from malware!
Not all documents will contain metadata. It’s not always embedded in the first place, and the creator can easily delete or modify it, as can anyone else with the ability to edit the document. Moreover, not all metadata relates to the original author. Documents change hands and are sometimes created on devices that belong to people other than the author.
Again, any piece of information you find needs to be verified and corroborated from multiple sources. Despite that, metadata could provide you with additional leads or help to confirm other evidence you have already found.
Case Study
In addition to helping you identify the true owner of a document or website, metadata can also provide clues about employment contracts and other affiliations and connections. For example, a?Slate writer analysed the PDFs ?found on a conservative policy website run by former American media personality Campbell Brown and discovered that all of them were written by staff working for a separate right-leaning policy group. The link between these two groups was not known until the metadata analysis was conducted. The full story is available?here .
Let’s look at how this finding can be replicated.
The PDF described in this article was originally found at the following web address on the?commonsensecontract.com ?website:?https://commonsensecontract.com/assets/downloads/Rewards_for_Great_Teachers.pdf ?(document archived?here ).
It has since been taken down and, indeed, that domain name now points to a completely different website:?https://commonsensecontract.com . You can still find the original one archived on theInternet Archive’s Wayback Machine.
To learn more about the Wayback Machine, see our resource on?“Retrieving and Archiving Information From Websites.”
Archived webpage from “commonsensecontract.com”
You can follow the steps below to examine the metadata in question. But first:
To view the metadata in this PDF:
Exposing Hidden Web Content
Nearly every site on the internet hides something (and often, many things) from visitors, intentionally or not. For example, the content management systems employed by most sites hide the internal files used to generate posts and maintain the website. Databases that store data for sites and applications are usually hidden from public access.?Cookies ?and other client-side data, while accessible and legible to a knowledgeable user, are concealed from the view of the average user, stored and processed automatically in the background.
There are simple tools and techniques that allow anyone to access such information without doing anything shady. These are just small tricks that let you see what a website is made of and what additional data it might reveal to you. Accessing such information can be helpful when investigating a website to determine its owners or to identify connections to other sites. It can also help turn up contact details or further leads for your research.
Robots.txt
Websites indicate how scrapers and search engines should interact with their content by using a file called “robots.txt ”. This file allows site administrators to request that scrapers, indexers, and?crawlers ?limit their activities in certain ways (for instance, some do not want information and files from their websites to be scraped).
Robots.txt files list particular files or?subdirectories ?- or entire websites - that are off-limits to “robots”. As an example, this could be used to prevent the?Wayback Machine ?crawlers from archiving all or part of a website’s content.
Some administrators may add sensitive web addresses to a robots.txt file in an attempt to keep them hidden. This approach can backfire, as the file itself is easy to access, usually by appending “/robots.txt” to the domain name.
Be sure to check the robots.txt file of the websites you investigate, just in case they list files or directories that the sites’ administrators want to hide. If a server is securely configured, the listed web addresses might be blocked. If they are accessible, however, they might contain valuable information.
Each?subdomain ?is managed by its own robots.txt file. Subdomains have web addresses that include at least one additional word in front of the domain name. For example, the Internet Archive itself has at least two robots.txt files: one for its main site, at?https://archive.org/robots.txt , and one for its blog, at?https://blog.archive.org/robots.txt .
It is worth noting that robots.txt files are not meant to restrict access by humans using web browsers. Also, websites rarely enforce these restrictions, so email harvesters, spambots, and malicious crawlers often ignore them. If you are scraping a website using automated tools, however, it is considered polite to comply with any directives you might find in a robots.txt file.
Example:
As a test, we can access the robots.txt file for the?Payment Card Industry Security Standards Council .
This is an interesting example not because the Council is trying to hide anything but because their robots.txt file -?pcisecuritystandards.org/robots.txt ?/?archived here ?- includes a number of digital files- including Word documents, PDFs and spreadsheets - none of which would turn up in regular search results:
Screenshot of robots.txt
In order to visit a webpage or download a document that you find this way, just copy the partial web address on the right-hand side of a “Disallow:” restriction and paste it into your browser’s address bar after the domain name. In this case, you can download the “SAQ_C_V3.docx” file you see in the image, for example, using the following web address:?https://www.pcisecuritystandards.org/SAQ_C_v3.docx .
Often, such files will be accessible through the website itself, so this might just be a shortcut. In some cases, however, you might stumble upon pages or files that a website administrator was trying to hide.
Remember?- digital files can contain malware, so take care when opening them. Consider using an online document viewer unless you are concerned about sharing the content of those documents with whoever operates your document viewing service.
Sitemap.xml
Sitemap ?files are sort of the opposite to the robots.txt files. They are used by site administrators to inform search engines about pages on their site that are available for crawling. Websites often use sitemap files to list all of the parts of the site they want to be indexed, and how often they want search engine indexes to be updated.
Like robots.txt files, sitemaps live in the topmost folder or?directory ?of the website (sometimes called the ‘root’ directory).
For large and complex websites, the sitemap often links to other Extensible Markup Language (XML) files, which are sometimes compressed, or ‘zipped’. Where these files are accessible, they sometimes point to sections of the website that might be interesting.
The result is sometimes?URLs ?that typically do not show up in searches. You can explore those manually.
To access sitemaps, you need to add “/sitemap.xml” to the domain name. Not all sites will have an accessible sitemap.xml file.
The UK-based open-source investigations site?Bellingcat ?has one that you can reach it by typing?https://www.bellingcat.com/sitemap.xml ?into your browser address bar (note that “sitemap.xml might not work the same in all the browsers). You will get a list of xml files, as seen below.
Sitemap examples for bellingcat.com
You can click any of the addresses listed to see what they contain. In this example we can access?https://www.bellingcat.com/attachment-sitemap1.xml
Sitemap examples for www.bellingcat.com/attachment-sitemap1.xml
Subdomain enumeration
A subdomain is an extra identifier, typically added before a domain name, that represents a subcategory of content. For example, “google.com” is a domain name whereas “translate.google.com” is a subdomain.
Websites often have unlisted subdomains that their administrators believe are private. These subdomains occasionally point to unfinished content or content that is intended for an internal audience.This might include development subdomains used by programmers to test new content, event pages with links to materials distributed at conferences, or login pages for internal webmail.
Many subdomains are uninteresting from an investigative standpoint, but some can reveal hidden details about your research subject that are not easily accessible through basic online searching.
Here are some tools and techniques you can use when researching website subdomains:
DNSDumpster
DNSDumpster ?provides data about subdomains, server locations and other domain information. Like FindSubdomains.com, it does not actively scan the website as you request this information, which means that your searches cannot be tracked by the website you are investigating. It also works via the Tor Browser.
Subdomain example for tacticaltech.org via DNSDumpster.com
Although we reviewed quite a lot of tools and methods already, there is much more out there for those of you passionate about online investigations. For more tips and techniques related to uncovering hidden website content, have a look at another Kit resource:?“Search Smarter by Dorking” .
Safety First!
HOW TO STAY SAFE WHEN INVESTIGATING WEBSITES
Searching and collecting information about domain ownership, history, website source code, metadata and many other elements that can help you build your evidence when investigating websites, involves navigating a large number of online tools and services. Some of these work with the Tor Browser and that allows you to protect your privacy to some extent. Others not only do not work on Tor but they also require you to sign up with an email address, name and other personal details.
Here are some suggestions for digital safety tools and techniques you can use to protect your privacy as well as the security of your devices and data when investigating online.
ACCOUNTS
Some services require users to create an account, to choose a username, to provide payment information, to verify an email addresses or to associate a social media profile in order to gain access to information on their platforms.
You should consider establishing a separate set of accounts, for use with services like these, in order to compartmentalise (separate) your investigative work from your personal online identity.
In some cases, you might even want to create a single use “identity” for a particular investigation, and dispose of it once research is done.
Either way, your first step will be to create a relatively secure, compartmentalised email account, which you can do quite easily with Tutanota?tutanota.de ?or Protonmail?protonmail.com .
BROWSERS
As someone who is looking to uncover hidden truths, you probably already use the internet for personal communication and for some of your research.
It’s a good idea to use different browsers for your research and for casual web browsing. By doing so, you are practicing “compartmentalisation”, marking one browser for research and another for everything else. It’s like sorting things into two different boxes or compartments.
We recommend you choose a “privacy aware” browser for your research and avoid logging in to web-based email and social media on that browser. Using a privacy aware browser will prevent a lot of your personal data from being sent to the sites you visit.
Before using any of the online tools we talk about here or in the overall kit, it’s a good idea to download and install one of these browsers. Then, add an extra layer of certainty by testing the browser with a tool like?Browser Leaks ,?Cover Your Tracks ?or other similar tools. The results of what you see should look different from when you visit Browserleaks or Panopticlick with a normal browser, which would usually reveal more weaknesses.
These are some examples of tools that can help protect your privacy while researching online, with some pros and cons of using them.
Pros:?This is the best privacy aware browser. The code is published openly so anyone can see how it works. It has a built-in way of changing your IP address and encrypting your traffic.
Cons:?There are places in the world where Tor Browser usage is blocked or banned. While there are ways around these blocks, such as?Tor Bridges , using Tor may also flag your traffic as suspicious in such places.
What if I can’t use Tor Browser? There are cases when Tor browser might not be the best for you. Here are some other options. These other browsers are not on the same level as Tor but they can be considered. Be sure to always test the browser you choose in Browserleaks, CoverYourTracks or other such tools.
NOTE - Tactical Tech’s?Security-in-a-Box website ?includes detailed guides on?how to remain anonymous and bypass internet censorship ?use the Tor browser on Linux, Mac, and Windows, among others.
Pros:?Firefox blocks trackers and cookies with a setting called “Enhanced Tracking Protection”, which is automatically turned on when you set “Content Blocking” to “strict”.
Cons:?You need to turn on this option, it’s off by default. When you use Firefox, it’s important to remember that your IP address is still visible to the sites you visit. WebRTC is enabled by default, and can leak your real IP address, even if you are using, for instance, a VPN.
Pros: Brave tries to protect privacy without the need for turning options on or adding add-ons or extensions. Brave has a security setting to erase all Private Data when the browser is closed. It has a feature called ‘Shields’ where you can block ads and trackers. Brave also allows you to create a new “Private Tab with Tor”, which uses the Tor network to protect your IP address (regular use won’t protect it). This even allows you to visit Tor hidden service sites - which are sites that end in .onion and are configured to be securely accessed only by Tor-enabled browsers. If you encounter a webpage that blocks Tor you can decide whether or not to visit it with Tor turned off.
Cons: Brave has a feature called “payments” or “Brave payments” – this is for those wishing to donate to content creators or websites they access via Brave (a portion of the payments goes to the browser to sustain its operations). It’s important to keep this option off as it sends data that could be used to identify you. When you use Brave, you should use the ‘Private Tab with Tor’ feature to protect your IP address.
Pros: Epic browser has a built-in technology to hide your IP address called an encrypted proxy.
Cons: Epic is only for Mac and PC, not Linux.
Pros: These are two different projects based on Firefox but they have removed code that can send information to Mozilla, the owner of Firefox.
Cons: These browsers are based on older versions of Firefox code. Palemoon is not available for Apple computers. When you use Waterfox or Palemoon, it’s important to remember that your IP address is still leaked to the sites you visit.
Pros: This is a privacy-aware search engine (not a browser) that claims not to collect any personal data about its users. You can use DuckDuckGo in combination with the Tor Browser to further preserve your privacy.
Cons: DuckDuckGo does save your search queries but it doesn’t collect data that can identify you personally.
VIRTUAL PRIVATE NETWORKS (VPNs)
Unless you are using Tor Browser, we recommend you always use a?Virtual Private Network (VPN) ?when conducting your research.
We have explained that visiting a website is like making a phone call. The website you are visiting can see your “number” - your IP address - which can be used to map where you are coming from.
To illustrate, if you are researching a corporation and frequently visit its board of directors page – a page that typically gets very little traffic - your repeated visits from your specific location might make the company aware of your research.
One way you can work against being identified in this situation is by disguising your IP address. This is what a VPN does: rather than seeing your real IP address, sites you visit will see the IP of the VPN provider.
You can think of the VPN as a concrete tunnel between you and the site you want to visit. The VPN creates a tunnel around your traffic so it can’t be observed from the outside, and routes it through an intermediary server owned by your provider, so your traffic looks to any site you visit like it is coming from a different location than where you actually are. Neither the web browser, your internet service provider nor the site you visit will see your IP or be able to identify you. Sites will only see that your traffic is coming from the IP address of your VPN provider.
There are many VPN options and it can be confusing when deciding which one to pick. To add to the confusion, most VPN reviews and listings are not independent, some are really biased.?Safety Detectives ?is a VPN review site you can check, among many others. Also check this (older but still relevant) guide on?how to choose a VPN ?“That One Privacy Site” (this site is no longer being updated).
It is recommended you choose a VPN company that claims that they do not record logs of your traffic.
While most free VPNs should be avoided because they are often funding their operation by selling their log data (records of what sites users visit via the VPN), there are some reputable ones we can endorse, such as: