Type a question into Google and press Enter — you get an answer in less than a second. Behind that little search box is a huge system: it visits hundreds of billions of web pages, stores them in a massive library, and uses AI to pick the most relevant results for your question. Here’s how it works.
A Brief History
The story of search engines is really the story of the web growing faster than anyone could keep up with. For thirty years, engineers kept inventing new ways to answer one question: how do you find something in an infinite library where no one printed the catalogue?
The game-changing moment was 1998. Stanford PhD students Sergey Brin and Larry Page published a paper describing PageRank — the insight that every hyperlink on the web is a vote of confidence, and that pages linked to by trustworthy pages should themselves be trusted. That one idea made every previous search engine look crude by comparison.
The modern era since 2015 is defined by machine learning. Google's AI models — RankBrain, BERT, MUM — no longer just match keywords. They try to understand what you mean, not just what you typed.
The Three Pillars
Despite decades of evolution, every search engine still works in three stages: crawling, indexing, and ranking. Google's own documentation uses exactly these words. Think of it as building and using a library.
Automated programs follow links across the web, downloading pages and discovering new ones — like a robot reading every book in every library and noting where they reference each other.
Every crawled page is analysed, and its words are stored in a giant lookup table — the index — so any word can instantly point to every page that contains it.
When you search, billions of matching pages are scored against over 200 signals and AI models to pick the ten most relevant, trustworthy and useful results.
Crawling: Mapping the Web
A crawler — also called a spider or bot — is a program that downloads pages and follows their links to discover more pages. Google's is Googlebot, Microsoft's is Bingbot. They start from known URLs and fan out, link by link, across the web.
Because the web is essentially infinite, every crawler operates within a crawl budget — a quota of how often it visits any given site. Site owners can influence this with robots.txt, a small text file that tells crawlers where they may and may not go. This convention was formally standardized in 2022 as RFC 9309 by the IETF.
Bing has pushed the model further with IndexNow (2021): instead of waiting to be crawled, a site can instantly notify the engine when content changes — reducing the gap between publishing and appearing in results.
Imagine the whole web is a city, and every page is a building. A crawler is like a postal worker who starts at one building, reads all the signs and notices inside, then follows every address mentioned on those signs to find the next building — and the next, and the next. The postal worker never stops, because new buildings appear every second. The list of buildings they have visited is the crawl.
Indexing: The Library Behind the Search Box
Once a page is fetched, the engine parses its HTML, executes its JavaScript, extracts text and metadata, and stores everything in a vast database called the index. Google's index contains hundreds of billions of pages and over 100 petabytes of data.
At the heart of every search index is a deceptively simple data structure: the inverted index. Instead of storing pages and listing their words, it stores words and lists every page each word appears on — along with how often it appears and where. When you search for jollof rice, the engine doesn't scan the web; it looks up two short lists and intersects them in milliseconds.
Remember the index at the back of a school textbook? It lists words alphabetically and next to each word it tells you the page numbers where that word appears. A search engine index works exactly the same way — except instead of one textbook it covers hundreds of billions of web pages, and instead of printing it on paper, it lives in massive warehouses full of computers. When you search a word, the engine doesn't read the whole web. It just looks up that word in its index and instantly finds every page that contains it.
Ranking: Choosing the Best Ten
Of the billions of pages that could match your query, only ten will appear on the first page. The ranking system decides which. It starts with understanding what you actually mean — correcting spelling, expanding synonyms, detecting intent — and then scores candidates against more than 200 signals.
Layered on top of traditional signals are machine-learning models that transformed how search understands language. RankBrain (2015) introduced AI to interpret never-before-seen queries. BERT (2019) brought deep language understanding — Google said it would affect "one in ten searches in the US." MUM (2021) went further still: trained across 75 languages, 1,000 times more powerful than BERT, and capable of understanding images as well as text.
How the Main Engines Differ
Google holds over 90% of global search traffic, but the alternatives are worth understanding — they make very different choices about privacy, independence, and business models.
The biggest index, heaviest AI investment (Gemini, BERT, MUM), and deep personalization. In 2024 introduced AI Overviews — generative summaries at the top of results pages.
Bing
Independent index, push-based IndexNow protocol, and the engine behind Microsoft Copilot. Estimated 8–14 billion pages indexed — much smaller than Google but significant.
DuckDuckGo
Privacy-firstUses its own DuckDuckBot crawler plus Bing's index. Proxies all requests so neither DuckDuckGo's partners nor anyone else can build a profile on you.
Brave Search
Fully independentAnnounced 100% independence from Bing in April 2023. Unique feature: Goggles — an open system letting users apply their own re-ranking rules on top of results.
Kagi
Paid · Ad-freeA subscription engine ($5/month+) with no ads, no tracking, and no incentive to favour sponsored content. Aggregates Brave, Google, and its own Teclis crawler.
The Hard Problems
Modern search faces tensions that no algorithm fully resolves. Spam is a constant arms race — Google fights back with SpamBrain AI, following earlier systems like Panda (2011) and Penguin (2012). Freshness versus authority is a permanent trade-off: breaking news demands the newest sources, medical queries demand the most trustworthy — and the engine must judge which a given query needs, correctly, every time.
The deepest tension today is between AI-generated answers and the open web. A 2025 Pew Research study of nearly 69,000 searches found that users who saw an AI summary clicked through to a website only 8% of the time — versus 15% for those who saw traditional results. That collapse in traffic has fuelled antitrust complaints and, in March 2026, a Google commitment to develop a publisher opt-out for generative features.
The next decade of search will be defined less by who can index the most pages than by who can answer questions honestly, attribute sources fairly, and keep the open web worth searching at all.
Main References
- Google Search Central. In-Depth Guide to How Google Search Works.
developers.google.com/search/docs/fundamentals/how-search-works - Pandu Nayak. Understanding searches better than ever before (BERT), The Keyword, October 25, 2019.
blog.google/products/search/search-language-understanding-bert - Pandu Nayak. MUM: A new AI milestone for understanding information, The Keyword, May 18, 2021.
blog.google/products/search/introducing-mum - Brin, S. & Page, L. The Anatomy of a Large-Scale Hypertextual Web Search Engine, Stanford InfoLab, 1998.
infolab.stanford.edu/pub/papers/google.pdf - Koster, M. et al. RFC 9309: Robots Exclusion Protocol, IETF, September 2022.
rfc-editor.org/rfc/rfc9309