The first Internet search engine was created in 1990 at McGill University in Canada and attempted to create a searchable census of all of the files on FTP sites at that time. But as the size of the Web grew exponentially, it soon became apparent that a method of sampling the Internet would be needed to produce the indexes required for searching. A solution was soon developed where sampling is done by a web crawler or “spider”–software that takes information from a web page and all of the pages that it links to and all of the pages that they link to and so on.
Statistical issues then arise in the indexing component of the process. What variables should be saved? For example, one variable in these indexes examines the number of inbound links to a page weighted by measures of the quality of the sites that link to the page in question. Interestingly, this is proportional to an estimate of the equilibrium probability of landing on a given page after a large number of clicks in a Markov model of Internet browsing. Next, what data structures and index size make for the quickest computation without sacrificing relevance? Even the large index maintained by Google, which is more than a hundred Petabytes in size, holds just a small fraction of the estimated 30 trillion pages on the Internet.
Finally, searching algorithms must produce results in a split second. Results are rank orders of websites in the index that should be strongly related to the probability that the site is relevant to user intent and needs. Models for predicting relevance are constantly updated–partly to ensure that website owners improve their sites using best practices and not simply to artificially match with search ranking variables. Current models are based on several hundred variables continuously examined using variable selection and model building experiments against user responses to search engine results. Do users click more often on the highest ranked items? Do they stay longer on the sites they go to?
Thus, statistical issues are addressed in the sampling (web crawling), indexing and ranking phases of the operation of a good search engine. Without statistics we would have to search the Internet one site at a time.