The little-known reason why competing with Google is so hard

Before a new search engine can hope to make a run against Google, it has to crawl. But indexing the web by “crawling” sites with automated software doesn’t just require scaling up to the web’s vast scope—even though doing so is a big challenge in itself. Individual sites have no obligation to welcome a new search crawler. Some instead post digital no-trespassing signs, a way to discourage automated traffic that might bog down performance. “The web has trillions of documents,” says Vivek Raghunathan, cofounder of the ad-free, subscription-based search startup Neeva. “And the web is a lot trickier to crawl than it was a few years ago.” An October 2020 report on digital competition by the House Judiciary Committee’s Subcommittee on Antitrust aimed a government spotlight at this situation. “The high cost of maintaining a fresh index, and the decision by many large webpages to block most crawlers, significantly limits new search engine entrants,” the report stated. “Today, the only English-language search engines that maintain their own comprehensive webpage index are Google and Bing.” That leaves many Google competitors renting the index Microsoft maintains for its Bing search, which has 6.4% of the U.S. market—compared to Google’s 87.3%—in Statcounter’s measurements. Bing’s index works well for many queries, but sites leaning on it cede a key way to differentiate themselves. That’s an issue for Neeva as well as two other privacy-centric search engines, DuckDuckGo and Brave. All three call on Bing for some of the results they provide to users. It’s just one ingredient rather than the entirety of their technology, but still: It would be easier to do without it if creating a new index of the web wasn’t so hard. Robots not welcome here Websites control automated access to their pages using standardized “robots.txt” files enumerating where crawlers may go. Crawlers can disregard these instructions, as the Internet Archive began doing in 2017, to improve its backup of the web. But sites can punish a pushy robot by blocking its access. DuckDuckGo and Neeva pointed to Facebook’s platform as one example. Its robots.txt file takes a guest-list approach, approving Google and Bing as well as such less obvious crawlers as “Applebot,” which gathers data for Apple’s Siri and Spotlight. But it excludes all bots not cited by name. Jason Grosse, a spokesperson for Facebook’s parent firm Meta, said in an email: “Generally speaking, our robots.txt policy is not out of line with other major platforms.” Indexing sites that don’t appreciate a new crawler’s attention can demand discretion and diplomacy. “A lot of the work we’ve done in the last year, year and a half, is building a crawler system that is well behaved,” said Neeva’s Raghunathan. “We do things like smart algorithmic estimation of how much can we crawl this site so it looks like a rounding error.” Sometimes, however, Neeva has to ask for help. From whom? “I’d say it’s been the first person we know, and often the first person we know is the CEO or the head of engineering.” Even a search site that excels at providing web results will struggle to match Google’s full-spectrum information retrieval.Brave, meanwhile, operates in a stealth mode by varying its crawler’s identification and only abiding by whatever restrictions a robots.txt file places on Google’s crawler. Josep M. Pujol, chief of search at Brave, founded by Mozilla cofounder Brendan Eich and better known for its privacy-focused browser, said in an email that this requires treading lightly. “We respect the spirit of the law but not the letter,” he said. “As of today, the data centers that host our crawlers have received a very small number of complaints.” Pujol called asking individual sites’ permission impractical: “How do you scale human interaction to thousands of companies?” Google, meanwhile, can get another leg up because its nonsearch lines of businesses—starting with display ads, but including services like Google Analytics—require access to sites that competitors can only request, said Zack Maril, a software engineer and founder of a search-competition group called Knuckleheads’ Club. These other ventures, he wrote in an email, “all can benefit from Google’s search business in various ways that other competitors running only search engines simply cannot compete on.” Search sites without Google- or Bing-level traffic also lack large-scale metrics about what sites are more or less popular. Google and Bing “can look at everything that people liked, and prioritize all the clicks from there,” says Raghunathan. “When you’re bootstrapping, it’s a lot harder.” A report on digital competition, published in July 2020 by the U.K.’s Competition and Markets Authority, suggested requiring Google to provide some of these metrics. As DuckDuckGo communications vice president Kamyl Bazbaz approvingly phrased it, “Share a certain amount of click-and-query data that other search engines could use to level the playing field.” Brave invites itself to a form of that sharing when it asks its users to allow “Google fallback mixing,” in which Brave sends along a query to Google and then analyzes the results to improve its index. Even a search site that excels at providing web results will struggle to match Google’s full-spectrum information retrieval. For example, I’ve had DuckDuckGo as the default on my iPad Mini for years—but its maps results only cover driving and walking, so I still find myself turning to Apple Maps and Google Maps. Despite the inherent challenges of competing with Google in search, the fact that new firms are still willing to try speaks well of the stubbornness that these upstarts will need. “We love that there are lots of other search competitors now,” said DuckDuckGo’s Bazbaz. “It’s a market that, historically, people have been really afraid of—and for good reason—because of the way that Google has dominated it.”

https://www.fastcompany.com/90709672/the-little-known-reason-why-competing-with-google-is-so-hard?partner=rss&utm_source=rss&utm_medium=feed&utm_campaign=rss+fastcompany&utm_content=rss

Created 3y | Jan 7, 2022, 11:21:11 AM


Login to add comment

Other posts in this group

LinkedIn’s big bet on TikTok-style video is paying off in a big way

Is LinkedIn the new TikTok? 

Short-form video is now the fastest-growing category on LinkedIn, growing at twice the rate of other post formats on the platform. According to LinkedIn

Feb 5, 2025, 7:50:06 AM | Fast company - tech
Robinhood halts Super Bowl betting contracts after CFTC request

Robinhood said on Tuesday it is rolling back the event contracts that would let users bet on the result of the

Feb 5, 2025, 12:50:09 AM | Fast company - tech
The value of Trump’s memecoin has dropped more than 75% since inauguration

Donald Trump drew plenty of criticism by launching his own branded memecoin three days before his

Feb 5, 2025, 12:50:08 AM | Fast company - tech
You can try DeepSeek’s R1 through Perplexity—without the security risk

The AI search firm Perplexity routinely lets users try out state-of-the-art large language models on its site, but the company moved quickly to put Chinese company DeepSeek’s new R1 model front an

Feb 5, 2025, 12:50:07 AM | Fast company - tech
What’s behind Nintendo’s 42% drop in profits?

Nintendo’s profits tumbled as sales of its Switch console lost momentum, prompting the

Feb 4, 2025, 6:10:05 PM | Fast company - tech
‘I would love to share affection and attention’: This Facebook group connect families with surrogate grandparents

“We want grandparents who want to have pizza nights with us, attend baseball and basketball games, have ice cream dates, take bike rides, just genuinely have fun with us and our boys,” reads one p

Feb 4, 2025, 6:10:04 PM | Fast company - tech
Apple launches Invites, its event invitation app that takes on Partiful

Apple rolled out its newest iPhone app called Invites, which lets iCloud+ subs

Feb 4, 2025, 6:10:03 PM | Fast company - tech