The little-known reason why competing with Google is so hard

Before a new search engine can hope to make a run against Google, it has to crawl. But indexing the web by “crawling” sites with automated software doesn’t just require scaling up to the web’s vast scope—even though doing so is a big challenge in itself. Individual sites have no obligation to welcome a new search crawler. Some instead post digital no-trespassing signs, a way to discourage automated traffic that might bog down performance. “The web has trillions of documents,” says Vivek Raghunathan, cofounder of the ad-free, subscription-based search startup Neeva. “And the web is a lot trickier to crawl than it was a few years ago.” An October 2020 report on digital competition by the House Judiciary Committee’s Subcommittee on Antitrust aimed a government spotlight at this situation. “The high cost of maintaining a fresh index, and the decision by many large webpages to block most crawlers, significantly limits new search engine entrants,” the report stated. “Today, the only English-language search engines that maintain their own comprehensive webpage index are Google and Bing.” That leaves many Google competitors renting the index Microsoft maintains for its Bing search, which has 6.4% of the U.S. market—compared to Google’s 87.3%—in Statcounter’s measurements. Bing’s index works well for many queries, but sites leaning on it cede a key way to differentiate themselves. That’s an issue for Neeva as well as two other privacy-centric search engines, DuckDuckGo and Brave. All three call on Bing for some of the results they provide to users. It’s just one ingredient rather than the entirety of their technology, but still: It would be easier to do without it if creating a new index of the web wasn’t so hard. Robots not welcome here Websites control automated access to their pages using standardized “robots.txt” files enumerating where crawlers may go. Crawlers can disregard these instructions, as the Internet Archive began doing in 2017, to improve its backup of the web. But sites can punish a pushy robot by blocking its access. DuckDuckGo and Neeva pointed to Facebook’s platform as one example. Its robots.txt file takes a guest-list approach, approving Google and Bing as well as such less obvious crawlers as “Applebot,” which gathers data for Apple’s Siri and Spotlight. But it excludes all bots not cited by name. Jason Grosse, a spokesperson for Facebook’s parent firm Meta, said in an email: “Generally speaking, our robots.txt policy is not out of line with other major platforms.” Indexing sites that don’t appreciate a new crawler’s attention can demand discretion and diplomacy. “A lot of the work we’ve done in the last year, year and a half, is building a crawler system that is well behaved,” said Neeva’s Raghunathan. “We do things like smart algorithmic estimation of how much can we crawl this site so it looks like a rounding error.” Sometimes, however, Neeva has to ask for help. From whom? “I’d say it’s been the first person we know, and often the first person we know is the CEO or the head of engineering.” Even a search site that excels at providing web results will struggle to match Google’s full-spectrum information retrieval.Brave, meanwhile, operates in a stealth mode by varying its crawler’s identification and only abiding by whatever restrictions a robots.txt file places on Google’s crawler. Josep M. Pujol, chief of search at Brave, founded by Mozilla cofounder Brendan Eich and better known for its privacy-focused browser, said in an email that this requires treading lightly. “We respect the spirit of the law but not the letter,” he said. “As of today, the data centers that host our crawlers have received a very small number of complaints.” Pujol called asking individual sites’ permission impractical: “How do you scale human interaction to thousands of companies?” Google, meanwhile, can get another leg up because its nonsearch lines of businesses—starting with display ads, but including services like Google Analytics—require access to sites that competitors can only request, said Zack Maril, a software engineer and founder of a search-competition group called Knuckleheads’ Club. These other ventures, he wrote in an email, “all can benefit from Google’s search business in various ways that other competitors running only search engines simply cannot compete on.” Search sites without Google- or Bing-level traffic also lack large-scale metrics about what sites are more or less popular. Google and Bing “can look at everything that people liked, and prioritize all the clicks from there,” says Raghunathan. “When you’re bootstrapping, it’s a lot harder.” A report on digital competition, published in July 2020 by the U.K.’s Competition and Markets Authority, suggested requiring Google to provide some of these metrics. As DuckDuckGo communications vice president Kamyl Bazbaz approvingly phrased it, “Share a certain amount of click-and-query data that other search engines could use to level the playing field.” Brave invites itself to a form of that sharing when it asks its users to allow “Google fallback mixing,” in which Brave sends along a query to Google and then analyzes the results to improve its index. Even a search site that excels at providing web results will struggle to match Google’s full-spectrum information retrieval. For example, I’ve had DuckDuckGo as the default on my iPad Mini for years—but its maps results only cover driving and walking, so I still find myself turning to Apple Maps and Google Maps. Despite the inherent challenges of competing with Google in search, the fact that new firms are still willing to try speaks well of the stubbornness that these upstarts will need. “We love that there are lots of other search competitors now,” said DuckDuckGo’s Bazbaz. “It’s a market that, historically, people have been really afraid of—and for good reason—because of the way that Google has dominated it.”

https://www.fastcompany.com/90709672/the-little-known-reason-why-competing-with-google-is-so-hard?partner=rss&utm_source=rss&utm_medium=feed&utm_campaign=rss+fastcompany&utm_content=rss

Created 3y | Jan 7, 2022, 11:21:11 AM


Login to add comment

Other posts in this group

‘The White Lotus’ star Aimee Lou Wood’s smile is inspiring to fans—and a dangerous TikTok trend

The breakout star of this season of The White Lotus? Aimee Lou Wood—and her distinctive real-life smile. “I mean, I can’t believe the impact my teeth are having,” the English actress told

Apr 5, 2025, 6:30:04 AM | Fast company - tech
Trump extends TikTok sale deadline again—this time by 75 days

President Donald Trump on Friday said is signing an executive order to

Apr 4, 2025, 9:20:02 PM | Fast company - tech
Nintendo delays Switch 2 preorders because of Trump’s tariffs

Nintendo is pushing back preorders for its upcoming Nintendo Switch 2 while it figures out the implications of President Donald Trump’s

Apr 4, 2025, 6:50:05 PM | Fast company - tech
$2,300 for an iPhone? Trump’s tariffs could make that a reality

Your favorite iPhone could soon become much pricier, thanks to tariffs.

Apr 4, 2025, 4:30:07 PM | Fast company - tech
My dog recognizes the sounds a Waymo car makes

Most of us know the general (albeit simplified) story: Russian physiologist Ivan Pavlov used a stimulus—like a metronome—around the dogs he was studying, and soon, the hounds would start to saliva

Apr 4, 2025, 4:30:07 PM | Fast company - tech
How I wrote the notes app of my dreams (no coding required)

For years, I’ve had a secret ambition tucked away somewhere near the back of my brain. It was to write a simple note-taking app—one that wouldn’t be overwhelmed with features and that would reflec

Apr 4, 2025, 2:20:04 PM | Fast company - tech
The AI tools we love right now—and what’s next

AI tools are everywhere, changing the way we work, communicate, and even create. But which tools are actually useful? And how can users integrate

Apr 4, 2025, 2:20:04 PM | Fast company - tech