The little-known reason why competing with Google is so hard

Before a new search engine can hope to make a run against Google, it has to crawl. But indexing the web by “crawling” sites with automated software doesn’t just require scaling up to the web’s vast scope—even though doing so is a big challenge in itself. Individual sites have no obligation to welcome a new search crawler. Some instead post digital no-trespassing signs, a way to discourage automated traffic that might bog down performance. “The web has trillions of documents,” says Vivek Raghunathan, cofounder of the ad-free, subscription-based search startup Neeva. “And the web is a lot trickier to crawl than it was a few years ago.” An October 2020 report on digital competition by the House Judiciary Committee’s Subcommittee on Antitrust aimed a government spotlight at this situation. “The high cost of maintaining a fresh index, and the decision by many large webpages to block most crawlers, significantly limits new search engine entrants,” the report stated. “Today, the only English-language search engines that maintain their own comprehensive webpage index are Google and Bing.” That leaves many Google competitors renting the index Microsoft maintains for its Bing search, which has 6.4% of the U.S. market—compared to Google’s 87.3%—in Statcounter’s measurements. Bing’s index works well for many queries, but sites leaning on it cede a key way to differentiate themselves. That’s an issue for Neeva as well as two other privacy-centric search engines, DuckDuckGo and Brave. All three call on Bing for some of the results they provide to users. It’s just one ingredient rather than the entirety of their technology, but still: It would be easier to do without it if creating a new index of the web wasn’t so hard. Robots not welcome here Websites control automated access to their pages using standardized “robots.txt” files enumerating where crawlers may go. Crawlers can disregard these instructions, as the Internet Archive began doing in 2017, to improve its backup of the web. But sites can punish a pushy robot by blocking its access. DuckDuckGo and Neeva pointed to Facebook’s platform as one example. Its robots.txt file takes a guest-list approach, approving Google and Bing as well as such less obvious crawlers as “Applebot,” which gathers data for Apple’s Siri and Spotlight. But it excludes all bots not cited by name. Jason Grosse, a spokesperson for Facebook’s parent firm Meta, said in an email: “Generally speaking, our robots.txt policy is not out of line with other major platforms.” Indexing sites that don’t appreciate a new crawler’s attention can demand discretion and diplomacy. “A lot of the work we’ve done in the last year, year and a half, is building a crawler system that is well behaved,” said Neeva’s Raghunathan. “We do things like smart algorithmic estimation of how much can we crawl this site so it looks like a rounding error.” Sometimes, however, Neeva has to ask for help. From whom? “I’d say it’s been the first person we know, and often the first person we know is the CEO or the head of engineering.” Even a search site that excels at providing web results will struggle to match Google’s full-spectrum information retrieval.Brave, meanwhile, operates in a stealth mode by varying its crawler’s identification and only abiding by whatever restrictions a robots.txt file places on Google’s crawler. Josep M. Pujol, chief of search at Brave, founded by Mozilla cofounder Brendan Eich and better known for its privacy-focused browser, said in an email that this requires treading lightly. “We respect the spirit of the law but not the letter,” he said. “As of today, the data centers that host our crawlers have received a very small number of complaints.” Pujol called asking individual sites’ permission impractical: “How do you scale human interaction to thousands of companies?” Google, meanwhile, can get another leg up because its nonsearch lines of businesses—starting with display ads, but including services like Google Analytics—require access to sites that competitors can only request, said Zack Maril, a software engineer and founder of a search-competition group called Knuckleheads’ Club. These other ventures, he wrote in an email, “all can benefit from Google’s search business in various ways that other competitors running only search engines simply cannot compete on.” Search sites without Google- or Bing-level traffic also lack large-scale metrics about what sites are more or less popular. Google and Bing “can look at everything that people liked, and prioritize all the clicks from there,” says Raghunathan. “When you’re bootstrapping, it’s a lot harder.” A report on digital competition, published in July 2020 by the U.K.’s Competition and Markets Authority, suggested requiring Google to provide some of these metrics. As DuckDuckGo communications vice president Kamyl Bazbaz approvingly phrased it, “Share a certain amount of click-and-query data that other search engines could use to level the playing field.” Brave invites itself to a form of that sharing when it asks its users to allow “Google fallback mixing,” in which Brave sends along a query to Google and then analyzes the results to improve its index. Even a search site that excels at providing web results will struggle to match Google’s full-spectrum information retrieval. For example, I’ve had DuckDuckGo as the default on my iPad Mini for years—but its maps results only cover driving and walking, so I still find myself turning to Apple Maps and Google Maps. Despite the inherent challenges of competing with Google in search, the fact that new firms are still willing to try speaks well of the stubbornness that these upstarts will need. “We love that there are lots of other search competitors now,” said DuckDuckGo’s Bazbaz. “It’s a market that, historically, people have been really afraid of—and for good reason—because of the way that Google has dominated it.”

https://www.fastcompany.com/90709672/the-little-known-reason-why-competing-with-google-is-so-hard?partner=rss&utm_source=rss&utm_medium=feed&utm_campaign=rss+fastcompany&utm_content=rss

Utworzony 3y | 7 sty 2022, 11:21:11


Zaloguj się, aby dodać komentarz

Inne posty w tej grupie

How ‘lore’ became the internet’s favorite way to overshare

Lore isn’t just for games like The Elder Scrolls or films like The Lord of the Rings—online, it has evolved into something entirely new.

The Old English word made the s

24 lut 2025, 13:20:04 | Fast company - tech
These LinkedIn comedians are leaning into the cringe for clout

Ben Sweeny, the salesman-turned-comedian behind that online persona Corporate Sween, says that bosses should waterboard their employees. 

“Some companies drown their employees with

24 lut 2025, 10:50:08 | Fast company - tech
The best apps to find new books

This article is republished with permission from Wonder Tools, a newsletter that helps you discover the most useful sites and apps. 

24 lut 2025, 06:20:05 | Fast company - tech
5 tips for mastering virtual communication

Andrew Brodsky is a management professor at McCombs School of Business at the University of Texas at Austin. He is also CEO of Ping Group and has received nume

23 lut 2025, 11:50:03 | Fast company - tech
Apple’s hidden white noise feature may be just the productivity boost you need

As I write this, the most pleasing sound is washing over me—gentle waves ebbing and flowing onto the shore. Sadly, I’m not actually on some magnificent tropical beach. Instead, the sounds of the s

22 lut 2025, 12:40:06 | Fast company - tech
The next wave of AI is here: Autonomous AI agents are amazing—and scary

The relentless hype around AI makes it difficult to separate the signal from the

22 lut 2025, 12:40:05 | Fast company - tech
This slick new service puts ChatGPT, Perplexity, and Wikipedia on the map

I don’t know about you, but I tend to think about my favorite tech tools as being split into two separate saucepans: the “classic” apps we’ve known and relied on for ages and then the newer “AI” a

22 lut 2025, 12:40:03 | Fast company - tech