In brief

  • The Graph, a blockchain infrastructure company, experienced an outage that affected some of the most popular DeFi applications.
  • The outage came as a result of a rapidly growing number of complex queries being processed by The Graph.
  • The Graph has outlined a number of actions they’ll take to avoid a similar outage going forward.

DeFi is growing so fast, infrastructure providers are struggling to keep up.

The Graph, a backend service for making blockchain data more digestible for decentralized applications, released on Thursday a “post-mortem” report on outages in their system on Wednesday. The outage caused issues for some of the most popular and rapidly growing DeFi applications, such as lending protocols Aave and Balancer

As DeFi users and transactions continue to surge, the pressure is on for The Graph and others to provide a seamless and, eventually, completely decentralized infrastructure.

The outages returned HTTP 500 errors (internal server error code) for queries sent to The Graph starting June 24 at 12:00pm PST, and the issue was resolved about 11 hours later at 11:10am PST, according to The Graph’s internal report. A second, shorter outage also returned similar errors to queries for about an hour between 11:35 PST and 12:20 PST. 

Lending protocol Aave and 1inchexchange, an aggregator for token swaps on decentralized exchanges, were among the services that experienced frontend disruptions as a result of the outages. The outage made it difficult for users to access the frontend websites for the various apps that The Graph services, though the protocols themselves never went down.

For context, roughly “65% of DeFi [assets under management] flow through dapps” built with The Graph’s tech, according to Eva Beylin, a strategist with The Graph and MolochDAO contributor. Put another way, The Graph is “basically the middleware layer for most of the DeFi ecosystem,” in the words of Digital Assets Capital Management CEO Richard Galvin.

In the post-mortem, The Graph project lead Yaniv Tal explained that in just two weeks, query volume for The Graph has grown 80% from 25 million per day to more than 45 million. Along with the increase in overall query volume, the number of highly complex queries has also increased. 

A misconfiguration of a setting that would otherwise drop highly complex queries caused so much strain that a Google Cloud database that normally runs with 50% unused capacity was maxed out at 100% capacity, triggering the errors for subsequent queries.

In addition to technical difficulties, the human element of database management played a part in the initial outage that lasted nearly 12 hours. Tal explained that the outage occurred during late night hours for most of the engineering staff, who are mostly located in North and South America. The inopportune timing added significantly to the time needed to reach a resolution, simply because those with the knowledge to fix the problem were still asleep for several hours.

As with many decentralized protocols currently in development, The Graph strives to become fully decentralized, but is still months away from launching such a network. In the interim, The Graph team plans to make a number of improvements to the existing system, including optimized query costing and processing, more aggressive triggers and alerts for high load events, and adding failover infrastructure for responding to large traffic spikes. 

The Graph also intends to hire additional engineers in Asia and Europe to help mitigate the effects of time zone coverage in any potential future scenarios.