Flood of data from the LHC

By Yoo Chung.

Published on 2008-09-08.

If all goes well with the Large Hadron Collider this week, it will finally have gotten a beam to go around a full circle almost a month after the first beam injection. While from a physics standpoint this will be quite exciting, although it will be much more exciting when they manage head-on collisions between two beams a couple of months later, the LHC is also very impressive in terms of the supporting computing infrastructure.

The LHC is going to generate an incredible number of collision events, too much to handle in a single computing center. And I mean a center with more than 100,000 computers. This means that they need a computing infrastructure distributed all over the world which is able to handle the flood of data that comes out of the collider. With about one DVD’s worth of data being generated every five seconds, the data is first received by CERN’s computing center, which then distributes the data to 11 computing sites in Europe, North America, and Asia. These then provide access to the collision data to scientists on their own computers, which will do the actual CPU-intensive work of analyzing the data for new discoveries.

And despite the incredible amount of data that comes out of the collider, it’s mind-boggling how that is just a tiny fraction of what is originally produced in it. Out of the 40 million collision events that would occur, a lot of work is done to filter out “boring” and “well-known” events which our current theories of physics can already explain quite well, so that data for “only” about 100 potentially interesting events per second will come out of the collider. Without such filters, even all the computing power and network bandwidth in the world would not be able to handle all the data.