NOTE: Most of the contents of this file are extracted "as is" from various sources that are mentioned just before the extracted text in ALL CAPS. Some of these extracts might refer to figures or tables or pages which can be found in the referenced documents. References are in the accompanying presentation "grid-computing.pdf". Objective ========= To begin with, the objective of my talk will be to introduce the concept of Grid Computing and provide an overview of the aspects involved. I will focus on a pretty much high level discussion of the major aspects involved, without getting into too much of technical details. We will also look at some practical implications and implementations. Lastly, I will try to relate the concepts to what we have covered in the course so far. Overview ======== Here is a brief outline of my talk. Initially I will explain the need for Grid Computing. After that we will see what Grid Computing is, what Grid can achieve for us and what it cannot and how is it different from other concepts in High Performance Computing which are often considered similar. Then we will take a look at some interesting aspects of Grid Computing, primarily those that give a feel of how this concept can be realized and what are the design considerations involved. Finally we will see what real world systems are actually implementing this concept. Why Grid ======== Over the duration of this course and in the presentations covered so far, we have studied or come across numerous enhancements to architecture that increase CPU speed, network speed and storage capacity. An example that really made me appreciate this fact is that NSFnet (a computational grid from the National Science Foundation which connects Super Computers) backbone in 1985 was a 56KB/s network, whereas now it is a 40GB/s network. This has led to the existence of huge amounts of potential computational power. How much would you think is this power utilized. According to an IBM white paper published in 2005, this is astonishingly low. Imagine an organization where 90% of the resources are simply wasted while 10% are stressed to their limits. It seems that the principle of locality comes to haunt us in this case, rather than being an advantage. In the sense that 90% of workload is handled just by 10% of resources. We know we have tremendously fast computers with huge capacity and fast means to connect them together. Why Grid ..contd. ================= Then we have to realize that today's super computers, however fast they may be, are not sufficient to give us results to some problems as efficiently as we would desire. e.g. (from an article in Physics Today - http://www.aip.org/pt/vol-55/iss-2/p42.html) Computer-in-the-loop instrumentation: Scientific instruments such as telescopes, synchrotrons, and electron microscopes generate raw data streams that are archived for subsequent batch processing. But quasi-real-time analysis can greatly enhance an instrument's capabilities. For example, consider an astronomer studying solar flares with a radio telescope array. The deconvolution and analysis algorithms used to process the data and detect flares are computationally demanding. Running the algorithms continuously would be inefficient for studying flares that are brief and sporadic. But if the astronomer could call on substantial computing resources (and sophisticated software) in an on-demand fashion, he or she could use automated detection techniques to zoom in on solar flares as they occurred. What is Grid ============ In simplistic terms, it is just distributed computing taken to the next level. More formally, here is a high level definition from the IBM Redbook titled "Introduction to Grid Computing with Globus". What is Grid ..contd. ===================== The idea is to make use of under-utilized resources, which can participate in this scheme voluntarily. This helps out in balancing the workload across different resources by making use of available parallelism. This is turn will provide a means to solve intensive problems like the solar flare analysis in quasi real time. What Grid can do ================ FROM IBM REDBOOK Exploiting underutilized resources: The easiest use of grid computing is to run an existing application on a different machine. The machine on which the application is normally run might be unusually busy due to an unusual peak in activity. The job in question could be run on an idle machine elsewhere on the grid. There are at least two prerequisites for this scenario. First, the application must be executable remotely and without undue overhead. Second, the remote machine must meet any special hardware, software, or resource requirements imposed by the application. Parallel CPU capacity: The potential for massive parallel CPU capacity is one of the most attractive features of a grid. In addition to pure scientific needs, such computing power is driving a new evolution in industries such as the bio-medical field, financial modeling, oil exploration, motion picture animation, and many others. The common attribute among such uses is that the applications have been written to use algorithms that can be partitioned into independently running parts. A CPU intensive grid application can be thought of as many smaller subjobs, each executing on a different machine in the grid. To the extent that these subjobs do not need to communicate with each other, the more scalable the application becomes. A perfectly scalable application will, for example, finish 10 times faster if it uses 10 times the number of processors. Virtual resources and virtual organizations for collaboration: Another important grid computing contribution is to enable and simplify collaboration among a wider audience. In the past, distributed computing promised this collaboration and achieved it to some extent. Grid computing takes these capabilities to an even wider audience, while offering important standards that enable very heterogeneous systems to work together to form the image of a large virtual computing system offering a variety of virtual resources. Access to additional resources: In addition to CPU and storage resources, a grid can provide access to increased quantities of other resources and to special equipment, software, licenses, and other services. The additional resources can be provided in additional numbers and/or capacity. For example, if a user needs to increase his total bandwidth to the Internet to implement a data mining search engine, the work can be split among grid machines that have independent connections to the Internet. Resource balancing: A grid federates a large number of resources contributed by individual machines into a greater total virtual resource. For applications that are grid-enabled, the grid can offer a resource balancing effect by scheduling grid jobs on machines with low utilization, as illustrated in Figure 1-2 on page 9. This feature can prove invaluable for handling occasional peak loads of activity in parts of an larger organization. This can happen in two ways: An unexpected peak can be routed to relatively idle machines in the grid. If the grid is already fully utilized, the lowest priority work being performed on the grid can be temporarily suspended or even cancelled and performed again later to make room for the higher priority work. Without a grid infrastructure, such balancing decisions are difficult to prioritize and execute. Reliability: High-end conventional computing systems use expensive hardware to increase reliability. They are built using chips with redundant circuits that vote on results, and contain much logic to achieve graceful recovery from an assortment of hardware failures. The machines also use duplicate processors with hot pluggability so that when they fail, one can be replaced without turning the other off. Power supplies and cooling systems are duplicated. The systems are operated on special power sources that can start generators if utility power is interrupted. All of this builds a reliable system, but at a great cost, due to the duplication of high-reliability components. In the future, we will see a complementary approach to reliability that relies on software and hardware. A grid is just the beginning of such technology. The systems in a grid can be relatively inexpensive and geographically dispersed. Thus, if there is a power or other kind of failure at one location, the other parts of the grid are not likely to be affected. Grid management software can automatically resubmit jobs to other machines on the grid when a failure is detected. In critical, real-time situations, multiple copies of the important jobs can be run on different machines throughout the grid, as illustrated in Figure 1-3. Their results can be checked for any kind of inconsistency, such as computer failures, data corruption, or tampering. Management: The goal to virtualize the resources on the grid and more uniformly handle heterogeneous systems will create new opportunities to better manage a larger, more disperse IT infrastructure. It will be easier to visualize capacity and utilization, making it easier for IT departments to control expenditures for computing resources over a larger organization. What the Grid cannot do ======================= FROM IBM REDBOOK A word of caution should be given to the overly enthusiastic. The grid is not a silver bullet that can take any application and run it a 1000 times faster without the need for buying any more machines or software. Not every application is suitable or enabled for running on a grid. Some kinds of applications simply cannot be parallelized. For others, it can take a large amount of work to modify them to achieve faster throughput. The configuration of a grid can greatly affect the performance, reliability, and security of an organization s computing infrastructure. For all of these reasons, it is important for the users to understand how far the grid has evolved today and which features are coming tomorrow or in the distant future. Grid Computing compared to.. ============================ Distributed: Deals with homogeneous resources, whereas grid is inherently heterogeneous. Scale differs. Cluster Computing: Centralized and enforced resource control, whereas grid has distributed and voluntary resource control P2P: These are being considered as mostly merging streams. An interesting paper comparing them is provided in the references. While P2P (like KaZaa, BitTorrent) was mainly thought of as a means of data sharing, and Grid as more of computational resource both are merging towards each other with some Grids being used purely for data streaming and some P2P techniques doing computational services. e.g. SETI@home is a project that both streams associate themselves with. Types of Grids ============== FROM IBM REDBOOK Computational Grid: A computational grid aggregates the processing power from a distributed collection of systems. A well known example of a computational grid is the SETI@home grid. This type of grid is primarily comprised of low powered computers with minimal application logic awareness and minimal storage capacity. Rather than simply painting images of flying toasters, the idle cycles of the personal computers on the SETI@home grid are combined to create a computational grid used to analyze radio transmissions received from outer space in the Search for Extra Terrestrial Intelligence. Data grid: While computational grids are more suited for aggregating resources, data grids focus on providing secure access to distributed, heterogeneous pools of data. Through collaboration, data grids can also include a new concept such as a federated database. Within a federated database, as illustrated in Figure 4-1 on page 89, a data grid makes a group of databases available that function as a single virtual database. Through this single interface, the federated database provides a single query point, data modeling, and data consistency. Data grids also harness data, storage, and network resources located in distinct administrative domains, respect local and global policies governing how data can be used, schedule resources efficiently, again subject to local and global constraints, and provide high speed and reliable access to data. Businesses interested in data grids typically have IT initiatives to expand data mining abilities while maximizing the utilization of an existing storage infrastructure investment, and to reduce the complexity of data management. FROM IBM WHITEPAPER "GRID EXPLAINED" Enterprise Grid: This is the ultimate form of Grid computing. The Enterprise Grid allows users to execute their most demanding business applications in the minimum time possible and get instant access to all information across the organization. In addition, users experience computing resources that have been described as always there, always constant with no obvious peaks and troughs in performance and no expensive, frustrating outages. The Enterprise Grid provides the functionalities of the Computer and Information Grids with total visibility and control combined with the minimum possible manual intervent on from the IT operations team. General Issues ============== EXTRACTED FROM "Grids and Grid technologies for wide-area distributed computing" Multiple administrative domains and autonomy: Grid resources are geographically distributed across multiple administrative domains and owned by different organizations. The autonomy of resource owners needs to be honored along with their local resource management and usage policies. " Heterogeneity: A Grid involves a multiplicity of resources that are heterogeneous in nature and will encompass a vast range of technologies. " Scalability: A Grid might grow from a few integrated resources to millions. This raises the problem of potential performance degradation as the size of Grids increases. Consequently, applications that require a large number of geographically distributed resources must be designed to be latency and bandwidth tolerant. " Dynamicity or adaptability: In a Grid, resource failure is the rule rather than the exception. In fact, with so many resources in a Grid, the probability of some resource failing is high. Resource managers or applications must tailor their behavior dynamically and use the available resources and services efficiently and effectively. Components of Grid ================== EXTRACTED FROM "Grids and Grid technologies for wide-area distributed computing" Grid fabric: This consists of all the globally distributed resources that are accessible from anywhere on the Internet. These resources could be computers (such as PCs or Symmetric Multi- Processors) running a variety of operating systems (such as UNIX orWindows), storage devices, databases, and special scientific instruments such as a radio telescope or particular heat sensor. " Core Grid middleware: This offers core services such as remote process management, co-allocation of resources, storage access, information registration and discovery, security, and aspects of Quality of Service (QoS) such as resource reservation and trading. " User-level Grid middleware: This includes application development environments, programming tools, and resource brokers for managing resources and scheduling application tasks for execution on global resources. " Grid applications and portals: Grid applications are typically developed using Grid-enabled languages and utilities such as HPC++ or MPI. An example application, such as parameter simulation or a grand-challenge problem, would require computational power, access to remote data sets, and may need to interact with scientific instruments. Grid portals offer Web-enabled application services, where users can submit and collect results for their jobs on remote resources through the Web. Architectural / Application Considerations ========================================== FROM IBM REDBOOK While a grid may offer many advantages, any given application may not necessarily benefit from a grid. For example, a word processor is tightly coupled with a user s interface, and does not consume a large amount of computing resources. Running it on a grid would likely degrade its performance by having to use a remote windowing system and being subject to more potential points of failure. However, other applications may be very suited for exploiting a grid. CPU Considerations Possible / not possible to parallelize. But repeated experiments with different parameters can make use of independent resources. For example, in the spacecraft trajectory problem, we might run simulations to see where a spacecraft goes if you alter, in small increments, its starting direction, starting speed, and apply various midcourse corrections. Each of these alternatives can be a separate job running the same simulation application, but with different input values. We call this a parameter space problem. Data Considerations Amount of data required by a node for calculations and the time to send it. Ideal case would be small chunks of input and output. This may require re-design of algorithms / applications whereever possible. Same kinds of problems with shared data as with cache coherence. Synchronizations, multiple redundant copies etc., in this case on a far larger scale. Design ====== EXTRACTED FROM "Grids and Grid technologies for wide-area distributed computing" Administrative hierarchy: An administrative hierarchy is the way that each Grid environment divides itself up to cope with a potentially global extent. The administrative hierarchy determines how administrative information flows through the Grid. " Communication services: The communication needs of applications using a Grid environment are diverse, ranging from reliable point-to-point to unreliable multicast communications. The communications infrastructure needs to support protocols that are used for bulk-data transport, streaming data, group communications, and those used by distributed objects. The network services used also provide the Grid with important QoS parameters such as latency, bandwidth, reliability, fault-tolerance, and jitter control. " Information services: A Grid is a dynamic environment where the location and types of services available are constantly changing. A major goal is to make all resources accessible to any process in the system, without regard to the relative location of the resource user. It is necessary to provide mechanisms to enable a rich environment in which information is readily obtained by requesting services. The Grid information (registration and directory) services components provide the mechanisms for registering and obtaining information about the Grid structure, resources, services, and status. " Naming services: In a Grid, like in any distributed system, names are used to refer to a wide variety of objects such as computers, services, or data objects. The naming service provides a uniform name space across the complete Grid environment. Typical naming services are provided by the international X.500 naming scheme or DNS, the Internet s scheme. " Distributed file systems and caching: Distributed applications, more often than not, require access to files distributed among many servers. A distributed file system is therefore a key component in a distributed system. From an applications point of view it is important that a distributed file system can provide a uniform global namespace, support a range of file I/O protocols, require little or no program modification, and provide means that enable performance optimizations to be implemented, such as the usage of caches. " Security and authorization: Any distributed system involves all four aspects of security: confidentiality, integrity, authentication, and accountability. Security within a Grid environment is a complex issue requiring diverse resources autonomously administered to interact in a manner that does not impact the usability of the resources or introduces security holes/lapses in individual systems or the environments as a whole. A security infrastructure is the key to the success or failure of a Grid environment. " System status and fault tolerance: To provide a reliable and robust environment it is important that a means of monitoring resources and applications is provided. To accomplish this task, tools that monitor resources and application need to be deployed. " Resource management and scheduling: The management of processor time, memory, network, storage, and other components in a Grid is clearly very important. The overall aim is to efficiently and effectively schedule the applications that need to utilize the available resources in the Grid computing environment. From a user's point of view, resource management and scheduling should be transparent; their interaction with it being confined to a manipulating mechanism for submitting their application. It is important in a Grid that a resource management and scheduling service can interact with those that may be installed locally. " Computational economy and resource trading: As a Grid is constructed by coupling resources distributed across various organizations and administrative domains that may be owned by different organizations, it is essential to support mechanisms and policies that help in regulate resource supply and demand [16,17]. An economic approach is one means of managing resources in a complex and decentralized manner. This approach provides incentives for resource owners, and users to be part of the Grid and develop strategies that help maximize their objectives. " Programming tools and paradigms: Grid applications (multi-disciplinary applications) couple resources that cannot be replicated at a single site even or may be globally located for other practical reasons. A Grid should include interfaces, APIs, utilities, and tools to provide a rich development environment. Common scientific languages such as C, C++, and Fortran should be available, as should application-level interfaces such as MPI and PVM. A variety of programming paradigms should be supported, such as message passing or distributed shared memory. In addition, a suite of numerical and other commonly used libraries should be available. " User and administrative GUI: The interfaces to the services and resources available should be intuitive and easy to use. In addition, they should work on a range of different platforms and operating systems. They also need to take advantage of Web technologies to offer a view of portal supercomputing. The Web-centric approach to access supercomputing resources should enable users to access any resource from anywhere over any platform at any time. That means, the users should be allowed to submit their jobs to computational resources through a Web interface from any of the accessible platforms such as PCs, laptops, or Personal Digital Assistant, thus supporting the ubiquitous access to the Grid. The provision of access to scientific applications through the Web (e.g. RWCPs parallel protein information analysis system [18]) leads to the creation of science portals.