Practice of Building Intelligent Operation and Maintenance Platform for International Securities Enterprises
Customer Profile
The case client is a large international securities enterprise in Asia with a registered capital of 2 million yuan. Its business scope mainly includes securities brokerage, securities investment consulting, securities self operation, securities asset management, etc.
Pain point analysis
The securities industry is a data intensive and technology intensive industry. As a large securities company, the case client's IT system includes multiple subsystems, involving trading, settlement, risk control, and other aspects. The system architecture is complex, and a large amount of IT resources need to be managed and maintained. Especially, the core trading system needs to handle a large amount of trading data and high concurrency trading requests, which puts high demands on the performance and stability of the system.
With the increasing growth of the client's business, the scale of its underlying IT infrastructure is also constantly expanding, and various hardware facilities and information system failures are gradually increasing. On the contrary, the existing monitoring system has relatively simple functions and no effective alarm notification methods. When a fault occurs, the response speed of personnel is slow and they cannot locate the problem in a timely manner; Daily monitoring can only rely on maintenance personnel constantly staring at the screen, which undoubtedly increases the difficulty of management for maintenance personnel.
Specifically, the customer faced the following issues during the IT operations process:
- IT asset management is chaotic, difficult to sort out, and there is a serious lack of information;
- The company has multiple data centers and complex network area divisions, making it difficult to centrally manage equipment;
- There are numerous dedicated business lines with frequent transmission of large files, making it difficult to ensure stability;
- The company's important portal lacks personnel maintenance, and manual testing is conducted daily;
- Failure to detect faults in a timely manner results in a significant time lag from the occurrence of the fault to the detection and feedback of the problem by front-end business personnel, and then to the reception of fault information by operation and maintenance personnel, leading to delayed response.
To solve the above problems, the client has put on the agenda to build a fully functional monitoring system, hoping to meet the comprehensive guarantee of the entire business system through the transformation and upgrading of the original operation and maintenance system.
scheme
Based on the characteristics of the IT system structure of securities enterprises, combined with the customer's pain points and actual needs in operation and maintenance, LeWei has created an intelligent operation and maintenance solution for the securities customer that covers global monitoring, asset sorting, large screen view, dedicated line links, management portal, alarm center, etc. It provides one-stop operation and maintenance management services to effectively solve the customer's pain points and difficulties in operation and maintenance practice.
System Architecture
For high availability and security considerations, the project adopted a distributed deployment solution based on PostgreSQL stream replication and Gpool II HA as the monitoring underlying database to effectively address the issues of massive transaction data and high concurrency. The high availability cluster architecture of Zabbix, Web, and Proxy nodes was implemented using PCs through Corosync and Pacemaker, and automatic switching of dual nodes was achieved in disaster situations, greatly ensuring the reliability of the basic monitoring system itself.

Key functional scenarios
Global monitoringFull monitoring of resources and fully perceptible status. Realize full monitoring of all customer resources, including network devices, security devices, servers, storage, operating systems, virtualization, databases, and middleware;
Asset sortingManage assets based on rigorous naming conventions and adherence to scientific and reasonable grouping standards;
Large screen viewProvide network topology, business screens, etc., such as network topology diagrams that can display the complete network topology architecture and real-time status of important links between IDCs;
Dedicated line linkReal time perception of business dedicated line bandwidth utilization, automatic triggering of threshold alarms, and monitoring of dedicated line latency and jitter;
Portal monitoringSimulate login, monitor portal service status in multiple steps, and visually display the trend of changes in web access speed and response time;
Alarm CenterCombining with the existing SMS platform of the company, enable SMS alarm push mode, support SMS and email message customization, alarm analysis, and alarm history.
Global monitoring:
Lewei Monitoring takes a global perspective and provides unified monitoring and display. Operations and management personnel can see the overall health status of the system at a glance and quickly access fault resource information by switching tags. At the same time, a unified alarm is generated based on unified monitoring, and the alarm information can be pushed through desktop (system itself, PC email, etc.) and mobile (SMS, mobile email, etc.).

Asset combing
Customers are facing the problem of efficient management of a large number of assets, which is difficult due to the diverse asset categories and brands, as well as the lack of a unified naming paradigm. Lewei Plan manages assets based on rigorous naming rules and adherence to scientific and reasonable grouping norms.

Large screen view
Display the complete network topology architecture, real-time status of important links between IDCs, and solve the problems of delayed fault detection and difficult positioning. As shown in the figure, the network projection clearly displays the interconnection relationship between data centers, as well as the composition of subnet areas within each data center. The operation status of network members can be intuitively seen through the color of equipment and lines.

Dedicated line link:
Link monitoring can intuitively show the real-time bandwidth utilization rate of important business dedicated lines, and an alarm will be triggered when the utilization rate reaches the percentage threshold of the dedicated line itself; You can also further check the delay and jitter of a certain dedicated line, as follows:


portal site:
After communicating with the client, it was learned that in the early days, there were occasional situations where the company portal could not be accessed, which had a significant impact. The company's leaders would manually check each access before going to work every morning; At present, it is in a monitoring state, and the WEB will constantly dial and test it. You can also view the response speed of the portal webpage in detail, effectively avoiding the repeated manual detection in the past. The following are:

SMS notification:
The company has adopted SMS alert notification method, which allows operation and maintenance personnel to receive event notifications in the first time when the system fails, breaking the previous dilemma of business personnel discovering information system failures in advance. The following are:

Implementation Overview:
Resource requirements
The monitoring system consists of four roles: main collection, WEB portal, database, and proxy collection. Among them, the main collection, WEB portal, and database are all dual nodes; Proxy collection is divided into two groups, each with two nodes.

Server distribution:
The main collection, web portal, database, and proxy collection are distributed in the A-1 area of the computer room, with monitoring coverage including the A-1 and C-1 areas, with a coverage rate of 100%; Two sets of proxy collection are distributed in the B-1 area of the computer room, with monitoring coverage including the A-1 area, B-1 area, 2 area, and C-1 area, with a coverage rate of 100%.

Monitoring objects:

Customer revenue
This solution deeply analyzes the pain points of customers in basic information management and maintenance. Through detailed design and planning, it establishes a high-performance, powerful, wide coverage, and flexible operation and maintenance monitoring management system.
- We have achieved full coverage monitoring of information infrastructure and full awareness of resource status, becoming a strong support for the stable operation of customer business;
- Abandoning the previous "screen watching" work mode, providing timely and reliable alarm notifications for operation and maintenance personnel in case of sudden failures, and accurately locating the outbreak point of the event;
- Effectively reducing the complexity of operations and maintenance personnel, reducing the daily maintenance costs of information systems, and greatly ensuring the stability of business systems.

- Construction Practice of Comprehensive Operation and Maintenance Platform for Futures Enterprises
- Example of Upgrading the Operation and Maintenance Monitoring System in a Third Class Hospital
- Case Interpretation | Construction Practice of Comprehensive Operation and Maintenance Monitoring Platform for a Large Household Enterprise-Lewei Software
- Digital transformation and upgrading of information technology enterprises
- Practice of Building Intelligent Operation and Maintenance Platform for International Securities Enterprises