Objective The digital twin of a natural gas pipeline, constructed from physical entities, continuously generates substantial data, posing significant challenges for data mining regarding computational cost, efficiency, and scalability. The k-means clustering algorithm, a widely used data mining method, requires specifying the number of clusters, k, and relies on the selection of initial centroids which impacts computational efficiency and the risk of getting stuck in local optima. Consequently, the traditional k-means algorithm often takes several minutes or longer to process large datasets, producing non-unique results that require multiple operations to obtain the optimal results.
Methods To meet the real-time data mining requirements of pipeline digital twins, an adaptive k-means clustering algorithm based on grid and domain centroid weight (GDCW-AKM) was proposed. The algorithm operates as follows: (1) Divide the dataset into evenly sized grid spaces, replacing each space with its center position and assigning weights based on the number of data within that space to obtain the sample set and weight set. (2) Further segment the sample set into multiple domains, using the centroid of each domain as a sample and the sum of weights as the new weight, resulting in an updated sample set and weight set. (3) Assign density (ρ) and distance (δ) to each sample, let r=\delta \sqrt\rho , and sort the samples in descending order based on the value of r. (4) Set multiple values for k, selecting the first k samples corresponding to r as initial centroids. The k that yields the maximum variance ratio criterion for the clustering results, along with the initial centroids, determines the optimal number of clusters and the optimal initial centroids. This clustering result is then considered final.
Results The GDCW-AKM algorithm was applied to a station of Shandong Natural Gas Pipeline Co., Ltd., with its performance evaluated based on runtime and various internal indexes. Results demonstrated that the algorithm effectively identified the number of operating conditions for historical pipeline operation data and accurately assigned new data into corresponding operating condition categories. Furthermore, when the clustering accuracy exceeded 99% compared to the k-means algorithm, the computational efficiency was significantly improved. For a dataset size of 20×104, the efficiency was increased nearly 12-fold, with larger datasets yielding even greater efficiency gains.
Conclusion The GDCW-AKM algorithm effectively meets the efficiency and accuracy requirements of station pipeline digital twins, making it highly suitable for widespread promotion and application in stations.