Rapid development in R with lots of help from ChatGPT
Like many people, I’ve been tinkering with ChatGPT to see what it can do, but I was working on developing some fairly complex analysis code using the R language that I wanted to talk about and open source as part of the talk I gave earlier this week at Monitorama, and decided to ask ChatGPT (the free version) to remind me how to code in R, as an alternative to reading man pages and looking for stack overflow fragments. I’ve been coding in R for decades, but I don’t do it every day, however every year or two I have a reason to use it, and spend a lot of time reading man pages. This time I got code that worked, that used constructs and packages that I hadn’t used before, and got more coding done much more than ten times faster than usual. The rest of this blog post is an edited version of the conversation I had with ChatGPT, that also acts as documentation of the code I’ve open sourced. I’m pretty sure that anyone that wants this algorithm in a different language like Python or Go could have a similar conversation and make it happen.
This work built on my previous blog post — Percentiles Don’t Work — Analyzing the Distribution of Response Times. At that time I developed a function called as.peaks — which takes a histogram, and fits normally distributed peaks to it, and shared that code on github. Some minor updates to as.peaks and the new code developed with help from ChatGPT is provided there.
This is a long blog post, as I think it’s interesting to see how the conversation developed, what worked well and what didn’t for my first attempt at coding with ChatGPT. I’m extremely impressed, and expect to follow this kind of development assistance on anything I write from now on.
The very first question in the conversation is shown below, it implied that I’m coding in R, and I was trying to import some data in a format I hadn’t tried before, which was a csv of a log file, where the details I wanted were coded as part of a json object.
There was three pages of detailed instructions, which all looked good, but I realized I hadn’t asked for exactly the right thing, because autocorrect had changed CSV to CSS, and tried again. I’ve reduced the font size here to show the entire response.
This worked fine. Now I needed to pull out some fields from the JSON.
A few more similar small lookups helped me get the right data into the right format, which is a dataframe containing a log file, with each row having a timestamp, the latency of that request in milliseconds, and a tag that is the type of query being made. For this analysis I want to make a series of histograms by taking the log file and processing each minute separately, then looking at how the histograms change over time. The logfile I started with had several minutes of data in it.
This code worked well, but left out the partial minutes at the start and end.
I decided that rather than clustering all the data at once, I wanted to eventually end up with a way to do a sliding window across time, so I asked for a re-write of the code.
# Function to add a new data frame to the iterative clustering algorithm
addDataFrame <- function(data_frame, clusters) {
# Iterate over each point in the new data frame
for (i in 1:nrow(data_frame)) {
point <- data_frame[i, ] # Get the current point
# Calculate distances to existing clusters
distances <- sapply(clusters, function(cluster) sqrt(sum((point - cluster$centroid)^2)))
# Find the closest cluster
closest_cluster_index <- which.min(distances)
closest_cluster <- clusters[[closest_cluster_index]]
# Check if the closest cluster is within the matching criterion
if (distances[closest_cluster_index] <= threshold) {
# Add the point to the closest cluster
closest_cluster$points <- rbind(closest_cluster$points, point)
# Update the cluster centroid
closest_cluster$centroid <- colMeans(closest_cluster$points)
} else {
# Create a new cluster with the point
new_cluster <- list(points = point, centroid = as.matrix(point))
clusters <- c(clusters, list(new_cluster))
}
}
return(clusters)
}
# Example usage
# Initialize the clusters using the points from the first data frame
first_data_frame <- list_of_dataframes[[1]]
initial_clusters <- lapply(1:nrow(first_data_frame), function(i) {
list(points = first_data_frame[i, ], centroid = as.matrix(first_data_frame[i, ]))
})
# Process subsequent data frames
for (i in 2:length(list_of_dataframes)) {
new_data_frame <- list_of_dataframes[[i]]
initial_clusters <- addDataFrame(new_data_frame, initial_clusters)
}
# Return the current cluster state
current_clusters <- initial_clusters
I copied the above code here because it has long lines.
addDataFrame <- function(data_frame, clusters) {
for (i in 1:nrow(data_frame)) {
point <- data_frame[i, c("x", "y")] # Extract the (x, y) coordinates
# Calculate distances to existing clusters
distances <- sapply(clusters, function(cluster) sqrt((point[1] - cluster$centroid[1])^2 + (point[2] - cluster$centroid[2])^2))
closest_cluster_index <- which.min(distances)
closest_cluster <- clusters[[closest_cluster_index]]
if (distances[closest_cluster_index] <= threshold) {
closest_cluster$points <- rbind(closest_cluster$points, point)
closest_cluster$centroid <- colMeans(closest_cluster$points)
} else {
new_cluster <- list(points = point, centroid = point)
clusters <- c(clusters, list(new_cluster))
}
}
return(clusters)
}
The above code was working well, with a few tweaks, and my initial sample data, but I wanted to look at it visually, so I asked for a plot, function as shown below. This is all I said, and the resulting code worked fine. I think this is amazing.
At this point, it was all looking good. I was using 5 minutes of data, so I obtained a larger dataset, for 15 minutes. This gave an error, and between me and ChatGPT it took a while to figure out what was going on. The list of clusters contained a data frame for the centroid when it was first created, but that was overwritted by a vector when it was updated. The mixture of types was breaking the plot routine when there was a cluster with a single point. ChatGPT was good at suggesting alternatives, and interpreting some of the error messages, but in the end I had to spot that the data types were different and fix it.
This error was quite misleading, and the suggestion to provide the code and data led to another issue. The data structure was quite large, and when I pasted it into the conversation as a question, it seemed to overflow the context, and ChatGPT forgot what we were doing. I deleted that question and ChatGPT regained it’s context. The problem was that when displayed as text, both the data frame and vector types looked alike. I ended up isolating a small test case that failed and was able to figure it out, but there was a lot of chat that I won’t share here.
I then tried to get ChatGPT to clean up and document the code
I found that it documented the parameters of the functions well, but it actually removed some of my comments from the code, so I tried again.
# addPeaks: Add peaks to the clusters
# Input:
# - peaks: Data frame of peaks (x, y) coordinates
# - clusters: List of current clusters
# - normalize: Normalization factor for log latency
# - threshold: Maximum distance threshold to consider a point as part of a cluster
# Output:
# - Updated list of clusters
addPeaks <- function(peaks, clusters, normalize, threshold = 0.05) {
for (i in 1:nrow(peaks)) {
point <- peaks[i, c(8, 1)] # Extract the (x, y) coordinates (Latency and Density)
if (point[1] <= 0) next # Skip if latency rounds down to zero (to avoid log(0))
point[1] <- log(point[1]) / normalize # Normalize log latency to a consistent max bucket
# Calculate distances to existing clusters
distances <- sapply(clusters, function(cluster) sqrt((point[1] - cluster$centroid[1])^2 + (point[2] - cluster$centroid[2])^2))
closest_cluster_index <- which.min(distances)
closest_cluster <- clusters[[closest_cluster_index]]
if (distances[closest_cluster_index] <= threshold) {
closest_cluster$points <- rbind(closest_cluster$points, point)
closest_cluster$centroid <- colMeans(closest_cluster$points)
clusters[[closest_cluster_index]] <- closest_cluster
} else {
new_cluster <- list(points = point, centroid = colMeans(point)) # Create a new cluster
clusters <- c(clusters, list(new_cluster))
}
}
return(clusters)
}
# guillotine: Process log file by chopping into one-minute chunks, finding peaks, and clustering them
# Input:
# - df: Data frame containing timestamp, latency, and query
# - plot: Boolean indicating whether to plot the clusters
# - epsilon: Epsilon value for peak detection
# - peakcount: Number of peaks to identify
# Output:
# - List of clusters
guillotine <- function(df, plot = FALSE, epsilon = 0.01, peakcount = 10) {
start_time <- round(min(df$time), "mins")
end_time <- round(max(df$time), "mins")
minute_intervals <- seq.POSIXt(start_time, end_time, by = "min")
msc <- cut(df$time, breaks = minute_intervals, right = FALSE, labels = FALSE)
last_interval <- max(msc, na.rm = TRUE)
msc[is.na(msc)] <- last_interval + 1
df$msc <- msc
hb <- hist(log(df$latency), breaks = 40, plot = plot)$breaks
mhb <- max(hb) # Max histogram bucket - needed to normalize latency
results_list <- lapply(unique(df$msc), function(section) {
subset_df <- df[df$msc == section, , drop = FALSE]
peaks <- as.peaks(hist(log(subset_df$latency), breaks = hb, plot = FALSE), normalize = TRUE, epsilon = epsilon, peakcount = peakcount, plots = plot)
})
first_data_frame <- results_list[[1]]
first_points <- first_data_frame[, c(8, 1)] # Extract the (x, y) coordinates (Latency and Density)
first_points[, 1] <- log(first_points[, 1]) / mhb # Normalize log latency to max bucket
initial_clusters <- lapply(1:nrow(first_points), function(i) {
list(points = first_points[i, ], centroid = as.matrix(first_points[i, ]))
})
for (i in 2:length(results_list)) {
new_data_frame <- results_list[[i]]
initial_clusters <- addPeaks(new_data_frame, initial_clusters, mhb)
}
return(initial_clusters)
}
library(ggplot2)
# plotClusters: Plot the current state of the clusters
# Input:
# - clusters: List of clusters
# Output:
# - Plot object
plotClusters <- function(clusters) {
points <- data.frame(x = numeric(), y = numeric(), cluster = factor())
for (i in 1:length(clusters)) {
cluster <- clusters[[i]]
cluster_points <- cluster$points
points <- rbind(points, data.frame(x = cluster_points[, 1], y = cluster_points[, 2], cluster = as.factor(i)))
}
p <- ggplot(points, aes(x = x, y = y, color = cluster)) +
geom_point() +
theme_minimal()
centroids <- lapply(clusters, function(cluster) cluster$centroid)
centroids_df <- data.frame(x = sapply(centroids, "[", 1), y = sapply(centroids, "[", 2))
p <- p + geom_point(data = centroids_df, aes(x = x, y = y), color = "black")
return(p)
}
I decided not to take the cleaned up version, as I wanted to keep my comments, but it did seem to do a reasonable job of tidying the code.
I wanted to do some more work on the plots, but I haven’t used the ggplot2 library before, so I asked the expert…
plotClusters <- function(clusters) {
points <- data.frame(x = numeric(), y = numeric(), cluster = factor())
for (i in 1:length(clusters)) {
cluster <- clusters[[i]]
cluster_points <- cluster$points
points <- rbind(points, data.frame(x = cluster_points[, 1], y = cluster_points[, 2], cluster = as.factor(i)))
}
p <- ggplot(points, aes(x = x, y = y, color = cluster)) +
geom_point() +
theme_minimal() +
labs(x = "X-Axis Label", y = "Y-Axis Label") # Add x-axis and y-axis labels here
centroids <- lapply(clusters, function(cluster) cluster$centroid)
centroids_df <- data.frame(x = sapply(centroids, "[", 1), y = sapply(centroids, "[", 2))
p <- p + geom_point(data = centroids_df, aes(x = x, y = y), color = "black")
return(p)
}
Here’s what that plot looks like.
Next I wanted a new plot, to show how the peaks evolve over time, and it too me a few tries before I got the request right and got the plot I wanted. Each of the responses worked, but wasn’t what I was looking for. This was a very fast way to iterate, compared to writing the plot code myself…
plotClusterDensity <- function(clusters) {
# Create a data frame to store the cluster density and time information
cluster_data <- data.frame(Time = numeric(), Cluster = numeric(), Density = numeric())
# Extract cluster density and time information from each cluster
for (i in 1:length(clusters)) {
cluster <- clusters[[i]]
density <- cluster$points$PeakDensity
time <- cluster$peaks$Time
cluster_data <- rbind(cluster_data, data.frame(Time = time, Cluster = i, Density = density))
}
# Plot the cluster densities over time
p <- ggplot(cluster_data, aes(x = Time, y = Density, group = Cluster, color = as.factor(Cluster))) +
geom_line() +
geom_point() +
labs(x = "Time", y = "Peak Density", color = "Cluster") +
theme_minimal()
return(p)
}
This is what that plot looks like.
I did some more work on yet another kind of plot, but ran out of time before I had to give my talk, and posted the code as it was, to github.
I know that this is the kind of thing that Github Copilot and similar tools are designed to do, and that ChatGPT 4 is better at this, but I initially just wanted to look up some syntax reminders, and ended up with big chunks of working code written by ChatGPT. I think it’s impressive as it is.