Moonboards are a great tool for a climber to get stronger and work on technique. A moonboard consists of a grid of standardized holds that uses lights to indicate which holds are "active" during the climb. A problem begins with hands on the starting holds illuminated green, and feet on any of the yellow holds in the foot box (if permitted by the problem). A climb is complete when both hands match on the final hold, illuminated by red, using any of the lit up holds. An example problem is shown in the "moonboard data scraper" section below. The goal of this project is to build a neural network, trained off of climbing YouTube videos, that can solve a moonboard problem. Stay tuned for updates.
I first built a moonboard data scraper library in Python to automatically extract human joint positions, and climbing sequence data from hundreds of climbing videos from YouTube. I used this as training data for my neural network. Schematic of the pipeline:
The first task was to develop a way to scrape climbing video training data from youtube, and process the video footage to extract the joint positions for each move, relative to the board holds. Since the moonboard holds are standardized and have (mostly) unique shapes, I labelled them for approximately 200 moonboard video images, using Roboflow. With this data, I retrained an Ultralytics image detection model to identify and label the visible holds, from which I could extract hold positions as pixel coordinates from the image. The result was:
So far it looks good, and the object detection model is able to identify (shown by blue bounding boxes with a green circle in the center) non-blocked holds, but misses obstructed holds by video text and the climber's body. The model is also unable to uniquely identify the 10 foot-holds on the kickboard (bottom two rows of holds with large spaces between them), since the holds have an identical shape. To get around this issue, I used scikit-learn to perform linear regression and k-means classification to interpolate the missing hand and foot hold locations. In the first figure below, the grey lines form lines of best fit between detected holds by the image detection model. Intersections between these lines are the approximate positions of the blocked holds. Finally, I used the Ultralytics Yolo11 key-points model to track joint positions and a bounding box for the climber shown in the second image, below.
The next challenge was to identify starting, ending and valid holds from a video. For most climbing footage this information is extremely challenging to obtain from training video footage alone, since the valid hold lights on the moonboard typically aren't picked up by phone cameras. To make matters worse, the Moonboard website no longer provides a public database for their problems... Thankfully there has been work in the community to scrape moonboard problem data from the app, which can be found on GitHub here. To extract problem data I modified the above code, ran the moonboard android app on my pc using the Link to Windows app (available on google play), and used a Auto clicker app to automatically scroll through and extract moonboard for over 1200 problems. I plan to make the dataset on my website at a later time for others to use.
To extract moves for a given climb, I analyzed the x and y components of the velocity and acceleration time-series for the hands and feet to determine when a move was made, and when a move ended. The above footage is choppy: hold bounding boxes are visibly jittery, and joint positions move unnaturally. This makes the video unsuitable to extract data from in its current state, since the resulting time series is extremely noisy. To obtain robust hold positions, I averaged the hold bounding box and center over every frame in the video. To obtain more robust pose positions, I generated a new "coarse grained" video by grouping frames of the old video into batches of 5-10 frames, and averaging over joint positions. The processed video and the corresponding movement detection time series for the hands and feet are shown below. The x-axis denotes the video frame, and the red vertical line indicates the current frame in the video. Sharp peaks in the time series indicates when a movement that has taken place has ended (since a move takes a finite amount of time), for one of the joints labelled in the legend.