Wednesday, May 14, 2008

PSA to all PITCHf/x guys

As of the 12th MLBAM has changed the header for their data. It used to look like this:

atbat num="1" b="4" s="2" o="0" batter="435065" pitcher="458567" des="Reggie Willits walks. " stand="L" event="Walk"

Now it looks like this:

atbat num="1" b="3" s="1" o="1" batter="408299" stand="R" b_height="6-0" pitcher="435043" p_throws="L" des="Omar Infante grounds out, shortstop Brian Bixler to first baseman Adam LaRoche. " event="Ground Out"

Make sure to change you parser accordingly.

Tuesday, May 13, 2008

Are there issues with the 2007 data corrections?

Recently at the PITCHf/x summit Ike Hall presented a talk on data corrections where he noted that it appears that my corrections are over correcting the data on off speed pitches. You can find the talk on the summit's website it is labeled Data_Improvement.pdf. On page 11 Ike shows a plot of the differences between the drag coefficient at Comerica and PETCO parks. While the data seems to great in the fastball region it appears to differ from 0.3 to 0.5 for the drag coefficient for pitches thrown near 60 MPH.

Now this seems like a huge issue. Obviously there a very large difference between 0.3 and 0.5. The problem is difference in drag actually results in a very small effect. If you assume the air density to be 1.2 pascals, the balls initial velocity to be 60 MPH (26.8 m/s), the circumference of a baseball to be 9 inches (area 0.004 m^2), then you can calculate the drag force. If you do that you get the drag force between and 0.52 and 0.86 N. If you then want to say want to find the differences in final velocity you can use the equations of motion and if you assume the ball takes 0.5 second (which is actually quite large) you get a difference in final velocities of less than a third of a meter per second.

So while it appears that my corrections are indeed over correcting the data the results of these over corrections are small. That said, I will be looking in to how to adjust for this to fix my corrections but don't expect a huge change to the data. Once I get that worked out I should be ready to run the corrections on the 2008 data as inter league play is nearly upon us and that is what my code really needs.

Saturday, April 19, 2008

Ok another attempt at daily updates

Wow was that a mess but I think I might finally have all the bugs worked out for daily updates to the 2008 player cards. In any case it is clear now that 2008 data corrections will be needed. There were some hints of trouble with the data earlier but now the cameras in Cincinnati are just clearly messed up. Take a look at Aaron Harang's card. Two starts at home, two starts on the road and the data is split. Ben Sheets just made a start there and you can see the effect there as well (no it wasn't his injury that caused the data to be that skewed).

So corrections will have to be made again. I am planning on running corrections like 2007 but for the data set to become complete interleague games will have to occur. I could do one correction for the AL and another for the NL but my code really isn't setup for that. We will see. I will keep you posted.

Monday, March 31, 2008

2008 Player Cards

Well the new season is upon us and even though I haven't updated this blog in ages I do have a treat for anyone who happens to stumble on this blog. The 2008 player cards are here! Just click here or the player cards link to the right and off you go.

A few notes about the new cards. First, absolutely no corrections have been done to the data. Everything is straight from Sportvision. Second, Sportvision was kind enough to add in pitch types to the 2008 data so instead of running my pitch identifying code I am just using their pitch types. Third, I have lowered the number of pitches necessary to have a player card from 100 to 10 at least for the beginning part of the season. This applies to both batters and pitchers. Lastly, while I had to upload today's batch by hand tomorrow's should be automatically uploaded so you should have completely up to date player cards at your disposal. Enjoy!

Monday, December 3, 2007

Web based PITCHf/x tool help/comment page

Here is the help/comment page for the web based PITCHf/x tool. If you have any comments please add them to the bottom of this post.

First, let me just make sure everyone is aware of what the PITCHf/x system is. PITCHf/x by sportvision is a system of tracking the ball as it travels to home plate with two cameras. The cameras take a bunch of pictures of the ball in flight and then sends the data to MLB who puts it online for users to see. The data is in a messy form and needs corrections and pitch classifications before the data can really be used. That is why I made the web based tool for anyone to use.

So the first few things the tool will ask you for are simple things like the name of the pitcher and batter. The only restriction is you must put in either a pitcher or a batter (or put in both). Sadly, less than a quarter of all pitches were tracked this year so if you put in certain pitcher/batter match ups it will come back with no results. If that happens please try again.

After you have entered the pitcher/batter it will ask you for the type of pitch, the result of the pitch, and the count. When you start try leaving these blank to see what a certain pitcher throws then you can go back and focus on only one type of pitch for instance. If you feel a certain pitcher stuff isn't being represented correctly please comment below.

Next, options are available to cut on things like pitch speed, and horizontal and vertical movement. All horizontal measurements have negative numbers as moving in towards a right handed batter. Speed is measured in MPH and movement in inches.

Lastly, either the location of the pitch or the break of the pitch is shown. The location is simply where the ball crossed home plate. The break is how the ball moved in comparison to a ball thrown without spin. So if there was no spin the ball would end up at (0,0) on the graph.

Please note that sometimes the image posted will be the previous inputed image from your web browser's cache. This is because it takes the tool a few seconds to produce it's result and some times your impatient browser will just use the previous image. If this happens please press reload.

There are still a few issues including not allowing you to cut on the date. This is something that will be including but I am having trouble with it in my database. Also, I am having a bit of trouble with the spin and direction so sorry that didn't make it. It will be coming soon though. Also, the release point will become an option to plot and cut on and an extended table with some league averages will be in the next version. Lastly, the biggest issue is when you make a selection and it run it doesn't store your selection to allow you to alter your query quickly. This is very annoying but hard to fix on my end. I'll have a solution by the next update. If you press the back button hopefully your browser will remember your options but that is an imperfect solution.

If you would like to use any of the plots you make go ahead just add a link to the tools webpage, this page, or the hardballtimes article.

A big thanks to my beta testing team Mark (TigsTown.com), Lee (www.detroittigertales.blogspot.com), and the guys at nomaas.org. Sorry for the slow posting of this. Hopefully the next version will be available before Christmas.

Monday, November 12, 2007

Classifcation Algorithm Explained

Once the data has been corrected we are ready to start classifying the pitches. But first there is a little trick I want to apply. Because the atmospherics can reduce the spin on the ball up to 25% on a hot day at Coors I translate each pitch like it was thrown at sea level at standard temperature (59 degrees Fahrenheit). This is sort of like applying a park factor to correct for runs scored and puts each pitch on a level playing field. This is very important for the classification algorithm because if these pitches weren't translated pitchers who spent half of their time at Coors would have two separate curve balls. This would really mess the algorithm up and while Coors is the biggest problem some other parks during mid summer or during a cold spell can have a higher than 10% change as well. Translating these pitches solves these problems.

Ok so now the pitches are translated we are ready to classify them. I am using an incredible simple algorithm that clusters pitches by determining how close a pitch was to every other pitch thrown by that pitcher. It calculates a "distance" between each pair of pitches by comparing the speed the pitch was thrown at and the vertical and horizontal accelerations. The two pitches that are closest together get merged. This process continues until all pitches are in clusters and the clusters are far enough away from each other.

Once the clusters are formed the algorithm finds the pitcher's fastball. It does this by simply taking the cluster that has the highest speed. Once the fastball is found every other cluster is compared to the fastball in speed and the two accelerations. Now the cluster algorithm is run again on the remaining clusters and pitch types are formed. By first comparing the pitches to the pitchers fastball Jamie Moyer's other pitches can be on the same footing as Joel Zumaya's pitches. The algorithm can't say these are curve balls but it can put all the curve balls
together and then I can label the group curve balls. Once this is done it goes back to the fastballs we started with and reclassifies those in case a pitcher only throws sinkers or cutters for example.

Sadly, this algorithm is far from perfect and needs some human intervention. I have to hand edit about 40 pitchers who might have a splitter that looks like a sinker to the algorithm or a slider that looks like a cutter and so on. I have tried to check other references to make sure I have the right pitches for each pitcher but for many pitchers who have just thrown a few pitches in the big leagues this is particularly hard. If you are browsing the player cards and find something you think I got wrong please leave a comment below.

Explanation of the correction code

This post is way overdue but finally here is a detailed explanation of the correction code to the PITCHf/x data. As we have seen in previous posts, the PITCHf/x data needs some serious corrections. This is going to be a pretty hard core post so feel free to skip it if you aren't interesting in the method or how to correct the data. I am going to describe the process for one variable, the initial position of the ball in the vertical position, or z0. After that I will discuss alternations for other variables.

Once I have all the data read in and all initial positions are moved back to 55 feet from home plate I am ready to correct the data from park to park. What we would really like to do is first calculate a league average and then calculate how each park varies from that. But because the nature of the data this is impossible. For instance, if a home team has a very short pitching staff that park is going to have a low average z0 if we simply averaged all the pitches thrown in the park. Having a park average for each park is essential for the league average calculation so we must do something else.

What I have come up with is instead of calculating an average I am calculating the difference between two parks based off common pitchers to each park. I first calculate a mean and a variance for z0 for each pitcher for each park he has pitched in. I then take every pitcher who has thrown a tracked pitch in park A and park B and calculate the difference between the two means from the two parks. I also carry out a similar trick by adding the square of the variances to find the error on this difference. So, if a pitcher had a mean of 6 feet in park A and a mean of 6.5 feet in park B than his difference would be -.5 feet. Once I have done this for every pitcher who has thrown in the two parks I can add up the differences. But, because some pitchers contributed a lot of pitches in both parks and some just a few I actually find a weighted average. This is were the error comes in for each pitcher in the differences. If a pitcher just threw a few pitches in both parks he is going to have a very large variance and won't count as much to the weighted mean.

So this should give me a park difference between every park. The problem is there are many park combinations that no pitcher threw in both parks while PITCHf/x was tracking. To solve this problem I carry out the above procedure to higher orders. I do that by adding intermediary parks. So instead of going straight from park A to park B I also add in pitchers who threw pitches in park A and park C and then pitchers who threw in park B and park C. Now because park C has been added we have two sets of errors which again we need to combine in quadrature which means this measurement will be less accurate than just going from park A to park B but it is the only solution for parks with no common pitchers. In fact, I carry this procedure out to 4th order to get the best possible results. I could go further but I have found that 5th order and beyond change the numbers less than 1/2 a percent. Needless to say, this takes a long time. Hours in fact on my desktop. But the result is I now have a difference between all the parks. From now on I will call the difference between park A and B D(A)(B).

I now have all of the differences but this doesn't get me any closer to the league average. In fact, we will now apply a nifty statistics trick. While I would really like to find the league average I don't actually need it. What I really need is the difference between each park and the league average. I will also note park A's average as PA. Again, I can't actually find this number but we will need it in the difference between each park and league average calculation. Here is how we are going to find that.

By definition, the league average would be the sum of each park divided by the number of parks. Multiplying each side of that equation by the number of parks and we get.

P1+P2+P3+...+P28+P29 = LgAve * 29

Note we are using 29 here because the system was never turned on in Baltimore. Also, the numbers 1 through 29 are just placeholders for each of the parks. If we want to now find the difference between park 1 and league average we can start by adding P1-P2 to both sides.

2*P1+P3+....+p28+P29 = LgAve*29 + P1 - P2

We have got P2 out of the right side which is good but now it is on the right side which is bad. The good news is we know what P1 - P2 is that is D(1)(2) which we already have measured. In fact, I now can add P1 - P3 and P1 - P4 and so on to each side and then replace each difference on the right side with the corresponding D until I get:

29 * P1 = LgAve * 29 +D(1)(2) + D(1)(3) + ... + D(1)(28) + D(1)(29)

Moving the LgAve to the left side and dividing by 29 we get:

P1-LgAve = (D(1)(2)+D(1)(3) + ... D(1)(28)+D(1)(29))/29

The left side is exactly what we want, the difference between one park and league average. The right side are all numbers which we have calculated. So we can apply this method for each park and just like that we have the park corrections for the initial vertical release point.

Whew, we now need to do this method for each park for each variable. That is all the initial locations, the initial velocities, and the accelerations. The accelerations are a little bit complicated because they also are affected by the atmospheric conditions. For them I find the altitude and temperature of the game and find the air density. Because the ball is being manipulated by drag and spin (Magnus force) and both forces are proportional to air density I can multiply in the air density then run the correction code. This actually gives me the correction factor times the density but I can divide that out when I go to apply it.

Lastly, the z direction acceleration needs another trick. Gravity is also acting on the ball but it doesn't care about air density. So it must be subtracted first. Once the correction factor is found gravity can be added back in to find the true acceleration in the z direction.