We have a new #1 on our leaderboard – a competitor who surprisingly joined the platform just two years ago. Shubin Dai, better known as Bestfitting on Kaggle or Bingo by his friends, is a data scientist and engineering manager living in Changsha, China. He currently leads a company he founded that provides software solutions to banks. Outside of work, and off Kaggle, Dai’s an avid mountain biker and enjoys spending time in nature. Here’s Bestfitting:
I majored in computer science and have more than 10 years of experience in software development. For work, I currently lead a team that provides data processing and analyzing solution for banks.
Since college, I’ve been interested in using math to building programs that solve problems. I continually read all kinds of computer science books and papers, and am very lucky to have followed the progress made on machine learning and deep learning within the past decade.
As mentioned before, I’ve been reading a lot of books and papers about machine learning and deep learning, but found it always hard to apply the algorithms I learned on small datasets that are readily available. So I found Kaggle a great platform with all sorts of interesting datasets, kernels, and great discussions. I couldn’t wait to try something, and first entered the “Predicting Red Hat Business Value” competition.
Within the first week of a competition launch, I create a solution document which I follow and update as the competition continues on. To do so, I must first try to get an understanding of the data and the challenge at hand, then research similar Kaggle competitions and all related papers.
I choose algorithms case by case, but I prefer to use simple algorithms such as ridge regression when ensemble, and I always like starting from resnet-50 or designing similar structure in deep learning competitions.
I like pytorch in computer vision competitions very much. I use tensorflow or keras in NLP or time-series competitions. I use seaborn and products in the scipy family when doing analysis. And, scikit-learn and XGB are always good tools.
I try to tune parameters based on my understanding of the data and the theory behind an algorithm, I won’t feel safe if I can’t explain why the result is better or worse.
In a deep learning competition, I often search related papers and try to find what the authors did in a similar situation.
And, I will compare the result before and after making parameter changes, such as the prediction distribution, the examples affected, etc.
A good CV is half of success. I won’t go to the next step if I can’t find a good way to evaluate my model.
To build a stable CV, you must have a good understanding of the data and the challenges faced. I’ll also check and make sure the validation set has similar distribution to the training set and test set and I’ll try to make sure my models improve both on my local CV and on the public LB.
In some time series competitions, I set aside data for a period of time as a validation set.
I often choose my final submissions in a conservative way, I always choose a weighted average ensemble of my safe models and select a relatively risky one (in my opinion, more parameters equate to more risks). But, I never chose a submission I can’t explain, even with high public LB scores.
Good CV, learning from other competitions and reading related papers, discipline and mental toughness.
Nature protection and medical related competitions are my favorite ones. I feel I should, and perhaps can, do something to make our lives and planet better.
I am interested in all kinds of progress in deep learning. I want to use deep learning to solve problems besides computer vision or NLP, so I try to use them in competitions I enter and in my regular occupation.
To be frank, I don’t think we can benefit from domain expertise too much, the reasons are as follows:
But, there are some exceptions. For example, in the Planet Amazon competition, I did get ideas from my personal rainforest experiences, but those experiences might not technically be called domain expertise.
I think it is to prepare the solution document in the very beginning. I force myself to make a list that includes the challenges we faced, the solutions and papers I should read, possible risks, possible CV strategies, possible data augmentations, and the way to add model diversities. And, I keep updating the document. Fortunately, most of these documents turned out to be winning solutions I provided to the competition hosts.
We try to use machine learning in all kinds of problems in banking: to predict visitors of bank outlets, to predict cash we should prepare for ATMs, product recommendation, operation risk control, etc.
Competing on Kaggle also changed the way I work, when I want to find a solution to solve a problem, I will try to find similar Kaggle competitions as they are precious resources, and I also suggest to my colleagues to study similar, winning solutions so that we can glean ideas from them.
Here are my opinions:
Interesting competitions and great competitors on Kaggle make me better.
With so many great competitors here, winning a competition is very difficult, they pushed me to my limit. I tried to finish my competitions solo as many times as possible last year, and I must guess what all other competitors would do. To do this, I had to read a lot of materials and build versatile models. I read all the solutions from other competitors after a competition.
I hope I can enter a deep reinforcement learning competition on Kaggle this year.
First of all, No.1 is a measurement of how much I learned on Kaggle and how lucky I was.
In my first several competitions, I tried to turn the theories I learned in recent years into skills, and learned a lot from others.
After I gained some understanding of Kaggle competitions, I began to think about how to compete in a systematic way, as I have many years of experience in software engineering.
About half a year later, I received my first prize and some confidence. I thought I might become a grandmaster in a year. In Planet Amazon competition, I tried to get a golden medal, so it came to me as a surprise when I found out I was in first place.
Then I felt I should keep using the strategies and methods I mentioned before and got more successes. After I won the Cdiscount competition, I climbed to the top of Users Rank board.
I think I benefited from the Kaggle platform, I learned so much from others and the rank system of Kaggle also play an important role in my progress. I also felt so lucky as I I never expected I could get 6 prizes in a row, my goals of many competitions were top 10 or top 1%. I don’t think I could replicate the journey again.
However, I’m here not for a good rank. I always treat every competition as an opportunity to learn, so I try to select competitions from the field I am not so familiar with, which forced myself to read hundreds of papers last year.
I respect all the winners and wonderful solution contributors, I know how much effort they put into it. I always read the solutions with an admirable attitude.
A few of the most memorable insights came from Data Science Bowl 2017: the pytorch, 3D segmentation of medical images, the solutions from the Web Traffic Time Series Forecasting which use sequence model from NLP to solve time series problem, and the beautiful solutions from Tom (https://www.Kaggle.com/tvdwiele), and Heng (https://www.Kaggle.com/hengck23).
手机扫一扫
移动阅读更方便
你可能感兴趣的文章