Thursday, January 16, 2014

What does it take to be a Data Scientist?

I was looking at some of the job ads for a Data Scientist and found the desired skills/qualifications section quite interesting. The following is a list of companies and the educational qualification, statistical/DM skills, and software knowledge required for a Data Scientist.

SAS and R seems to be very popular. Samsung wants a PhD only. And the ad for CITI looks really amusing - I think they are looking for an army not a person :)

Edu - Bachelor’s degree in Computer Science, Statistics, Mathematics or a related field. Master’s degree preferred

St/DM - Prior experience with AB testing (or other experimental design), statistical data analysis, model creation and refinement on big data sets

SW - Working knowledge of SQL (Hadoop/Hive preferred), SAS, R, Python, or Java

Edu - PhD/Post Doc from a renown institution in any advanced quantitative modeling oriented discipline including but not limited to Machine learning, Statistics, Marketing Science, Operations Research, Econometrics, Stochastic Finance, Distributed and parallel computing, Digital media analytics, etc.

St/DM - 
1. Advanced statistical methods including complex multi-variate statistical methods, discrete choice modeling, conjoint based analysis  
2. Machine learning including Bayesian methods, reinforcement learning, Neural networks, Support vector machines, Hidden Markov Models, relevance vector machines, Probabilistic/ Evidential Reasoning
3. Operations Research (Queuing, Markov Models, DEA, Integer Programming, Dynamic programming, Stochastic Programming, Game theory) 

4. Macroeconomic modeling, Leading indicator analysis, Long term and near term Forecasting, Time series based methods, Bayesian multivariate regression methods, ARCH/GARCH/VAR models and other advanced regression methods, Mathematical economics, System dynamics, Stochastic control, Nonlinear dynamic models, etc. Prior practical industrial scale modeling exposure is a must.
5. Advanced quantitative methods relevant to modeling consumer experience in the digital world. Experience in web log mining for visitor segmentation, visitor behavior modeling, common path analysis, conversion analysis, abandonment analysis, promotion analytics, buzz analysis, sentiment analysis, social networking analysis etc is a must.
6. Latent class models, Multivariate logit/probit/tobit models, Multinomial logistic models, Marketing mix modeling, Hidden Markov models, Conjoint methods, Market research and optimization methods. Prior experience in customer mindset modeling, customer loyalty, customer choice, brand equity, advertisements/promotion mix, etc is a must.
7. Parallelizing existing traditional or modern (machine learning) based algorithms, Randomized algorithms, Simulations and Simulation based methods including Markov Chain Monte Carlo, parallel and distributed simulations, next gen optimization methods, etc. Knowledge of Hadoop/grid based programming for large scale problem solving is a must.

SW - Proven ability in model building and application experience in data mining techniques and tools (SAS and/or other modeling packages like R, Matlab, Mathematica, ILOG etc) 
Edu - Master’s Degree/ Bachelors with 5 years’ work experience in business analytics, reporting, and process design

St/DM - Strong understanding and implementation of predictive / analytical modeling techniques, theories, principles, and practices Specific experience in more than one of: statistical modeling, machine learning and text mining techniques 

SW - Excellent knowledge of data mining / predictive modeling tools such as SAS, R, or SPSS

Edu - M.S. or Ph.D. in a relevant technical field, or 4+ years experience in a relevant role

St/DM - Extensive experience solving analytical problems using quantitative approaches

SW - 
1. Fluency with at least one scripting language such as Python or PHP 
2. Familiarity with relational databases and SQL
3. Expert knowledge of an analysis tool such as R, Matlab, or SAS
4. Experience working with large data sets, experience working with distributed computing tools a plus (Map/Reduce, Hadoop, Hive, etc.)

Edu - Masters, Phd, or equivalent experience in a quantitative field (computer science, physics, mathematics, bioinformatics, etc.)

St/DM - Strong background in Machine Learning, Statistics, Information Retrieval, or Graph Analysis

SW - 
1. Some experience working with large datasets, preferably using tools like Hadoop, MapReduce, Pig, or Hive
2. Experience programming in an object oriented language (Java, C++, etc)
3. Knowledge of scripting languages like Ruby or Python, familiarity with web frameworks a plus
4. Comfortable with data analysis & visualization using tools like R, Matlab, or SciPy

Edu - Bachelor’s degree in Mathematics, Statistics or Computer Science with strong statistical background

St/DM - Demonstrated statistical analysis skills in a business environment

SW - Knowledge of structured (SQL) and non-structured (Log files) data bases

Edu - PhD in a quantitative discipline (Applied Mathematics, Statistics, Computer Science, Operations Research, or related field)

St/DM - Experience utilizing both qualitative analysis (e.g. content analysis, phenomenology, hypothesis testing) and quantitative analysis techniques (e.g. clustering, regression, pattern recognition, descriptive and inferential statistics)

SW - R preferred

Edu - PhD Computer Science Candidate Only

St/DM - Strong background in machine learning

SW - Strong programming experience in Java/C++

Edu - Bachelors degree in Statistics, Economics, Analytics, Mathematics, Computer Science, Information Technology or related field and 10 years experience in an analytics related field OR Masters' degree in Statistics, Economics, Analytics, Mathematics, Computer Science, Information Technology, or related field and 8 years experience in an analytics related field

St/DM - Certificate in business analytics, data mining, or statistical analysis

SW -
1. 7+ years experience with statistical programming languages (for example, SAS, R)
2. 7+ years experience with SQL and relational databases (for example, DB2, Oracle, SQL Server)