My choice for OS in data science

6251011880_5e5f9f3e74_z

Just sharing a nice comic posted by Duncan Hull. In my opinion this comic accurately reflects the culture of different user groups!

We all know that in many real life situations, there could be many solutions to a single problem. This is also true for a data scientist. Many junior/to-be data scientists often wonder what kind of equipment, OS or language should they use in work. In this blog post I will share my thoughts on what OS I would pick.

Mac vs Windows

Usually the first question that comes up is “Should I get a Mac or Windows?”. Personally I use an 11” MacBook Air (2011). This is because I have borrowed it for free! Jokes aside this is also because I mainly use Python (more on that in later posts) and in my experience it is easier to use Python in Mac comparing to Windows. For example it is very common for a data scientist to have a need in installing various packages to complete some specific tasks. In Mac it can be easily accomplished by typing “pip install your_package” in the terminal. In Windows you would have to go through a series of process to get “pip” working in the command line. These little inconveniences add up quickly.

It is not to say that you cannot achieve great things using a Windows machine. For example my friend Greg is currently using a Windows laptop and he is able to produce a lot of  fascinating work! It is also a commonly known fact that most offices (at least in the corporate world) use Windows so it would be a great asset if one could efficiently conduct analysis on a Windows platform.

You may ask so Ricky it seems like you prefer a Mac. Are you going to get a new Mac for a data science purpose? My answer is “probably…not!”. This is because computing power is important and it doesn’t come cheap with a Mac. After spending a good amount of time with this MacBook Air (1.6GHz Dual Core, 2GB RAM), I have come to a realization that although it does a fantastic job in daily uses such as word processing and web browsing (without many tabs), it is much less capable in other areas such as running a Random Forest Classifier with 10 fold cross validation. The small amount of RAM means probably you cannot run many applications together at once or load a big-ish data into memory, which could be inconvenient. It is also difficult to use a virtual machine to create isolated environments.

Linux

Windows is inconvenient and Mac is expensive so what can we do? Luckily there is a third option – Linux.

Linux is an OS that is based on UNIX, which is what Mac is also based on. Therefore there are many similarities between Mac OS and Linux. This is very important as this implies the ease of obtaining and using data science tools in Mac and Linux are very similar!

Unlike Mac or Windows, which have only 1 version of OS being distributed at a time (I.e. Windows 10 and OSX El Capitan), Linux comes with different distribution kernels (e.g. Ubuntu, Debian). Some are light and efficient and others have more features to provide a better daily use experience. Hence Linux can be a versatile option for data scientists and developers.

 

What’s more? It’s completely FREE.

 

One downside of Linux is that it has got no official customer service and this can be frustrating sometimes. However with a problem solving mindset and the support of a large community of developers and engineers, a data scientist should be able to overcome most difficulties encountered.

Therefore the answer to what my my next laptop will be – it is likely to be a powerful Windows machine with Linux installed for data science related applications.

Windows Mac Linux
Pros ·      Highly available

·      Most people have experience

·      Good hardware specification for price

·      Efficient OS

·      Easy to obtain and integrate data science tools

·      Large and knowledgeable community

·      Good customer service

·      Free

·      Easy to obtain and integrate data science tools

·      Large and knowledgeable community

·      Can install on Windows machine to take advantage of good hardware specification

Cons ·      Inefficient OS

·      Difficult to obtain and integrate data science tools

·      Average hardware specification for high price ·      No customer service hence could be challenging when trouble shoot

·      Could be viewed as a geek

 

Although in this blog post I have mentioned by thoughts and preference, it is recommended that everyone should make use of what they have already got at hand! As data scientist we are problem solvers. Trouble shooting and finding out how an OS interact with you and your tools can be very valuable experience! It would be better for us to work and hit a bottleneck; and only then consider upgrading our machines, than to spill out a large amount of cash just to find out later that it’s overkill.

 

Hope this helps!

 

 

 

Advertisements

2 thoughts on “My choice for OS in data science

  1. Good summary Ricky! Personally I chose Windows primarily because Mac is so over priced… for the same price, PC will have twice as much RAM/processing power. Now I also run Oracle VirtualBox with Ubuntu, as for some tasks Linux is so much better… for instance, the Git integration with Linux is much deeper than with Windows, and using it on the latter sometimes leads to unexpected results, to say the least.

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s