Why becoming a Data Scientist is NOT actually easier than you think.

Why becoming a Data Scientist is NOT actually easier than you think.

Most data scientists and the companies that employ them are not using Matlab/Octave. They have backend web services written in Java, Python, Scala, or Ruby. These languages are not covered. Python has libraries like Scipy, Numpy, and Scikit-learn that are great for solving numerical problems. Java has a bunch of libraries too like the Mahout math library. R is used by most statisticians (again not covered in the course).

No alt text provided for this image


When your boss (or a customer) comes to you and says you need to integrate an algorithm into a pre-existing web service ( example -they need a recommendation engine), and you say "I only know Matlab" that is going be a huge problem. You don't just pick up Java/Python/C++/Scala/whatever in a few days on the job. You have to be somewhat familiar with these languages to understand large, pre-existing codebases. It wouldn't hurt to have a decent understanding of existing technologies like Django, ROR, Groovy, Lift, etc. because you're going to have to integrate your amazing algorithm into one of them. If you only know Python but the rest of the company is using Java, you better know about Thrift, Avro, Google ProtoBufs, or something similar.

No alt text provided for this image


So what skills do you need in order to become a data scientist?

You Need to Be Able to Program

Sure, some companies might use a software package like Matlab or Octave, but the majority of companies instead have their own data analysis software programmed in Java, Python, Scala, or Ruby. In order to be able to use these correctly, you really should know how to program in them.

After all, when a client or your boss wants you to adjust program or integrate a new algorithm into it and you don’t have the knowhow to do so, then you’re not going to look very good. In fact, at that moment you’re failing as a data scientist. Not because you don’t know the numbers, but because you can’t apply them.

Think you’ll pick these languages up on the job? It’s possible, but unless you’re a genius, it’s not likely. In fact, you should probably start learning them now, because otherwise when they give you a program that is ten thousand lines long and expects you to modify it, you’re probably already too late to learn what you need to.

No alt text provided for this image


Big Data Software

The current boom in data that’s going on is hugely beneficial for finding out what is going on, why people do what they do and how to either make money from that or to deliver better services. That said, it does mean that data scientists suddenly have to work with an entirely new realm of software packages and computers.

Standard desktops that you are generally going to be familiar with won’t be able to deal with the data or the software packages that are out there. And so you’ll have to learn how to do redistributive processing and that means understanding map-reduce, distributive file systems and being able to use Hadoop. Don’t know what that is? Then you better find out!

No alt text provided for this image


Data Cleaning

Then there’s data cleaning. You see, when you’re following a course to learn how to become a data scientist, they’ll generally provide you with the data package right there, scrubbed of strange anomalies and weird entries.

That’s not how the real world works. Their data is ugly. There people and machines have done strange things. For that reason, you’ve got to know how to clean up your data, what you can eliminate without creating a problem and when you’re going to be corrupting your data simply by erasing an entry or altering it.

Of course, you’ll also need to know how to tweak the data. And that means learning yet more, as that’s best done in UNIX. Are you familiar with it? Have you worked with commands like sed, grep, tr, cut, sort, awk, and map/reduce? Well, then add that to your list of things that you really want to have mastered before you actually get a data analysis job.

No alt text provided for this image


Probability and statistics

What is a p-value? Is your feature dependent or independent? What are your confidence intervals and how do you set them up? Can you do an F test? What is your standard error? What is the difference between mediators and moderators? How do you set up your hypotheses beforehand and how do you test them correctly?

Statistics is not easy and – at least initially – not a hell of a lot of fun. (It actually becomes incredibly interesting once you understand it, but you’ll probably have to take my word for it).

And yet it is absolutely vital if you want to be a data scientist, as this is how you give yourself the big picture understanding of a dataset and know if you’re doing the test in accordance with how it is expected to be done.

Why should you do that? Because otherwise other data scientists are going to poke holes in every theory you’ve come up with and in every way you’ve tested it.

No alt text provided for this image


Don't Let Me Discourage You

Everything I’ve said here is not to discourage you. It is just to make you aware of what you’ll actually have to know. You see, most article writers aren’t writing that article to really educate you. Oh sure, they’ll take that as a bonus, but what they’re really trying to do is attract readers.

And you don’t attract readers by saying something is hard. For example, how many articles say that it is really difficult to develop leadership skills? Sure, they’re out there, but they’re very rarely popular. Instead, it’s the ones that make it seem easy that do well.

Don’t let that fool you. If it was easy everybody would be doing it and nobody would be making any money.

Nothing in life that is worth doing is done easily. That’s what makes it worth doing. So, if you’re going to become a data scientist, more power to you! I’m happy to hear it. Just be ready to roll up your sleeves and really learn everything you need to know. Then after that, you can join the chorus of people saying how easy it is, even while you know different.

No alt text provided for this image


if you follow this Algorithm then it will not that much difficult:

Step 1: Understand Data

Step 2: Apply some test on that data

Step 3: Check Results

Step 4: Go to Step 1

learn more:https://www.dhirubhai.net/in/pratyushachakraborty/



要查看或添加评论,请登录

社区洞察

其他会员也浏览了