Machine Learning 101- Opening

Since I focused on data science to get comfortable in computational part for biotechnology,  I will start learning machine learning using Python scikit-learn library. Today I will share some of my knowledge about machine learning and some new things that I just learn from course in Udacity: Intro to Machine Learning:

By Andrew Ng, Machine Learning is the science of getting computers to act without being explicitly programmed. As it is in the name, machine is learning by seeing and analysing the data which is already handed to it. This is a long process of making machine to learn or decide by itself.

There are dozens of cool examples of usage in Machine Learning. You probably well aware of face tag mechanism on Facebook, and our smart phones. They are using machine learning to recognize faces. Self driving cars, some robot projects, search engines like Google, and so on. Nowadays we surrounded by machine learning technologies.

With this post I officially start learning and sharing Machine Learning. Join me on this enjoyable journey. Share your knowledge, ask questions, make contributions. See you soon…

Advertisements

Python Machine Learning Books

Python is a very popular language for machine learning. The machine learning libraries and frameworks in Python (especially around the SciPy stack) are maturing quickly. They may not be as feature rich as R, but they are robust enough for small to medium scale production implementation.

If you are a Python programmer looking to get into machine learning or you are generally interested to get into machine learning via Python, then I want to use this post to point out some key books you might find useful on your journey.

This is by no means a complete list of books, but I think they are the pick of the books you should look at if you are interested in machine learning in Python

Machine Learning in Python

Amazon ImageBuilding Machine Learning Systems with Python (2013): Master the art of machine learning with Python and build effective machine learning systems with this intensive hands-on guide.

Learning scikit-learn: Machine Learning in Python (2013): Experience the benefits of machine learning techniques by applying them to real-world problems using Python and the open source scikit-learn library.

Machine Learning in Action (2012): Machine Learning in Action is unique book that blends the foundational theories of machine learning with the practical realities of building tools for everyday data analysis. You’ll use the flexible Python programming language to build programs that implement algorithms for data classification, forecasting, recommendations, and higher-level features like summarization and simplification.

Amazon ImageProgramming Collective Intelligence: Building Smart Web 2.0 Applications (2007): This fascinating book demonstrates how you can build Web 2.0 applications to mine the enormous amount of data created by people on the Internet.

Machine Learning: An Algorithmic Perspective (2011): The field is ready for a text that not only demonstrates how to use the algorithms that make up machine learning methods, but also provides the background needed to understand how and why these algorithms work. Machine Learning: An Algorithmic Perspective is that text.

Specialty Machine Learning in Python

Amazon ImageMining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More (2013): You’ll learn how to acquire, analyze, and summarize data from all corners of the social web, including Facebook, Twitter, LinkedIn, Google+, GitHub, email, websites, and blogs.

Natural Language Processing with Python (2009): This book offers a highly accessible introduction to natural language processing, the field that supports a variety of language technologies, from predictive text and email filtering to automatic summarization and translation.

Programming Computer Vision with Python: Tools and algorithms for analyzing images (2012): If you want a basic understanding of computer vision’s underlying theory and algorithms, this hands-on introduction is the ideal place to start. You’ll learn techniques for object recognition, 3D reconstruction, stereo imaging, augmented reality, and other computer vision applications as you follow clear examples written in Python.

Amazon ImagePython for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (2012): It is also a practical, modern introduction to scientific computing in Python, tailored for data-intensive applications. This is a book about the parts of the Python language and libraries you’ll need to effectively solve a broad set of data analysis problems. This book is not an exposition on analytical methods using Python as the implementation language.

Things that I hate to Teach in Python

I am currently teaching Python class for total beginners. My students are not only new for Python programming, most of them are new to programming. I choose to teach Python because Python is the best beginner programming language. Throughout the class -which we took a break for 2 weeks- I talked about basic data structures of Python functions control statements. Although they are very complicated concepts in computational world, I think Python makes it easier to understand. This is a blog post from Python for biologist. He plotted almost every hardship that I am dealing with.
I realized that, while I’ve spent a lot time talking about why Python is a great language, I have a number of pet peeves that I’ve never written down.I’m not talking about the usual problems, like Python’s relative lack of performance or lack of compile-time type checking – these things are deliberate design trade-offs and changing them would involve making Python not-Python. I’m talking about the small things that cause friction, especially in a teaching environment.

Note: I realize that there are good reasons for all these things to be the way they are, so don’t take this too seriously….

1. Floating point vs. integer division

Anyone who’s written in Python for any length of time probably types this line automatically without really thinking about it:

1
from __future__ import division

but take a moment to consider how you would explain what’s going on in this piece of code to a beginner. In order to really understand what’s happening here, you have to know about:

  • Python’s system for importing modules
  • Python’s system for grouping modules into packages
  • the fact that there are different versions of Python with slightly different behavior
  • the difference between floating-point and integer numbers
  • the mechanisms of operator overloading, whereby we can define the behavior of things like + and / for different types
  • the concept of polymorphic functions and operators, which allow us to treat different classes the same, some of the time

Explaining all this to someone who has never written a line of code before is unlikely to be productive, but none of the alternatives are particularly attractive either. We can just present this as a magic piece of code and save the explanation for later (this is normally what I do). We can instruct students to use explicit floating point numbers:

1
2
answer = float(4)/3
answer = 4.0/3

, but eventually they will forget and use integers and find that it works some of the time. We can carefully craft our examples and exercises to avoid the need for floating point division, but this is setting students up for pain further down the line. We can use the command-line argument -Q to force floating-point division, or just use Python 3 for teaching, but both of these options will cause confusion once the student goes back to their own environment.

2. split() vs. join()

“OK class, this is how we take a string and split it up into a list of strings using a fixed delimiter:”

1
2
sentence = "The all-England summarize Proust competition"
words = sentence.split(" ")

“So I guess, logically, to put the words back together again we just say:

1
sentence = words.join(" ")

right? Look at that elegant symmetry…… Wait a minute, you’re telling me it doesn’t work like that? The list and the delimiter actually go the other way around, so that we have to write this ugly line?

1
sentence = " ".join(words)

Wow, that just looks wrong.”

Yes, I know that there are good reasons for collection classes to only have methods that are type-agnostic, but would it really be so bad to just str() everything?

3. Exhaustible files

It’s perfectly logical that you shouldn’t be able to iterate through a file object twice without re-opening it….. once you know a fair bit about how iteration is actually implemented in Python. As a beginner, thought, it’s a bit like Python is giving with one hand and taking away with the other – you can use an opened file object just like a list, except in this one specific but very important way:

1
2
3
4
5
6
7
8
9
10
11
12
13
my_list = [1,2,3,4]
for number in my_list:
    do_something(number)
# second loop works just as you'd expect
for number in my_list:
    do_something_else(number)
my_file = open("some.input")
for line in my_file:
    do_something(line)
# second loop silently never runs
for line in my_file:
    do_something_else(line)

This problem also rears its ugly head when students try to iterate over a file having already consumed its contents using read():

1
2
3
4
5
6
my_file = open("some.input")
my_contents = my_file.read()
....
# this loop silently never runs
for line in my_file:
    do_something(line)

That second line can be difficult to spot for student and teacher alike when there are many intervening lines between it and the loop.

4. Lambda expressions

OK, this one is more annoying when writing code than when teaching it, since I rarely get round to talking about functional programming in introductory courses. I totally get why there should be a big, obvious flag when we are doing something clever (which lambda expressions generally are). Nevertheless, it seems a shame to have a style of coding that lends itself to elegant brevity marred by so many unnecessary keystrokes.

I think that the reason this bugs me so much is that I first got into functional programming by way of Groovy, which has (to me) a very pleasing syntax for anonymous functions (actually closures):

1
{x,y -> x**y}

compared to Python:

1
lambda x,y : x**y

Of course, Python lessens the sting of having to type lambda with its various comprehensions:

1
2
squares = map(lambda x : x**2, range(10))
squares = [x**2 for x in range(10)]

so I can’t complain too loudly.

5. Variables aren’t declared

It’s just way too easy for beginners to make a typo that brings their progress to a screeching halt. Consider this real-life example from my most recent course:

1
2
3
4
positions = [0]
for pos in [12,54,76,103]:
    postions  = positions + [pos]
print(positions) # prints [0] rather than [0,12,54,76,103]

Leaving aside that this particular example could have been salvaged by using positions.append(), it took way to long for us to track down the typo. In real-life code, this is the kind of thing that would ideally be caught by unit testing. This is one (rare!) case in which I pine for the old days of teaching Perl – use strict and my would have taken care of this type of problem.