Libraries for data science

These are the best libraries that transfer Python from a general purpose programming language into a powerful and robust tool for data analysis and visualization.


NumPy is the foundation library or scientific computer in Python, and many of the downstream libraries use NumPy arrays as inputs and outputs. NumPy introduces objects for multidimensional arrays and matrices, and also routines that allow developers to perform advanced mathematical and statistical functions on the arrays with as little code as possible.

import numpy as np 
# Create the following rank 2 array with shape (3, 4) # [[ 1 2 3 4] # [ 5 6 7 8] # [ 9 10 11 12]] a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) 
# Use slicing to pull out the subarray consisting of the first 2 rows # and columns 1 and 2; b is the following array of shape (2, 2): # [[2 3] # [6 7]] b = a[:2, 1:3] 
# A slice of an array is a view into the same data, so modifying it # will modify the original array. print(a[0, 1]) # Prints "2" b[0, 0] = 77 # b[0, 0] is the same piece of data as a[0, 1] 
print(a[0, 1]) # Prints "77"


SciPy builds on NumPy by adding a collection of algorithms and high-level commands for manipulating and visualizing data. This package includes functions for computing integrals numerically, solving differential equations, optimization.

Example - Testing skewness and kurtosis in a normal distribution

>> from scipy import stats 
>>> print('normal skewtest teststat = %6.3f pvalue = %6.4f' % stats.skewtest(x)) 
normal skewtest teststat = 2.785 pvalue = 0.0054 
>>> print('normal kurtosistest teststat = %6.3f pvalue = %6.4f' % stats.kurtosistes t(x)) 
normal kurtosistest teststat = 4.757 pvalue = 0.0000
>>> print('normaltest teststat = %6.3f pvalue = %6.4f' % stats.normaltest(x)) 
normaltest teststat = 30.379 pvalue = 0.0000


Pandas add data structures and tools that are designed for practical data analysis in finance, social sciences and engineering. Panda works well with incomplete, messy and unlabeled data and provides tools for shaping, merging, reshaping and slicing datasets.

import pandas as pd 
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'], 
'age': [42, 52, 36, 24, 73], 
'preTestScore': [4, 24, 31, ".", "."], 
'postTestScore': ["25,000", "94,000", 57, 62, 70]} 
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore']) 


Matplotlib is the standard Python library for creating 2D plots and graphs. It is low level, which means that it requires addition commands to generate nice looking graphs and figures than some other advanced libraries. It is flexible and with enough commands, almost all kind of graphs can be made with matplotlib.

import numpy as np # importing numpy 
import matplotlib.pyplot as plt # importing matplotlib 

# Compute the x and y coordinates for points on sine and cosine curves 
x = np.arange(0, 3 * np.pi, 0.1) 
y_sin = np.sin(x) # finding sin of x 
y_cos = np.cos(x) # finding cosine of x 

# Plot the points using matplotlib 
plt.plot(x, y_sin) 
plt.plot(x, y_cos) 
plt.xlabel('x axis label') 
plt.ylabel('y axis label') 
plt.title('Sine and Cosine') 
plt.legend(['Sine', 'Cosine'])

Libraries for machine learning

Machine learning sits at the intersection of Artificial Intelligence and Data Science. By training computers with real-world data sets, we are able to create algorithms that make accurate and sophisticated predictions. Some use cases are about getting better driving directions and building computers that can identify landmarks by looking at pictures.


Scikit-learn builds on NumPy and SciPy by incorporating algorithms for common machine learning and data mining tasks, including clustering, regression and classification. Scikit-learn tools are well-documented and its contributors include many machine learning experts. It is a curated library which means that developers do not need to choose between different versions of the same algorithm. Its power and ease of use make it popular with data-heavy startups like Evernote and Spotify.

Example - Logistic regression using Scikit-learn
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn import linear_model, datasets 

# import some data to play with 
iris = datasets.load_iris() 
X =[:, :2] # we only take the first two features. 
Y = iris.targe

h = .02 # step size in the mesh
logreg = linear_model.LogisticRegression(C=1e5) 

# we create an instance of Neighbours Classifier and fit the data., Y)

# Plot the decision boundary. For that, we will assign a color to each 
# point in the mesh [x_min, x_max]x[y_min, y_max]. 
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5 
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5 
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) 
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()]) 

# Put the result into a color plot 
Z = Z.reshape(xx.shape) 
plt.figure(1, figsize=(4, 3)) 
plt.pcolormesh(xx, yy, Z, 

# Plot also the training points 
plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', 
plt.xlabel('Sepal length') 
plt.ylabel('Sepal width') 

plt.xlim(xx.min(), xx.max()) 
plt.ylim(yy.min(), yy.max()) 


TensorFlow is developed by Google as an open-source successor to DistBelief, their previous framework for training neural networks. TensoFlow uses a system of multi-layered nodes that allow you to quickly set up, train and deploy artificial neural networks with large datasets. It is what allows Google to identify objects in photos or understand spoken words in its voice-recognition app and is mainly used in Deep Learning.

Example - Example in MNIST dataset (hand-written digits from 0-9)
import tensorflow as tf mnist = tf.keras.datasets.mnist 
(x_train, y_train),(x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 
model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(), tf.keras.layers.Dense(512, activation=tf.nn.relu), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation=tf.nn.softmax) ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']), y_train, epochs=5) model.evaluate(x_test, y_test
Keras is a high-level neural network API, written in Python and is capable of running on top of TensorFlow, CNTK and  Theano. It was developed with a focus on enabling fast experimentation. To be able to go from idea to result with the least possible delay is key to doing good research.

You will use Keras if you need a deep learning library that:

  • Allows for easy and fast prototyping through user friendliness, modularity and extensibility
  • Supports both convolutional networks and recurrent networks
  • Runs seamlessly on CPU and GPU
from keras.models import Sequential 
model = Sequential() 
from keras.layers import Dense 
model.add(Dense(units=64, activation='relu', input_dim=100)) 
model.add(Dense(units=10, activation='softmax')) 
model.compile(loss='categorical_crossentropy', optimizer='sgd', 
optimizer=keras.optimizers.SGD(lr=0.01, momentum=0.9, nesterov=True), y_train, epochs=5, batch_size=32)
model.train_on_batch(x_batch, y_batch)
loss_and_metrics = model.evaluate(x_test, y_test, batch_size=128) 
classes = model.predict(x_test, batch_size=128)

Libraries for natural language processing


Scrapy is a library for creating spider bots to systematically crawl the web and extract structured data like prices, contact info and URLS. Scrapy was originally designed for web scraping. It can also extract data from APIs.

import scrapy 
class QuotesSpider(scrapy.Spider): name = "quotes" 
def start_requests(self): urls = [ '', '', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) 
def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body) self.log('Saved file %s' % filename)


NLTK is a set of libraries designed for Natural Language Processing (NLP). NLTK’s basic functions allow you to tag text, identify named entities. It also displays parse trees which are like sentence diagrams that reveal parts of speech and dependencies. Through NLTK, more complicated things like sentiment analysis and automatic summarization can be done.


SpaCy is minimal and opinionated and it does not flood you with options like NLTK. It’s philosophy is to present the best one algorithm for each purpose. Little opinionated choices need to be made and one can focus on being productive.

As SpaCy is built on Cython, it is lightning fast. It is considered to be ‘state-of-the-art’. It’s main weakness is that is only.


Gensim is a well-optimized library for topic modelling and document similarity analysis. It’s topic modelling algorithms like Latent Dirichlet Allocation (LDA) implementation are best-in-class. It is robust, efficient and scalable.

Libraries for plotting and visualizations

An analysis is only worth value if it can be communicated with other people. These libraries are based on matplotlib to enable an easy and visually compelling creation of sophisticated graphs, charts and maps.


Seaborn is a popular visualization library that is built on matplotlib’s foundation. Seaborn’s default styles are more sophisticated than matplotlib’s. Seaborn is a high-level library, which means that it is easy to generate various kinds of plot, including heatmaps, time series and violin plots.

>> import seaborn as sns 
>>> sns.set(style="whitegrid") 
>>> tips = sns.load_dataset("tips") 
>>> ax = sns.boxplot(x=tips["total_bill"])

Bokeh makes interactive, zoomable plots in modern web browsers using JavaScript widgets. A feature of Bokeh is that it comes with 3 levels of interface, from high level abstractions that allow you to quickly generate complex plots to a low level view that offers maximum flexibility to app developers.

Matplotlib is for creating 2D plots and graphs. It is a low-level library, which means that it requires more commands to generate nice-looking graphs and figures than with advanced libraries. It has a good amount of flexibility which means that just about any kind of graph can be made with matplotlib.

Scikit-learn vs TensorFlow vs Keras

TensorFlow is low level which means that it is the Lego bricks that help to implement machine learning algorithms where scikit-learn offers off-the-shelf algorithms, such as classification like SVMs, Random Forests, Logistic Regression. TensorFlow shines for deep learning algorithms since it takes advantage of GPUs for more efficient training.

Scikit’s deep learning functionality is quite limited. It introduced shallow networks recently and its multi-layered perceptron (MLP) are not that well optimized in comparison to Kears or TensorFlow.


XGBoost has both linear model solver and tree learning algorithm. It makes its selection intelligently and subsequently gives more weight to hard to classify observations. It is fast due to its capacity to do parallel computation on a single machine. It has additional features for doing cross-validation and finding important variables.


  • Speed: It can automatically do parallel computation on Windows and Linux, generally over 10 times faster than the classical gradient boosting model
  • Input type: It takes several types of input data
  • Dense matrix: It can take dense matrix
  • Sparse matrix: It can take sparse matrix
  • xgb.DMatrix: Its own class
  • Sparsity: It accepts sparse input for both tree booster and linear booster and is optimized for sparse input
  • Customization: It supports customized objective functions and evaluation functions