Created for
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-file myPythonScript.py
Machine learning is a field of computer science that uses statistical techniques to give computer systems the ability to "learn" (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed
Next sections demonstrate the usage of various Python's Data Science related packages. You can install them separatelly, as given in the slides, or you can skip the installations, and use the Anaconda Python's distribution, bundled with data science and machine learning related applications
#install iPython with pip3
$ pip3 install ipython
#install iPython on pipenv
$ pipenv install ipython
#install jupyter with pip3
$ pip3 install jupyter
#install jupyter on pipenv
$ pipenv install jupyter
# navigate to the folder you want to be served:
cd jupyter_demos
# start the server:
jupyter notebook
#install pandas with pip3
$ pip3 install pandas
#install pandas on pipenv
$ pipenv install pandas
# install pandas on conda
conda install pandas
The two primary data structures in pandas are Series and DataFrame.
import pandas as pd
ds = pd.Series([1,2,3,4])
print(ds)
0 1
1 2
2 3
3 4
dtype: int64
ds = pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])
print(ds)
a 1
b 2
c 3
d 4
dtype: int6
ds = pd.Series({
"d":4,
"a":1,
"c":3,
"b":2,
"e":5
})
print(ds)
a 1
b 2
c 3
d 4
e 5
dtype: int64
##get index object:
print(ds.index)
#Index(['apples', 'bananas', 'oranges'], dtype='object')
## numerical or keyword indexes
print(ds["a"])
print(ds[0])
## indexes as list:
print(ds[['a', 'c', 'e']])
#a 1
#c 3
#e 5
#dtype: int6
## slicing
print(ds["a":"d"])
#a 1
#b 2
#c 3
#d 4
#dtype: int64
ds = pd.Series([1,2,3,4,5],index=["a","b","c","d","e"])
ds.index = ["A","B","C","D","E"]
print(ds)
#A 1
#B 2
#C 3
#D 4
#E 5
#dtype: int64
ds = pd.Series([1,2,3,4,5],index=["a","b","c","d","e"])
## filtering by value
ds[ds>2]
#c 3
#d 4
#e 5
#dtype: int64
## multiplication
ds*2
#a 2
#b 4
#c 6
#d 8
#e 10
#dtype: int64
ds = pd.Series([1,2,3,4,5],index=["a","b","c","d","e"])
"a" in ds
#True
"f" in ds
#False
ds1 = pd.Series([1,3], index=["a","c"])
ds2 = pd.Series([2,3], index=["b","c"])
print(ds1+ds2)
#a NaN
#b NaN
#c 6.0
#dtype: float64
# create Series Object:
prices_ds = pd.Series([1.5, 2, 2.5, 3],
index=["apples", "oranges", "bananas", "strawberries"])
# create DataFrame Object from prices Series:
prices_df = pd.DataFrame(prices_ds)
print(prices_df)
# 0
#apples 1.5
#oranges 2.0
#bananas 2.5
#strawberries 3
# create DataFrame Object from prices Series:
prices_df = pd.DataFrame(prices_ds,columns=["prices"])
print(prices_df)
# prices
#apples 1.5
#oranges 2.0
#bananas 2.5
#strawberries 3.0
prices_ds = pd.Series([1.5, 2, 2.5, 3],
index=["apples", "oranges", "bananas", "strawberries"])
suppliers_ds = pd.Series(["supplier1", "supplier2", "supplier4", "supplier3"],
index=["apples", "oranges", "bananas", "strawberries"])
fruits_df = pd.DataFrame({
"prices": prices_ds,
"suppliers": suppliers_ds
})
print(fruits_df)
# prices suppliers
#apples 1.5 supplier1
#oranges 2.0 supplier2
#bananas 2.5 supplier4
#strawberries 3.0 supplier3
read_csv
read_csv
is the preferred method for loading csv data into a DataFrame objectmerge
method in pandas is analogues to SQL join
operation!DataFrame.join
method, uses merge internally for the index-on-index (by default) and column(s)-on-index join.
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)
did; dname
1; Ivan
2; Asen
3; Maria
4; Stoyan
5; Aleks
6; Svetlin
did; language
2; "C++"
3; "Python"
3; "R"
6; "Java"
![]() |
Return only the rows in which the left table have matching keys in the right table |
dev_langs_inner = pd.merge(devs,langs,on="did",how='inner')
print(dev_langs_inner)
did dname language
0 2 Asen "C++"
1 3 Maria "Python"
2 3 Maria "R"
3 6 Svetlin "Java"
![]() |
Returns all rows from both tables, join records from the left which have matching keys in the right table. |
dev_langs_outer = pd.merge(devs,langs,on="did",how='outer')
dev_langs_outer
did dname language
0 1 Ivan NaN
1 2 Asen "C++"
2 3 Maria "Python"
3 3 Maria "R"
4 4 Stoyan NaN
5 5 Aleks NaN
6 6 Svetlin "Java"
![]() |
Return all rows from the left table, and any rows with matching keys from the right table. |
dev_langs_left_outer = pd.merge(devs,langs,on="did",how='left')
dev_langs_left_outer
did dname language
0 1 Ivan NaN
1 2 Asen "C++"
2 3 Maria "Python"
3 3 Maria "R"
4 4 Stoyan NaN
5 5 Aleks NaN
6 6 Svetlin "Java"
![]() |
Return all rows from the right table, and any rows with matching keys from the right table. |
dev_langs_right_outer = pd.merge(devs,langs,on="did",how='right')
dev_langs_right_outer
did dname language
0 2 Asen "C++"
1 3 Maria "Python"
2 3 Maria "R"
3 6 Svetlin "Java"
These slides are based on
customised version of
framework