Tag Archives: UNIX

Export data from HDFS to MySQL

First create the DB and table where you want to populate.

echo "create database staff2; use staff2; CREATE TABLE editorial (id INT(100) unsigned not null AUTO_INCREMENT, name VARCHAR(20), email VARCHAR(20), primary key (id));" | mysql -u root -p

Once done, we have the data we want to copy in HDFS.

hdfs dfs -cat /home/training/staff/editorial/part-m-*
1,Peter,peter@example.com
2,Jack,jack@example.com

Now dump into MySQL using sqoop.

sqoop export --connect jdbc:mysql://localhost/staff2 --username root -P --table editorial --export-dir /home/training/staff/editorial
17/02/27 12:51:56 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.2.0
Enter password: 
17/02/27 12:51:58 INFO manager.SqlManager: Using default fetchSize of 1000
17/02/27 12:51:58 INFO tool.CodeGenTool: Beginning code generation
17/02/27 12:51:59 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `editorial` AS t LIMIT 1
17/02/27 12:51:59 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `editorial` AS t LIMIT 1
17/02/27 12:51:59 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
Note: /tmp/sqoop-training/compile/e560499b42a9738bbc5ef127712adc7b/editorial.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
17/02/27 12:52:03 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-training/compile/e560499b42a9738bbc5ef127712adc7b/editorial.jar
17/02/27 12:52:03 INFO mapreduce.ExportJobBase: Beginning export of editorial
17/02/27 12:52:06 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
17/02/27 12:52:08 INFO input.FileInputFormat: Total input paths to process : 2
17/02/27 12:52:08 INFO input.FileInputFormat: Total input paths to process : 2
17/02/27 12:52:09 INFO mapred.JobClient: Running job: job_201702221239_0006
17/02/27 12:52:10 INFO mapred.JobClient:  map 0% reduce 0%
17/02/27 12:52:31 INFO mapred.JobClient:  map 50% reduce 0%
17/02/27 12:52:45 INFO mapred.JobClient:  map 100% reduce 0%
17/02/27 12:52:49 INFO mapred.JobClient: Job complete: job_201702221239_0006
17/02/27 12:52:49 INFO mapred.JobClient: Counters: 24
17/02/27 12:52:49 INFO mapred.JobClient:   File System Counters
17/02/27 12:52:49 INFO mapred.JobClient:     FILE: Number of bytes read=0
17/02/27 12:52:49 INFO mapred.JobClient:     FILE: Number of bytes written=1176756
17/02/27 12:52:49 INFO mapred.JobClient:     FILE: Number of read operations=0
17/02/27 12:52:49 INFO mapred.JobClient:     FILE: Number of large read operations=0
17/02/27 12:52:49 INFO mapred.JobClient:     FILE: Number of write operations=0
17/02/27 12:52:49 INFO mapred.JobClient:     HDFS: Number of bytes read=759
17/02/27 12:52:49 INFO mapred.JobClient:     HDFS: Number of bytes written=0
17/02/27 12:52:49 INFO mapred.JobClient:     HDFS: Number of read operations=19
17/02/27 12:52:49 INFO mapred.JobClient:     HDFS: Number of large read operations=0
17/02/27 12:52:49 INFO mapred.JobClient:     HDFS: Number of write operations=0
17/02/27 12:52:49 INFO mapred.JobClient:   Job Counters 
17/02/27 12:52:49 INFO mapred.JobClient:     Launched map tasks=4
17/02/27 12:52:49 INFO mapred.JobClient:     Data-local map tasks=4
17/02/27 12:52:49 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=64216
17/02/27 12:52:49 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=0
17/02/27 12:52:49 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
17/02/27 12:52:49 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
17/02/27 12:52:49 INFO mapred.JobClient:   Map-Reduce Framework
17/02/27 12:52:49 INFO mapred.JobClient:     Map input records=2
17/02/27 12:52:49 INFO mapred.JobClient:     Map output records=2
17/02/27 12:52:49 INFO mapred.JobClient:     Input split bytes=661
17/02/27 12:52:49 INFO mapred.JobClient:     Spilled Records=0
17/02/27 12:52:49 INFO mapred.JobClient:     CPU time spent (ms)=3390
17/02/27 12:52:49 INFO mapred.JobClient:     Physical memory (bytes) snapshot=422584320
17/02/27 12:52:49 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2940895232
17/02/27 12:52:49 INFO mapred.JobClient:     Total committed heap usage (bytes)=127401984
17/02/27 12:52:49 INFO mapreduce.ExportJobBase: Transferred 759 bytes in 42.9426 seconds (17.6748 bytes/sec)
17/02/27 12:52:49 INFO mapreduce.ExportJobBase: Exported 2 records.

Now we can see the content in MySQL DB named staff2.

echo "use staff2; SELECT * FROM editorial;" | mysql -u root -p 
Enter password: 
id	name	email
1	Peter	peter@example.com
2	Jack	jack@example.com

SUM function in Apache Pig

Leave a reply

I have been learning Big Data and among other things Apache Pig. So initially I thought that once you have loaded the data a SUM applied after a FOREACH would bring the total amount, but that’s not the case. A grouping needs to be performed first else SUM would give an error.

Continue reading →

Create csv file ordered with highest yield dividend paying stock

Leave a reply

For this script to work Yahoo Finance is needed. It can be easily installed with pip. Stock tickers are taken from a file called tickers.txt.

#!/usr/bin/env python

import time
from yahoo_finance import Share

dict = {}
f = open("tickers.txt", "r")
for ticker in f.readlines():
        dividend = Share(ticker).get_dividend_yield()
        if dividend is not None:
                print ticker.rstrip()+": "+dividend
                if float(dividend) > 0:
                        dict[float(dividend)] = ticker.rstrip()
f.close()

filename = "dividends-"+time.strftime("%Y%m%d")+".csv"
f = open(filename, "a")
f.write("Ticker, Dividend Yield, Dividend Share, Stock Price\n")
print "Ticker; Dividend Yield; Dividend Share; Stock Price\n"
for k in reversed(sorted(dict.keys())):
        escribir = dict[k]+","+str(k)+","+str(float(Share(dict[k]).get_dividend_share()))+","+str(float(Share(dict[k]).get_price()))+"\n"
        print dict[k]+";"+str(k)+";"+Share(dict[k]).get_dividend_share()+";"+Share(dict[k]).get_price()
        f.write(escribir)
f.close()

When done it creates a csv (named dividends-YYYYMMDD.csv) file ordered by highest dividend yield paying stocks.

00:55:40 [me@server Finance]$ head -5 dividends-20160414.csv
Ticker, Dividend Yield, Dividend Share, Stock Price
GLBL,44.0,1.1,2.64
CLM,29.7,4.42,15.624
XRDC,22.99,0.6,2.65
CRF,22.1,3.98,16.75
00:55:49 [me@server Finance]$

Python mechanize to automate WordPress login

Leave a reply

I had to play around with Mechanize lib to automate wordpress login. Proved out to be pretty handy.

#!/usr/bin/env python

from mechanize import Browser #pip install mechanize

br = Browser()
br.set_handle_robots(False)
br.addheaders = [("User-agent","Python Script using mechanize")]

sign_in = br.open("https://www.wordpress.com/wp-login.php")  #the login url

br.select_form(nr = 0) #accessing form by their index. Since we have only one form in this example, nr =0.
br["log"] = "user@domain" #the key "username" is the variable that takes the username/email value
br["pwd"] = "passwd"    #the key "password" is the variable that takes the password value

logged_in = br.submit()   #submitting the login credentials

logincheck = logged_in.read()  #reading the page body that is redirected after successful login

print logged_in.code   #print HTTP status code(200, 404...)
print logged_in.info() #print server info
#print logincheck #printing the body of the redirected url after login

Execution:

user@server: ~/Python $ python wordpress-login.py 
200
Server: nginx
Date: Fri, 04 Mar 2016 16:45:40 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 5762
Connection: close
Vary: Accept-Encoding
X-Frame-Options: SAMEORIGIN
X-ac: 1.ams _dfw
Strict-Transport-Security: max-age=15552000
X-UA-Compatible: IE=Edge

user@server: ~/Python $

If you uncomment the last line in the script above it would print out the HTML code of the logged in page.

Python script to download all NASDAQ and NYSE ticker symbols

Leave a reply

Below is a short python script to get all NASDAQ and NYSE common stock tickers. You can then use the resulting file to get a lot of info using yahoofinance library.

#!/usr/bin/env python

import ftplib
import os
import re

# Connect to ftp.nasdaqtrader.com
ftp = ftplib.FTP('ftp.nasdaqtrader.com', 'anonymous', 'anonymous@debian.org')

# Download files nasdaqlisted.txt and otherlisted.txt from ftp.nasdaqtrader.com
for ficheiro in ["nasdaqlisted.txt", "otherlisted.txt"]:
        ftp.cwd("/SymbolDirectory")
        localfile = open(ficheiro, 'wb')
        ftp.retrbinary('RETR ' + ficheiro, localfile.write)
        localfile.close()
ftp.quit()

# Grep for common stock in nasdaqlisted.txt and otherlisted.txt
for ficheiro in ["nasdaqlisted.txt", "otherlisted.txt"]:
        localfile = open(ficheiro, 'r')
        for line in localfile:
                if re.search("Common Stock", line):
                        ticker = line.split("|")[0]
                        # Append tickers to file tickers.txt
                        open("tickers.txt","a+").write(ticker + "\n")

XaviGNU

Me and my world

Tag Archives: UNIX

Export data from HDFS to MySQL

SUM function in Apache Pig

Create csv file ordered with highest yield dividend paying stock

Python mechanize to automate WordPress login

Python script to download all NASDAQ and NYSE ticker symbols