IMDB extractor transforms Internet Movie Database data files into a topic map browsable with Wandora. Extractor has been created for demonstration purposes only. Wandora does not contain any IMDB data files. Also, be aware that Wandora or Wandora authors have no rights to give you any permission to use IMDB data. If you plan to use IMDB topic maps beyond personal usage, you should contact IMDB Licensing department.
You may download IMDB datafiles from
As datafiles are extremely large you can't extract data to memory topic maps but have to use database topic maps. Wandora does not transfer all IMDB files. Current extractor transfers only
A hopefully easy to use guide on how to import IMDB data dumps to MySQL. In this file, I try to document as best as possible how I got up and running. Build for a machine running Windows 8 with the following programs already installed.
actors
actresses
keywords
countries
language
locations
genres
movies
biographies
producers
directors
plot summaries
running times
release dates
To prepare the extraction download all required data files and unpack them to your local file system. Then create a database topic map and start extractor with File > Extract > Media > IMDB Extractor. Wandora requests a folder containing IMDB data files or a single data file and starts the extraction after successful data file or folder identification. IMDB data files are very large and you should be patient as the extraction may take a while.
Below is a screenshot of Wandora viewing associations of movie Dr. Strangelove.... Notice the layer structure. Each IMDB datafile has been extracted to a separate database topic map.
Contents
1Step by step example of extracting IMDB with Wandora
Step by step example of extracting IMDB with Wandora
This chapter is a step by step tutorial showing you how to use IMDB extractor and database topic maps. Tutorial extractions were made in a Ubuntu Linux 8.1 running on top of Sun's VirtualBox (running on top of Windows XP). Next screen shot views system properties of the Ubuntu Linux used for IMDB extractions. Notice the memory amount given for the Linux. We gave the Ubuntu 1500 MB of memory. Our experiences suggest you should give Linux memory as much as possible. With small memory footprints the IMDB extraction fails after heavy swapping. Now start Ubuntu Linux and log in.
Downloading IMDB datafiles
After Ubuntu launch, start WWW browser in Ubuntu and
Download IMDB data files:
Unzip all data files in shell with gunzip or right click each data file icon and select option Extract Here.
Now you should have all required IMDB data files ready for extraction as shown below.
Setting up Wandora
We prepare Wandora application next. In Ubuntu
Download Wandora application.
Start Linux shell with menu option Applications > Accessories > Terminal
Open Wandora's bin directory.
Change execution rights of Wandora-huge.sh to allow execution.
Finally add Java's bin directory to the PATH environment variable.
Here is how I did previous steps:
Now you are ready to start Wandora application in Linux. Write ./Wandora-huge.sh in terminal and hit enter. Wandora application should start.
Setting up databases for IMDB topic maps
As stated in the beginning of IMDB extractor documentation above, you need a database topic map to store extracted topic map as it is very large. To prepare database topic map start another terminal window in Ubuntu with option Applications > Accessories > Terminal. In terminal
Install MySQL server with command sudo apt-get install mysql-server.
Log into the MySQL server with command mysql --user=<your-username> --password=<your-password>
Create empty databases with MySQL command create database <database-name>; (notice ending semicolon) for next database names:
imdb_actors
imdb_actresses
imdb_countries
imdb_genres
imdb_movies
Prepare each created database with Wandora specific database table structures in wandora/build/resources/conf/database/db_mysql.sql. In detail:
Select database with MySQL command use <database-name>;, for example use imdb_actors; (notice ending semicolon).
Read database table creation clauses from external file with MySQL command source wandora/build/resources/conf/database/db_mysql.sql; (notice ending semicolon). Notice that you may have to change the path of db_mysql.sql depending on you Wandora installation directory and your current directory.
Below is my terminal capture of previous steps. After these steps I have six empty database topic maps in local MySQL and I am ready for actual IMDB extractions.
Extracting IMDB with Wandora
Go back to the Wandora application started earlier and select menu option Layers > New layer. A dialog window opens. Select Database on drop down selector labeled Type. Layer creation dialog should now look something like this:
Select MySQL test in database settings list and click Edit button. Another dialog opens for database settings (see image below). In this dialog you can enter database's name, user name, and password. Change database name to imdb_actors. Change user field to your database user name. Change password field to the user's password.
Now click OK button and database configuration window closes reveling previous dialog window. Enter name for the layer, say imdb_actors, keep the MySQL test database configuration selected, and click OK button. Wandora creates a new topic map layer and shows it left bottom corner of Wandora application window (see below). Now select the created layer by clicking it. Selected layer is little darker than unselected. Now all 'write' operations go to the selected database topic map layer.
If created layer is dark red, your new layer is broken. Layer is broken when database connection fails for some reason. Check Wandora's terminal window for specific error message. I managed to break a layer couple of times by entering wrong user name and password for the database.
Next we are going to start the IMDB extraction. Select menu option File > Extract > Media > IMDB extract.... Wandora opens a Files/Urls/Raw selector. Keep the Files tab open and click Browse button. A file selector opens. Go to the directory you uncompressed IMDB data files and select actors.list (see below). To start extraction press Extract button. As IMDB data files are extremely large, it is not very surprising the extraction takes several hours. For example, extracting >9 million rows of actors.list took ~6 hours in my virtual Ubuntu.
When extraction finishes, you can request statistics from the database topic map layer with menu option Layers > Statistics > Layer info.... It took my system several minutes to open layer statistics dialog window:
Extracted topic map contained little over 2 million topics and near 3 million associations. It is very important you to understand that trying to access such topic map in Wandora is extremely slow and causes OutOfMemory exceptions easily. As a thumb rule do not try to search anything that could generate a result set with millions of hits. Also, do not open association type topics, role topics, or class topics as they probably generate extremely large topic table structures Wandora can't handle.
Now, to continue extracting other IMDB files, drop extracted layer imdb_actors with menu option Layers > Delete layer... Database topic map layer deletion doesn't touch the database content and you can open it again later on. It's just more convenient to do the extraction when there are no other topic map layers disturbing.
Now you should do all the steps described above to all other IMDB data files. You should extract each data file to it's own database topic map:
Merging IMDB database topic map layers
Now you should have all IMDB data files extracted. Final step is to open all generated topic maps to Wandora as separate layers. In Wandora, for each database topic map
Select menu option Layers > New layer...
Change topic map type to Database
Edit default settings of MySQL test as you did while preparing the extraction.
Give unique name for the layer and hit OK.
As a result, your Wandora should look something like below and you can continue accessing the merged IMDB topic. Be careful, the layer stack is huge and you get easily OutOfMemory exceptions as said above :)
Below user has searched with Brazil. Result set contains Terry Gilliam's movie called Brazil. User opens the movie to the topic panel.
Below user has scrolled downwards to see all associations of the movie Brazil.
Below user has double clicked the topic of Terry Gilliam. Topic is open at topic panel and user can see all associations of Terry Gilliam.
If user scrolls downward, topic panel reveals an association table representing all movies, Terry Gilliam has directed.
One of Terry Gilliam's directions is movie Twelve Monkeys. User double clicks the topic in Terry Gilliam's director table and topic Twelve Monkeys is opened to the topic panel.
Retrieved from 'http://wandora.org/w/index.php?title=IMDB_extractor&oldid=8532'
Welcome to my public projects related to IMDBs database dumps :) Here is what this project will cover:
To make it easier to traverse this project as it gets bigger, I am going to put links to quickly go to the sections you want to.
Where to get the data, how to import it, including the idea of the relational model solution are in the 'How to get started' and 'Explanation of the solution' sections
Below are links to get you quickly to the destinations you require. For the Data Warehouse project I recommend going to the general folder first due to the README. There is important info there
Relational Model Project
Data Warehouse Project
How to get started
Here is IMDB's simple documentation on their DB dump:http://www.imdb.com/interfaces/
And here's their link from which you can download all the .tsv files:https://datasets.imdbws.com/
Keep in mind this Database is roughly around 5Gb after we’re done with our solution. This is not accounting to other testing tables you might create. So, keep at least 10Gb free on your PC for this project.First, create a testing database, it can be a tempDB it can be an actual database, it’s up to you.
Unfortunately (although justified) this dump is not their full database. So, you will come across some weird issues when importing.First of all, how to import the tsv files onto SQL Server as actual tables:
Although you are importing a flat file, in this case tsv (tab separated value), we will go with the ‘Import Data…’ option. Reason being, is that the 'Import Flat File' option does not allow you to change how SQL Server will deal with strings that you are importing. And if you try to do it without changing that, it will always give you an error and won’t work.
After you choose 'Flat File' in the Data Source, and browse to the file you want to import, go into Advanced and mark all columns. The Columns by default are on string[DT_STR], we don’t want that. Change it to the option [DT_TEXT] (not Unicode) or [DT_WSTR] (Preferably WSTR for Unicode/UTF8). The reason, is that STR is varchar and is in ANSI (also has a different size limit). This is a database in UTF8 format, with titles in different languages. So, you don’t want weird characters like these popping up.
After that you choose your testing database and import it there as a table (done by default).
Explanation of the solution
This solution is probably extremely inefficient but it was the way I solved it, so if you have any other solutions by all means, share them.
One of the biggest problems with this database are arrays, if you want to normalize the database you can’t have string arrays. The link up there will lead you to the folder holding all the scripts responsible for splitting all the arrays in the DB.
Depending on how you will solve this, you might be able to convert certain strings to their respective types. But in my solution dates and pretty much all other attributes except for Ratings are done in nvarchar. So most of your tables will have nvarchar as attribute types.
The IMDB_SCHEMA_TABLES script creates all tables and their respective schemas, this will be your “production” database, where your final product will be.
The script for the Title_Crew table runned for around 2 hours on my rig, it outputs around 13.6 milion rows (around 4 million before splitting it). So, keep that in mind.
I have also included two scripts for Title.Type’s and Cast.Profession’s data, so you don’t have to import them, the link for them is up there.
Now for the second problem while importing this database to your real production database:
As mentioned before, this is not the full database… And I am not talking just about the tables, but the data itself. For example: there are more than one million titles included in the name_basics table in the “knownForTitle” attribute that don’t exist in the actual title_basics table (and even more in the title_crew table). So, when you try import data it will never work, it will give you an error that you’re violating the FOREIGN KEY CONSTRAINT.
The way I solved this is probably not the best way possible, but it works. I made in the testDB copies of the production tables but without any of the constraints, joined them together by putting the FOREIGN KEY constraint on the IDENTITY field. But after importing the data, I inner joined them on their tconst and nconst fields. This resulted in consistent data that only exists in all tables in the given database dump.
There is a link up there to take you immediately to the respective folder holding the script
After you import into the dummy tables the data from the tsv tables all you have to do now is write simple SELECT queries with INNER JOIN on the tconst and nconst fields instead of the generated ID.