Import Stackoverflow into Neo4j


Download the Stackoverflow Archives

  • https://archive.org/details/stackexchange
  • download
    • stackoverflow.com-Badges.7z
    • stackoverflow.com-Comments.7z
    • stackoverflow.com-PostHistory.7z
    • stackoverflow.com-PostLinks.7z
    • stackoverflow.com-Posts.7z
    • stackoverflow.com-Tags.7z
    • stackoverflow.com-Users.7z
    • stackoverflow.com-Votes.7z

Install p7zip

brew install p7zip

Unzip the Posts, Users and Tags

  • 7za -y -oextracted x *Users.7z
  • 7za -y -oextracted x *Tags.7z
  • 7za -y -oextracted x *Posts.7z

Review the Extracted XML files

  • Users.xml – 3.53GB
  • Tags.xml – 5MB
  • Posts.xml – 74GB

Clone stackoverflow-neo4j

git clone https://github.com/mdamien/stackoverflow-neo4j

Install Python

  • brew install python3
  • sudo pip3 install xmltodict

Extract the XML files as CSV

python3 to_csv.py extracted
This generates the following files:

Import the Data into Neo4J

  • Open Neo4j Desktop
  • Create a StackOverflow Project
  • Create a Local Graph
  • click Manage
  • click Terminal
  • set a variable DATA to the path of the generated CSV files
    • export DATA=/Development/Neo4j/StackOverflow/stackoverflow-neo4j/csvs
  • import the CSV files into a new stackoverflow graph database
./bin/neo4j-admin import \
--mode=csv \
--database=stackoverflow.db \
--id-type string \
--ignore-missing-nodes=true \
--nodes:Post $DATA/posts.csv \
--nodes:User $DATA/users.csv \
--nodes:Tag $DATA/tags.csv \
--relationships:PARENT_OF $DATA/posts_rel.csv \
--relationships:HAS_TAG $DATA/tags_posts_rel.csv \
--relationships:POSTED $DATA/users_posts_rel.csv

Edit Settings

  • uncomment #dbms.active_database=graph.db
  • rename graph.db to stackoverflow.db
  • click Apply
  • Restart the database

Open the Browser

  • click Open Browser
match(n)
return count(n)
match (n) return head(labels(n)) as label, count(*);

Create Indexes and Constraints

create index on :Post(title);
create index on :Post(createdAt);
create index on :Post(score);
create index on :Post(views);
create index on :Post(favorites);
create index on :Post(answers);
create index on :Post(score);

create index on :User(name);
create index on :User(createdAt);
create index on :User(reputation);
create index on :User(age);

create index on :Tag(count);

create constraint on (t:Tag) assert t.tagId is unique;
create constraint on (u:User) assert u.userId is unique;
create constraint on (p:Post) assert p.postId is unique;

Check Indexes

:schema

Java Questions with Answers

match (tag:Tag{tagId:"java"})<-[r1:HAS_TAG]-(question)-[r2:PARENT_OF]-(answer)
return tag,r1,question,r2,answer
limit 20
as a graph
match (tag:Tag{tagId:"java"})<-[r1:HAS_TAG]-(question:Post)-[r2:PARENT_OF]-(answer)
return question.title, question.body ,answer.score,answer.body
limit 20
as a table

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Up ↑

%d bloggers like this: